Clock In & Out Analysis

Author

Tino Muzambi

Introduction

Towards the end of 2023, I applied for and got accepted into the MSc Data Science program at the University of Cape Town. This new journey inspired me to embark on a little side project of my own. The idea behind the project was to pursue a task that would involve all stages of the data analysis pipeline. From data collection, to data cleaning, to visualising the data and finally using it to tell a story. I also envisioned it as an opportunity to put into practical practice what I would’ve learned throughout the year as part of my MSc.

I chose the focus of my project to be analysing the times I clocked in and out at the office. I figured this would be interesting to analyse, especially considering that as part of my MSc I would have lectures that would sometimes eat into my usual working hours. That, coupled with other commitments that come up in life, sounded like it would make for an interesting analysis and that’s how this project was born.

This notebook demonstrates how I went about cleaning, analysing and plotting the data. A more detailed post will follow on my blog which will delve more into the data collection process and follow a more storytelling narrative.

Read in data

The data come from a Notion database as a CSV and contain fields for the date of interest, the time I clocked in and out respectively on said date and any notes detailing any events of interest that occurred on said date e.g. a lecture, a pre-work commitment, a conference, etcetera.

We begin by reading in the data and preventing automatic casting as this leads to the time fields being cast to a difftime object which, for our purposes isn’t conducive for analysis. Therefore we use the col_types argument to retain all data as character vectors. We also convert the data to a tibble for the nice qualities that tibbles provide.^[1]

# Read data
data <- read_csv("data/data.csv", col_types = "cccc") %>% as_tibble()

# View data
data %>% head

# A tibble: 6 × 4
  Date                       `Clock In Time` `Clock Out Time` Notes         
  <chr>                      <chr>           <chr>            <chr>         
1 Monday, 8 January 2024     09:02           16:49            <NA>          
2 Tuesday, 9 January 2024    09:22           16:36            <NA>          
3 Wednesday, 10 January 2024 09:21           14:25            <NA>          
4 Thursday, 11 January 2024  09:19           16:31            <NA>          
5 Friday, 12 January 2024    <NA>            <NA>             Work from home
6 Monday, 15 January 2024    08:45           16:51            <NA>

# Structure of the data
str(data)

tibble [250 × 4] (S3: tbl_df/tbl/data.frame)
 $ Date          : chr [1:250] "Monday, 8 January 2024" "Tuesday, 9 January 2024" "Wednesday, 10 January 2024" "Thursday, 11 January 2024" ...
 $ Clock In Time : chr [1:250] "09:02" "09:22" "09:21" "09:19" ...
 $ Clock Out Time: chr [1:250] "16:49" "16:36" "14:25" "16:31" ...
 $ Notes         : chr [1:250] NA NA NA NA ...

unique(data$Notes)

 [1] NA                                         
 [2] "Work from home"                           
 [3] "Post-Work Commitment, Pre-Work Commitment"
 [4] "16:00 lecture"                            
 [5] "14:00 lecture"                            
 [6] "14:00 lecture, 16:00 lecture"             
 [7] "Pre-Work Commitment"                      
 [8] "16:00 lecture, Post-Work Commitment"      
 [9] "Public Holiday"                           
[10] "Annual leave"                             
[11] "11:00 lecture, Pre-Work Commitment"       
[12] "Study leave"                              
[13] "16:00 lecture, Pre-Work Commitment"       
[14] "Post-Work Commitment"                     
[15] "Conference"                               
[16] "Sick leave"                               
[17] "16:00 lecture, Conference"

The data contain 250 entries where each entry is a day and all fields are stored as character vectors. The notes field is a multi-select field and the various options for it are a comma separated character vector of any number of the following:

Table of event descriptions
Event	Description
Events where I am in office
11:00 lecture	A day where I had a 11:00 lecture
14:00 lecture	A day where I had a 14:00 lecture
16:00 lecture	A day where I had a 16:00 lecture
Study leave	A day or part of a day where I was on study leave
Pre-work commitment	A day where I had some commitment to tend to before going into the office
Post-work commitment	A day where I had some commitment to tend to after going into the office
Standard	A day where there were none of the above events
Events where I am not in office
Work from home	A day where I worked from home
Annual leave	A day where I was on annual leave
Sick leave	A day where I was on sick leave
Public holiday	A day that was a public holiday
Conference	A day or part of a day where I was away at a conference

Prepare data for analysis

In order to prepare the data for analysis, a few cleaning steps are necessary. We begin by renaming the columns to make them easier to work with and more representative of the data they represent.

# Rename columns
data <- data %>% rename(
  date = Date,
  clock.in = "Clock In Time",
  clock.out = "Clock Out Time",
  event = Notes
)

data %>% head

# A tibble: 6 × 4
  date                       clock.in clock.out event         
  <chr>                      <chr>    <chr>     <chr>         
1 Monday, 8 January 2024     09:02    16:49     <NA>          
2 Tuesday, 9 January 2024    09:22    16:36     <NA>          
3 Wednesday, 10 January 2024 09:21    14:25     <NA>          
4 Thursday, 11 January 2024  09:19    16:31     <NA>          
5 Friday, 12 January 2024    <NA>     <NA>      Work from home
6 Monday, 15 January 2024    08:45    16:51     <NA>

We also add an ID column to make identifying rows easier.

# Add id column
data <- data %>% 
  mutate(id = as.numeric(rownames(data)))

data %>% head

# A tibble: 6 × 5
  date                       clock.in clock.out event             id
  <chr>                      <chr>    <chr>     <chr>          <dbl>
1 Monday, 8 January 2024     09:02    16:49     <NA>               1
2 Tuesday, 9 January 2024    09:22    16:36     <NA>               2
3 Wednesday, 10 January 2024 09:21    14:25     <NA>               3
4 Thursday, 11 January 2024  09:19    16:31     <NA>               4
5 Friday, 12 January 2024    <NA>     <NA>      Work from home     5
6 Monday, 15 January 2024    08:45    16:51     <NA>               6

We then manually cast fields to appropriate data types. Here we take advantage of Lubridate’s excellent datetime casting capabilities. We use the parse_date_time function to cast both the date and clock in/out time fields displaying the versatility of this function and this library. This casts the fields into POSIXct class which allows us to perform various datetime functions on them. However, here we run into a characteristic of R that’s less than ideal.

When working with dates and times in R, the aforementioned POSIXct class is used and this class mandates having both a date portion and a time portion. This means that date fields are required to have some arbitrary time portion and similarly time fields are required to have some arbitrary date portion. This is why you’ll often see time fields beginning with 1970-01-01 because that is the Unix epoch (a reference point for measuring time in seconds).^[2] This ultimately has no effect on functionality but is just a nuisance when it comes to displaying as it requires reformatting to exclude the unwanted portion.

# Cast date, clock.in and clock.out using Lubridate
data <- data %>% 
  mutate(date = parse_date_time(date, orders = "dmy"),
         clock.in = parse_date_time(clock.in, orders = "%H:%M"),
         clock.out = parse_date_time(clock.out, orders = "%H:%M"),
         # Convert event names to programmatically friendly strings
         event = gsub(" ", "_", event),
         event = gsub(":", "_", event),
         event = gsub("-", "_", event),
         # Replace NA events
         event = replace(event, is.na(event), "Standard"))

data %>% head

# A tibble: 6 × 5
  date                clock.in            clock.out           event           id
  <dttm>              <dttm>              <dttm>              <chr>        <dbl>
1 2024-01-08 00:00:00 0000-01-01 09:02:00 0000-01-01 16:49:00 Standard         1
2 2024-01-09 00:00:00 0000-01-01 09:22:00 0000-01-01 16:36:00 Standard         2
3 2024-01-10 00:00:00 0000-01-01 09:21:00 0000-01-01 14:25:00 Standard         3
4 2024-01-11 00:00:00 0000-01-01 09:19:00 0000-01-01 16:31:00 Standard         4
5 2024-01-12 00:00:00 NA                  NA                  Work_from_h…     5
6 2024-01-15 00:00:00 0000-01-01 08:45:00 0000-01-01 16:51:00 Standard         6

Tidy data

With those basic steps complete, we now begin moving towards getting the data into a “tidy” format. Tidy data is a standard way of structuring data that makes it easier to analyse, visualise, and model. The key principle of tidy data is that each variable forms a column, each observation forms a row, and each type of observational unit forms a table.^[3]

For our use case, tidying the data will mean dealing with the event column by extracting the different possible events into their own columns. We start by splitting the fields in the event column for cases where there were multiple events in one day. The day with ID 24 shown below is an example of such a day.

# Separate options from event into separate columns
data.tidy <- data %>% separate_rows(event, sep = ",_")

data.tidy %>% filter(id == 24) %>% head

# A tibble: 2 × 5
  date                clock.in            clock.out           event           id
  <dttm>              <dttm>              <dttm>              <chr>        <dbl>
1 2024-02-08 00:00:00 0000-01-01 11:22:00 0000-01-01 15:08:00 Post_Work_C…    24
2 2024-02-08 00:00:00 0000-01-01 11:22:00 0000-01-01 15:08:00 Pre_Work_Co…    24

Next we create binary indicator columns for each type of event. This moves us closer to our data being tidy with one column for each event where it takes on the value of true or false depending on if that event occured on that day or not.

# Create indicator columns for event options
data.tidy <- data.tidy %>% 
  mutate(event_Work_from_home = if_else(event == "Work_from_home", TRUE, FALSE),
         event_Post_Work_Commitment = if_else(event == "Post_Work_Commitment", TRUE, FALSE),
         event_Pre_Work_Commitment = if_else(event == "Pre_Work_Commitment", TRUE, FALSE),
         event_16_00_lecture = if_else(event == "16_00_lecture", TRUE, FALSE),
         event_14_00_lecture = if_else(event == "14_00_lecture", TRUE, FALSE),
         event_11_00_lecture = if_else(event == "11_00_lecture", TRUE, FALSE),
         event_Annual_leave = if_else(event == "Annual_leave", TRUE, FALSE),
         event_Sick_leave = if_else(event == "Sick_leave", TRUE, FALSE),
         event_Study_leave = if_else(event == "Study_leave", TRUE, FALSE),
         event_Public_Holiday = if_else(event == "Public_Holiday", TRUE, FALSE),
         event_Conference = if_else(event == "Conference", TRUE, FALSE))

data.tidy %>% select(starts_with("event_")) %>% head

# A tibble: 6 × 11
  event_Work_from_home event_Post_Work_Commitment event_Pre_Work_Commitment
  <lgl>                <lgl>                      <lgl>                    
1 FALSE                FALSE                      FALSE                    
2 FALSE                FALSE                      FALSE                    
3 FALSE                FALSE                      FALSE                    
4 FALSE                FALSE                      FALSE                    
5 TRUE                 FALSE                      FALSE                    
6 FALSE                FALSE                      FALSE                    
# ℹ 8 more variables: event_16_00_lecture <lgl>, event_14_00_lecture <lgl>,
#   event_11_00_lecture <lgl>, event_Annual_leave <lgl>,
#   event_Sick_leave <lgl>, event_Study_leave <lgl>,
#   event_Public_Holiday <lgl>, event_Conference <lgl>

We replace any NAs in the indicator columns with 0 because those values mean the same thing (that event did not occur on that day) and the data should clearly reflect that.

# Replace NAs with 0s in indicator columns
data.tidy <- data.tidy %>% 
  mutate(across(starts_with("event_"), ~
                  replace(., is.na(.), 0)))

data.tidy %>% select(starts_with("event_")) %>% head

# A tibble: 6 × 11
  event_Work_from_home event_Post_Work_Commitment event_Pre_Work_Commitment
                 <dbl>                      <dbl>                     <dbl>
1                    0                          0                         0
2                    0                          0                         0
3                    0                          0                         0
4                    0                          0                         0
5                    1                          0                         0
6                    0                          0                         0
# ℹ 8 more variables: event_16_00_lecture <dbl>, event_14_00_lecture <dbl>,
#   event_11_00_lecture <dbl>, event_Annual_leave <dbl>,
#   event_Sick_leave <dbl>, event_Study_leave <dbl>,
#   event_Public_Holiday <dbl>, event_Conference <dbl>

We then collapse the data back into the original number of rows to keep with the requirements of tidy data. However, we do this by grouping by the ID and using the max aggregate function on the event columns which means we lose the other columns. We remedy this by then performing a full join on the original data.

# Collapse into original number of rows
data.collapsed <- data.tidy %>% 
  group_by(id) %>% 
  summarise(
    event_Work_from_home = max(event_Work_from_home),
    event_Post_Work_Commitment = max(event_Post_Work_Commitment),
    event_Pre_Work_Commitment = max(event_Pre_Work_Commitment),
    event_16_00_lecture = max(event_16_00_lecture),
    event_14_00_lecture = max(event_14_00_lecture),
    event_11_00_lecture = max(event_11_00_lecture),
    event_Annual_leave = max(event_Annual_leave),
    event_Sick_leave = max(event_Sick_leave),
    event_Study_leave = max(event_Study_leave),
    event_Public_Holiday = max(event_Public_Holiday),
    event_Conference = max(event_Conference),
    .groups = "drop"
  )

data.collapsed %>% str

tibble [250 × 12] (S3: tbl_df/tbl/data.frame)
 $ id                        : num [1:250] 1 2 3 4 5 6 7 8 9 10 ...
 $ event_Work_from_home      : num [1:250] 0 0 0 0 1 0 0 0 0 0 ...
 $ event_Post_Work_Commitment: num [1:250] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_Pre_Work_Commitment : num [1:250] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_16_00_lecture       : num [1:250] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_14_00_lecture       : num [1:250] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_11_00_lecture       : num [1:250] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_Annual_leave        : num [1:250] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_Sick_leave          : num [1:250] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_Study_leave         : num [1:250] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_Public_Holiday      : num [1:250] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_Conference          : num [1:250] 0 0 0 0 0 0 0 0 0 0 ...

# Combine dataframes
data.combined <- full_join(data.collapsed, data, by = c("id"))

data.combined %>% str

tibble [250 × 16] (S3: tbl_df/tbl/data.frame)
 $ id                        : num [1:250] 1 2 3 4 5 6 7 8 9 10 ...
 $ event_Work_from_home      : num [1:250] 0 0 0 0 1 0 0 0 0 0 ...
 $ event_Post_Work_Commitment: num [1:250] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_Pre_Work_Commitment : num [1:250] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_16_00_lecture       : num [1:250] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_14_00_lecture       : num [1:250] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_11_00_lecture       : num [1:250] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_Annual_leave        : num [1:250] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_Sick_leave          : num [1:250] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_Study_leave         : num [1:250] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_Public_Holiday      : num [1:250] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_Conference          : num [1:250] 0 0 0 0 0 0 0 0 0 0 ...
 $ date                      : POSIXct[1:250], format: "2024-01-08" "2024-01-09" ...
 $ clock.in                  : POSIXct[1:250], format: "0000-01-01 09:02:00" "0000-01-01 09:22:00" ...
 $ clock.out                 : POSIXct[1:250], format: "0000-01-01 16:49:00" "0000-01-01 16:36:00" ...
 $ event                     : chr [1:250] "Standard" "Standard" "Standard" "Standard" ...

Compile summaries of data

With the data now being in a tidy format, we are ready to proceed to plotting and analysing. For the first analysis we will look at clock in and out times per event. In order to do this we need to pivot the data longer such that there is one clock in/out event per row which makes this analysis easier.

# Pivot longer
data.long <- data.combined %>%
  gather(key = "event.type",
         value = "time", clock.in, clock.out) %>% 
  mutate(event.type = ifelse(event.type == "clock.in", 
                             "Clock In", "Clock Out"),
         event = str_split(event, ",_")) %>% 
  unnest(event)

data.long %>% str

tibble [528 × 16] (S3: tbl_df/tbl/data.frame)
 $ id                        : num [1:528] 1 2 3 4 5 6 7 8 9 10 ...
 $ event_Work_from_home      : num [1:528] 0 0 0 0 1 0 0 0 0 0 ...
 $ event_Post_Work_Commitment: num [1:528] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_Pre_Work_Commitment : num [1:528] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_16_00_lecture       : num [1:528] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_14_00_lecture       : num [1:528] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_11_00_lecture       : num [1:528] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_Annual_leave        : num [1:528] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_Sick_leave          : num [1:528] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_Study_leave         : num [1:528] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_Public_Holiday      : num [1:528] 0 0 0 0 0 0 0 0 0 0 ...
 $ event_Conference          : num [1:528] 0 0 0 0 0 0 0 0 0 0 ...
 $ date                      : POSIXct[1:528], format: "2024-01-08" "2024-01-09" ...
 $ event                     : chr [1:528] "Standard" "Standard" "Standard" "Standard" ...
 $ event.type                : chr [1:528] "Clock In" "Clock In" "Clock In" "Clock In" ...
 $ time                      : POSIXct[1:528], format: "0000-01-01 09:02:00" "0000-01-01 09:22:00" ...

Per Event Summaries

# Get per event summaries
time.summaries <- data.long %>% group_by(event, event.type) %>% 
  summarise(
    num = n(),
    mean.time = format(mean(time, na.rm = T), "%H:%M"),
    .groups = "keep"
  )

datatable(time.summaries)

From this we can see that the times largely correspond with the event as expected. We will move on to look at the overall clock in/out time stats.

Overall times stats

To do this, we could have used the built-in base R summary function, however we found that when working with time fields, it formats the fields in a less than ideal way as alluded to before. To remedy this, we built a custom summary function that calculates the statistics and then tabulates them formatting them before return the table.

time.summary <- function(time.field) {
  # Ensure the input is in POSIXct format
  time.field <- as.POSIXct(time.field, format = "%H:%M")
  
  # Sort times
  sorted.times <- sort(time.field)
  
  # Calculate minimum and maximum
  min.time <- min(time.field, na.rm = T)
  max.time <- max(time.field, na.rm = T)
  
  # Calculate median
  median.time <- median(time.field, na.rm = T)
  
  # Calculate mode
  mode.time <- as.POSIXct(names(sort(table(sorted.times), 
                                     decreasing = T)[1]), 
                          origin = "0000-01-01", tz = "UTC")
  
  # Calculate mean
  mean.time <- mean(time.field, na.rm = T)
  
  # Compile output
  formatted.summary <- tibble(
    statistic = c("Min", "Max", "Median", "Mode", "Mean"),
    time = c(format(min.time, "%H:%M"),
              format(max.time, "%H:%M"),
              format(median.time, "%H:%M"),
              format(mode.time, "%H:%M"),
              format(mean.time, "%H:%M"))
  )
  
  return(formatted.summary)
}

Clock in summary

clock.in.summary <- time.summary(data$clock.in)
datatable(clock.in.summary)

Clock out summary

clock.out.summary <- time.summary(data$clock.out)
datatable(clock.out.summary)

We extract some useful insights from the clock in/out summaries. We can then, using the mean clock in/out times, determine how much time I spend in the office on average.

# Mean time spent at office
mean.clock.in <- parse_date_time(clock.in.summary$time[5], orders = "%H:%M")
mean.clock.out <- parse_date_time(clock.out.summary$time[5], orders = "%H:%M")

mean.office.time <- mean.clock.out - mean.clock.in
mean.office.time

Time difference of 6.133333 hours

Plot some graphs

Finally we can visualise the data. We start with a scatter plot that shows the time spent in office throughout the year. The mean time per event and overall mean is also overlayed on the plot.

# Calculate time spent in office
daily.hours <- data.tidy %>%
  dplyr::group_by(date) %>% 
  mutate(office.time = as.numeric(difftime(clock.out, clock.in, units = "hours"))) %>% 
  select(c(clock.in, clock.out, office.time, event, date))

# Define plot colours
plot.colours <- c("Post_Work_Commitment" = "green",
                  "Pre_Work_Commitment" = "grey", "16_00_lecture" = "blue", 
                  "14_00_lecture" = "red", "11_00_lecture" = "black",
                  "Study_leave" = "turquoise", "Conference" = "purple",
                  "Standard" = "orange")

# Define nicer labels
plot.labels <- c("Work_from_home" = "Work from home",
                 "Post_Work_Commitment" = "Post-work commitment",
                 "Pre_Work_Commitment" = "Post-work commitment",
                 "16_00_lecture" = "16:00 lecture",
                 "14_00_lecture" = "14:00 lecture",
                 "Public_Holiday" = "Public holiday",
                 "Annual_leave" = "Annual leave",
                 "11_00_lecture" = "11:00 lecture",
                 "Study_leave" = "Study leave", "Conference" = "Conference",
                 "Sick_leave" = "Sick leave", "Standard" = "Standard")

# Plot graph
office.time.plot <- ggplot(daily.hours, aes(x = date, 
                                            y = office.time, 
                                            colour = event)) +
  geom_point(size = 3, alpha = 0.7,na.rm = T) +
  geom_hline(yintercept = mean.office.time, 
             linetype = "dashed", 
             linewidth = 1) + 
  scale_color_manual(values = plot.colours, labels = plot.labels) + 
  labs(title = "Time Spent in Office for 2024",
       x = "Date",
       y = "Number of Hours",
       color = "Legend") +
  theme_minimal() +
  theme(legend.position = "bottom", plot.title = element_text(hjust = 0.5))

# Get per event mean office time
event.time <- daily.hours %>% 
  dplyr::group_by(event) %>% 
  summarise(mean.hours = mean(office.time, na.rm = T))

# Add mean office time per event to plot
office.time.plot + geom_hline(data = event.time, 
                              aes(yintercept = mean.hours, colour = event), 
                              linetype = "dashed",
                              alpha = 0.7,
                              na.rm = T)

Next we plot a line graph that shows the clock in/out times over the year. This visualisation is useful for seeing the day to day trends but is not as effective because of the breaks that occur on days where I didn’t go into the office.

# Line graph of clock in and out times
min.y.time <- floor_date(min(data.combined$clock.in, na.rm = T), unit = "hour")
max.y.time <- max(data.combined$clock.out, na.rm = T)

ggplot(data.combined, aes(x = date)) +
  geom_line(aes(y = clock.in, color = "Clock In"), linewidth = 1, na.rm = T) +
  geom_line(aes(y = clock.out, color = "Clock Out"), linewidth = 1, na.rm = T) +
  scale_color_manual(values = c("Clock In" = "blue", "Clock Out" = "red")) +
  scale_y_datetime(labels = scales::date_format("%H:%M"),
                   breaks = seq(from = min.y.time,
                                to = max.y.time,
                                by = "1 hour")) +
  labs(title = "Clock in and Clock Out Times for 2024",
       x = "Date",
       y = "Time",
       color = "Legend") +
  theme_minimal() +
  theme(legend.position = "bottom", plot.title = element_text(hjust = 0.5))

A scatter plot works better in this scenario and allows you to still see the general trend but does not seem as disjointed for the days without data.

# Scatter plot of clock in and out times
ggplot(data.combined,
       aes(x = date)) +
  geom_point(aes(y = clock.in, color = "Clock In"), na.rm = T) +
  geom_point(aes(y = clock.out, color = "Clock Out"), na.rm = T) +
  scale_color_manual(values = c("Clock In" = "blue", "Clock Out" = "red")) +
  scale_y_datetime(labels = scales::date_format("%H:%M"),
                   breaks = seq(from = min.y.time,
                                to = max.y.time,
                                by = "1 hour")) +
  labs(title = "Clock in and Clock Out Times for 2024",
       x = "Date",
       y = "Time",
       color = "Legend") +
  theme_minimal() +
  theme(legend.position = "bottom", plot.title = element_text(hjust = 0.5))

However, we can fit a lot more data onto this plot. In the next plot, we plot the same scatter plot as before however, we differentiate between clock in and clock out points using shape instead of colour. This frees us up to use colour for the different types of event. We also then add the mean clock in/out times as a horizontal line.

We use geom_jitter to disperse points so it is visible when there are multiple events on one day. We also use transparency to help show when this is the case.

# Plot more detailed plot
set.seed(2025) # Set seed for jitter

clock.in.out.detail <- ggplot(data.long, 
                              aes(x = date, y = time, colour = event, 
                                  fill = event, shape = event.type)) +
  geom_jitter(size = 3, width = 250, height = 250, alpha = 0.7, na.rm = T) +
  scale_color_manual(values = plot.colours, labels = plot.labels) + 
  scale_fill_manual(values = plot.colours, labels = plot.labels) +
  scale_shape_manual(values = c("Clock In" = 21, "Clock Out" = 25)) +
  scale_y_datetime(labels = scales::date_format("%H:%M"),
                   breaks = seq(from = min.y.time,
                                to = max.y.time,
                                by = "1 hour")) +
  labs(title = "Clock in and Clock Out Times for 2024",
       x = "Date",
       y = "Time",
       color = "Event",
       fill = "Event",
       shape = "Event Type") +
  theme_minimal() +
  theme(legend.position = "bottom", plot.title = element_text(hjust = 0.5))

# Add mean times to plot
clock.in.out.detail.static <- clock.in.out.detail + 
  geom_hline(yintercept = mean.clock.in, color = "blue", linetype = "dashed") + 
  geom_hline(yintercept = mean.clock.out, color = "red", linetype = "dashed") +
  geom_text(aes(x = as.POSIXct("2024-01-01"), y = mean.clock.in, 
                label = paste(format(mean.clock.in, "%H:%M"))), 
            color = "blue", vjust = -0.5, hjust = 1.2) + 
  geom_text(aes(x = as.POSIXct("2024-01-01"), y = mean.clock.out, 
                label = paste(format(mean.clock.out, "%H:%M"))), 
            color = "red", vjust = -0.5, hjust = 1.2)

clock.in.out.detail.static

Finally, an interactive version of the above plot using Plotly.

ggplotly(clock.in.out.detail, tooltip = c("x", "y", "colour", "shape"))

Improvements and Future Work

This has been an insightful project to work on but there are several ideas for improvements as I resume collecting data for 2025.

I usually begin working from home around 08:00 and only leave home later partly for reducing commute time. This can be integrated into the analysis.
Incorporating times for leaving home and arriving at the office and vice versa would make for insights on time spent commuting.
Several improvements for the data collection process which will be detailed in the blog post to follow.
I invite you to also suggest improvements and things I could have done differently.

References

1. Müller, K., & Wickham, H. (2023). Tibble: Simple data frames. https://CRAN.R-project.org/package=tibble

2. R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

3. Wickham, H. (2014). Tidy data. J. Stat. Softw., 59(10).