Towards the end of 2023, I applied for and got accepted into the MSc Data Science program at the University of Cape Town. This new journey inspired me to embark on a little side project of my own. The idea behind the project was to pursue a task that would involve all stages of the data analysis pipeline. From data collection, to data cleaning, to visualising the data and finally using it to tell a story. I also envisioned it as an opportunity to put into practical practice what I would’ve learned throughout the year as part of my MSc.
I chose the focus of my project to be analysing the times I clocked in and out at the office. I figured this would be interesting to analyse, especially considering that as part of my MSc I would have lectures that would sometimes eat into my usual working hours. That, coupled with other commitments that come up in life, sounded like it would make for an interesting analysis and that’s how this project was born.
This notebook demonstrates how I went about cleaning, analysing and plotting the data. A more detailed post will follow on my blog which will delve more into the data collection process and follow a more storytelling narrative.
Read in data
The data come from a Notion database as a CSV and contain fields for the date of interest, the time I clocked in and out respectively on said date and any notes detailing any events of interest that occurred on said date e.g. a lecture, a pre-work commitment, a conference, etcetera.
We begin by reading in the data and preventing automatic casting as this leads to the time fields being cast to a difftime object which, for our purposes isn’t conducive for analysis. Therefore we use the col_types argument to retain all data as character vectors. We also convert the data to a tibble for the nice qualities that tibbles provide.[1]
The data contain 250 entries where each entry is a day and all fields are stored as character vectors. The notes field is a multi-select field and the various options for it are a comma separated character vector of any number of the following:
Table of event descriptions
Event
Description
Events where I am in office
11:00 lecture
A day where I had a 11:00 lecture
14:00 lecture
A day where I had a 14:00 lecture
16:00 lecture
A day where I had a 16:00 lecture
Study leave
A day or part of a day where I was on study leave
Pre-work commitment
A day where I had some commitment to tend to before going into the office
Post-work commitment
A day where I had some commitment to tend to after going into the office
Standard
A day where there were none of the above events
Events where I am not in office
Work from home
A day where I worked from home
Annual leave
A day where I was on annual leave
Sick leave
A day where I was on sick leave
Public holiday
A day that was a public holiday
Conference
A day or part of a day where I was away at a conference
Prepare data for analysis
In order to prepare the data for analysis, a few cleaning steps are necessary. We begin by renaming the columns to make them easier to work with and more representative of the data they represent.
# Rename columnsdata <- data %>%rename(date = Date,clock.in ="Clock In Time",clock.out ="Clock Out Time",event = Notes)data %>% head
# A tibble: 6 × 4
date clock.in clock.out event
<chr> <chr> <chr> <chr>
1 Monday, 8 January 2024 09:02 16:49 <NA>
2 Tuesday, 9 January 2024 09:22 16:36 <NA>
3 Wednesday, 10 January 2024 09:21 14:25 <NA>
4 Thursday, 11 January 2024 09:19 16:31 <NA>
5 Friday, 12 January 2024 <NA> <NA> Work from home
6 Monday, 15 January 2024 08:45 16:51 <NA>
We also add an ID column to make identifying rows easier.
# Add id columndata <- data %>%mutate(id =as.numeric(rownames(data)))data %>% head
# A tibble: 6 × 5
date clock.in clock.out event id
<chr> <chr> <chr> <chr> <dbl>
1 Monday, 8 January 2024 09:02 16:49 <NA> 1
2 Tuesday, 9 January 2024 09:22 16:36 <NA> 2
3 Wednesday, 10 January 2024 09:21 14:25 <NA> 3
4 Thursday, 11 January 2024 09:19 16:31 <NA> 4
5 Friday, 12 January 2024 <NA> <NA> Work from home 5
6 Monday, 15 January 2024 08:45 16:51 <NA> 6
We then manually cast fields to appropriate data types. Here we take advantage of Lubridate’s excellent datetime casting capabilities. We use the parse_date_time function to cast both the date and clock in/out time fields displaying the versatility of this function and this library. This casts the fields into POSIXct class which allows us to perform various datetime functions on them. However, here we run into a characteristic of R that’s less than ideal.
When working with dates and times in R, the aforementioned POSIXct class is used and this class mandates having both a date portion and a time portion. This means that date fields are required to have some arbitrary time portion and similarly time fields are required to have some arbitrary date portion. This is why you’ll often see time fields beginning with 1970-01-01 because that is the Unix epoch (a reference point for measuring time in seconds).[2] This ultimately has no effect on functionality but is just a nuisance when it comes to displaying as it requires reformatting to exclude the unwanted portion.
# Cast date, clock.in and clock.out using Lubridatedata <- data %>%mutate(date =parse_date_time(date, orders ="dmy"),clock.in =parse_date_time(clock.in, orders ="%H:%M"),clock.out =parse_date_time(clock.out, orders ="%H:%M"),# Convert event names to programmatically friendly stringsevent =gsub(" ", "_", event),event =gsub(":", "_", event),event =gsub("-", "_", event),# Replace NA eventsevent =replace(event, is.na(event), "Standard"))data %>% head
# A tibble: 6 × 5
date clock.in clock.out event id
<dttm> <dttm> <dttm> <chr> <dbl>
1 2024-01-08 00:00:00 0000-01-01 09:02:00 0000-01-01 16:49:00 Standard 1
2 2024-01-09 00:00:00 0000-01-01 09:22:00 0000-01-01 16:36:00 Standard 2
3 2024-01-10 00:00:00 0000-01-01 09:21:00 0000-01-01 14:25:00 Standard 3
4 2024-01-11 00:00:00 0000-01-01 09:19:00 0000-01-01 16:31:00 Standard 4
5 2024-01-12 00:00:00 NA NA Work_from_h… 5
6 2024-01-15 00:00:00 0000-01-01 08:45:00 0000-01-01 16:51:00 Standard 6
Tidy data
With those basic steps complete, we now begin moving towards getting the data into a “tidy” format. Tidy data is a standard way of structuring data that makes it easier to analyse, visualise, and model. The key principle of tidy data is that each variable forms a column, each observation forms a row, and each type of observational unit forms a table.[3]
For our use case, tidying the data will mean dealing with the event column by extracting the different possible events into their own columns. We start by splitting the fields in the event column for cases where there were multiple events in one day. The day with ID 24 shown below is an example of such a day.
# Separate options from event into separate columnsdata.tidy <- data %>%separate_rows(event, sep =",_")data.tidy %>%filter(id ==24) %>% head
Next we create binary indicator columns for each type of event. This moves us closer to our data being tidy with one column for each event where it takes on the value of true or false depending on if that event occured on that day or not.
We replace any NAs in the indicator columns with 0 because those values mean the same thing (that event did not occur on that day) and the data should clearly reflect that.
# Replace NAs with 0s in indicator columnsdata.tidy <- data.tidy %>%mutate(across(starts_with("event_"), ~replace(., is.na(.), 0)))data.tidy %>%select(starts_with("event_")) %>% head
We then collapse the data back into the original number of rows to keep with the requirements of tidy data. However, we do this by grouping by the ID and using the max aggregate function on the event columns which means we lose the other columns. We remedy this by then performing a full join on the original data.
# Collapse into original number of rowsdata.collapsed <- data.tidy %>%group_by(id) %>%summarise(event_Work_from_home =max(event_Work_from_home),event_Post_Work_Commitment =max(event_Post_Work_Commitment),event_Pre_Work_Commitment =max(event_Pre_Work_Commitment),event_16_00_lecture =max(event_16_00_lecture),event_14_00_lecture =max(event_14_00_lecture),event_11_00_lecture =max(event_11_00_lecture),event_Annual_leave =max(event_Annual_leave),event_Sick_leave =max(event_Sick_leave),event_Study_leave =max(event_Study_leave),event_Public_Holiday =max(event_Public_Holiday),event_Conference =max(event_Conference),.groups ="drop" )data.collapsed %>% str
With the data now being in a tidy format, we are ready to proceed to plotting and analysing. For the first analysis we will look at clock in and out times per event. In order to do this we need to pivot the data longer such that there is one clock in/out event per row which makes this analysis easier.
# Get per event summariestime.summaries <- data.long %>%group_by(event, event.type) %>%summarise(num =n(),mean.time =format(mean(time, na.rm = T), "%H:%M"),.groups ="keep" )datatable(time.summaries)
From this we can see that the times largely correspond with the event as expected. We will move on to look at the overall clock in/out time stats.
Overall times stats
To do this, we could have used the built-in base R summary function, however we found that when working with time fields, it formats the fields in a less than ideal way as alluded to before. To remedy this, we built a custom summary function that calculates the statistics and then tabulates them formatting them before return the table.
time.summary <-function(time.field) {# Ensure the input is in POSIXct format time.field <-as.POSIXct(time.field, format ="%H:%M")# Sort times sorted.times <-sort(time.field)# Calculate minimum and maximum min.time <-min(time.field, na.rm = T) max.time <-max(time.field, na.rm = T)# Calculate median median.time <-median(time.field, na.rm = T)# Calculate mode mode.time <-as.POSIXct(names(sort(table(sorted.times), decreasing = T)[1]), origin ="0000-01-01", tz ="UTC")# Calculate mean mean.time <-mean(time.field, na.rm = T)# Compile output formatted.summary <-tibble(statistic =c("Min", "Max", "Median", "Mode", "Mean"),time =c(format(min.time, "%H:%M"),format(max.time, "%H:%M"),format(median.time, "%H:%M"),format(mode.time, "%H:%M"),format(mean.time, "%H:%M")) )return(formatted.summary)}
We extract some useful insights from the clock in/out summaries. We can then, using the mean clock in/out times, determine how much time I spend in the office on average.
# Mean time spent at officemean.clock.in <-parse_date_time(clock.in.summary$time[5], orders ="%H:%M")mean.clock.out <-parse_date_time(clock.out.summary$time[5], orders ="%H:%M")mean.office.time <- mean.clock.out - mean.clock.inmean.office.time
Time difference of 6.133333 hours
Plot some graphs
Finally we can visualise the data. We start with a scatter plot that shows the time spent in office throughout the year. The mean time per event and overall mean is also overlayed on the plot.
# Get per event mean office timeevent.time <- daily.hours %>% dplyr::group_by(event) %>%summarise(mean.hours =mean(office.time, na.rm = T))# Add mean office time per event to plotoffice.time.plot +geom_hline(data = event.time, aes(yintercept = mean.hours, colour = event), linetype ="dashed",alpha =0.7,na.rm = T)
Next we plot a line graph that shows the clock in/out times over the year. This visualisation is useful for seeing the day to day trends but is not as effective because of the breaks that occur on days where I didn’t go into the office.
# Line graph of clock in and out timesmin.y.time <-floor_date(min(data.combined$clock.in, na.rm = T), unit ="hour")max.y.time <-max(data.combined$clock.out, na.rm = T)ggplot(data.combined, aes(x = date)) +geom_line(aes(y = clock.in, color ="Clock In"), linewidth =1, na.rm = T) +geom_line(aes(y = clock.out, color ="Clock Out"), linewidth =1, na.rm = T) +scale_color_manual(values =c("Clock In"="blue", "Clock Out"="red")) +scale_y_datetime(labels = scales::date_format("%H:%M"),breaks =seq(from = min.y.time,to = max.y.time,by ="1 hour")) +labs(title ="Clock in and Clock Out Times for 2024",x ="Date",y ="Time",color ="Legend") +theme_minimal() +theme(legend.position ="bottom", plot.title =element_text(hjust =0.5))
A scatter plot works better in this scenario and allows you to still see the general trend but does not seem as disjointed for the days without data.
# Scatter plot of clock in and out timesggplot(data.combined,aes(x = date)) +geom_point(aes(y = clock.in, color ="Clock In"), na.rm = T) +geom_point(aes(y = clock.out, color ="Clock Out"), na.rm = T) +scale_color_manual(values =c("Clock In"="blue", "Clock Out"="red")) +scale_y_datetime(labels = scales::date_format("%H:%M"),breaks =seq(from = min.y.time,to = max.y.time,by ="1 hour")) +labs(title ="Clock in and Clock Out Times for 2024",x ="Date",y ="Time",color ="Legend") +theme_minimal() +theme(legend.position ="bottom", plot.title =element_text(hjust =0.5))
However, we can fit a lot more data onto this plot. In the next plot, we plot the same scatter plot as before however, we differentiate between clock in and clock out points using shape instead of colour. This frees us up to use colour for the different types of event. We also then add the mean clock in/out times as a horizontal line.
We use geom_jitter to disperse points so it is visible when there are multiple events on one day. We also use transparency to help show when this is the case.
# Plot more detailed plotset.seed(2025) # Set seed for jitterclock.in.out.detail <-ggplot(data.long, aes(x = date, y = time, colour = event, fill = event, shape = event.type)) +geom_jitter(size =3, width =250, height =250, alpha =0.7, na.rm = T) +scale_color_manual(values = plot.colours, labels = plot.labels) +scale_fill_manual(values = plot.colours, labels = plot.labels) +scale_shape_manual(values =c("Clock In"=21, "Clock Out"=25)) +scale_y_datetime(labels = scales::date_format("%H:%M"),breaks =seq(from = min.y.time,to = max.y.time,by ="1 hour")) +labs(title ="Clock in and Clock Out Times for 2024",x ="Date",y ="Time",color ="Event",fill ="Event",shape ="Event Type") +theme_minimal() +theme(legend.position ="bottom", plot.title =element_text(hjust =0.5))
# Add mean times to plotclock.in.out.detail.static <- clock.in.out.detail +geom_hline(yintercept = mean.clock.in, color ="blue", linetype ="dashed") +geom_hline(yintercept = mean.clock.out, color ="red", linetype ="dashed") +geom_text(aes(x =as.POSIXct("2024-01-01"), y = mean.clock.in, label =paste(format(mean.clock.in, "%H:%M"))), color ="blue", vjust =-0.5, hjust =1.2) +geom_text(aes(x =as.POSIXct("2024-01-01"), y = mean.clock.out, label =paste(format(mean.clock.out, "%H:%M"))), color ="red", vjust =-0.5, hjust =1.2)clock.in.out.detail.static
Finally, an interactive version of the above plot using Plotly.
2. R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/
3. Wickham, H. (2014). Tidy data. J. Stat. Softw., 59(10).
Source Code
---title: "Clock In & Out Analysis"author: "Tino Muzambi"format: html: embed-resources: true page-layout: full toc: true code-fold: false code-tools: true execute: echo: true output: trueexecute: include: truebibliography: references.bibcsl: apa-numeric-superscript-brackets.csl---```{r setup, include=FALSE}knitr::opts_chunk$set(echo = TRUE)library(tidyverse)library(lubridate)library(plotly)library(DT)library(scales)```# IntroductionTowards the end of 2023, I applied for and got accepted into the MSc Data Science program at the University of Cape Town. This new journey inspired me to embark on a little side project of my own. The idea behind the project was to pursue a task that would involve all stages of the data analysis pipeline. From data collection, to data cleaning, to visualising the data and finally using it to tell a story. I also envisioned it as an opportunity to put into practical practice what I would've learned throughout the year as part of my MSc.I chose the focus of my project to be analysing the times I clocked in and out at the office. I figured this would be interesting to analyse, especially considering that as part of my MSc I would have lectures that would sometimes eat into my usual working hours. That, coupled with other commitments that come up in life, sounded like it would make for an interesting analysis and that's how this project was born.This notebook demonstrates how I went about cleaning, analysing and plotting the data. A more detailed post will follow on [my blog](https://blog.tinomuzambi.com) which will delve more into the data collection process and follow a more storytelling narrative.# Read in dataThe data come from a Notion database as a CSV and contain fields for the date of interest, the time I clocked in and out respectively on said date and any notes detailing any events of interest that occurred on said date e.g. a lecture, a pre-work commitment, a conference, etcetera.We begin by reading in the data and preventing automatic casting as this leads to the time fields being cast to a `difftime` object which, for our purposes isn't conducive for analysis. Therefore we use the `col_types` argument to retain all data as character vectors. We also convert the data to a tibble for the nice qualities that tibbles provide.[@tibble]```{r read data}# Read datadata <- read_csv("data/data.csv", col_types = "cccc") %>% as_tibble()# View datadata %>% head``````{r}# Structure of the datastr(data)``````{r}unique(data$Notes)```The data contain 250 entries where each entry is a day and all fields are stored as character vectors. The notes field is a multi-select field and the various options for it are a comma separated character vector of any number of the following:| Event | Description ||---------------------------|---------------------------------------------|| **Events where I am in office** || 11:00 lecture | A day where I had a 11:00 lecture || 14:00 lecture | A day where I had a 14:00 lecture || 16:00 lecture | A day where I had a 16:00 lecture || Study leave | A day or part of a day where I was on study leave || Pre-work commitment | A day where I had some commitment to tend to before going into the office || Post-work commitment | A day where I had some commitment to tend to after going into the office || Standard | A day where there were none of the above events || **Events where I am not in office** || Work from home | A day where I worked from home || Annual leave | A day where I was on annual leave || Sick leave | A day where I was on sick leave || Public holiday | A day that was a public holiday || Conference | A day or part of a day where I was away at a conference |: Table of event descriptions# Prepare data for analysisIn order to prepare the data for analysis, a few cleaning steps are necessary. We begin by renaming the columns to make them easier to work with and more representative of the data they represent.```{r rename columns}# Rename columnsdata <- data %>% rename( date = Date, clock.in = "Clock In Time", clock.out = "Clock Out Time", event = Notes)data %>% head```We also add an ID column to make identifying rows easier.```{r id column}# Add id columndata <- data %>% mutate(id = as.numeric(rownames(data)))data %>% head```We then manually cast fields to appropriate data types. Here we take advantage of Lubridate's excellent datetime casting capabilities. We use the `parse_date_time` function to cast both the date and clock in/out time fields displaying the versatility of this function and this library. This casts the fields into POSIXct class which allows us to perform various datetime functions on them. However, here we run into a characteristic of R that's less than ideal.When working with dates and times in R, the aforementioned POSIXct class is used and this class mandates having both a date portion and a time portion. This means that date fields are required to have some arbitrary time portion and similarly time fields are required to have some arbitrary date portion. This is why you'll often see time fields beginning with 1970-01-01 because that is the Unix epoch (a reference point for measuring time in seconds).[@RDoc] This ultimately has no effect on functionality but is just a nuisance when it comes to displaying as it requires reformatting to exclude the unwanted portion.```{r cast}# Cast date, clock.in and clock.out using Lubridatedata <- data %>% mutate(date = parse_date_time(date, orders = "dmy"), clock.in = parse_date_time(clock.in, orders = "%H:%M"), clock.out = parse_date_time(clock.out, orders = "%H:%M"), # Convert event names to programmatically friendly strings event = gsub(" ", "_", event), event = gsub(":", "_", event), event = gsub("-", "_", event), # Replace NA events event = replace(event, is.na(event), "Standard"))data %>% head```# Tidy dataWith those basic steps complete, we now begin moving towards getting the data into a "tidy" format. Tidy data is a standard way of structuring data that makes it easier to analyse, visualise, and model. The key principle of tidy data is that each variable forms a *column*, each observation forms a *row*, and each type of observational unit forms a *table*.[@Wickham2014-ju]For our use case, tidying the data will mean dealing with the event column by extracting the different possible events into their own columns. We start by splitting the fields in the event column for cases where there were multiple events in one day. The day with ID 24 shown below is an example of such a day.```{r separate}# Separate options from event into separate columnsdata.tidy <- data %>% separate_rows(event, sep = ",_")data.tidy %>% filter(id == 24) %>% head```Next we create binary indicator columns for each type of event. This moves us closer to our data being tidy with one column for each event where it takes on the value of true or false depending on if that event occured on that day or not.```{r indicator columns}# Create indicator columns for event optionsdata.tidy <- data.tidy %>% mutate(event_Work_from_home = if_else(event == "Work_from_home", TRUE, FALSE), event_Post_Work_Commitment = if_else(event == "Post_Work_Commitment", TRUE, FALSE), event_Pre_Work_Commitment = if_else(event == "Pre_Work_Commitment", TRUE, FALSE), event_16_00_lecture = if_else(event == "16_00_lecture", TRUE, FALSE), event_14_00_lecture = if_else(event == "14_00_lecture", TRUE, FALSE), event_11_00_lecture = if_else(event == "11_00_lecture", TRUE, FALSE), event_Annual_leave = if_else(event == "Annual_leave", TRUE, FALSE), event_Sick_leave = if_else(event == "Sick_leave", TRUE, FALSE), event_Study_leave = if_else(event == "Study_leave", TRUE, FALSE), event_Public_Holiday = if_else(event == "Public_Holiday", TRUE, FALSE), event_Conference = if_else(event == "Conference", TRUE, FALSE))data.tidy %>% select(starts_with("event_")) %>% head```We replace any NAs in the indicator columns with 0 because those values mean the same thing (that event did not occur on that day) and the data should clearly reflect that.```{r replace nas}# Replace NAs with 0s in indicator columnsdata.tidy <- data.tidy %>% mutate(across(starts_with("event_"), ~ replace(., is.na(.), 0)))data.tidy %>% select(starts_with("event_")) %>% head```We then collapse the data back into the original number of rows to keep with the requirements of tidy data. However, we do this by grouping by the ID and using the max aggregate function on the event columns which means we lose the other columns. We remedy this by then performing a full join on the original data.```{r collapse}# Collapse into original number of rowsdata.collapsed <- data.tidy %>% group_by(id) %>% summarise( event_Work_from_home = max(event_Work_from_home), event_Post_Work_Commitment = max(event_Post_Work_Commitment), event_Pre_Work_Commitment = max(event_Pre_Work_Commitment), event_16_00_lecture = max(event_16_00_lecture), event_14_00_lecture = max(event_14_00_lecture), event_11_00_lecture = max(event_11_00_lecture), event_Annual_leave = max(event_Annual_leave), event_Sick_leave = max(event_Sick_leave), event_Study_leave = max(event_Study_leave), event_Public_Holiday = max(event_Public_Holiday), event_Conference = max(event_Conference), .groups = "drop" )data.collapsed %>% str``````{r combine}# Combine dataframesdata.combined <- full_join(data.collapsed, data, by = c("id"))data.combined %>% str```# Compile summaries of dataWith the data now being in a tidy format, we are ready to proceed to plotting and analysing. For the first analysis we will look at clock in and out times per event. In order to do this we need to pivot the data longer such that there is one clock in/out event per row which makes this analysis easier.```{r pivot longer}# Pivot longerdata.long <- data.combined %>% gather(key = "event.type", value = "time", clock.in, clock.out) %>% mutate(event.type = ifelse(event.type == "clock.in", "Clock In", "Clock Out"), event = str_split(event, ",_")) %>% unnest(event)data.long %>% str```## Per Event Summaries```{r time summaries}# Get per event summariestime.summaries <- data.long %>% group_by(event, event.type) %>% summarise( num = n(), mean.time = format(mean(time, na.rm = T), "%H:%M"), .groups = "keep" )datatable(time.summaries)```From this we can see that the times largely correspond with the event as expected. We will move on to look at the overall clock in/out time stats.## Overall times statsTo do this, we could have used the built-in base R summary function, however we found that when working with time fields, it formats the fields in a less than ideal way as alluded to before. To remedy this, we built a custom summary function that calculates the statistics and then tabulates them formatting them before return the table. ```{r time summary function}time.summary <- function(time.field) { # Ensure the input is in POSIXct format time.field <- as.POSIXct(time.field, format = "%H:%M") # Sort times sorted.times <- sort(time.field) # Calculate minimum and maximum min.time <- min(time.field, na.rm = T) max.time <- max(time.field, na.rm = T) # Calculate median median.time <- median(time.field, na.rm = T) # Calculate mode mode.time <- as.POSIXct(names(sort(table(sorted.times), decreasing = T)[1]), origin = "0000-01-01", tz = "UTC") # Calculate mean mean.time <- mean(time.field, na.rm = T) # Compile output formatted.summary <- tibble( statistic = c("Min", "Max", "Median", "Mode", "Mean"), time = c(format(min.time, "%H:%M"), format(max.time, "%H:%M"), format(median.time, "%H:%M"), format(mode.time, "%H:%M"), format(mean.time, "%H:%M")) ) return(formatted.summary)}```## Clock in summary```{r clock in summary}clock.in.summary <- time.summary(data$clock.in)datatable(clock.in.summary)```## Clock out summary```{r clock out summary}clock.out.summary <- time.summary(data$clock.out)datatable(clock.out.summary)```We extract some useful insights from the clock in/out summaries. We can then, using the mean clock in/out times, determine how much time I spend in the office on average.```{r mean office time}# Mean time spent at officemean.clock.in <- parse_date_time(clock.in.summary$time[5], orders = "%H:%M")mean.clock.out <- parse_date_time(clock.out.summary$time[5], orders = "%H:%M")mean.office.time <- mean.clock.out - mean.clock.inmean.office.time```# Plot some graphsFinally we can visualise the data. We start with a scatter plot that shows the time spent in office throughout the year. The mean time per event and overall mean is also overlayed on the plot.```{r office time}# Calculate time spent in officedaily.hours <- data.tidy %>% dplyr::group_by(date) %>% mutate(office.time = as.numeric(difftime(clock.out, clock.in, units = "hours"))) %>% select(c(clock.in, clock.out, office.time, event, date))# Define plot coloursplot.colours <- c("Post_Work_Commitment" = "green", "Pre_Work_Commitment" = "grey", "16_00_lecture" = "blue", "14_00_lecture" = "red", "11_00_lecture" = "black", "Study_leave" = "turquoise", "Conference" = "purple", "Standard" = "orange")# Define nicer labelsplot.labels <- c("Work_from_home" = "Work from home", "Post_Work_Commitment" = "Post-work commitment", "Pre_Work_Commitment" = "Post-work commitment", "16_00_lecture" = "16:00 lecture", "14_00_lecture" = "14:00 lecture", "Public_Holiday" = "Public holiday", "Annual_leave" = "Annual leave", "11_00_lecture" = "11:00 lecture", "Study_leave" = "Study leave", "Conference" = "Conference", "Sick_leave" = "Sick leave", "Standard" = "Standard")# Plot graphoffice.time.plot <- ggplot(daily.hours, aes(x = date, y = office.time, colour = event)) + geom_point(size = 3, alpha = 0.7,na.rm = T) + geom_hline(yintercept = mean.office.time, linetype = "dashed", linewidth = 1) + scale_color_manual(values = plot.colours, labels = plot.labels) + labs(title = "Time Spent in Office for 2024", x = "Date", y = "Number of Hours", color = "Legend") + theme_minimal() + theme(legend.position = "bottom", plot.title = element_text(hjust = 0.5))``````{r office time graph, fig.width=13, fig.height=6, out.width="100%", out.height="auto"}# Get per event mean office timeevent.time <- daily.hours %>% dplyr::group_by(event) %>% summarise(mean.hours = mean(office.time, na.rm = T))# Add mean office time per event to plotoffice.time.plot + geom_hline(data = event.time, aes(yintercept = mean.hours, colour = event), linetype = "dashed", alpha = 0.7, na.rm = T)```Next we plot a line graph that shows the clock in/out times over the year. This visualisation is useful for seeing the day to day trends but is not as effective because of the breaks that occur on days where I didn't go into the office.```{r line graph, fig.width=13, fig.height=6, out.width="100%", out.height="auto"}# Line graph of clock in and out timesmin.y.time <- floor_date(min(data.combined$clock.in, na.rm = T), unit = "hour")max.y.time <- max(data.combined$clock.out, na.rm = T)ggplot(data.combined, aes(x = date)) + geom_line(aes(y = clock.in, color = "Clock In"), linewidth = 1, na.rm = T) + geom_line(aes(y = clock.out, color = "Clock Out"), linewidth = 1, na.rm = T) + scale_color_manual(values = c("Clock In" = "blue", "Clock Out" = "red")) + scale_y_datetime(labels = scales::date_format("%H:%M"), breaks = seq(from = min.y.time, to = max.y.time, by = "1 hour")) + labs(title = "Clock in and Clock Out Times for 2024", x = "Date", y = "Time", color = "Legend") + theme_minimal() + theme(legend.position = "bottom", plot.title = element_text(hjust = 0.5))```A scatter plot works better in this scenario and allows you to still see the general trend but does not seem as disjointed for the days without data.```{r scatter plot, fig.width=13, fig.height=6, out.width="100%", out.height="auto"}# Scatter plot of clock in and out timesggplot(data.combined, aes(x = date)) + geom_point(aes(y = clock.in, color = "Clock In"), na.rm = T) + geom_point(aes(y = clock.out, color = "Clock Out"), na.rm = T) + scale_color_manual(values = c("Clock In" = "blue", "Clock Out" = "red")) + scale_y_datetime(labels = scales::date_format("%H:%M"), breaks = seq(from = min.y.time, to = max.y.time, by = "1 hour")) + labs(title = "Clock in and Clock Out Times for 2024", x = "Date", y = "Time", color = "Legend") + theme_minimal() + theme(legend.position = "bottom", plot.title = element_text(hjust = 0.5))```However, we can fit a lot more data onto this plot. In the next plot, we plot the same scatter plot as before however, we differentiate between clock in and clock out points using shape instead of colour. This frees us up to use colour for the different types of event. We also then add the mean clock in/out times as a horizontal line.We use `geom_jitter` to disperse points so it is visible when there are multiple events on one day. We also use transparency to help show when this is the case.```{r detailed scatter plot}# Plot more detailed plotset.seed(2025) # Set seed for jitterclock.in.out.detail <- ggplot(data.long, aes(x = date, y = time, colour = event, fill = event, shape = event.type)) + geom_jitter(size = 3, width = 250, height = 250, alpha = 0.7, na.rm = T) + scale_color_manual(values = plot.colours, labels = plot.labels) + scale_fill_manual(values = plot.colours, labels = plot.labels) + scale_shape_manual(values = c("Clock In" = 21, "Clock Out" = 25)) + scale_y_datetime(labels = scales::date_format("%H:%M"), breaks = seq(from = min.y.time, to = max.y.time, by = "1 hour")) + labs(title = "Clock in and Clock Out Times for 2024", x = "Date", y = "Time", color = "Event", fill = "Event", shape = "Event Type") + theme_minimal() + theme(legend.position = "bottom", plot.title = element_text(hjust = 0.5))``````{r scatter mean times, fig.width=12, fig.height=6, out.width="100%", out.height="auto"}# Add mean times to plotclock.in.out.detail.static <- clock.in.out.detail + geom_hline(yintercept = mean.clock.in, color = "blue", linetype = "dashed") + geom_hline(yintercept = mean.clock.out, color = "red", linetype = "dashed") + geom_text(aes(x = as.POSIXct("2024-01-01"), y = mean.clock.in, label = paste(format(mean.clock.in, "%H:%M"))), color = "blue", vjust = -0.5, hjust = 1.2) + geom_text(aes(x = as.POSIXct("2024-01-01"), y = mean.clock.out, label = paste(format(mean.clock.out, "%H:%M"))), color = "red", vjust = -0.5, hjust = 1.2)clock.in.out.detail.static``````{r, include=FALSE}# Save plot# ggsave("Clock In Out Chart.png", clock.in.out.detail.static, width = 2560, height = 1600, units = "px", bg = "white", scale = 1.5)```Finally, an interactive version of the above plot using Plotly.```{r}ggplotly(clock.in.out.detail, tooltip =c("x", "y", "colour", "shape"))```# Improvements and Future WorkThis has been an insightful project to work on but there are several ideas for improvements as I resume collecting data for 2025. - I usually begin working from home around 08:00 and only leave home later partly for reducing commute time. This can be integrated into the analysis. - Incorporating times for leaving home and arriving at the office and vice versa would make for insights on time spent commuting. - Several improvements for the data collection process which will be detailed in the blog post to follow. - I invite you to also suggest improvements and things I could have done differently.