Data Quality Problems in Process Mining and What To Do About Them — Part 8: Different Clocks
This is the eighth article in our series on data quality problems for process mining. You can find an overview of all articles in the series here.
In previous articles we have seen how wrong timestamps can mess up everything in process mining: The process flows, the variants, and time measurements like case durations and waiting times in the process map.
One particularly tricky reason for timestamp errors is that the timestamps in your data set may have been recorded by multiple computers that run on different clocks. For example, in this case study at a security services company operators logged their actions when they arrived on-site, identified the problem, etc. on their hand-held devices. These mobile devices sometimes had different local times from the server as well as from each other.
If you look at the scenario below you can see why that is a problem: Let’s say a new incident is reported at the headquarters at 1:30 PM. Five minutes later, a mobile operator responds to the request and indicates that they will go to the location to fix it. However, because the clock on their mobile device is running 10 minutes late, the recorded timestamp indicates 1:25 PM.
When you then combine all the different timestamps in your data set to perform a process mining analysis, you will actually see the response of the operator show up before the initial incident report. Not only does this create incorrect flows in your process map and variants, but when you try to measure the time between the raising of the incident and the first response it will actually give you a negative time.
So, what can you do when you have data that has this problem?
First, investigate the problem to see whether the clock drift is consistent over time and which activities are affected. Then, you have the following options.
How to fix:
If the clock difference is consistent enough you can correct it in your source data. For example, in the scenario above you could add 10 minutes to the timestamps from the local operator.
If an overall correction is not possible, you can try to clean your data by removing cases that show up in the wrong order. Note that the Follower filter in Disco also allows you to remove cases, where more or less than a specified amount of time has passed between two activities. This way, you can separate minor clock drift glitches (typically the differences are just a few seconds) from cases where two activities were indeed recorded with a significant time difference. Make sure that the remaining data set is still representative after the cleaning.
If nothing helps, you might have to go back to your data collection system and set up a clock synchronization mechanism to constantly measure the time differences between the networked devices and get the correct timestamps while recording the data along the way.