Data Quality Problems In Process Mining And What To Do About Them — Part 3: Zero Timestamps
This week, we are moving to the timestamp problems. Timestamps are really the Achilles heel of data quality in process mining. Everything is based on the timestamps: Not just the performance measurements but also the process flows and variant sequences themselves. So, over the next weeks we will look at the most typical timestamp-related issues.
Zero timestamps (or future timestamps)
One data problem that you will most certainly encounter at some point in time are so-called zero timestamps, or other kind of default timestamps that are given by the system. Often, zero timestamps were initially set as an empty value by the programmer of the information system. They can either be a mistake or indicate that the real timestamp has not yet been provided (for example, because an expected process step has not happened yet). Another reason can be typos in manually entered data.
These Zero timestamps typically take the form of 1 January 1900, the Unix epoch timestamp 1 January 1970, or some future timestamp (like 2100).
To find out whether you have Zero timestamps in your data, you can best go to the Overview statistics and take a look at the earliest and the latest timestamps in the data set. For example, in the screenshot below we can see that there is at least one 1900 timestamp in the imported data (click on the screenshot to see a larger version).
You should know what timeframe you are expecting for your data set and then verify that the earliest and latest timestamp confirm the expected time period. Be aware that if you do not address a problem like the 1900 timestamp in the picture above, you may end up with case durations of more than 100 years!
How to fix: You can remove Zero timestamps using the Timeframe filter in Disco (see instructions below).
You may also want to communicate your findings back to the system administrator to find out how these Zero timestamps can be avoided in the future.
To understand the impact of the Zero timestamps, you first need to investigate in more detail what is going on.
You want to find out whether just a few cases are affected by the Zero timestamps, or whether this is a wide-spread problem. For example, if Zero timestamps are recorded in the system for all activities that have not happened yet, you will see them in all open cases.
To investigate the cases that have Zero timestamps, add a Timeframe filter and use the ‘Intersecting timeframe’ mode while focusing on the problematic time period. This will keep all those cases that contain at least one Zero timestamp. Then use the ‘Copy and filter’ button to create a new data set focusing on the Zero timestamp cases (see screenshot below).
As a result, you will see just the cases that have Zero timestamps in them. You can see how many there are. Furthermore, you can inspect a few example cases to see whether the problem is always in the same place or whether multiple activities are affected. In our example, just two cases contain Zero timestamps (see below).
Now, let’s move on to fix the Zero timestamp problem in the data set.
Then: Remove cases or Zero timestamps only
Depending on whether Zero timestamps are a wide-spread problem or not you can take two different actions:
If only a few cases are affected, you can best remove these cases altogether. This way, they will not disturb your analysis. At the same time you will not be left with partial cases that miss some activities because of data issues.
If many cases are affected, like in the situation that Zero timestamps were recorded for activities that have not happened yet, you can better remove just the events that have Zero timestamps and keep the rest of these cases for your analysis.
In our example, just two cases are affected and we will remove these cases altogether. To do this, add a Timeframe filter and choose the ‘Contained in timeframe’ option while focusing your selection on the expected timeframe. This will remove all cases that have any events outside the chosen timeframe (see screenshot below).
If you just want to remove the activities that have Zero timestamps, choose the ‘Trim to timeframe’ option instead. This will “cut off” all events outside of the chosen timeframe and keep the rest of these cases in your data (see below)
Note that if your Zero timestamps indicate that certain activities have not happened yet, it would be better to keep the timestamp cells in the source data empty, rather than filling in a 1900 or 1970 timestamp value (see example below).
Events with empty timestamps will not be imported in Disco, because they cannot be placed in the sequence of activities for the case. So, keeping the timestamp cell empty for activities that have not occurred yet will save you this extra clean-up step in the future.
Finally: Make a clean copy
Once you have cleaned up the Zero timestamps from your data, you can best make a new copy using the ‘Apply filters permanently’ option to get a fresh start (see screenshot below). The result will be a new (cleaned) data set, which can now serve as the starting point for your analysis.
That’s it! You have successfully removed your Zero timestamps and any new filters that you add from now an will be based on your cleaned data.