Data Quality Problems In Process Mining And What To Do About Them — Part 13: Missing Complete Timestamps for Ongoing Activities
This is the 13th article in our series on data quality problems for process mining. You can find an overview of all articles in the series here.
If you have ‘start’ and ‘complete’ timestamps in your data set, then you can sometimes encounter situations, where the ‘complete’ timestamp is missing for those activities that are currently still running.
For example, take a look at the data snippet below (click on the image to see a larger version). Two process steps were performed for case ID 1938. The second activity that was recorded for this case is ‘Analyze Purchase Requisition’. It has a ‘start’ timestamp but the ‘complete’ timestamp is empty, because the activity has not yet completed (it is ongoing).
In principle, this is not a problem. After importing the data set, you can simply analyze the process map and the variants, etc., as you would usually do. When you look at a concrete case, then the activity duration for the activities that have not completed yet is shown as “instant” (see the history for case ID 1938 in the screenshot below).
However, where this does become a problem is when you analyze the activity duration statistics (see screenshot below). The “instant” activity durations influence the mean and the median duration of the activity. So, you want to remove those activities that are still ongoing from the calculation of the activity duration statistics.
How to fix:
Import your data set again and only configure the complete timestamp as a ‘Timestamp’ column (keep the start timestamp column as an attribute via the ‘Other’ configuration). This will remove all events, where the complete timestamp is missing.
Your activity duration statistics will now only be based on those activities that actually have both a start and a complete timestamp.