Data Quality Problems in Process Mining and What To Do About Them — Part 10: Missing Timestamps For Activity Repetitions
This is the tenth article in our series on data quality problems for process mining. You can find an overview of all articles in the series here.
Last week, we were looking at missing activities and missing timestamps. Today, we will discuss another common data quality problem that I am sure most of you will encounter at some point in time in the future.
Take a look at the following data snippet (you can click on the image to see a larger version). In this data set, you can see three cases (Case ID 1, 2, and 3). If you compare this data set below with a typical process mining data set, you can see the following differences:
There is just one row per case (see case 1 highlighted). Normally, you would have multiple rows — One row for each event in the case. The activities are in columns (here, activity A, B, C, D and E), with the dates or timestamps recorded in the cell content.
When you encounter such a data set, you will have to re-format it into the process mining format in the following way (see screenshot below):
Add a rows for each activity (again, case 1 is highlighted). Create an activity and a timestamp column to capture the name and the time for each activity.
However, the important thing to realize here is that this is not purely a formatting problem. The column-based format is not suitable to…
Leave a Comment
You must be logged in to post a comment.