Blog Posts

Top 5 Data Quality Problems for Process Mining

“Garbage in, garbage out” – Most of you will know this phrase. For any data analysis technique the quality of the underlying data is important. Otherwise you run the risk of drawing the wrong conclusions.

In this post, I want to go over the five biggest data problems that you might encounter in a process mining project.

1. Incorrect logging

In the process mining world most people use the term “Noise” for exceptional behavior – not for incorrect logging. This means that if a process discovery algorithm is said to be able to deal with noise, then it can abstract from low-frequent behavior by only showing the main process flow. The reason is simple: It is impossible for discovery algorithms to distinguish incorrect logging from exceptional events.

What incorrect logging means is that the recorded data is wrong. The problem is that in such a situation the data does not reflect “the Truth” but instead provides wrong information about reality.

Here are two true stories of incorrect data:

In an ERP system, data entries from invoice documents had been scanned automatically. However, because of a mistake in the scanning procedure the invoice ID was interpreted as the invoice date for some of the cases. As a result, activities with a timestamp of the year 2020 appeared in the log data.
In a process improvement project in a hospital the data showed low utilization rates. As a consequence the hospital closed 2 wards but had to re-open them again shortly afterwards. When consultants looked into problem they found out that it was the data. The reason was that patient admittances were registered manually one day later than the patients had actually arrived.

The message here is to be careful with manually created data because it is usually less reliable than automatically registered data. If there are doubts about the trustworthiness of the data, then the data quality should be examined first before proceeding with the analysis.

Another example are inconsistencies in logging due to human differences: For example, one person may hit the “completed” button in a workflow system at the beginning and another person at the end of a task. Only when you are aware of such inconsistencies then they can be factored in during the analysis.

2. Insufficient logging

While incorrect logging is about wrong data, insufficient logging is about missing data. The minimum requirements for process mining are a case ID, an activity name, and a timestamp per event to reconstruct the history of each process instance.

Typical problems with missing data are:

Fields in the database of the information system are simply overwritten. So, old entries are lost and the database only provides information about the current status, but not the overall history of what happened in the past.
Some systems employ “batch logging” procedures, where, for example, activities are logged once a day (all at once). This way, all changes in-between are lost as well as the ordering of what happened when cannot be reconstructed anymore.

Typical OLAP and data mining techniques do not require the whole history of a process, and therefore data warehouses often do not contain all the data that is needed for process mining.

Another problem is that, ironically, by logging too much data sometimes there is not enough data. I have heard of more than one SAP or enterprise service bus system that does not keep logs longer than one month for the sheer amount of data that would accumulate otherwise. But processes often run longer than one month and, therefore, logs from a larger timeframe would be needed.

Finally, for specific types of analysis additional data is required. For example, to calculate execution times for activities both start and completion timestamps must be available in the data. For an organizational analysis, the person or the department that performed an activity should be included in the log extract, and so forth.

3. Semantics

One of the biggest challenges can be to find the right information and to understand what it means.

In fact, figuring out the semantics of existing IT logs can be anything between really easy and incredibly complicated. It largely depends on how distant the logs are from the actual business logic. For example, the performed business process steps may be recorded directly with their activity name, or you might need a mapping between some kind of cryptic action code and the actual business activity.

It is best to work together with an IT specialist who helps you extract the right data and explain the meaning of the different fields. In terms of process mining it helps not to try to understand everything at once. Instead, focus first on the three essential elements:

How to differentiate process instances,
Where to find the activity logs, and
The start and/or completion timestamps for activities.

In the next phase, one can look further for additional data that would enhance the analysis from a business perspective.

4. Correlation

Because process mining is based on the history of a process, the individual process instances need to be reconstructed from the log data. Correlation is about stitching everything together in the correct way:

Business processes often span multiple IT systems, and usually each IT system has its own local IDs. One needs to correlate these local process IDs to combine log fragments from the different systems (local ID from system No. 1 and local ID from system No. 2) in order to get a full picture of the process from start to end.
Even within the same system correlation may be necessary. For example, in an ERP purchase-to-pay process purchase orders are identified by purchase order IDs and later on the invoices are characterized by invoice IDs. To get an end-to-end process perspective, the corresponding purchase order IDs and invoice IDs need to be matched.
Sometimes, there are hierarchical processes and then activity instances need to be distinguished to correlate lower-level events that belong to these (activity) sub processes.

Overall, it is best to start simple (and ideally with one system) to pick the low-hanging fruits first and demonstrate the value of process mining.

5. Timing

Precisely because process mining evaluates the history of performed process instances, the timing is very important for ordering the events within each sequence. If the timestamps are wrong or not precise enough, then it is difficult to create the correct order of events in the history.

Some of the problems I have seen with timestamps are:

Timestamp resolution is too low. For example, only the date of a performed activity (but not the time) is recorded. But even if the time is recorded, it may be necessary to record it at least with millisecond accuracy if many events follow each other in automated systems.
Different timestamp granularites on different systems. For example, the timestamps in one system may be rounded to minutes. Another system (which is also executing a part of the process) records events with 1-second resolution. When put together, the order of some of the events may be wrong due to the granularity difference.
Different clocks on different systems. If multiple computers record data, then these computers can have different system clocks. In the merged log, these time differences then create problems, since they destroy the correct order of events.

Ideally, timestamps should be precise, not be rounded up or down, and synchronized (if there are multiple systems). If there are differences, it may help to work with offsets. If too many events have the same timestamp, one can try to use the original sequence of events.

Too many problems?

If all this sounds terrible, do not despair. Not all data are bad, and starting simple helps. Furthermore, it is surprising how many valuable results can be obtained from existing log data that were not even created with analysis purposes in mind.

Insight into data quality problems and bad data is often one of the first good results. Improving data is important as analyzability becomes more and more relevant. I liked what Mark Norton wrote in his comment on a recent blog post about the monetary value of data by Forrester Analyst Rob Karel:

If you don’t have the data, decisions can’t be made (by definition), and if decisions can’t be made, the organization cannot create value. So there is also an opportunity cost associated with non-existent or bad data.

What are your experiences with bad data?