Blog Posts

Top 5 Data Quality Problems for Process Mining

“Garbage in, garbage out” – Most of you will know this phrase. For any data analysis technique the quality of the underlying data is important. Otherwise you run the risk of drawing the wrong conclusions.

In this post, I want to go over the five biggest data problems that you might encounter in a process mining project.

1. Incorrect logging

In the process mining world most people use the term “Noise” for exceptional behavior – not for incorrect logging. This means that if a process discovery algorithm is said to be able to deal with noise, then it can abstract from low-frequent behavior by only showing the main process flow. The reason is simple: It is impossible for discovery algorithms to distinguish incorrect logging from exceptional events.

What incorrect logging means is that the recorded data is wrong. The problem is that in such a situation the data does not reflect “the Truth” but instead provides wrong information about reality.

Here are two true stories of incorrect data:

The message here is to be careful with manually created data because it is usually less reliable than automatically registered data. If there are doubts about the trustworthiness of the data, then the data quality should be examined first before proceeding with the analysis.

Another example are inconsistencies in logging due to human differences: For example, one person may hit the “completed” button in a workflow system at the beginning and another person at the end of a task. Only when you are aware of such inconsistencies then they can be factored in during the analysis.

2. Insufficient logging

While incorrect logging is about wrong data, insufficient logging is about missing data. The minimum requirements for process mining are a case ID, an activity name, and a timestamp per event to reconstruct the history of each process instance.

Typical problems with missing data are:

Typical OLAP and data mining techniques do not require the whole history of a process, and therefore data warehouses often do not contain all the data that is needed for process mining.

Another problem is that, ironically, by logging too much data sometimes there is not enough data. I have heard of more than one SAP or enterprise service bus system that does not keep logs longer than one month for the sheer amount of data that would accumulate otherwise. But processes often run longer than one month and, therefore, logs from a larger timeframe would be needed.

Finally, for specific types of analysis additional data is required. For example, to calculate execution times for activities both start and completion timestamps must be available in the data. For an organizational analysis, the person or the department that performed an activity should be included in the log extract, and so forth.

3. Semantics

One of the biggest challenges can be to find the right information and to understand what it means.

In fact, figuring out the semantics of existing IT logs can be anything between really easy and incredibly complicated. It largely depends on how distant the logs are from the actual business logic. For example, the performed business process steps may be recorded directly with their activity name, or you might need a mapping between some kind of cryptic action code and the actual business activity.

It is best to work together with an IT specialist who helps you extract the right data and explain the meaning of the different fields. In terms of process mining it helps not to try to understand everything at once. Instead, focus first on the three essential elements:

  1. How to differentiate process instances,

  2. Where to find the activity logs, and

  3. The start and/or completion timestamps for activities.

In the next phase, one can look further for additional data that would enhance the analysis from a business perspective.

4. Correlation

Because process mining is based on the history of a process, the individual process instances need to be reconstructed from the log data. Correlation is about stitching everything together in the correct way:

Overall, it is best to start simple (and ideally with one system) to pick the low-hanging fruits first and demonstrate the value of process mining.

5. Timing

Precisely because process mining evaluates the history of performed process instances, the timing is very important for ordering the events within each sequence. If the timestamps are wrong or not precise enough, then it is difficult to create the correct order of events in the history.

Some of the problems I have seen with timestamps are:

Ideally, timestamps should be precise, not be rounded up or down, and synchronized (if there are multiple systems). If there are differences, it may help to work with offsets. If too many events have the same timestamp, one can try to use the original sequence of events.

Too many problems?

If all this sounds terrible, do not despair. Not all data are bad, and starting simple helps. Furthermore, it is surprising how many valuable results can be obtained from existing log data that were not even created with analysis purposes in mind.

Insight into data quality problems and bad data is often one of the first good results. Improving data is important as analyzability becomes more and more relevant. I liked what Mark Norton wrote in his comment on a recent blog post about the monetary value of data by Forrester Analyst Rob Karel:

If you don’t have the data, decisions can’t be made (by definition), and if decisions can’t be made, the organization cannot create value. So there is also an opportunity cost associated with non-existent or bad data.

What are your experiences with bad data?

Leave a Comment

Get the BPI Web Feed

Using the HTML code below, you can display this Business Process Incubator page content with the current filter and sorting inside your web site for FREE.

Copy/Paste this code in your website html code:

<iframe src="https://www.businessprocessincubator.com/content/top-5-data-quality-problems-for-process-mining/?feed=html" frameborder="0" scrolling="auto" width="100%" height="700">

Customizing your BPI Web Feed

You can click on the Get the BPI Web Feed link on any of our page to create the best possible feed for your site. Here are a few tips to customize your BPI Web Feed.

Customizing the Content Filter
On any page, you can add filter criteria using the MORE FILTERS interface:

Customizing the Content Filter

Customizing the Content Sorting
Clicking on the sorting options will also change the way your BPI Web Feed will be ordered on your site:

Get the BPI Web Feed

Some integration examples

BPMN.org

XPDL.org

×