Blog Posts

Do I Need to Remove Outliers for My Process Mining Analysis?

Outliers in Process Mining

A data point that is significantly different from other data points in a data set is considered an outlier. If you find an outlier in your event log, should you remove it before you continue with your process mining analysis?

In process mining terms, an outlier can mean many different things:

In machine learning, outliers are sometimes removed from the data sample during a cleaning step to improve the model. So, what about process mining: Should you remove such outliers when you find them to better represent the mainstream behavior of your process?

It depends.

First, you need to check whether the outlier is a data quality problem or whether it really happened in the process. As a rule of thumb, you should then remove outliers if they are there due to data quality issues and keep the ones that truly happened.

For example, one reason that a case has a much longer duration than others could be that it contains an event with a zero timestamp (such as 1900, 1970, or 2999). Zero timestamps can be errors or indicate that an activity has not happened yet. Either way, they do not reflect the actual time of the activity and, therefore, are misleading.

Another reason could be that the one case that took 20 times as long as you would expect (for example, 20 months instead of 4 weeks) really belongs to a crazy customer case that took multiple rounds, lots of ping pong between different departments, and simply an unusually long time to resolve. This is part of the process reality.

When you should remove outliers

You should clean up your outliers in the following situations:

Be mindful of how much data you remove in the cleaning process. If too much is removed then the remaining data set may not be representative anymore.

And keep in mind that not all data quality problems are outliers! For example, the recorded timestamps may not reflect the actual time of activities but look entirely normal.

When you should keep outliers

The idea behind keeping outliers if they reflect what really happened is that you want to see the whole picture of the process. Sometimes, exceptions in the process are the most interesting result of your analysis. Especially when they imply compliance issues or security risks in the process (say, a violation of the segregation of duties rule).

For example, you should keep outliers in the following situations:

At the same time, there are reasons to specifically address – and sometimes even remove – outliers although they are “real”. For example:

So, if outliers really happened in the process then you generally want to keep them. Because you want to see everything that is really there (just like you don’t need a minimum number of data points to perform a process mining analysis). But you want to be aware of them in the analysis.

Leave a Comment

Get the BPI Web Feed

Using the HTML code below, you can display this Business Process Incubator page content with the current filter and sorting inside your web site for FREE.

Copy/Paste this code in your website html code:

<iframe src="" frameborder="0" scrolling="auto" width="100%" height="700">

Customizing your BPI Web Feed

You can click on the Get the BPI Web Feed link on any of our page to create the best possible feed for your site. Here are a few tips to customize your BPI Web Feed.

Customizing the Content Filter
On any page, you can add filter criteria using the MORE FILTERS interface:

Customizing the Content Filter

Customizing the Content Sorting
Clicking on the sorting options will also change the way your BPI Web Feed will be ordered on your site:

Get the BPI Web Feed

Some integration examples