How Much Data Do You Need For Your Process Mining Project?
Here is another question I get frequently once people are eager to get started with the data extraction phase for their process mining project.
FAQ #3: Which timeframe should my log cover?
As a rule of thumb, I usually recommend to try to get data for at least 3 months. Depending on the run time of a single process instance it may be better to get data for up to a year. For example, if your process usually needs 5–6 months to complete (think of a public building permit process), a 3-month-long sample will not get you even one complete process instance.
How long are your cases
So, it really depends on how long a case in your process is typically running. You want to get a representative set of cases and you need to keep some room to catch the usual few long-running instances as well.
If you are still unsure how much data you need to extract, use the following formula based on the expected throughput time for your process:
timeframe = expected case completion time * 4 * 5
The baseline is the expected process completion time for a typical case. The 4 ensures that you have as much data that you could see four cases that were started and completed after each other (of course there will be others in between). The 5 accounts for the occasional long-running cases (20/80 rule) and makes sure you see cases that take up to five times longer in the extracted time window.
For example, if the expected completion time of a typical case in your process is 5 days, then the formula yields 100 days = 5 days * 4 * 5, which is approximately 3 months of data. If, however, a typical process is completed in just a few minutes, then extracting a couple of hours of data may be enough.
Please take the formula with a grain of salt. It has worked well for me, but the more you know about your process the better you will be able to judge the amount of data you should extract.
Two ways to extract data
Another way to make sure you get a good data sample is to choose a timeframe that you want to analyze (say, for example, April this year) and then extract all events for the cases that were started that month. This way, you can catch long-running instances even though you are focusing on a shorter timeframe for your analysis.
The picture below illustrates the difference. Every horizontal bar represents one case over time. The highlighted area stands for the selected timeframe, and the dark blue areas are the events that are covered by the data extraction method.
On the left, all events outside of the chosen timeframe are ignored, which leads to incomplete cases in your data set. These incomplete cases can be easily filtered out and are not a problem as long as you have enough data.
On the right, the events for all cases that are started within the chosen timeframe are kept, even if they fall outside the selected time period. This leads to a greater number of completed cases and can be useful if the chosen timeframe is short.1
If the end date of your timeframe is today, then there is no difference between (a) and (b): Cases may always be incomplete because they are still running.
It also depends on your questions
The amount of data you should extract also depends on the questions that you want to answer. For example, if you want to understand the regular process, then adding more data at a certain point won’t give you any more insights.
However, if you are looking for exceptions or irregularities that are important from a compliance angle, you probably want to check the data of the whole audit year to catch everything that went wrong in the audited period.
What is your experience with the amount of data that needs to be extracted? Let us know in the comments.
Be aware, however, that any activity from earlier cases (started before the selected time period) will not be visible with this extraction method. ↩︎