Privacy, Security and Ethics in Process Mining — Part 3: Anonymization
This is the 3rd article in our series on privacy, security and ethics in process mining. You can find an overview of all articles in the series here.
If you have sensitive information in your data set, instead of removing it you can also consider the use of anonymization techniques. When you anonymize a set of values, then the actual values (for example, the employee names Mary Jones, Fred Smith, etc.) will be replaced by another value (for example, Resource 1, Resource 2, etc.).
If the same original value appears multiple times in the data set, then it will be replaced with the same replacement value (Mary Jones will always be replaced by Resource 1). This way, anonymization allows you to obfuscate the original data but it preserves the patterns in the data set for your analysis. For example, you will still be able to analyze the workload distribution across all employees without seeing the actual names.
Some process mining tools (Disco and ProM) include anonymization functionality. This means that you can import your data into the process mining tool and select which data fields should be anonymized. For example, you can choose to anonymize just the Case IDs, the resource name, attribute values, or the timestamps. Then you export the anonymized data set and you can distribute it among your team for further analysis.
Determine which data fields are sensitive and need to be anonymized (see also the list of common process mining attributes and how they are impacted if anonymized below).
Keep in mind that despite the anonymization certain information may still be identifiable. For example, there may be just one patient having a very rare disease, or the birthday information of your customer combined with their place of birth may narrow down the set of possible people so much that the data is not anonymous anymore.
Anonymize the data before you have cleaned your data, because after the anonymization the data cleaning may not be possible anymore. For example, imagine that slightly different customer category names are used in different regions but they actually mean the same. You would like to merge these different names in a data cleaning step. However, after you have anonymized the names as Category 1, Category 2, etc. the data cleaning cannot be done anymore.
Anonymize fields that do not need to be anonymized. While anonymization can help to preserve patterns in your data, you can easily lose relevant information. For example, if you anonymize the Case ID in your incident management process, then you cannot look up the ticket number of the incident in the service desk system anymore. By establishing a collaborative culture around your process mining initiative (see part 4) and by working in a responsible, goal-oriented way, you can often work openly with the original data that you have within your team.
Anonymization of Common Process Mining Fields
Here is an overview of the typical process mining attributes and why you might want (or might not want) to anonymize them:
Removing the names of the employees working in the process is one of the more common anonymization steps. It can help to decrease friction and put employees more at ease when you involve them in a joint analysis workshop. Anonymizing employee names certainly is a must if you make your data publicly available in some form.
Be aware that it may still be possible to trace back individual employees. For example, if you look up a concrete case based on the case ID in the operational system, you will see the actual resource names there.
Finally, keep in mind that anonymizing employee names for an internal process mining analysis also removes valuable information. For example, if you identify process deviations or an interesting process pattern, normally the first step is to speak with the employees who were involved in this case to understand what happened and learn from them.
Anonymizing the case ID is a must if it contains sensitive information. For example, if you analyze the income tax return process at the tax office, then the case ID will be a combination of the social security number of the citizen and the year of the tax declaration. You will have to replace the social security information for obvious reasons.
However, for data sets where the case ID is less sensitive it is a good idea to keep it in place as it is. The benefit will be that you can look up individual cases in the operational system to verify your analysis or obtain additional information. Losing this link will limit your ability to perform root cause analyses and take action on the process problems that you discover.
Normally, you would not anonymize the activity name itself. The activities are the process steps that appear in the process map and in the variant sequences in the Process Mining tool. The reason why you do not want to replace the activity names by, for example, Activity 1, Activity 2, Activity 3, etc., is that most processes become very complex very quickly and without the activity names you have no chance to build a mental model and understand the process flows you are analyzing. Your analysis becomes useless.
Keeping the activity names in full is usually not a problem, because they describe a generic process step (like Email sent). However, especially if you have many different activity names in your data, you should review them to ensure they contain no confidential information (e.g., Email sent by lawyer X).
Sensitive information is often contained in additional attribute columns. For example, even if you are analyzing an internal ordering process, there might be additional data fields revealing information about the customer.
You can either completely remove data columns that you don’t need, or you can anonymize their values. Keep the attribute columns that are not sensitive in their original form, because they can contain important context information when you inspect individual cases during your Process Mining analysis.
Finally, be aware that sensitive information can also be hidden in a Notes attribute or some other kind of free-text field, where the employees write down additional information about the case or the process step. Simply anonymizing such a free-text field would be useless, because the whole text would be replaced by Value 1, Value 2, etc. To preserve the usefulness of the free-text field while removing sensitive information requires more work in the data pre-processing step and is not something that process mining tools can do for you automatically.
Sometimes, the time at which a particular activity happened already reveals too much information and would make it possible to identify one of your business entities in an unwanted way. In such situations, you can anonymize the timestamps by applying an offset. This means that a certain number of days, hours, and minutes will be added to the actual timestamps to create new (now anonymized) timestamps.
Keep in mind that some of the process patterns may change when you analyze data sets with anonymized timestamps. For example, you might see activities appear on other times of the day than you would see in the original data set. For this reason, timestamp anonymization is mostly used if data sets are prepared for public release and not if you analyze a process within your company.