Data Preparation for Process Mining — Part II: Timestamp Headaches and Cures
This is a guest post by Nicholas Hartman (see further information about the author at the bottom of the page) and the article is part II of a series of posts highlighting lessons learned from conducting process mining projects within large organizations (read Part I here).
If you have a process mining article or case study that you would like to share as well, please contact us at email@example.com.
Timestamps are core to any process mining effort. However, complex real-world datasets frequently present a range of challenges in analyzing and interpreting timestamp data. Sloppy system implementations often create a real mess for a data scientist looking to analyze timestamps within event logs. Fortunately, a few simple techniques can tackle most of the common challenges one will face when handling such datasets.
In this post I’ll discuss a few key points relating to timestamps and process mining datasets, including:
- Reading timestamps with code
- Useful time functions (time shifts and timestamp arithmetic)
- Understanding the meaning of timestamps in your dataset
Note that in this post all code samples will be in Python, although the concepts and similar functions will apply across just about any programming language, including various flavors of SQL.
Reading timestamps with code
As a data type, timestamps present two distinct challenges:
- The same data can appear in many different formats
- Concepts like time zones and daylight savings time mean that the same point in real time can be represented by entirely different numbers
To a computer time is a…