Process Mining for ERP Systems
Description
Presentation held at the 1st Workshop for Data- and Artifact-Centric Processes, co-located with BPM 2012, September 2012.
Transcript
Erik Nooijen, Boudewijn v. Dongen, Dirk FahlandProcess Mining for ERP Systems Process Discovery process event process discovery log model algorithm c1: A B C D E assumptions c2: A C B D E • case = sequence of events of this case c3: A F D E • cases are isolated: event A in c1 happens only in c1 (and not in c2) … • cases of the same process • one unique case id, • each event associated to exactly one case id PAGE 1 Typical Process in an ERP System Manufacturer Material A Material B order Material B Material B product X orderAlice materials ACME Inc. Material B Material A order Material C Material C product Y orderBob materials Build to Order Mega Corp. PAGE 2 n-to-m relations database process process discovery model algorithmid attributes time-stamp attributes ProductOrder CustomerpoID cust. … created processed built shipped cust. address …po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15 Alice … …po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18 Bob … … relations data attributes OrderedMaterial id attributes MaterialOrderpoID moID type added moID suppl. … completed sent receivedpo1 mo3 B 30-08 13:13 mo3 ACME 30-08 13:15 30-08 14:15 01-09 9:05po1 mo4 A 30-08 13:14 mo4 MEGA 30-08 13:17 30-08 16:12 01-09 10:13po2 mo3 B 30-08 13:15po2 mo4 C 30-08 13:16 relations PAGE 3 Process Discovery for ERP Systems process process discovery model algorithm 0..* Customer reality: data in a relational DBProductOrder – cust 1 -… • events stored as time-stamped- poID- cust attributes in tables- created OrderedMat. MaterialOrder- processed – poID- built 1 – moID – moID • multiple primary keys- shipped 1..* – supplier multiple notions of case – type 1..* – completed – added 1 – sent – received • tables are related one event related to multiple cases PAGE 4 Process Discovery for ERP Systems process process discovery model algorithm 0..* Customer reality: data in a relational DBProductOrder – cust 1 -… • events stored as time-stamped- poID- cust attributes in tables- created OrderedMat. MaterialOrder- processed – poID- built 1 – moID – moID • multiple primary keys- shipped 1..* – supplier multiple notions of case – type 1..* – completed – added 1 – sent – received • tables are related one event related to multiple cases PAGE 5 Outline process model related by primary foreign-key relations decompose by primary keys model f. log f. discovery PO log f. model f. MO PO MO discovery PAGE 6 Find Artifact Schemas process model related by primary foreign-key relations decompose by primary keys model f. log f. discovery PO log f. model f. MO PO MO discovery PAGE 7 Step 0: discover database schema document schema vs. actual schema identify • column types (esp. time-stamped columns) • primary keys • foreign keys various (non-trivial) techniques available key discovery is NP-complete in the size of the table(s) result: PAGE 8 Step 1: decompose schema into processes= schema summarization find: 1. sets of corresponding tables 2. links between those ProductOrder MaterialOrder PAGE 9 Automatic Schema Summarization= group similar tables through clustering define a distance between any 2 tables • by relations • by information content tables that are close to each other same cluster # of clusters: user input PAGE 10 Automatic Schema Summarization1. structural distance A between tables 1 2 fanout: 1 = (2+0)/2 fanout ~ avg. # of child fanout: 1 records related to the fanout: 2 same parent record A B A B A B 1 X 1 X 1 X 2 Y 1 Y 1 Y 2 Z 2 U PAGE 11 Automatic Schema Summarization1. structural distance A between tables 1 2 fanout: 1 fanout ~ avg. # of child fanout: 1 m.fr: 2 = 1/ (1/2) records related to the m.fr: 1 fanout: 2 same parent record m.fr: 1 A B A B A B matched fraction ~ 1 X 1 X 1 X 1 / (fraction of records in 2 Y 1 Y 1 Y parent with matching child 2 Z record) 2 U PAGE 12 Grouping by Clustering1. structural distance2. information distance importance of each table = entropy (is maximal if all records are different) distance: 2 tables with high entropies large distance3. weighted distance by structure + information4. k-means clustering: most important table of cluster k clusters based on = table with least distance to all key attribute of the cluster weighted distance PAGE 13 Artifact Schema Artifact Log process model related by primary foreign-key relations decompose by primary keys model f. log f. discovery PO log f. model f. MO PO MO discovery PAGE 14 Log Extraction cluster = set of related tables + primary key of most important table case id poID cust. … created processed built shipped log f. PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15 po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18 poID moID type added po1 mo3 B 30-08 13:13po1: po1 mo4 A 30-08 13:14 po2 mo3 B 30-08 13:15po2: po2 mo4 C 30-08 13:16 PAGE 15 Log Extraction cluster = set of related tables + primary key of most important table case id time-stamped attribute event poID cust. … created processed built shipped log f. PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15 po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18 poID moID type added po1 mo3 B 30-08 13:13po1: (created, poID=po1, time=30-08 9:22, …) po1 mo4 A 30-08 13:14 po2 mo3 B 30-08 13:15 po2 mo4 C 30-08 13:16 PAGE 16 Log Extraction cluster = set of related tables + primary key of most important table case id time-stamped attribute event related attributes event attributes poID cust. … created processed built shipped log f. PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15 po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18 poID moID type added po1 mo3 B 30-08 13:13po1: (created, poID=po1, time=30-08 9:22, cust.=Alice, …)po1 mo4 A 30-08 13:14 po2 mo3 B 30-08 13:15 po2 mo4 C 30-08 13:16 PAGE 17 Log Extraction cluster = set of related tables + primary key of most important table case id time-stamped attribute event related attributes event attributes poID cust. … created processed built shipped log f. PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15 po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18 poID moID type added po1 mo3 B 30-08 13:13po1: (created, poID=po1, time=30-08 9:22, cust.=Alice, …)po1 mo4 A 30-08 13:14 (processed, poID=po1, time=30-08 13:12, …) po2 mo3 B 30-08 13:15 po2 mo4 C 30-08 13:16 PAGE 18 Log Extraction cluster = set of related tables + primary key of most important table case id time-stamped attribute event related attributes event attributes poID cust. … created processed built shipped log f. PO po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15 po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18 poID moID type added po1 mo3 B 30-08 13:13po1: (created, poID=po1, time=30-08 9:22, cust.=Alice, …)po1 mo4 A 30-08 13:14 (processed, poID=po1, time=30-08 13:12, …) po2 mo3 B 30-08 13:15 (added, poID=po1, time=30-08 13:13, moID=mo3, …)po2 mo4 C 30-08 13:16 refers to artifact “MaterialOrder” PAGE 19 Outline process model compose by primary foreign-key relations decompose by primary keys model f. log f. discovery order log f. model f. order quote quote discovery PAGE 20 Resulting Model(s) Product Order Material Order 1..* added create completed processed added 1..* sent built received shipped (addded, poID=po1, …, moID=mo3) PAGE 21 Implementation & Evaluation prototype tool • input: relational database (via JDBC), .csv tables • steps − discover database schema (types, keys, relations) − discover artifact schema − by k-means clustering − by user picking tables − extract logs ProM PAGE 22 Evaluation: SAP System of Sligro > 300 tables, > 40 GiB of data schema extraction time-stamp attributes: 15 hrs primary keys: 4 hrs foreign keys: 5 hrs (single col)/ 6 days (double col.) clustering entropies: 17 hrs table distances: 5 hrs clustering: a few seconds ~20 different artifacts found largest: 47 tables, 869 columns log extraction extract 1000 traces of > 246,000 events query database: 1 hrs write log file: 32 hrs PAGE 23 Sligro: Artikel lifecycle model PAGE 24 Open issues performance • key discovery: NP-complete in R (# of records) • foreign key discovery: NP-complete in R2 • problem is in the “hard part” of NP • sampling of data, domain knowledge, semi-automatic requires good database structure • proper relations, proper keys • otherwise wrong clusters are formed • events don’t get right attributes • semi-automatic approach events shared by multiple cases… working on it… PAGE 25 Erik Nooijen, Boudewijn v. Dongen, Dirk FahlandProcess Mining for ERP Systems
Leave a Comment
You must be logged in to post a comment.