Process Mining
An index to the state of the art and an outline of open
research challenges at DIAG
Claudio Di Ciccio, Massimo Mecella
Seminars in Software and Services for the Information Society
Process Mining
Definition
• Process Mining [Aalst2011.book], also referred to
as Workflow Mining, is the set of techniques that
allow the extraction of process descriptions,
stemming from a set of recorded real executions
(logs).
• ProM [AalstEtAl2009] is one of the most used plugin based software environment for implementing
workflow mining (and more) techniques.
• The new version 6.0 is available for download at
www.processmining.org
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 2
Process Mining
Definition
• Process Mining involves:
• Process discovery
• Control flow mining, organizational mining, decision mining;
• workflow
• Conformance checking
• Operational support
• We will focus on the control flow mining
• Many control flow mining algorithms proposed
•
•
•
•
•
•
α [AalstEtAl2004], α++ [WenEtAl2007] and α# [WenEtAl2010]
Fuzzy [GüntherAalst2007]
Heuristic [WeijtersEtAl2001]
Genetic [MedeirosEtAl2007]
Two-step [AalstEtAl2010]
…
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 3
Process Mining
Further reading
• The rest of the lesson is based on the following
material:
• Van der Aalst, W. M. P.: “Process Discovery: An
Introduction”
• Available at
http://www.processmining.org/_media/processminingbook/process_
mining_chapter_05_process_discovery.pdf
• From the teaching material for [Aalst2011.book]
• De Medeiros, A. K. A.: “Process Mining: Control-Flow
Mining Algorithms”
• Available at
http://www.processmining.org/_media/courses/processmining/lecture3_controlfl
owminingalgorithms.ppt
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 4
A different context for process mining
Artful processes and knowledge workers
•
Artful processes [HillEtAl06]
– informal processes typically
carried out by those people
whose work is mental rather
than physical (managers,
professors, researchers,
engineers, etc.)
• “knowledge workers”
[ACTIVE09]
•
Knowledge workers create artful
processes “on the fly”
•
Though artful processes are
frequently repeated, they are not
exactly reproducible, even by their
originators, nor can they be easily
shared
– Loosely structured
– Highly flexible
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 5
A different context (2)
Email conversations
•
In collaborative contexts,
knowledge workers share
their information and
outcomes with other
knowledge workers
– E.g., a software
development mgr.
•
Typically, by means of
several email
conversations
– Email conversations
are actual traces of
running processes
that knowledge
workers adhere to
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 6
A different context (4)
Some areas of applicability
• Personal information management (PIM)
– how to organize one’s own activities, contacts, etc. through the
usage of software
• Information warfare
– in supporting anti-crime intelligence agencies
• Enterprise engineering
– for knowledge-heavy industries, where preserving documents
making up product data is not enough
• eHealth
– for the automatic discovery of medical treatment procedures on
top of patient health records
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 7
MailOfMine
What is MailOfMine?
MailOfMine is the approach and the implementation
of a collection of techniques, the aim of which is to
is to automatically build, on top of a collection of
email messages, a set of workflow models that
represent the artful processes laying behind the
knowledge workers’ activities.
[DiCiccioEtAl11]
[DiCiccioMecella12]
[DiCiccioMecella/TR12]
[DiCiccioMecella13]
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 8
On the visualization of processes
The imperative model
• Represents the
whole process
at once
• The most used
notation is
based on a
subclass of
Petri Nets
(namely, the
Workflow
Nets)
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 9
On the visualization of processes
The declarative model
•
If A is performed,
B must be perfomed,
no matter
before or afterwards
(responded existence)
Rather than using a procedural
language for expressing the
allowed sequence of activities, it is
based on the description of
workflows through the usage of
constraints
•
•
the idea is that every task can be
performed, except the ones which
do not respect such constraints
this technique fits with processes
that are highly flexible and subject
to changes, such as artful
processes
The notation here is based on [AalstEtAl06,
MaggiEtAl11] (DecSerFlow, Declare)
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
Whenever B is performed,
C must be performed
afterwards
and B can not be repeated
until C is done
(alternate response)
P. 10
On the visualization of processes
Imperative vS declarative
Declarative
Imperative
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
Declarative models work
better in presence of a
partial specification of
the process scheme
P. 11
A real discovered process model
“Spaghetti process” [Aalst2011.book]
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 12
Declare constraint templates
Existence templates
Existence(n, A)
Activity A occurs at least n times in the process instance
BCAAC ✓
BCAAAC ✓
BCAC ✗ (for n = 2)
Absence(A)
Activity A does not occur in the process instance
BCC ✓
BCAC ✗
Absence(n+1, A)
Activity A occurs at most n+1 times in the process instance
BCAAC ✗
BCAC ✓
BCC ✓ (for n = 2)
Exactly(n, A)
Activity A occurs exactly n times in the process instance
BCAAC ✗
BCAAAC ✗
BCAC ✗ (for n = 2)
Init(A)
Activity A is the first to occur in each process instance
BCAAC ✗
ACAAAC ✓
BCC ✗
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 13
Declare constraint templates
Relation templates
RespondedExistence(A, B)
If A occurs in the process instance, then B occurs as well
CAC ✗
CAACB ✓
BCAC ✓
BCC ✓
Response(A, B)
If A occurs in the process instance, then B occurs after A
BCAAC ✗
CAACB ✓
CAC ✗
BCC ✓
AlternateResponse(A, B)
Each time A occurs in the process instance, then B occurs
afterwards, before A recurs
BCAAC ✗
CAACB ✗
CACB ✓
CABCA ✗
BCC ✓
CACBBAB ✓
ChainResponse(A, B)
Each time A occurs in the process instance, then B occurs
immediately afterwards
BCAAC ✗
BCAABC ✗
BCABABC ✓
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 14
Declare constraint templates
Relation templates
RespondedExistence(B, A)
If B occurs in the process instance, then A occurs as well
CAC ✓
CAACB ✓
BCAC ✓
BCC ✗
Precedence(A, B)
B occurs in the process instance only if preceded by A
BCAAC ✗
CAACB ✓
CAC ✓
BCC ✓
AlternatePrecedence(A, B)
Each time B occurs in the process instance, it is preceded by A
and no other B can recur in between
BCAAC ✗
CAACB ✓
CACB ✓
CABCA ✓
BCC ✗
CACBAB ✓
ChainPrecedence(A, B)
Each time B occurs in the process instance, then B occurs
immediately beforehand
BCAAC ✗
BCAABC ✗
CABABCA ✓
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 15
Declare constraint templates
Relation templates
CoExistence(A, B)
If B occurs in the process instance, then A occurs, and viceversa
CAC ✗
CAACB ✓
BCAC ✓
BCC ✗
Succession(A, B)
A occurs if and only if it is followed by B in the process instance
BCAAC ✗
CAACB ✓
CAC ✗
BCC ✗
AlternateSuccession(A, B)
A and B occur in the process instance if and only if the latter
follows the former, and they alternate each other in the trace
BCAAC ✗
CAACB ✗
CACB ✓
CABCA ✗
BCC ✗
CACBAB ✓
ChainSuccession(A, B)
A and B occur in the process instance if and only if the latter
immediately follows the former
BCAAC ✗
BCAABC ✗
CABABC ✓
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 16
Declare constraint templates
Negative relation templates
NotCoExistence(A, B)
A and B never occur together in the process instance
CAC ✓
CAACB ✗
BCAC ✗
BCC ✓
NotSuccession(A, B)
A can never occur before B in the process instance
BCAAC ✓
CAACB ✗
CAC ✓
BCC ✓
NotChainSuccession(A, B)
A and B occur in the process instance if and only if the latter
does not immediately follows the former
BCAAC ✓
BCAABC ✗
CBACBA ✓
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 17
Relation constraint templates subsumption
Constraint templates are not independent of each other
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 18
MINERful++
Our mining algorithm
 MINERful++ is the workflow discovery algorithm of
MailOfMine
 Its input is a collection of strings T and an alphabet
ΣT
 Each string t is a trace
 Each character is an event (enacted
task)
 The collection represents the log
 Its output is a declarative process model
 What is a declarative process model?
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 19
The declarative specification of an artful process (e.g.)
Scenario
A project meeting is scheduled
We suppose that a final agenda will be committed
(“confirmAgenda”) after that requests for a new
proposal (“requestAgenda”), proposals themselves
(“proposeAgenda”) and comments (“commentAgenda”)
have been circulated.
Shortcuts for tasks (process alphabet):
–
–
–
–
p
r
c
n
(“proposeAgenda”)
(“requestAgenda”)
(“commentAgenda”)
(“confirmAgenda”)
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 20
The declarative specification of an artful process (e.g.)
Constraints on activities
Existence constraints
1.
2.
3.
Participation(n)
Uniqueness(n)
End(n)
Relation constraints
4.
5.
6.
Response(r, p)
RespondedExistence(c, p)
Succession(p, n)
The agenda
1.
2.
3.
must be confirmed,
only once:
it is the last thing to do.
During the compilation:
4.
the proposal follows a
request;
5. if a comment circulates,
there has been / will be a
proposal;
6. after the proposal, there
will be a confirmation, and
there can be no
confirmation without a
preceding proposal.
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 21
MINERful++
Workflow discovery by constraints inference
 MINERful++ is a two-step algorithm
1. Construction of a Knowledge Base
2. Constraints inference by means of queries evaluated on
the KB
 This allows the discovery of constraints through a
faster procedure on data which are smaller in size
than the whole input
 Returned constraints are weighted with their
support
 the normalized fraction of cases in which
the constraint is verified over the set of
input traces
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 22
MINERful++ by example
Workflow discovery by constraints inference
 In order to see how it works now
 we see a run of MINERful++ over a string,
compliant with the previous example:
rrpcrpcrcpcn
 We start with the construction of the algorithm’s
Knowledge Base
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 23
MINERful++ by example
Building the “ownplay” of p and n
p
– p occurred 3 times in 1 string
γp(3) = 1
rrpcrpcrcpcn
• For each m ≠ 3
γp(m) = 0
– p did not occur as the first nor as the last character
gi(p) = 0
gl(p) = 0
n
rrpcrpcrcpcn
– γn(1) = 1
– For each m ≠ 1, γn(m) = 0
– n occurred as the last character in 1 string
gi(n) = 0
gl(n) = 1
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 24
MINERful++ by example
Building the “interplay” of p and n
With respect to the
occurrence of p, n
occurred…
i.
ii.
iii.
iv.
v.
Never before: 3 times
δp,n(-∞) = 3
2 char’s after: 1 time
δp,n(2) = 1
6 char’s after: 1 time
δp,n(6) = 1
9 char’s after: 1 time
δp,n(9) = 1
Repetitions in-between:
i.
ii.
Looking at the string
i.
rrpcrpcrcpcn
ii.
rrpcrpcrcpcn
iii.
rrpcrpcrcpcn
iv.
rrpcrpcrcpcn
v.
onwards: 2 times
b→p,n = 2
backwards: never
b←p,n = 0
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
i.
rrpcrpcrcpcn
ii.
rrpcrpcrcpcn
P. 25
MINERful++ by example
Building the “interplay” of r and p
δr,p
-∞
-5
-2
+1
+2
+4
+5
+8
+9
2
1
2
2
2
1
2
1
1
b→r,p = 1
b←r,p = 0
rrpcrpcrcpcn
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 26
MINERful++
Computing the support of constraints
 Let Gr and Gp be the number of times in which r and p
respectively appear in the log
G x = åg x (n)× n
we have, e.g.,
n>0
support for Response(r, p)
 hint: how many times p was not read in the traces
d (+¥)
after r occurred?
1- r, p
In those cases, Response(r, p) does not hold
G
r
support for Succession(r, p)
 how many times p was not read in the traces
after r occurred, nor r was read before p
dr, p (+¥) + d p,r (-¥)
1occurred?
Gr + G p
In those cases, Succession(r, p) does not hold
…
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 27
MINERful++
Computing the support of constraints
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 28
MINERful++ by example
Computing the support of constraints
1.
rrpcrpcrcpcn
2.
rccrppcpcrcppn
3.
cccccrrpcrrprrprppp
n
4.
cprpn
5.
rppn
6.
rrrcrcpn
7.
rpcn
8.
crprrccppn
9.
pccn
10.
crrcccrrpcpccpn
Support for…
– Response(r, p)
1.0
– Succession(r, p)
0.96364
– Precedence(c, n)
0.9
The support can be used to prune out
those constraints falling under a
given threshold
–
E.g., 0.95
A threshold equal to 1.0 selects those
constraints which are always valid on
the log
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 29
Relation constraint templates subsumption
Constraint templates are not independent of each other
E.g.,
– A trace like
ababcabcc
satisfies (w.r.t. b and a):
• RespondedExistence(a, b), RespondedExistence(b, a),
CoExistence(a, b), CoExistence(b, a), Response(a, b), AlternateResponse(a, b),
ChainResponse(a, b), Precedence(a, b), AlternatePrecedence(a, b),
ChainPrecedence(a, b),
Succession(a, b), AlternateSuccession(a, b),
ChainSuccession(a, b)
– The mining algorithm would have to show the most
strict constraint only (ChainSuccession(a, b))
MINERful++ faces this issue, by pruning the returned
constraints on the basis of the subsumption hierarchy
of constraints
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 30
Relation constraint templates subsumption
Constraint templates are not independent of each other
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 31
Relation constraint templates subsumption
A hint on the pruning procedure
✖
✖
✖
✖
1.0
1.0
1.0
1.0
✖
✖
✖
1.0
1.0
1.0
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
✖
1.0
✖
✖
1.0
1.0
1.0
P. 32
Relation constraint templates subsumption
A hint on the pruning procedure
S
u
p
p
o
r
t
✖
✖
✖
0.9
0.9
0.9
✖ 0.9
0.9
✖
?
✖
✖
?
0.9
0.7
0.7
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
?
✖
✖
?
0.9
0.8
0.8
P. 33
Evaluation
Characteristics of MINERful++
MINERful++ is
–
–
–
–
Independent on the formalism used for expressing constraints
Modular (two-phase)
Capable of eliminating redundancy in the process model
Fast
Sony VAIO VGN-FE11H
Intel Core Duo T2300 1.66 GHz
2 MB L2 cache
2 GB of DDR2 RAM @ 667 Mhz
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 34
Evaluation
150000
100000
50000
16000
6
9
12
15
Average string length
12000
50
46
48
44
8000
42
15000
40
4000
4000
8000
12000
16000
Number of traces
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
Execution time [msec]
Total execution time [msec]
Linear w.r.t. the
number of traces in
the log
|T|
Quadratic w.r.t. the
size of traces in the
log
|tmax|
Quadratic w.r.t. the
size of the alphabet
|ΣT|
Hence, polynomial
in the size of the
input
O(|T|·|tmax |2·|ΣT |2)
Total execution time [msec]
On the complexity of MINERful++
Alphabet size
50.0
47.5
45.0
10000
42.5
40.0
5000
0
50000
100000
150000
200000
Processed events
P. 35
Preliminary results
MINERful++ on real data
Logs extracted from 2
mailboxes
0/100
3.211 %
[7]
– [DiCiccioMecella2012]
5 traces
– 34.75 events each on average
– 139 events read in total
Evaluation of inferred
constraints conducted with
an expert domain
Precision ≈ 0.794
20.642 %
[ 45 ]
Result evaluation
Noticeably right
75
6.422 %
[ 14 ]
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
25
Right
Utterly wrong
Wrong
69.725 %
[ 173 ]
50
P. 36
Future work
Research in progress
Integrate MINERful++ with ProM
– MINERful++ is already capable of reading/writing XES logs
Study the effects of errors in logs on the inferred workflow
– Error-injected synthetic logs can help us conduct an automated
analysis on the quality of results
Refine the estimation calculi for support
Auto-tune the support threshold
– Depending on the constraint template, which can be more or
less “robust”
Enlarge the set of discovered constraint templates to the branching
Declare constraints
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 37
References
Cited articles and resources, in order of appearance
•
[Aalst2011.book] van der Aalst, W.M.P.: Process Mining: Discovery, Conformance and Enhancement of
Business Processes. Springer (2011).
•
[AalstEtAl2009] van der Aalst, W.M.P., van Dongen, B.F., Güther, C.W., Rozinat, A., Verbeek, E.,
Weijters, T.: Prom: The process mining toolkit. In de Medeiros, A.K.A., Weber, B., eds.: BPM (Demos).
Volume 489 of CEUR Workshop Proceedings., CEUR-WS.org (2009)
•
[AalstEtAl2004] van der Aalst, W.M.P., Weijters, T., Maruster, L.: Workflow mining: Discovering process
models from event logs. IEEE Trans. Knowl. Data Eng. 16(9) (2004) 1128–1142.
•
[WenEtAl2007] Wen, L., van der Aalst, W.M.P., Wang, J., Sun, J.: Mining process models with non-freechoice constructs. Data Min. Knowl. Discov. 15(2) (2007) 145–180.
•
[WenEtAl2010] Wen, L., Wang, J., van der Aalst, W. M. P., Huang, B., Sun, J.: Mining process models
with prime invisible tasks. Data Knowl. Eng., 69 (2010), 999-1021.
•
[GüntherEtAl2007] Günther, C.W., van der Aalst, W.M.P.: Fuzzy Mining - Adaptive Process
Simplification Based on Multi-perspective Metrics. Proc. BPM 2007.
•
[WeijtersEtAl2001] Weijters, A., van der Aalst, W.: Rediscovering workflow models from event-based
data using little thumb. Integrated Computer-Aided Engineering 10 (2001) 2003.
•
[MedeirosEtAl2007] Medeiros, A.K., Weijters, A.J., Aalst, W.M.: Genetic process mining: an experimental evaluation. Data Min. Knowl. Discov. 14(2) (2007) 245–304.
•
[AalstEtAl2010] van der Aalst, W., Rubin, V., Verbeek, H., van Dongen, B., Kindler, E., Gnther, C.:
Process mining: a two-step approach to balance between underfitting and overfitting. Software and
Systems Modeling 9 (2010) 87–111 10
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 38
References
Cited articles and resources, in order of appearance
•
[HillEtAl06] Hill, C., Yates, R., Jones, C., Kogan, S.L.: Beyond predictable workflows: Enhancing
productivity in artful business processes. IBM Systems Journal 45(4), 663–682 (2006)
•
[ACTIVE09] Warren, P., Kings, N., et al.: Improving knowledge worker productivity - the ACTIVE
integrated approach. BT Technology Journal 26(2), 165–176 (2009)
•
[DiCiccioEtAl11] Di Ciccio, C., Mecella, M., Catarci, T.: Representing and visualizing mined artful
processes in MailOfMine. Proc. USAB 2011:83-94
•
[DiCiccioMecella12] Di Ciccio, C., Mecella,M.: Mining constraints for artful processes. Proc. BIS 2012
•
[DiCiccioMecella/TR12] Di Ciccio, C., Mecella, M.: MINERful, a mining algorithm for declarative process
constraints in MailOfMine. Technical report, Dipartimento di Ingegneria Informatica, Automatica e
Gestionale “Antonio Ruberti” – SAPIENZA, Universita` di Roma (2012).
•
[DiCiccioMecella13] Di Ciccio, C., Mecella, M.: A Two-Step Fast Algorithm for the Automated Discovery
of Declarative Workflows. Proc. CIDM 2013
•
[AalstEtAl06] van der Aalst, W.M.P., Pesic, M.: Decserflow: Towards a truly declarative service flow
language. Proc. WS-FM 2006
•
[MaggiEtAl11] Maggi, F.M., Mooij, A.J., van der Aalst, W.M.P.: User-guided discovery of declar- ative
process models. Proc. CIDM 2011
Process Mining
Claudio Di Ciccio (DIAG, SAPIENZA – Università di Roma)
P. 39
Scarica

Representing and Visualizing Mined Artful Processes in MailOfMine