Dependability & Maintainability
Theory and Methods
5. Markov Models
Andrea Bobbio
Dipartimento di Informatica
Università del Piemonte Orientale, “A. Avogadro”
15100 Alessandria (Italy)
[email protected] - http://www.mfn.unipmn.it/~bobbio/IFOA
A. Bobbio
IFOA, Reggio
Emilia,
June 2003
17-18, 2003
Reggio Emilia,
June 17-18,
1
State-Space-Based Models
States and labeled state transitions
State can keep track of:
– Number of functioning resources of each type
– States of recovery for each failed resource
– Number of tasks of each type waiting at each
resource
– Allocation of resources to tasks
A transition:
– Can occur from any state to any other state
– Can represent a simple or a compound event
A. Bobbio
Reggio Emilia, June 17-18, 2003
2
State-Space-Based Models (Continued)
Transitions between states represent the change of the
system state due to the occurrence of an event
Drawn as a directed graph
Transition label:
– Probability: homogeneous discrete-time Markov
chain (DTMC)
– Rate: homogeneous continuous-time Markov chain
(CTMC)
– Time-dependent rate: non-homogeneous CTMC
– Distribution function: semi-Markov process (SMP)
A. Bobbio
Reggio Emilia, June 17-18, 2003
3
Modeler’s Options
Should I Use Markov Models?
State-Space-Based Methods
+ Model Dependencies
+ Model Fault-Tolerance and Recovery/Repair
+ Model Contention for Resources
+ Model Concurrency and Timeliness
+ Generalize to Markov Reward Models for Modeling
Degradable Performance
A. Bobbio
Reggio Emilia, June 17-18, 2003
4
Modeler’s Options
Should I Use Markov Models?
+ Generalize to Markov Regenerative Models for
Allowing Generally Distributed Event Times
+ Generalize to Non-Homogeneous Markov Chains for
Allowing Weibull Failure Distributions
+ Performance, Availability and Performability Modeling
Possible
- Large (Exponential) State Space
A. Bobbio
Reggio Emilia, June 17-18, 2003
5
In order to fulfil our goals
Modeling Performance, Availability and Performability
Modeling Complex Systems
We Need
Automatic Generation and Solution of Large Markov
Reward Models
A. Bobbio
Reggio Emilia, June 17-18, 2003
6
Model-based evaluation
Choice of the model type is dictated by:
– Measures of interest
– Level of detailed system behavior to be
represented
– Ease of model specification and solution
– Representation power of the model type
– Access to suitable tools or toolkits
A. Bobbio
Reggio Emilia, June 17-18, 2003
7
State space
models
xi
s
s’
A transition represents the change of state of a
single component
Z(t) is the stochastic process
Pr {Z(t) = s} is the probability of finding Z(t) in
state s at time t.
Pr {s  s’, t} = Pr {Z(t+ t) = s’ | Z(t) = s}
A. Bobbio
Reggio Emilia, June 17-18, 2003
8
State space
models
xi
s
s’
If s  s’ represents a failure event:
Pr {s  s’, t} =
= Pr {Z(t+ t) = s’ | Z(t) = s} =  i t
If s  s’ represents a repair event:
Pr {s  s’, t} =
= Pr {Z(t+ t) = s’ | Z(t) = s} =  i t
A. Bobbio
Reggio Emilia, June 17-18, 2003
9
Markov Process:
definition
A. Bobbio
Reggio Emilia, June 17-18, 2003
10
Transition Probability Matrix
initial
State Probability Vector
Chapman-Kolmogorov
Equations
Time-homogeneous CTMC
Time-homogeneous CTMC
The transition rate matrix
C-K Equations for CTMC
Solution equations
Transient analysis
Given that the initial state of the Markov chain,
then the system of differential Equations is written
based on:
rate of buildup = rate of flow in - rate of flow out
for each state (continuity equation).
Steady-state condition
If the process reaches a steady state condition, then:
Steady-state analysis
(balance equation)
The steady-state equation can be written as a flow balance
equation with a normalization condition on the state
probabilities.
(rate of buildup) = rate of flow in - rate of flow out
rate of flow in = rate of flow out
for each state (balance equation).
State Classification
2-component system
A. Bobbio
Reggio Emilia, June 17-18, 2003
23
2-component system
A. Bobbio
Reggio Emilia, June 17-18, 2003
24
2-component system
A. Bobbio
Reggio Emilia, June 17-18, 2003
25
2-component series system
A1
A2
2-component parallel system
A1
A2
A. Bobbio
Reggio Emilia, June 17-18, 2003
26
2-component stand-by system
A
B
A. Bobbio
Reggio Emilia, June 17-18, 2003
27
Markov Models
Repairable systems Availability
Repairable system:
Availability
A. Bobbio
Reggio Emilia, June 17-18, 2003
29
Repairable system:
2 identical
components
A. Bobbio
Reggio Emilia, June 17-18, 2003
30
Repairable system:
2 identical
components
A. Bobbio
Reggio Emilia, June 17-18, 2003
31
2-component Markov availability
model
 Assume we have a two-component parallel
redundant system with repair rate .
 Assume that the failure rate of both the components
is .
 When both the components have failed, the system
is considered to have failed.
A. Bobbio
Reggio Emilia, June 17-18, 2003
32
Markov availability model
 Let the number of properly functioning
components be the state of the system.
 The state space is {0,1,2} where 0 is the system
down state.
 We wish to examine effects of shared vs. nonshared repair.
A. Bobbio
Reggio Emilia, June 17-18, 2003
33
Markov availability model
2

2
1

2
2

2
1

A. Bobbio
0
0
Non-shared (independent)
repair
Shared repair

Reggio Emilia, June 17-18, 2003
34
Markov availability model
Note: Non-shared case can be modeled &
solved using a RBD or a FTREE but
shared case needs the use of Markov
chains.
A. Bobbio
Reggio Emilia, June 17-18, 2003
35
Steady-state balance equations
For any state:
Rate of flow in = Rate of flow out
Considering the shared case
2 2   1
(   ) 1  2 2   0
 1   0
i: steady state probability that system is in state i
A. Bobbio
Reggio Emilia, June 17-18, 2003
36
Steady-state balance equations
Hence
Since
We have
Or
A. Bobbio


2 
1 1   0
2

 0  1   2  1

    
 0   0     0  1

   2 
0 
1
 
1 
Reggio Emilia, June 17-18, 2003
 22
2
37
Steady-state balance equations
(Continued)
Steady-state Unavailability:
For the Shared Case = 0 = 1 - Ashared
Similarly, for the Non-Shared Case,
Steady-state Unavailability = 1 - Anon-shared
1  Anonshared 
1
2

1
 2


2
Downtime in minutes per year = (1 - A)* 8760*60
A. Bobbio
Reggio Emilia, June 17-18, 2003
38
Steady-state balance equations
A. Bobbio
Reggio Emilia, June 17-18, 2003
39
Absorbing states
MTTF
A. Bobbio
Reggio Emilia, June 17-18, 2003
40
Absorbing states - MTTF
zi , j 


0
Pi , j ( )d ,
(i, j )  B
zi , j
A. Bobbio
Reggio Emilia, June 17-18, 2003
41
Markov Reliability Model
with Imperfect Coverage
Markov model
with imperfect coverage
Next consider a modification of the 2-component
parallel system proposed by Arnold as a model of
duplex processors of an electronic switching system.
We assume that not all faults are recoverable and that c
is the coverage factor which denotes the conditional
probability that the system recovers given that a fault
has occurred.
The state diagram is now given by the following
picture:
A. Bobbio
Reggio Emilia, June 17-18, 2003
43
Now allow for Imperfect
coverage
c
A. Bobbio
Reggio Emilia, June 17-18, 2003
44
Markov model
with imperfect coverage
Assume that the initial state is 2 so that:
P2 (0)  1, P0 (0)  P1 (0)  0
Then the system of differential equations are:
dP2 (t )
 2cP2 (t )  2 (1  c) P2 (t )  P1 (t )
dt
dP1 (t )
 2cP2 (t )  (   ) P1 (t )
dt
dP0 (t )
 2 (1  c) P2 (t )  P1 (t )
dt
A. Bobbio
Reggio Emilia, June 17-18, 2003
45
Markov model
with imperfect coverage
After solving the differential equations we obtain:
R(t)=P2(t) + P1(t)
From R(t), we can obtain system MTTF:
 (1  2c)  
MTTF 
2[   (1  c)]
It should be clear that the system MTTF and system
reliability are critically dependent on the coverage factor.
A. Bobbio
Reggio Emilia, June 17-18, 2003
46
Source of fault coverage data
Measurement data from an operational system
Large amount of data needed
Improved instrumentation needed
Fault-injection experiments
Expensive but badly needed
Tools from CMU,Illinois, LAAS (Toulouse)
A fault/error handling submodel (FEHM)
Phases: detection, location, retry, reconfig,
reboot
Estimate duration and probability of success of
each phase
A. Bobbio
Reggio Emilia, June 17-18, 2003
47
Redundant System with Finite
Detection Switchover Time
 Modify the Markov model with imperfect
coverage to allow for finite time to detect as
well as imperfect detection.
 You will need to add an extra state, say D.
 The rate at which detection occurs is  .
 Draw the state diagram and investigate the
effects of detection delay on system reliability
and mean time to failure.
A. Bobbio
Reggio Emilia, June 17-18, 2003
48
Redundant System with Finite
Detection Switchover Time
Assumptions:
 Two units have the same MTTF and MTTR;
 Single shared repair person;
 Average detection/switchover time tsw=1/;
 We need to use a Markov model.
A. Bobbio
Reggio Emilia, June 17-18, 2003
49
Redundant System with Finite
Detection Switchover Time

2
2
1D

1


0

MTTF  1 / 
MTTR  1 / 
A. Bobbio
Reggio Emilia, June 17-18, 2003
50
Redundant System with Finite
Detection Switchover Time
After solving the Markov model, we obtain
steady-state probabilities:
 2 ,  1D ,  1 ,  0
Asys   2   1 (or   1D )
A. Bobbio
Reggio Emilia, June 17-18, 2003
51
Closed-form
 
2
2
 

] 1


 0 [1 
 (     ) 22     
    
0 
1
E
1
 

     E
1
2
 1D 
 (     ) E
1
 
2
2 
22      E
1 
A   2   1  r 1D
2  
  
2
( 2

r
)/ E
2           
 (     )
A. Bobbio
Reggio Emilia, June 17-18, 2003
52
WFS Example
A. Bobbio
Reggio Emilia, June 17-18, 2003
53
A Workstations-Fileserver Example
Computing system consisting of:
– A file-server
– Two workstations
– Computing network connecting them
System operational as long as:
– One of the Workstations
and
– The file-server are operational
Computer network is assumed to be fault-free
A. Bobbio
Reggio Emilia, June 17-18, 2003
54
The WFS Example
A. Bobbio
Reggio Emilia, June 17-18, 2003
55
Markov Chain for WFS Example
Assuming exponentially distributed times to failure
– w : failure rate of workstation
– f : failure rate of file-server
Assume that components are repairable
– w: repair rate of workstation
– f: repair rate of file-server
File-server has priority for repair over workstations
(such repair priority cannot be captured by non-statespace models)
A. Bobbio
Reggio Emilia, June 17-18, 2003
56
Markov Availability Model for WFS
w
2w
2,1
1,1
0,1
w
f
f
w
f
f
f
w
2w
2,0
f
1,0
0,0
Since all states are reachable from every other states, the
CTMC is irreducible. Furthermore, all states are positive
recurrent.
A. Bobbio
Reggio Emilia, June 17-18, 2003
57
Markov Availability Model for
WFS (Continued)
In the figure, the label (i,j) of each state is
interpreted as follows:
 i represents the number of workstations that are
still functioning
 j is 1 or 0 depending on whether the file-server is
up or down respectively.
A. Bobbio
Reggio Emilia, June 17-18, 2003
58
Markov Availability Model for
WFS (Continued)
For the example problem, with the states ordered as
(2,1), (2,0), (1,1), (1,0), (0,1), (0,0) the Q matrix is
given by:
Q=
f
2w
0
0
0 
 ( f  2w )




(


2

)
0
2

0
0
f
f
w
w



w
0
 (  w   f  w )
f
w
0 


0
0


(



)
0

f
f
w
w 


0
0
w
0
 ( w   f )  f 


0
0
0
0




f
f 

A. Bobbio
Reggio Emilia, June 17-18, 2003
59
Markov Model (steady-state)
 : Steady-state probability vector
Q  0,

i
i
1
  ( 21,  20 ,  11,  10 ,  01,  00 )
These are called steady-state balance equations
rate of flow in = rate of flow out
after solving for  ,obtain Steady-state
availability
ASS   21   11
A. Bobbio
Reggio Emilia, June 17-18, 2003
60
Markov Availability Model
We compute the availability of the system:
System is available as long as it is in states
(2,1) and (1,1).
Instantaneous availability of the system:
A(t )  P( 2,1) (t )  P(1,1) (t )
lim A(t )  Ass
t 
A. Bobbio
Reggio Emilia, June 17-18, 2003
61
Markov Availability Model
(Continued)
w  0.0001 hr 1 ,  f  0.00005 hr 1 , w  1.0 hr 1 ,  f  0.5 hr 1
A. Bobbio
Ass  0.9999
Reggio Emilia, June 17-18, 2003
62
Markov Reliability Model with
Repair
Assume that the computer system does not recover if
both workstations fail, or if the file-server fails
A. Bobbio
Reggio Emilia, June 17-18, 2003
63
Markov Reliability Model with Repair
States (0,1), (1,0) and (2,0) become absorbing states while (2,1) and (1,1)
are transient states.
Note: we have made a simplification that, once the CTMC reaches a system
failure state, we do not allow any more transitions.
A. Bobbio
Reggio Emilia, June 17-18, 2003
64
Markov Model with Absorbing States
If we solve for P2,1(t) and P1,1(t) then
R(t)=P2,1(t) + P1,1(t)
For a Markov chain with absorbing states:
A: the set of absorbing states
B =  - A: the set of remaining states
zi,j: Mean time spent in state i,j until absorption
zi , j 
A. Bobbio


0
Pi , j ( )d ,
(i, j )  B
z QB   PB (0)
Reggio Emilia, June 17-18, 2003
65
Markov Model with Absorbing
States (Continued)
QB derived from Q by restricting it to only
states in B
Mean time to absorption MTTA is given as:
MTTA 
z
( i , j )B
A. Bobbio
Reggio Emilia, June 17-18, 2003
(i , j )
66
Markov Reliability Model with
Repair (Continued)
 ( f  2w )
[
QB 
w
First solve
dP1,1
dt
A. Bobbio
dP2,1
dt
]
2w
 (  w   f  w )
 2w P2,1 (t )   w P1,1 (t )
 (  w   f  w ) P1,1 (t )  2w P2,1 (t )
Reggio Emilia, June 17-18, 2003
67
Markov Reliability Model with
Repair (Continued)
Then :
R(t )  P2,1 (t )  P1,1 (t )
next solve z 2,1 (( f  2w ))  z1,1 w  1
z 2,1 2w  z1,1 (  w   f  w )  0
Then :
MTTF  z 2,1  z1,1
Mean time to failure is 19992 hours.
A. Bobbio
Reggio Emilia, June 17-18, 2003
68
Markov Reliability Model
without Repair
Assume that neither workstations nor fileserver is repairable
A. Bobbio
Reggio Emilia, June 17-18, 2003
69
Markov Reliability Model
without Repair (Continued)
States (0,1), (1,0) and (2,0)
become absorbing states
A. Bobbio
Reggio Emilia, June 17-18, 2003
70
Markov Reliability Model
without Repair (Continued)
 ( f  2w )
2w
QB 
0
 ( f   w )
[
]
R(t )  P2,1 (t )  P1,1 (t )
MTTF  z2,1  z1,1
Mean time to failure is 9333 hours.
A. Bobbio
Reggio Emilia, June 17-18, 2003
71
Scarica

ifoa-markov