Dependability & Maintainability Theory and Methods 5. Markov Models Andrea Bobbio Dipartimento di Informatica Università del Piemonte Orientale, “A. Avogadro” 15100 Alessandria (Italy) [email protected] - http://www.mfn.unipmn.it/~bobbio/IFOA A. Bobbio IFOA, Reggio Emilia, June 2003 17-18, 2003 Reggio Emilia, June 17-18, 1 State-Space-Based Models States and labeled state transitions State can keep track of: – Number of functioning resources of each type – States of recovery for each failed resource – Number of tasks of each type waiting at each resource – Allocation of resources to tasks A transition: – Can occur from any state to any other state – Can represent a simple or a compound event A. Bobbio Reggio Emilia, June 17-18, 2003 2 State-Space-Based Models (Continued) Transitions between states represent the change of the system state due to the occurrence of an event Drawn as a directed graph Transition label: – Probability: homogeneous discrete-time Markov chain (DTMC) – Rate: homogeneous continuous-time Markov chain (CTMC) – Time-dependent rate: non-homogeneous CTMC – Distribution function: semi-Markov process (SMP) A. Bobbio Reggio Emilia, June 17-18, 2003 3 Modeler’s Options Should I Use Markov Models? State-Space-Based Methods + Model Dependencies + Model Fault-Tolerance and Recovery/Repair + Model Contention for Resources + Model Concurrency and Timeliness + Generalize to Markov Reward Models for Modeling Degradable Performance A. Bobbio Reggio Emilia, June 17-18, 2003 4 Modeler’s Options Should I Use Markov Models? + Generalize to Markov Regenerative Models for Allowing Generally Distributed Event Times + Generalize to Non-Homogeneous Markov Chains for Allowing Weibull Failure Distributions + Performance, Availability and Performability Modeling Possible - Large (Exponential) State Space A. Bobbio Reggio Emilia, June 17-18, 2003 5 In order to fulfil our goals Modeling Performance, Availability and Performability Modeling Complex Systems We Need Automatic Generation and Solution of Large Markov Reward Models A. Bobbio Reggio Emilia, June 17-18, 2003 6 Model-based evaluation Choice of the model type is dictated by: – Measures of interest – Level of detailed system behavior to be represented – Ease of model specification and solution – Representation power of the model type – Access to suitable tools or toolkits A. Bobbio Reggio Emilia, June 17-18, 2003 7 State space models xi s s’ A transition represents the change of state of a single component Z(t) is the stochastic process Pr {Z(t) = s} is the probability of finding Z(t) in state s at time t. Pr {s s’, t} = Pr {Z(t+ t) = s’ | Z(t) = s} A. Bobbio Reggio Emilia, June 17-18, 2003 8 State space models xi s s’ If s s’ represents a failure event: Pr {s s’, t} = = Pr {Z(t+ t) = s’ | Z(t) = s} = i t If s s’ represents a repair event: Pr {s s’, t} = = Pr {Z(t+ t) = s’ | Z(t) = s} = i t A. Bobbio Reggio Emilia, June 17-18, 2003 9 Markov Process: definition A. Bobbio Reggio Emilia, June 17-18, 2003 10 Transition Probability Matrix initial State Probability Vector Chapman-Kolmogorov Equations Time-homogeneous CTMC Time-homogeneous CTMC The transition rate matrix C-K Equations for CTMC Solution equations Transient analysis Given that the initial state of the Markov chain, then the system of differential Equations is written based on: rate of buildup = rate of flow in - rate of flow out for each state (continuity equation). Steady-state condition If the process reaches a steady state condition, then: Steady-state analysis (balance equation) The steady-state equation can be written as a flow balance equation with a normalization condition on the state probabilities. (rate of buildup) = rate of flow in - rate of flow out rate of flow in = rate of flow out for each state (balance equation). State Classification 2-component system A. Bobbio Reggio Emilia, June 17-18, 2003 23 2-component system A. Bobbio Reggio Emilia, June 17-18, 2003 24 2-component system A. Bobbio Reggio Emilia, June 17-18, 2003 25 2-component series system A1 A2 2-component parallel system A1 A2 A. Bobbio Reggio Emilia, June 17-18, 2003 26 2-component stand-by system A B A. Bobbio Reggio Emilia, June 17-18, 2003 27 Markov Models Repairable systems Availability Repairable system: Availability A. Bobbio Reggio Emilia, June 17-18, 2003 29 Repairable system: 2 identical components A. Bobbio Reggio Emilia, June 17-18, 2003 30 Repairable system: 2 identical components A. Bobbio Reggio Emilia, June 17-18, 2003 31 2-component Markov availability model Assume we have a two-component parallel redundant system with repair rate . Assume that the failure rate of both the components is . When both the components have failed, the system is considered to have failed. A. Bobbio Reggio Emilia, June 17-18, 2003 32 Markov availability model Let the number of properly functioning components be the state of the system. The state space is {0,1,2} where 0 is the system down state. We wish to examine effects of shared vs. nonshared repair. A. Bobbio Reggio Emilia, June 17-18, 2003 33 Markov availability model 2 2 1 2 2 2 1 A. Bobbio 0 0 Non-shared (independent) repair Shared repair Reggio Emilia, June 17-18, 2003 34 Markov availability model Note: Non-shared case can be modeled & solved using a RBD or a FTREE but shared case needs the use of Markov chains. A. Bobbio Reggio Emilia, June 17-18, 2003 35 Steady-state balance equations For any state: Rate of flow in = Rate of flow out Considering the shared case 2 2 1 ( ) 1 2 2 0 1 0 i: steady state probability that system is in state i A. Bobbio Reggio Emilia, June 17-18, 2003 36 Steady-state balance equations Hence Since We have Or A. Bobbio 2 1 1 0 2 0 1 2 1 0 0 0 1 2 0 1 1 Reggio Emilia, June 17-18, 2003 22 2 37 Steady-state balance equations (Continued) Steady-state Unavailability: For the Shared Case = 0 = 1 - Ashared Similarly, for the Non-Shared Case, Steady-state Unavailability = 1 - Anon-shared 1 Anonshared 1 2 1 2 2 Downtime in minutes per year = (1 - A)* 8760*60 A. Bobbio Reggio Emilia, June 17-18, 2003 38 Steady-state balance equations A. Bobbio Reggio Emilia, June 17-18, 2003 39 Absorbing states MTTF A. Bobbio Reggio Emilia, June 17-18, 2003 40 Absorbing states - MTTF zi , j 0 Pi , j ( )d , (i, j ) B zi , j A. Bobbio Reggio Emilia, June 17-18, 2003 41 Markov Reliability Model with Imperfect Coverage Markov model with imperfect coverage Next consider a modification of the 2-component parallel system proposed by Arnold as a model of duplex processors of an electronic switching system. We assume that not all faults are recoverable and that c is the coverage factor which denotes the conditional probability that the system recovers given that a fault has occurred. The state diagram is now given by the following picture: A. Bobbio Reggio Emilia, June 17-18, 2003 43 Now allow for Imperfect coverage c A. Bobbio Reggio Emilia, June 17-18, 2003 44 Markov model with imperfect coverage Assume that the initial state is 2 so that: P2 (0) 1, P0 (0) P1 (0) 0 Then the system of differential equations are: dP2 (t ) 2cP2 (t ) 2 (1 c) P2 (t ) P1 (t ) dt dP1 (t ) 2cP2 (t ) ( ) P1 (t ) dt dP0 (t ) 2 (1 c) P2 (t ) P1 (t ) dt A. Bobbio Reggio Emilia, June 17-18, 2003 45 Markov model with imperfect coverage After solving the differential equations we obtain: R(t)=P2(t) + P1(t) From R(t), we can obtain system MTTF: (1 2c) MTTF 2[ (1 c)] It should be clear that the system MTTF and system reliability are critically dependent on the coverage factor. A. Bobbio Reggio Emilia, June 17-18, 2003 46 Source of fault coverage data Measurement data from an operational system Large amount of data needed Improved instrumentation needed Fault-injection experiments Expensive but badly needed Tools from CMU,Illinois, LAAS (Toulouse) A fault/error handling submodel (FEHM) Phases: detection, location, retry, reconfig, reboot Estimate duration and probability of success of each phase A. Bobbio Reggio Emilia, June 17-18, 2003 47 Redundant System with Finite Detection Switchover Time Modify the Markov model with imperfect coverage to allow for finite time to detect as well as imperfect detection. You will need to add an extra state, say D. The rate at which detection occurs is . Draw the state diagram and investigate the effects of detection delay on system reliability and mean time to failure. A. Bobbio Reggio Emilia, June 17-18, 2003 48 Redundant System with Finite Detection Switchover Time Assumptions: Two units have the same MTTF and MTTR; Single shared repair person; Average detection/switchover time tsw=1/; We need to use a Markov model. A. Bobbio Reggio Emilia, June 17-18, 2003 49 Redundant System with Finite Detection Switchover Time 2 2 1D 1 0 MTTF 1 / MTTR 1 / A. Bobbio Reggio Emilia, June 17-18, 2003 50 Redundant System with Finite Detection Switchover Time After solving the Markov model, we obtain steady-state probabilities: 2 , 1D , 1 , 0 Asys 2 1 (or 1D ) A. Bobbio Reggio Emilia, June 17-18, 2003 51 Closed-form 2 2 ] 1 0 [1 ( ) 22 0 1 E 1 E 1 2 1D ( ) E 1 2 2 22 E 1 A 2 1 r 1D 2 2 ( 2 r )/ E 2 ( ) A. Bobbio Reggio Emilia, June 17-18, 2003 52 WFS Example A. Bobbio Reggio Emilia, June 17-18, 2003 53 A Workstations-Fileserver Example Computing system consisting of: – A file-server – Two workstations – Computing network connecting them System operational as long as: – One of the Workstations and – The file-server are operational Computer network is assumed to be fault-free A. Bobbio Reggio Emilia, June 17-18, 2003 54 The WFS Example A. Bobbio Reggio Emilia, June 17-18, 2003 55 Markov Chain for WFS Example Assuming exponentially distributed times to failure – w : failure rate of workstation – f : failure rate of file-server Assume that components are repairable – w: repair rate of workstation – f: repair rate of file-server File-server has priority for repair over workstations (such repair priority cannot be captured by non-statespace models) A. Bobbio Reggio Emilia, June 17-18, 2003 56 Markov Availability Model for WFS w 2w 2,1 1,1 0,1 w f f w f f f w 2w 2,0 f 1,0 0,0 Since all states are reachable from every other states, the CTMC is irreducible. Furthermore, all states are positive recurrent. A. Bobbio Reggio Emilia, June 17-18, 2003 57 Markov Availability Model for WFS (Continued) In the figure, the label (i,j) of each state is interpreted as follows: i represents the number of workstations that are still functioning j is 1 or 0 depending on whether the file-server is up or down respectively. A. Bobbio Reggio Emilia, June 17-18, 2003 58 Markov Availability Model for WFS (Continued) For the example problem, with the states ordered as (2,1), (2,0), (1,1), (1,0), (0,1), (0,0) the Q matrix is given by: Q= f 2w 0 0 0 ( f 2w ) ( 2 ) 0 2 0 0 f f w w w 0 ( w f w ) f w 0 0 0 ( ) 0 f f w w 0 0 w 0 ( w f ) f 0 0 0 0 f f A. Bobbio Reggio Emilia, June 17-18, 2003 59 Markov Model (steady-state) : Steady-state probability vector Q 0, i i 1 ( 21, 20 , 11, 10 , 01, 00 ) These are called steady-state balance equations rate of flow in = rate of flow out after solving for ,obtain Steady-state availability ASS 21 11 A. Bobbio Reggio Emilia, June 17-18, 2003 60 Markov Availability Model We compute the availability of the system: System is available as long as it is in states (2,1) and (1,1). Instantaneous availability of the system: A(t ) P( 2,1) (t ) P(1,1) (t ) lim A(t ) Ass t A. Bobbio Reggio Emilia, June 17-18, 2003 61 Markov Availability Model (Continued) w 0.0001 hr 1 , f 0.00005 hr 1 , w 1.0 hr 1 , f 0.5 hr 1 A. Bobbio Ass 0.9999 Reggio Emilia, June 17-18, 2003 62 Markov Reliability Model with Repair Assume that the computer system does not recover if both workstations fail, or if the file-server fails A. Bobbio Reggio Emilia, June 17-18, 2003 63 Markov Reliability Model with Repair States (0,1), (1,0) and (2,0) become absorbing states while (2,1) and (1,1) are transient states. Note: we have made a simplification that, once the CTMC reaches a system failure state, we do not allow any more transitions. A. Bobbio Reggio Emilia, June 17-18, 2003 64 Markov Model with Absorbing States If we solve for P2,1(t) and P1,1(t) then R(t)=P2,1(t) + P1,1(t) For a Markov chain with absorbing states: A: the set of absorbing states B = - A: the set of remaining states zi,j: Mean time spent in state i,j until absorption zi , j A. Bobbio 0 Pi , j ( )d , (i, j ) B z QB PB (0) Reggio Emilia, June 17-18, 2003 65 Markov Model with Absorbing States (Continued) QB derived from Q by restricting it to only states in B Mean time to absorption MTTA is given as: MTTA z ( i , j )B A. Bobbio Reggio Emilia, June 17-18, 2003 (i , j ) 66 Markov Reliability Model with Repair (Continued) ( f 2w ) [ QB w First solve dP1,1 dt A. Bobbio dP2,1 dt ] 2w ( w f w ) 2w P2,1 (t ) w P1,1 (t ) ( w f w ) P1,1 (t ) 2w P2,1 (t ) Reggio Emilia, June 17-18, 2003 67 Markov Reliability Model with Repair (Continued) Then : R(t ) P2,1 (t ) P1,1 (t ) next solve z 2,1 (( f 2w )) z1,1 w 1 z 2,1 2w z1,1 ( w f w ) 0 Then : MTTF z 2,1 z1,1 Mean time to failure is 19992 hours. A. Bobbio Reggio Emilia, June 17-18, 2003 68 Markov Reliability Model without Repair Assume that neither workstations nor fileserver is repairable A. Bobbio Reggio Emilia, June 17-18, 2003 69 Markov Reliability Model without Repair (Continued) States (0,1), (1,0) and (2,0) become absorbing states A. Bobbio Reggio Emilia, June 17-18, 2003 70 Markov Reliability Model without Repair (Continued) ( f 2w ) 2w QB 0 ( f w ) [ ] R(t ) P2,1 (t ) P1,1 (t ) MTTF z2,1 z1,1 Mean time to failure is 9333 hours. A. Bobbio Reggio Emilia, June 17-18, 2003 71