Rollback-Retry Techniques & Checnkpointing Protocols Software Faults are Soft • Most hardware faults are soft: most hardware faults are trainsient. Retries like checksum and retransmissions are standard way to deal with hardware faults. • Conjecture (J. Gray 85). Also software faults are soft. If some software operation fails and the application is restarted from a quiescent state, the operation will usually not fail the second time Bohrbugs vs. Heisenbugs • Considering a software systems which has gone through structured design, design review, quality assurance, alpha test etc. • After all these phases most of the bugs that always fail on retry are gone (Bohrbugs) • For the other bugs (heisenbugs) retrying techniques will work as the system model have changed since one minute ago!!! • During the operational phase of a software system, 1 out of 150 software faults is a bohrbug (1985). The importance of checkpointing • Checkpointing is the quiscent state from which a computation can be restarted after a failure • Checkpointing save the state of a process into stable storage • What does it means checkpointing a distributed application? Consistent System States • A global state of a message-passing system consists of: – individual states of all processes – the states of communication channels • A consistent global state is a global state in which if a process’s state reflects a message receipt, then the state of the corresponding sender reflects sending that message Consistent System States (2) • Intuitively, a consistent global state is one that may occur during a failure-free, correct execution of a distributed computation. • The goal of a rollback-retry protocol is to bring the system into a consistent state • Problem taking local ckpt that cannot belong to any consistent global checkpoint Consistent Global Checkpoint • A local scheckpoint is a local state saved onto stable storagr • A global checkpoint is a set of local checkpoints one for each process • A global checkpoint is consistent is no local checkpoint “happens-before” the other (i.e., there are no missing messages) • A local checkpoint that does not belong to any global checkpoint is useless. Checkpoint and Communication Patterns Z-Cycles and Z-Paths • A Z-path (zigzag path) is a special sequence of messages that connects two checkpoints. • Let denote Lamport’s happen-before relation. • Let ci,x denote the xth checkpoint of process Pi. • Define the execution portion between two consecutive checkpoints on the same process to be the checkpoint interval (starting with the earlier checkpoint). • Let sendi and deliveri be the communication events by process Pi. Definition of Z-Path Given two checkpoints ci,x and cj,y, a Z-path exists between ci,x and cj,y if and only if one of the following two conditions holds: 1. 2. x < y and i = j; or There exists a sequence of messages [m0, m1,…, mn], n 0, such that: ci,x sendi(m0); l < n, either deliverk(ml) and sendk(ml+1) are in the same checkpoint interval, or deliverk(ml) sendk(ml+1); and deliverj(mn) cj,y Z-Cycles and Z-Paths (2) [m1, m2] and [m3, m4] are Z-paths between c0,1 and c2,2 • Z-cycle is a Z-path that begins and ends with the same checkpoint. – Above, [m5, m4, m3] is a Z-cycle that start and ends at checkpoint c2,2. • c2,2 is involved in a z-cycle. It is useless (i.e., it cannot belong to any consistent global checkpoint) Rollback Propagation and The Domino Effect • Upon a failure of one or more processes, the dependencies induced by messages may force some of the processes that did not fail to roll back. – This is commonly called rollback propagation. – If the processes have to roll back to the beginning of the computation, this is called the domino effect. Failure of P2 causes rollback to the beginning of the computation Classification of Checkpointbased Protocols 1. Uncoordinated checkpointing – each process takes its checkpoints independently 2. Coordinated checkpointing – processes coordinate their checkpoints in order to save a system-wide consistent state 3. Communication-induced checkpointing – forces each process to take checkpoints based on information piggybacked on the application messages it receives from other processes. Communication-induced Checkpointing • Balances between uncoordinated and coordinated checkpointing – Allows processes to take some checkpoints independently. These checkpoints are called local checkpoints – Guarantees the eventual progress of the recovery line by forcing processes to take additional checkpoints, called forced checkpoints. Communication-induced Checkpointing (2) • Communication-induced checkpointing piggybacks protocol-related information on each application message. – In contrast with coordinated checkpointing, no special coordination messages are exchanged. • The receiver of each application message uses the piggybacked information to determine if it has to take a forced checkpoint to advance the global recovery line. • The forced checkpoint must be taken before the application may process the contents of the message. – high latency and overhead – need to reduce the number of forced checkpoints Protocols • CBR (Checkpoint before receive) • NRAS (No receive after send) • FDAS (Fixed dependency after send) CBR Checkpoint and communication pattern Checkpoint before receive The case of two processes esite Z-cicle sequenza SR dentro almeno 1 intervallo compreso nello Z-cicle Quindi una condizione sufficiente per prevenire Z-cicle è Prevenire pattern SR in tutti gli intervalli non esistono Z-cicle NRAS upon the receive of message m IF after_first_send THEN Take_CKPT() ELSE deliver(m) when sending message m after_first_send=TRUE Take_CKPT() CKPT // salva chek point su disco after_first_send=FALSE FDAS • Pi invia msg m (vedi handler: when sending message m): carica su m un timestamp D che è il VC corrente (Di) del processo mittente all’atto dell’invio. • Pi prende un ckpt locale (vedi funzione: Take_CKPT()): incremento VC locale (Di[i]++). Devo tener conto dei ckpt (locali o forzati dall’algoritmo) che prendo. upon the receive of message m IF (after_first_send k: m.D[k]>Di[k])THEN Take_CKPT() for each K Di[k]:=max (Di[k], m.D[k]) //propago la conoscenza transitiva deliver(m) when sending message m after_first_send:=TRUE send (m.D) Take_CKPT() CKPT // salva chek point su disco after_first_send=FALSE Di[i]:=Di[i]+1