Rollback-Retry Techniques &
Checnkpointing Protocols
Software Faults are Soft
• Most hardware faults are soft: most hardware
faults are trainsient. Retries like checksum
and retransmissions are standard way to deal
with hardware faults.
• Conjecture (J. Gray 85). Also software faults
are soft. If some software operation fails and
the application is restarted from a quiescent
state, the operation will usually not fail the
second time
Bohrbugs vs. Heisenbugs
• Considering a software systems which has gone
through structured design, design review, quality
assurance, alpha test etc.
• After all these phases most of the bugs that always
fail on retry are gone (Bohrbugs)
• For the other bugs (heisenbugs) retrying
techniques will work as the system model have
changed since one minute ago!!!
• During the operational phase of a software system,
1 out of 150 software faults is a bohrbug (1985).
The importance of checkpointing
• Checkpointing is the quiscent state from
which a computation can be restarted after a
failure
• Checkpointing save the state of a process
into stable storage
• What does it means checkpointing a
distributed application?
Consistent System States
• A global state of a message-passing system
consists of:
– individual states of all processes
– the states of communication channels
• A consistent global state is a global state in
which if a process’s state reflects a message
receipt, then the state of the corresponding
sender reflects sending that message
Consistent System States (2)
• Intuitively, a consistent global state is one that may
occur during a failure-free, correct execution of a
distributed computation.
• The goal of a rollback-retry protocol is to bring the system
into a consistent state
• Problem taking local ckpt that cannot belong to any
consistent global checkpoint
Consistent Global Checkpoint
• A local scheckpoint is a local state saved onto
stable storagr
• A global checkpoint is a set of local checkpoints
one for each process
• A global checkpoint is consistent is no local
checkpoint “happens-before” the other (i.e., there
are no missing messages)
• A local checkpoint that does not belong to any
global checkpoint is useless.
Checkpoint and Communication
Patterns
Z-Cycles and Z-Paths
• A Z-path (zigzag path) is a special sequence of
messages that connects two checkpoints.
• Let  denote Lamport’s happen-before relation.
• Let ci,x denote the xth checkpoint of process Pi.
• Define the execution portion between two
consecutive checkpoints on the same process to be
the checkpoint interval (starting with the earlier
checkpoint).
• Let sendi and deliveri be the communication events
by process Pi.
Definition of Z-Path
Given two checkpoints ci,x and cj,y, a Z-path exists
between ci,x and cj,y if and only if one of the following
two conditions holds:
1.
2.
x < y and i = j; or
There exists a sequence of messages [m0, m1,…, mn],
n  0, such that:
 ci,x  sendi(m0);
  l < n, either deliverk(ml) and sendk(ml+1) are in
the same checkpoint interval, or deliverk(ml) 
sendk(ml+1); and
 deliverj(mn)  cj,y
Z-Cycles and Z-Paths (2)
[m1, m2] and [m3, m4]
are Z-paths between
c0,1 and c2,2
• Z-cycle is a Z-path that begins and ends with the
same checkpoint.
– Above, [m5, m4, m3] is a Z-cycle that start and ends at
checkpoint c2,2.
• c2,2 is involved in a z-cycle. It is useless (i.e., it
cannot belong to any consistent global checkpoint)
Rollback Propagation and The
Domino Effect
• Upon a failure of one or more processes, the dependencies
induced by messages may force some of the processes that
did not fail to roll back.
– This is commonly called rollback propagation.
– If the processes have to roll back to the beginning of the
computation, this is called the domino effect.
Failure of P2 causes
rollback to the
beginning of the
computation
Classification of Checkpointbased Protocols
1. Uncoordinated checkpointing – each process takes
its checkpoints independently
2. Coordinated checkpointing – processes coordinate
their checkpoints in order to save a system-wide
consistent state
3. Communication-induced checkpointing – forces
each process to take checkpoints based on
information piggybacked on the application
messages it receives from other processes.
Communication-induced
Checkpointing
• Balances between uncoordinated and
coordinated checkpointing
– Allows processes to take some checkpoints
independently. These checkpoints are called
local checkpoints
– Guarantees the eventual progress of the
recovery line by forcing processes to take
additional checkpoints, called forced
checkpoints.
Communication-induced
Checkpointing (2)
• Communication-induced checkpointing piggybacks
protocol-related information on each application message.
– In contrast with coordinated checkpointing, no special coordination
messages are exchanged.
• The receiver of each application message uses the
piggybacked information to determine if it has to take a
forced checkpoint to advance the global recovery line.
• The forced checkpoint must be taken before the application
may process the contents of the message.
– high latency and overhead
– need to reduce the number of forced checkpoints
Protocols
• CBR (Checkpoint before receive)
• NRAS (No receive after send)
• FDAS (Fixed dependency after send)
CBR
Checkpoint and communication pattern
Checkpoint before receive
The case of two processes
esite Z-cicle  sequenza SR dentro almeno 1 intervallo compreso nello Z-cicle
Quindi una condizione sufficiente per prevenire Z-cicle è
Prevenire pattern SR in tutti gli intervalli  non esistono Z-cicle
NRAS
upon the receive of message m
IF after_first_send
THEN Take_CKPT()
ELSE deliver(m)
when sending message m
after_first_send=TRUE
Take_CKPT()
CKPT
// salva chek
point su disco
after_first_send=FALSE
FDAS
• Pi invia msg m (vedi handler:
when sending message m):
carica su m un timestamp D che
è il VC corrente (Di) del
processo mittente all’atto
dell’invio.
• Pi prende un ckpt locale (vedi
funzione: Take_CKPT()):
incremento VC locale
(Di[i]++). Devo tener conto dei
ckpt (locali o forzati
dall’algoritmo) che prendo.
upon the receive of message m
IF (after_first_send  k: m.D[k]>Di[k])THEN Take_CKPT()
for each K Di[k]:=max (Di[k], m.D[k])
//propago la conoscenza transitiva
deliver(m)
when sending message m
after_first_send:=TRUE
send (m.D)
Take_CKPT()
CKPT
// salva chek point su disco
after_first_send=FALSE
Di[i]:=Di[i]+1
Scarica

Rollback-Recovery Protocol in Message