An introduction to
NETWORK RESILIENCY
Giorgio Ventre & Stefano Avallone
COMICS Group
Dipartimento di Informatica e Sistemistica
Università di Napoli Federico II
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
1
References
 Jean-Philippe Vasseur, Mario Pickavet, Piet
Demeester. “Network Recovery, protection and
restoration of optical, SONET-SDH, IP and MPLS”.
Morgan Kaufmann
 AA. VV. Building Survivable Networks, Feature Issue of
IEEE Network Magazine, March/April 2004
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
2
Communication Networks Relevance
Communication Networks are becoming
fundamental infrastructures:
the amount of data carried out by Communication
Networks is considerably grows in the last years;
many social and economic activities depend on
Communication Networks;
many safe critical activities depend on
Communication Networks.
Reliability is an essential feature of today
Communication Networks !
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
3
Network Reliability: definition[1]
The (a) ability of a network to maintain or
restore an acceptable level of performance
during network failures by applying various
restoration techniques, and (b) mitigation or
prevention of service outages from network
failures by applying preventive techniques.
Acronym: Network Survivability.
[1] Alliance for Telecommunications Industry Solutions (ATIS)
http://www.atis.org/tg2k/_network_reliability.html
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
4
Network Reliability: related concepts
There are many concepts that are related to
Network Reliability, for example:
network element reliability: the probability of a
network element to be fully operational during a
certain period of time;
network element availability: the probability of a
network element to be in an up-state at a given
instant of time t;
network element fault: the inability of a network
element to perform a required action
....
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
5
Which failures may occur ?
The ability of a network to provide required
services may be compromised by different
failures:
planed or unplanned failures;
internal or external failures;
software or hardware failures;
malicious or casual failures
....
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
6
Accounted Failures
 Provide actions to address all the failures that may
occur on a Communication Network is unfeasible.
 Network provider and ISP normally provides
actions plain to address the most frequent failures.
 These failure are called Accounted Failure
 The most common type of Accounted Failure are:
single link failure;
single node failure.
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
7
Failures' Impact
In today Communication Networks a single
failure may produces a major disruption in
network availability.
A single cut in an optical cable may drop
thousands of logical network connections.
On July 5, 2002 a submarine cable break affected
the Asia Pacific Cable Network (ACPN 2), causing a
considerable slowdown in all the network
connections among Japan, China, South Korea, etc.
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
8
Failures' Impact: ATC systems
 Press Releases (http://www.natca.org/mediacenter/press-releasedetail.aspx?id=394)
 MASSIVE POWER, COMMUNICATIONS FAILURE AT MAJOR AIR
TRAFFIC CONTROL CENTER PUTS CONTROLLERS IN DARK, FLIGHTS
IN JEOPARDY
 07/19/2006 Bob Marks
PALMDALE, Calif. – A massive power and communications failure late
Tuesday at the Los Angeles Air Route Traffic Control Center left
scrambling air traffic controllers to deal with a nightmare scenario –
how to keep dozens of flights away from each other above a large
swath of the Southwestern United States despite the inability to see
them, talk to them or relay crucial instructions for 15 excruciatingly
long minutes.
 Every ounce of skill, heart and determination that controllers bring into
the control room every day was put to the test during one of the worst
outages to ever hit the facility. It was so bad, controllers say, that the
only thing they had of use to aid the situation that actually worked was
their cell phones – devices which the Federal Aviation Administration,
inexplicably, has barred from control rooms, further impeding the
safety of the system.
 More details in
http://themainbang.typepad.com/blog/2006/07/complete_failur.html
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
9
Network Reliability Parameters
Some parameters that may be used to
characterize the reliability of a network may be
found in ITU G.911 Recommendation:
“Parameters and Calculation Methodologies for
Reliability and Availability of Fibre Optic
Systems”
In the following slides some of the parameters
defined in ITU G.911 are introduced
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
10
Failure in Time (FITs) and Maintenance Time
Failure in Time:
is the number of device's failure occurred in a
specific time interval;
normally is expressed as failures per bilion of
device hours.
Maintenance Time:
the time interval during which a maintenance
action is performed on an item either manually or
automatically, ...
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
11
Mean Time Between Failure (MTBF)
The Mean Time Between Failures (MTBF) is the
steady-state expectation of time between failures
Mathematically the MTBF (in years per failure)
is releated to the failure rate F (in FITs per 109
hours) as follows:
MTBF
Dipartimento di Informatica e Sistemistica,
1.14 10
F
5
University of Napoli Federico II – Comics Group
12
Mean Time To Repair (MTTR)
The Mean Time To Repair (MTTR) is defined as
total corrective maintenance time divided by the
total number of corrective maintenance actions
during a period of time.
Given the definitions of MTBF and MTTR the
availability A of an item may be derived as:
MTTR
A 1
MTBF
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
13
Users, services and reliability requirements
Network reliability is a “relative concept”.
The reliability requirements of a communication
network depend on:
the user type;
the service type.
Different users-services combinations led to
divers requirements in terms of MTBF and
MTTR.
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
14
User classification
 According to their reliability requirements, network
users may be classified in the following categories:
Safety critical users. Users for which service
interruption are unacceptable.
Business critical users. Users for which any service
interruption bring to a high financial loss.
Low cost users. Users for which service interruption
cause only discomfort.
Basic lever users. Users for which service reliability is
only a side effect.
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
15
Ref: “Service Applications for SONET DCS Distribution Restoration”, IEEE J. Special Areas in Comm, Jan 94
• May drop
voice band
calls
depending
on channel
bank
vintage
4th
Restoration
Target
Range
University of Napoli Federico II – Comics Group
30 Min
Restoration time after failure detection
15 Min
5 Min
10 Sec
2 Sec
50 ms
200 ms
0
Dipartimento di Informatica e Sistemistica,
• Network
congestion
• Packet
(X.25)
disconnects
• Data session
time-outs
3rd
Restoration
Target
Range
2nd
Restoration
Target
Range
1st
Restoration
Target
Range
Protection
Switching
Range
• Drop all
circuit
switched
connections
• PL
disconnects
• Potential
packet
(X.25)
disconnects
• Potential
data session
time-outs
Unacceptable
Service
“Hit””
(Reframes)
• Minor
social/
Business
impacts
Social / Business Impact
• Potential
voiceband
discinnects
(<5%)
• Trigger
changeover
of CSS7
STP
signaling
links
• Effect cell
rerouting
process
• Potentially
FCC
reportable
• Major
social/
business
impacts
Undesirable
Service Outage Impact
Availability: Impact of Outages
16
Market Drivers for Survivability
 Customer Relations
 Competitive Advantage
 Revenue
 Negative - Tariff Rebates
 Positive - Premium Services
• Business Customers
• Medical Institutions
• Government Agencies
 Impact on Operations
 Minimize Liability
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
17
Network Survivability
 Availability: 99.999% (5 nines) => less than 5 min downtime per year
 Since a network is made up of several components, the ONLY way to
reach 5-nines is to add survivability in the face of failures…
 Survivability = continued services in the presence of failures
 Protection switching or restoration: mechanisms used to
ensure survivability
• Add redundant capacity, detect faults and automatically
re-route traffic around the failure
 Restoration: related term, but slower time-scale
 Protection: fast time-scale: 10s-100s of ms…
 implemented in a distributed manner to ensure fast restoration
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
18
Failure Types & Other Motivations
 Types of failure:
 Components: links, nodes, channels in WDM, active components,
software…
 Human error: backhoe fiber cut
• Fiber inside oil/gas pipelines less likely to be cut
 Systems: Entire COs can fail due to catastrophic events
 Protection allows easy maintenance and upgrades :
 Eg: switchover traffic when servicing a link…
 Single failure vs multiple concurrent failures…
 Goal: mean repair time << mean time between failures…
 Protection also depends upon kind of application.
 Survivability may hence be provided at several layers
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
19
Network Survivability Architectures
Network Survivability Architectures
Restoration
Self-healing
Network
Protection
Re-Configurable
Network
Mesh Restoration
Architectures
Dipartimento di Informatica e Sistemistica,
Protection
Switching
Linear Protection Ring Protection
Architectures
Architectures
University of Napoli Federico II – Comics Group
20
Network Availability & Survivability
Availability is the probability that an item will be able to
perform its designed functions at the stated
performance level, within the stated conditions and in
the stated environment when called upon to do so.
Availability =
Dipartimento di Informatica e Sistemistica,
Reliability
Reliability + Recovery
University of Napoli Federico II – Comics Group
21
Quantification of Availability
Percent
N-Nines
Availability
Downtime Time
Minutes/Year
99%
2-Nines
5,000 Min/Yr
99.9%
3-Nines
500 Min/Yr
99.99%
4-Nines
50 Min/Yr
99.999%
5-Nines
5 Min/Yr
99.9999%
6-Nines
.5 Min/Yr
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
22
PSTN
 Individual elements have an availability of 99.99%
 One cut off call in 8000 calls (3 min for average call). Five ineffective calls in every
10,000 calls.
PSTN End-2-End Availability
99.94%
NI
0.005 %
0.005 %
AN
0.01 %
LE
AN
Facility
Entrance
NI : Network Interface
LE : Local Exchange
NI
0.005 %
LD : Long Distance
AN : Access Network
Dipartimento di Informatica e Sistemistica,
Facility
Entrance
LD
LE
0.01 %
0.005 %
0.02 %
University of Napoli Federico II – Comics Group
Source : http://www.packetcable.com/downloads/specs/pkt-tr-voipar-v01-001128.pdf
23
IP Network Expectations
Service
Delay
Jitter
Loss
Availability
L
L
L
H
M
L
L
H
Internet Service
H
H
M
L
Video Services
L
M
M
H
Real Time Interactive
(VOIP, Cell Relay ..)
Layer 2 & Layer 3 VPN’s
(FR/Ethernet/AAL5)
L : Low
M : Medium
Dipartimento di Informatica e Sistemistica,
H : High
University of Napoli Federico II – Comics Group
24
Measuring Availability: The Port Method
• Based on Port count in Network
(Total # of Ports X Sample Period) - (number of impacted port x outage duration)
(Total number of Ports x sample period)
x 100
• Does not take into account the Bandwidth of ports
e.g. OC-192 and 64k are both ports
• Good for dedicated Access service because ports are tied to
customers.
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
25
The Port Method Example
• 10,000 active access ports Network
• An Access Router with 100 access ports fails for 30
minutes.
– Total Available Port-Hours = 10,000*24 = 240,000
– Total Down Port-Hours = 100*.5 = 50
– Availability for a Single Day =
(240000-50)/240,000*100 =
99.979166 %
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
26
The Bandwidth Method
• Based on Amount of Bandwidth available in Network
(Total amount of BW X Sample Period) - (Amount of BE impacted x outage duration)
x 100
(Total amount of BW in network x sample period)
• Takes into account the Bandwidth of ports
• Good for Core Routers
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
27
The Bandwidth Method Example
• Total capacity of network 100 Gigabits/sec
• An Access Router with 1 Gigabits/sec BW fails for 30 minutes.
– Total BW available in network for a day = 100*24 = 2400
Total BW lost in outage = 1*.5 = 0.5
– Availability for a Single Day =
((2400-0.5)/2,400)*100 = 99.979166 %
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
28
Basic Ideas: Working and Protect Fibers
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
29
Service classification (1/2)
Communication networks are used to carry
many different services.
Different services may have divers reliability
requirements.
Reliability requirements of such services are
related to QoS parameters:
Bit Rate;
Delay;
Jitter;
...
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
30
Service classification (2/2)
Application
Bit Rate
Bit Rate Variation Delay Sensitivity
Plain Old Telephone Service31-32 Kbps
Constant
Voice Over IP
8-32 Kbps
Constant
Video-telephony
256-1920 KbpsHigh High
Videoconferencing
at least 256 Kbps High
Teleworking
64 Kbps â€
“ 2 Mbps Very High
TV broadcast
2-8 Mbps
High
Distance Learning
64 Kbps â€
“ 2 Mbps Very High
Movies on Demand
750 Kbps â€
“ 4 MbpsHigh
News on Demand
64 Kbps
Very High
Internet Access
64 Kbps â€
“ 2 Mbps Very High
Teleshopping
64 Kbps â€
“ 2 Mbps Very High
5
5
5
5
5
4
5
4
2
1
2
Need for Recovery
5
5
5
5
4
3
5
5
2
2
2
[2] A.Lason, et al., “Network Scenarios and Requirements”, European IST project Layers Internetworking in
Optical Network (LION), deliverable D6, Septemper 1999.
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
31
How to increase network reliability ?
Prevent network failure:
put network cables deeper in the ground;
more testing for hardware and software;
.....
Duplicate vulnerable network elements:
dual homing.
Independently from these measures, network
failures still occur.
There is need for network recovery or resilience
schemes !
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
32
Network recovery basic idea
 Build networks to have alternate paths
 Design systems to have alternate entities
 Monitor for possible falures
 Manage networks proactively
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
33
Network recovery requirements
Network recovery imposes several requirements.
For example:
there should be backup capacity to create a
recovery path;
 the backup capacity must be enough to ensure QoS
constraints;
single point of failure must be avoided;
.....
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
34
Recovery and reversion cycles
Recovery Cycle
Reversion Cycle
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
35
Recovery mechanisms
A high variety of recovery mechanisms exist.
Every mechanisms has advantages and
drawbacks
In the following slides some criteria that may be
used to evaluate and classify recovery
mechanisms are reported [3, 4].
[3] V. Sharma et al., “Framework for MPLS-based recovery”, RFC 3469, IETF web site, Feb 2003
[4] K. Owens, V. Sharma, M. Oommen, and F. Hellstrand, “Network Survivability Considerations for Traffic
Engineered IP Networks”, Internet draft: draft-owens-te-network-survivability-03, May 2002. Available at:
www.ietf.org. Accessed July 2005
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
36
Backup Capacity
Dedicated
one to one relationship between the backup resources
and the working path;
the simplest solution;
an inefficient solution.
Shared
the backup resources are shared among different
working path;
a more simple solution;
a more efficient solution.
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
37
Recovery Path
Preplanned
recovery paths for all accounted failure scenario is
calculated in advance;
allows fast recovery of failure;
lacks flexibility for unaccounted failure scenarios.
Dynamic
the recover path is calculate “on the fly” when the
failure is detected;
may be used to search recovery paths also for
unaccounted failure scenarios.
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
38
Recovery Approaches
Protection
the recovery paths are preplanned and fully signaled before
a failure occurs;
when a failure occurs no additional signaling is needed to
establish the recovery path;
is the faster solution.
Restoration
the recovery pat may be preplanned or dynamically
allocated but are not signaled in advance;
when a failure occurs aditional signaling is needed to
establish the recovery path;
is a more flexible solution.
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
39
Protection Variants (1/2)
 1+1 Protection (Dedicated Protection)
 there is exactly one dedicated recovery path for each working
segment;
 the traffic is permanently duplicated on both the working path
and the recovery path;
 is a quite expensive solution.
 1:1 Protection (Dedicated Protection with extra traffic)
 there is exactly one dedicated recovery path for each working
segment;
 the traffic is transmitted over only a path at a time;
 it is possible to transport extra traffic along the recovery path in
failure free condition.
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
40
Protection Variants (2/2)
 1:N (Shared Recovery With Extra Traffic)
each recovery entity is used to protect N working
entities;
it is possible use the recovery entities to transport extra
traffic in failure free conditions.
 M:N (M ≤ N)
a set of M recovery entities are used to protect a set of N
working entities;
it is possible use the recovery entities to transport extra
traffic in failure free conditions.
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
41
Recovery Extent (1/2)
Local Recovery
in failure condition only the affected network element
are bypassed using the recovery path;
the RHE and RTE are closer to the failure, so they
may detect the failure quickly, leading to a smaller
recovery time.
in case of failure the route followed by the traffic may
be not optimal (e.g the same traffic may cross a link
twice !) .
In case of two successive nodes failure will fail
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
42
Recovery Extent (2/2)
 Global Recovery
in failure condition the complete working path between
source and destination is bypassed;
the recovery time is greater that that of the local recovery
an optimal recovery path is used in case of failure;
In case of two successive nodes failure could still resolve the
problem;
may generate more “state overhead” that the local
approach.
 An intermediate solution between Local and Global
approach may be adopted !!
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
43
Control of Recovery Mechanisms (1/2)
Centralized
a central controller determines the action to take in
case of failure;
the central controller also determine when and
where a fault ha occurred;
the central controller is a single point of failure.
is generally an efficient approach;
in principle is a simpler approach, but
the central controller may become a very complex
system;
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
44
Control of Recovery Mechanisms (2/2)
Distributed
there is not a centralized controller, all the network
elements are capable to autonomously react to
failure;
with this approach there is not a global view of the
network condition;
the network elements may have to exchange
information to keep a consistent view of the
network;
is a more scalable approach.
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
45
Protection Topologies - Ring
 Two or more nodes connected to each other with a ring of links
W
D
E
L
E
L
W
Working
Protect
W
E
E
Dipartimento di Informatica e Sistemistica,
W
University of Napoli Federico II – Comics Group
46
Protection Topologies - Mesh
 Three or more nodes connected to each other
 Can be sparse or complete meshes
 Spans may be individually protected with linear protection
 Overall edge-to-edge connectivity is protected through
multiple paths
Working
Protect
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
47
Protection Switching Terminology
 1+1 architectures - permanent bridge at the
source - select at sink
 m:n architectures - m entities provide protection
for n working entities where m is less than or
equal to n
allows unprotected extra traffic
most common - SONET linear 1:1 and 1:n
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
48
1+1 vs 1:n
Protect
Working
(1+1)
Dipartimento di Informatica e Sistemistica,
Working
Protect
(1:n)
University of Napoli Federico II – Comics Group
49
SONET Linear 1+1 APS
BR = Bridge
SW = Switch
TX = Transmitter
RX = Receiver
Working
TX
RX
BR
SW
Protection
RX
TX
Working
RX
TX
SW
BR
RX
Dipartimento di Informatica e Sistemistica,
Protection
TX
University of Napoli Federico II – Comics Group
50
SONET 1:1 Linear APS
BR = Bridge
SW = Switch
TX = Transmitter
RX = Receiver
APS Channel
TX
RX
BR
SW
Protection
TX
RX
Working
RX
TX
SW
BR
RX
Dipartimento di Informatica e Sistemistica,
Protection
TX
University of Napoli Federico II – Comics Group
51
Protection Switching: Terminology
 Dedicated vs Shared: working connection assigned dedicated or
shared protection bandwidth
 1+1 is dedicated, 1:n is shared
 Revertive vs Non-revertive: after failure is fixed, traffic is
automatically or manually switched back
 Shared protection schemes are usually revertive
 Uni-directional or bi-directional protection:
 Uni: each direction of traffic is handled independent of the
other.
 Fiber cut => only one direction switched over to protection .
Usually done with dedicated protection; no signaling
required.
 Bi-directional transmission on fiber (full duplex) => requires
bi-directional switching & signaling required
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
52
Mesh Restoration
Working Path
DCS
DCS
Line or Link
Restoration
DCS
DCS
DCS
DCS
Path Restoration
• Control: Centralized or Distributed
• Route Calculation: Preplanned or Dynamic
• Type of Alternate Routing: Line or Path
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
53
Link vs. Path restoration
 Link restoration
• Requires the ability to identify the failed link at both ends.
• Can not protect node failure.
• Link based
Mesh (generalized loop back) – insensitive to additions to network –
scalable; backup path can be pre-computed – fast recovery; dynamic
rerouting
 Path restoration
 More resilient than link restoration.
 Reroutes the traffic from the primary path to a Shared Risk Group (SRG) disjoint backup path.
•
Preferred: Path Based
 Protect both end-to-end paths and single links.
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
54
Link vs. Path restoration
D
A
Fault: Link Cut
C
F
B
D
A
E
C
F
B
Link (Generalized Loopback) Restoration
E
D
Flow 1: A-C-D
Flow 2: E-C-D-F
A
C
F
B
E
Path Restoration
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
55
Pre-compute vs. Real-time
 Pre-computed
 calculates restoration paths before a failure happens.
 Allows prior availability of reroute information to the nodes where
actions need to be taken after failure is detected.
 Enables fast restoration.
 Real-time
 calculates restoration paths after a failure happens.
 Restoration is slower.
 Enables more efficient capacity utilization.
•
Preferred: Pre-computed
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
56
Centralized vs. Distributed
 Centralized restoration:
 Computes restoration and primary paths for all demands with up-to-date
information
 Routes may then be downloaded into nodal databases.
 Effectiveness?
• More capacity efficiency
• Possibly slow (but may be executed in the background)
• Scalability in question.
 Distributed restoration
 Source and destination nodes dynamically search for the protection wavelengths
required to reestablish the disrupted lightpath
 Since lack of knowledge of sharing database of other OXCs, it may not be able
to determine backup sharability for any given primary path
•
Preferred:
•
Central path determination
•
Distributed Restoration
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
57
Protection Topologies - Linear
 Two nodes connected to each other with two or
more sets of links
Protect
Working
(1+1)
Dipartimento di Informatica e Sistemistica,
Working
Protect
(1:n)
University of Napoli Federico II – Comics Group
58
Mesh Restoration vs Ring/Linear Protection
Attributes
Linear APS
Ring PS
Mesh
Restoration
Most
Moderate
Least
Fiber Counts
Highest
Moderate
Moderate
Restoration Time
<50 ms
<50 ms
2-10 seconds
Software Complexity
Least
Moderate
Most
Protection Against Major
Failures
Worst
Medium
Best
Planning/Operations
Complexity
Least
Moderate/least
Most
Spare Capacity Needed
Extracted from: T-H. Wu, Emerging Technologies for Fiber Network Survivability, See References
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
59
IP layer restoration
 IP Layer Restoration (real-time)
 Achieved by exchanging control messages between adjacent routers
• Re-determine the affected route
• Update routing tables
• Propagate changes (OSPF, BGP-4)
 Capable of recovery from multiple faults
 Slow (10s of seconds to minutes – Fumagalli) requires online processing upon
failure
• Fault discovery:
Application
Presentation
• Explicitly: ICMP messaging
Session
• Implicitly: Expiring of timers
Transport
Network (IP)
 Guarantees networkwide survivability
Data Link
Physical
 Independent of underlying physical network
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
60
MPLS layer restoration
MPLS Layer Protection
 Real-time or pre-computed
 Line or path level protection
Protection path is node and link disjoint from the primary
path.
Protection path may be allocated to low-priority traffic in
the absence of network failure.
 Faster than dynamic IP rerouting
 Working LSPs have pre-established node/link disjoint protection
paths
Application
Presentation
Session
Transport
MPLS
Network
Data Link
Physical
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
61
Optical layer restoration
Optical layer restoration
 Real-time or pre-computed
Application
 Ring protection or mesh restoration
Presentation
Session
Transport
No visibility into higher layer operations.
May be wasteful use of resources.
Network IP)
DWDM (Optical)
Physical
• For ring protection, there is over 100% capacity redundancy
• For mesh restoration, 60-80% physical redundancy level is typical.
 Not recommended for node (or software) failures
 Faster than higher layer restorations (??)
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
62
Multilayer Recovery (1/2)
In a multilayer network it is possible to imagine
a situation in which each layer has its own
recovery mechanisms.
Not every failure in a particular layer may be
resolved in the same layer.
If a failure may be resolved in several layer
uncoordinated actions may produce inefficient
results
A coordination among the layers is needed !!
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
63
Multilayer Recovery (2/2)
 Sequential Approach[1]
using an hold-off time a chronological order among the
recovery mechanisms adopted in different layer is imposed;
alternatively a “token” may used to impose a sequential order
among the different layers.
 Integrated Approach[1]
there is a recovery scheme that has a full overview of all the
layers;
the recovery scheme may decide when and in which layer
(layers) the recovery actions must be taken.
[1] D. Colle, et all., “Data-centric optical networks and their survivability”, Selected Areas in Communications,
IEEE Journal on Volume 20, Issue 1, Jan. 2002 Page(s):6 - 20
Dipartimento di Informatica e Sistemistica,
University of Napoli Federico II – Comics Group
64
Scarica

Failures - Docenti.unina.it