CAPI’04
Milano, 24-25 Novembre 2004
Esperienze di sistemi meteorologici
numerici in configurazione di
servizio basati su calcolo ad alte
prestazioni in APAT
Attilio Colagrossi
Franco Valentinotti
APAT
Quadrics, Ltd.
Indice

APAT e servizi meteo: storia e stato attuale
 Il dualismo servizio-ricerca: criteri ed architetture
 I sistemi di calcolo
- La catena operativa e la complessità dei modelli
- Il modello meteorologico BOLAM
- QBolam e il sistema basato su APE100
- PBolam e il sistema basato su ALTIX350
- Esperienze su altri sistemi di calcolo
 Conclusioni
APAT e servizi meteo:
storia e stato attuale
APAT
Agenzia per la Protezione dell’Ambiente e i Servizi
Tecnici
Istituita nel 2002
Svolge attività tecnico-scientifiche di interesse nazionale per
la protezione dell’ambiente, dell’acqua e del suolo.
Incorpora le competenze precedentemente attribuite all’ANPA
ed al Dipartimento dei Servizi Tecnici Nazionali – Servizio
Idrografico e Mareografico Nazionale,
Servizio
Geologico Nazionale, Biblioteca
APAT e servizi meteo:
storia e stato attuale
1998: il Dipartimento dei Servizi Tecnici Nazionali avvia il
Progetto Idro-Meteo-Mare in collaborazione con ISAC-CNR
e ENEA
OBIETTIVI
Analisi e previsione della situazione meteorologica sul
territorio e dello stato del mare Mediterraneo
Monitoraggio in tempo reale, produzione di analisi e
previsione dei campi di interesse, valutazione dei fenomeni
idrometeorologici e dei rischi associati
APAT e servizi meteo:
storia e stato attuale
MODELLI UTILIZZATI
BOLAM (inizializzato sulle analisi ECMWF)
WAM
POM
FEM
APAT e servizi meteo:
storia e stato attuale
Requisito fondamentale:
Esecuzione dei modelli in
CONFIGURAZIONE DI SERVIZIO
WAM
ECMWF
BOLAM
…… dal 2001
POM
FEM
APAT e servizi meteo:
storia e stato attuale
Ambiente di calcolo basato su
computer ad alte prestazioni:
inizialmente….. APE 100
ora…………. ALTIX 350
Il dualismo servizio-ricerca:
criteri ed architetture
router
ECMWF
alpha server
4100
Sun spark
station
APE100
LAN di
palazzo
Unità di
storage
Internet
alpha
ADSL
Venezia
Servizio Laguna Veneta
Il dualismo servizio-ricerca:
criteri ed architetture
ora….
script di basso livello, file system, elaborazioni…’ad hoc’
tra poco…..
Open technologies: Linux, Apache, MySQL, PHP, Java
The operational chain: the models
The 3D meteorological model BOLAM
running at two different resolutions:
• High Resolution:
30 km grid spacing
• Very High Resolution: 10 km grid spacing
3 ocean models:
• WAM: a 2D model for the prediction of
amplitude, frequency and direction of the
sea waves ;
• POM: a shallow-water circulation model
for the prediction of surface elevation and
horizontal velocities ;
• VL-FEM: a 2D high res. circulation
model using finite elements to better
describe the Venice Lagoon morphology.
The computational domain
 H.R. BOLAM: coarse grid
with
160×98×40 pts. and 30 km of resolution.
 V.H.R. BOLAM: fine grid with
386×210×40 pts. and 10 km of resolution.
 WAM: grid covering the whole
Mediterranean Sea with about 3000 pts.
and 30 km of resolution.
POM: grid covering
the whole Adriatic
Sea with about 4000
pts. and a variable
resolution, with grid
size decreasing when
approaching Venice
(from 10 to 1 km).
VL-FEM:
mesh
covering the whole
Venice
Lagoon
with more than
7500 elements and
a spatial resolution
varying from 1 km
to 40 m.
The operational requirement
2 days of forecast in ~1 hour
The BOLAM computational cost
V.H.R. BOLAM
 103 flop / grid pt. / t. step
 3·106 grid points
 Time step of 80 s
~ 7 TFlop / 2-days
~ 2GFlops sustained
The Meteorological Model BOLAM
GENERAL FEATURE
•A 3D primitive equations (momentum, mass continuity, energy
conservation) model (in the hydrostatic limit)
•Prognostic variables: U, V, T, Q, Ps
NUMERICAL SCHEME
•Finite difference technique in time and space
•Advection:
Forward-Backward Advection Scheme (FBAS), explicit, 2
time-levels, centered in space
•Diffusion:
- horizontal:
- vertical:
4th order hyperdiffusion on U, V, T, Q
2nd order divergence damping on U, V
implicit scheme on U, V, T, Q
PHYSICS ROUTINES
• They only involve computations along the vertical direction.
Year 1997: the QBolam and APE100 choice
General features
 SIMD




Topology
Module
Scalability
Connections
Single Instruction
Multiple Data
3D cubic mesh
2 × 2 × 2 processors
from 8 to 2048 processors
3D first neighbours,
periodic at the boundaries
Processor
 MAD
 Memory
Multiplier & Adder Device
Pipeline
50 MFlops of peak
4 MByte per processor
(distributed)
Quadrics QH1
 128 processors
 6.4 GFlops
 512 MByte
Master Controller
 Z-CPU
Integer operation
Memory addressing
Server DEC 4100
The parallel code QBolam
The BOLAM code has been redesigned for the SIMD architecture
and rewritten in TAO language
Data Distribution Strategy
 Static Domain Decomposition
• N. of subdomains = N. of PEs
• Subdomains of same shape and
dimensions
Boundary between PEs
Grid box
Physical subdomain of the
central PE
 Connection between subdomains
using Frame Method
Frame containing data from
first neighbouring PEs
• Boundary data of the
neighbouring sub-domains
copied into the frame of local
domain
Physical subdomain of the
corner PEs
 Column Data Type Structure
Physical subdomain of the
first neighbouring PEs
Frame containing data from
corner PEs
• “Ad hoc” libraries for
communications
and
arithmetical operations between
columns
QBolam Performance on
Quadrics/APE100
machine type
QH1
QH1
QH4*
N. of processors
128
128
512
QBolam model
HR
VHR
VHR
resolution
30 km
10 km
10 km
N. of ops./time step
0.57 GFlop
2.90 GFlop
2.90 GFlop
Time step
240 s
80 s
80 s
Execution time/time step
0.297 s
1.333 s
0.392 s
Performances
1.92 GFlops
2.12 GFlops
7.21 GFlops
% of peak performance
30 %
33 %
28 %
days of simulation
2,5 days
2 days
2 days
elapsed time
8' 16''
1h 53' 35''
48' 25''
* Misure effettuate su Quadrics/APE100 QH4 del centro di calcolo ENEA Casaccia - Roma
Year 2004: PBolam and Cluster Linux
The goal of the project is to
substitute the existing previsional
operational system with a new one
in the next future:
 Simplify the operational chain:
all models and interfaces will be
executed on one machine only
 Parallel architecture upgrade:




4 dual CPU node
1.4 GHz ItaniumII
44.8 GFlops of peak
8 GByte of memory
(physically distributed)
 SMP thanks to the
NUMAFlex technology
(6.4 GBytes/s)
“single system image”
 OpenMP, MPI
Cluster Linux, Open Source
 Simulation model upgrade:
the first result is the PBolam code
development, a parallel meteorological
model.
SGI Altix 350
The parallel code PBolam
PBolam is a parallel version code for distributed memory architecture
of the meteorological model BOLAM.
General Features
 Portable
• Fortran90
• MPI
• standard Posix
 Versatile
• any number of processors
• any number of grid points
 Easy to maintain
• same data type structure
as BOLAM
• same variables/subroutine
name as BOLAM
Parallelization strategy
 Static Domain Decomposition
• Number of subdomain equal to number of
processes, but not fixed as QBolam
• Parallelepiped subdomains, but they may
have differents shape and dimension
 Data Distribution Strategy
• All vertical levels on the same process
• Subdivision on the horizontal:
NLon / PLon  NLat / PLat  NLev
where P = PLon  PLat (number of processes)
are choose to minimize communication time
 Frame Method
• Boundary data of the neighbouring subdomain copied into the frame of local
domain: exchange in North-South / East-West
VHR PBolam performance on Altix
Execution time for all possible PLon  PLat = P  [2,8] combination was measured
8x1
4x2
2x4
1x8
7x1
1x7
6x1
3x2
2x3
1x6
5x1
1x5
4x1
2x2
1x4
3x1
1x3
2x1
1,8
1,6
1,4
1,2
1,0
0,8
0,6
0,4
0,2
0,0
1x2
execution time(s)
Execution time vs. number of processes/data distribution
Number of processes (P_lon x P_lat)
Step
• Execution time of one step
decreases when P increases
•Communication time is quite
constant when P increases
Physics
Comm.
•Execution time of the physics phase
is quite constant, for a fixed P
• Execution time of one step, for a
fixed P, is minimum when also
communication time is minimum
VHR PBolam performance on Altix
Execution time vs. P for the best data distribution
•Communication time
increases slowly when P
Step
increases, because total data
Comm.
involved in the exchage
% of Comm.
increase slowly too
•Since total execution time is
2
4
6
8
10
 1/P, communication time
Num ber of processes
increases from 1% to 10%
1,00
0,10
0,01
0
10
8
Speed up
execution time (s)
10,00
6
SpeedUp
4
Ideal
This behaviour is evident also
in the speed up curve:
S = (timeNP =1 ) / (timeNP =P )
2
0
0
2
4
6
Num ber of processes
8
10
AMD Cluster with QsNetII
 8 dual CPU node
 2.2 GHz Opteron
 70.4 GFlops of peak performance
 8 GByte of distributed memory
 QsNetII interconnect
2000
1800
MPI Latency
1.8 s
2 x QsNetII
Bandwidth (Mbytes/sec)
1600
MPI Bandwidth 900 MB/s
1400
4 x QsNet
1200
Elan4 NIC
1000
QsNetII
800
2 x QsNet
600
400
QsNet
200
0
4
16
64
256
1024
4096
Message Size (Bytes)
16384
65536
262144
1048576
Elite4 Switch
Altix vs. AMD Cluster
execution time (s) .
10,00
Altix Step
Altix Com m .
Altix % of Com m .
AMD Step
AMD Com m .
AMD % of Com m .
•Less increase of percentage
of communication on the total
time means best speed up
curve, especially when the
number of processes grows
1,00
SpeedUp
0,10
AMD
Altix
Ideal
18
16
0,01
0
2
4
6
8
10
12
14
Number of processes
16
14
12
10
• Itanium is faster than Opteron
(Preliminary results show a 1.5
factor)
8
6
4
2
0
•
QsNetII
shows better performance
0
2
4
6
8
10
Number of processes
12
14
16
Conclusion
•The VHR execution time has been reduced with Altix 350:
PBolam performance (6.3 GFlops, 14% of peak) is 3 time QBolam perf.
from 100 min. to 20 min. of elapsed time, including I/O
•Now APAT has a meteorological parallel code PBolam, portable
on several cluster Linux
•In the next future, all the previsional chain will be simplified
because all models and interfaces will be executed on one
machine only
•SW architecture more suitable to perform both research and
service activities
Scarica

APAT e servizi meteo