CAPI’04 Milano, 24-25 Novembre 2004 Esperienze di sistemi meteorologici numerici in configurazione di servizio basati su calcolo ad alte prestazioni in APAT Attilio Colagrossi Franco Valentinotti APAT Quadrics, Ltd. Indice APAT e servizi meteo: storia e stato attuale Il dualismo servizio-ricerca: criteri ed architetture I sistemi di calcolo - La catena operativa e la complessità dei modelli - Il modello meteorologico BOLAM - QBolam e il sistema basato su APE100 - PBolam e il sistema basato su ALTIX350 - Esperienze su altri sistemi di calcolo Conclusioni APAT e servizi meteo: storia e stato attuale APAT Agenzia per la Protezione dell’Ambiente e i Servizi Tecnici Istituita nel 2002 Svolge attività tecnico-scientifiche di interesse nazionale per la protezione dell’ambiente, dell’acqua e del suolo. Incorpora le competenze precedentemente attribuite all’ANPA ed al Dipartimento dei Servizi Tecnici Nazionali – Servizio Idrografico e Mareografico Nazionale, Servizio Geologico Nazionale, Biblioteca APAT e servizi meteo: storia e stato attuale 1998: il Dipartimento dei Servizi Tecnici Nazionali avvia il Progetto Idro-Meteo-Mare in collaborazione con ISAC-CNR e ENEA OBIETTIVI Analisi e previsione della situazione meteorologica sul territorio e dello stato del mare Mediterraneo Monitoraggio in tempo reale, produzione di analisi e previsione dei campi di interesse, valutazione dei fenomeni idrometeorologici e dei rischi associati APAT e servizi meteo: storia e stato attuale MODELLI UTILIZZATI BOLAM (inizializzato sulle analisi ECMWF) WAM POM FEM APAT e servizi meteo: storia e stato attuale Requisito fondamentale: Esecuzione dei modelli in CONFIGURAZIONE DI SERVIZIO WAM ECMWF BOLAM …… dal 2001 POM FEM APAT e servizi meteo: storia e stato attuale Ambiente di calcolo basato su computer ad alte prestazioni: inizialmente….. APE 100 ora…………. ALTIX 350 Il dualismo servizio-ricerca: criteri ed architetture router ECMWF alpha server 4100 Sun spark station APE100 LAN di palazzo Unità di storage Internet alpha ADSL Venezia Servizio Laguna Veneta Il dualismo servizio-ricerca: criteri ed architetture ora…. script di basso livello, file system, elaborazioni…’ad hoc’ tra poco….. Open technologies: Linux, Apache, MySQL, PHP, Java The operational chain: the models The 3D meteorological model BOLAM running at two different resolutions: • High Resolution: 30 km grid spacing • Very High Resolution: 10 km grid spacing 3 ocean models: • WAM: a 2D model for the prediction of amplitude, frequency and direction of the sea waves ; • POM: a shallow-water circulation model for the prediction of surface elevation and horizontal velocities ; • VL-FEM: a 2D high res. circulation model using finite elements to better describe the Venice Lagoon morphology. The computational domain H.R. BOLAM: coarse grid with 160×98×40 pts. and 30 km of resolution. V.H.R. BOLAM: fine grid with 386×210×40 pts. and 10 km of resolution. WAM: grid covering the whole Mediterranean Sea with about 3000 pts. and 30 km of resolution. POM: grid covering the whole Adriatic Sea with about 4000 pts. and a variable resolution, with grid size decreasing when approaching Venice (from 10 to 1 km). VL-FEM: mesh covering the whole Venice Lagoon with more than 7500 elements and a spatial resolution varying from 1 km to 40 m. The operational requirement 2 days of forecast in ~1 hour The BOLAM computational cost V.H.R. BOLAM 103 flop / grid pt. / t. step 3·106 grid points Time step of 80 s ~ 7 TFlop / 2-days ~ 2GFlops sustained The Meteorological Model BOLAM GENERAL FEATURE •A 3D primitive equations (momentum, mass continuity, energy conservation) model (in the hydrostatic limit) •Prognostic variables: U, V, T, Q, Ps NUMERICAL SCHEME •Finite difference technique in time and space •Advection: Forward-Backward Advection Scheme (FBAS), explicit, 2 time-levels, centered in space •Diffusion: - horizontal: - vertical: 4th order hyperdiffusion on U, V, T, Q 2nd order divergence damping on U, V implicit scheme on U, V, T, Q PHYSICS ROUTINES • They only involve computations along the vertical direction. Year 1997: the QBolam and APE100 choice General features SIMD Topology Module Scalability Connections Single Instruction Multiple Data 3D cubic mesh 2 × 2 × 2 processors from 8 to 2048 processors 3D first neighbours, periodic at the boundaries Processor MAD Memory Multiplier & Adder Device Pipeline 50 MFlops of peak 4 MByte per processor (distributed) Quadrics QH1 128 processors 6.4 GFlops 512 MByte Master Controller Z-CPU Integer operation Memory addressing Server DEC 4100 The parallel code QBolam The BOLAM code has been redesigned for the SIMD architecture and rewritten in TAO language Data Distribution Strategy Static Domain Decomposition • N. of subdomains = N. of PEs • Subdomains of same shape and dimensions Boundary between PEs Grid box Physical subdomain of the central PE Connection between subdomains using Frame Method Frame containing data from first neighbouring PEs • Boundary data of the neighbouring sub-domains copied into the frame of local domain Physical subdomain of the corner PEs Column Data Type Structure Physical subdomain of the first neighbouring PEs Frame containing data from corner PEs • “Ad hoc” libraries for communications and arithmetical operations between columns QBolam Performance on Quadrics/APE100 machine type QH1 QH1 QH4* N. of processors 128 128 512 QBolam model HR VHR VHR resolution 30 km 10 km 10 km N. of ops./time step 0.57 GFlop 2.90 GFlop 2.90 GFlop Time step 240 s 80 s 80 s Execution time/time step 0.297 s 1.333 s 0.392 s Performances 1.92 GFlops 2.12 GFlops 7.21 GFlops % of peak performance 30 % 33 % 28 % days of simulation 2,5 days 2 days 2 days elapsed time 8' 16'' 1h 53' 35'' 48' 25'' * Misure effettuate su Quadrics/APE100 QH4 del centro di calcolo ENEA Casaccia - Roma Year 2004: PBolam and Cluster Linux The goal of the project is to substitute the existing previsional operational system with a new one in the next future: Simplify the operational chain: all models and interfaces will be executed on one machine only Parallel architecture upgrade: 4 dual CPU node 1.4 GHz ItaniumII 44.8 GFlops of peak 8 GByte of memory (physically distributed) SMP thanks to the NUMAFlex technology (6.4 GBytes/s) “single system image” OpenMP, MPI Cluster Linux, Open Source Simulation model upgrade: the first result is the PBolam code development, a parallel meteorological model. SGI Altix 350 The parallel code PBolam PBolam is a parallel version code for distributed memory architecture of the meteorological model BOLAM. General Features Portable • Fortran90 • MPI • standard Posix Versatile • any number of processors • any number of grid points Easy to maintain • same data type structure as BOLAM • same variables/subroutine name as BOLAM Parallelization strategy Static Domain Decomposition • Number of subdomain equal to number of processes, but not fixed as QBolam • Parallelepiped subdomains, but they may have differents shape and dimension Data Distribution Strategy • All vertical levels on the same process • Subdivision on the horizontal: NLon / PLon NLat / PLat NLev where P = PLon PLat (number of processes) are choose to minimize communication time Frame Method • Boundary data of the neighbouring subdomain copied into the frame of local domain: exchange in North-South / East-West VHR PBolam performance on Altix Execution time for all possible PLon PLat = P [2,8] combination was measured 8x1 4x2 2x4 1x8 7x1 1x7 6x1 3x2 2x3 1x6 5x1 1x5 4x1 2x2 1x4 3x1 1x3 2x1 1,8 1,6 1,4 1,2 1,0 0,8 0,6 0,4 0,2 0,0 1x2 execution time(s) Execution time vs. number of processes/data distribution Number of processes (P_lon x P_lat) Step • Execution time of one step decreases when P increases •Communication time is quite constant when P increases Physics Comm. •Execution time of the physics phase is quite constant, for a fixed P • Execution time of one step, for a fixed P, is minimum when also communication time is minimum VHR PBolam performance on Altix Execution time vs. P for the best data distribution •Communication time increases slowly when P Step increases, because total data Comm. involved in the exchage % of Comm. increase slowly too •Since total execution time is 2 4 6 8 10 1/P, communication time Num ber of processes increases from 1% to 10% 1,00 0,10 0,01 0 10 8 Speed up execution time (s) 10,00 6 SpeedUp 4 Ideal This behaviour is evident also in the speed up curve: S = (timeNP =1 ) / (timeNP =P ) 2 0 0 2 4 6 Num ber of processes 8 10 AMD Cluster with QsNetII 8 dual CPU node 2.2 GHz Opteron 70.4 GFlops of peak performance 8 GByte of distributed memory QsNetII interconnect 2000 1800 MPI Latency 1.8 s 2 x QsNetII Bandwidth (Mbytes/sec) 1600 MPI Bandwidth 900 MB/s 1400 4 x QsNet 1200 Elan4 NIC 1000 QsNetII 800 2 x QsNet 600 400 QsNet 200 0 4 16 64 256 1024 4096 Message Size (Bytes) 16384 65536 262144 1048576 Elite4 Switch Altix vs. AMD Cluster execution time (s) . 10,00 Altix Step Altix Com m . Altix % of Com m . AMD Step AMD Com m . AMD % of Com m . •Less increase of percentage of communication on the total time means best speed up curve, especially when the number of processes grows 1,00 SpeedUp 0,10 AMD Altix Ideal 18 16 0,01 0 2 4 6 8 10 12 14 Number of processes 16 14 12 10 • Itanium is faster than Opteron (Preliminary results show a 1.5 factor) 8 6 4 2 0 • QsNetII shows better performance 0 2 4 6 8 10 Number of processes 12 14 16 Conclusion •The VHR execution time has been reduced with Altix 350: PBolam performance (6.3 GFlops, 14% of peak) is 3 time QBolam perf. from 100 min. to 20 min. of elapsed time, including I/O •Now APAT has a meteorological parallel code PBolam, portable on several cluster Linux •In the next future, all the previsional chain will be simplified because all models and interfaces will be executed on one machine only •SW architecture more suitable to perform both research and service activities