Trends and Perspectives for HPC infrastructures Carlo Cavazzoni, CINECA outline - HPC resource in EUROPA (PRACE) Today HPC architectures Technology trends Cineca roadmaps (toward 50PFlops) EuroExa project The PRACE RI provides access to distributed persistent pan-European world class HPC computing and data management resources and services. Expertise in efficient use of the resources is available through participating centers throughout Europe. Available resources are announced for each Call for Proposals.. European Tier 0 National Tier 1 Local Tier 2 Peer reviewed open access PRACE Projects (Tier-0) PRACE Preparatory (Tier-0) DECI Projects (Tier-1) TIER-0 System, PRACE regular calls CURIE (GENCI, Fr), BULL Cluster, Intel Xeon, Nvidia cards, Infiniband network FERMI (CINECA, It) & JUQUEEN (Juelich, D), IBM BGQ, Power processors, custom 5D torus net. MARENOSTRUM (BSC, S), IBM DataPlex, Intel Xeon node, Infiniband net. HERMIT (HLRS, D), Cray XE6, AMD procs, custom 3D torus net. 1PFLops SuperMUC (LRZ, D), IBM DataPlex, Intel Xeon Node, Infiniband net.. TIER-1 Systems, DECI calls DECI site Machine name System type chip Bulgaria (NCSA) Czech Repulic (VSB-TUO) Finland (CSC) France (CINES) France (IDRIS) Germany (Jülich) Germany (RZG) Germany (RZG) Ireland (ICHEC) Italy (CINECA) Norway (SIGMA) Poland (WCSS) Poland (PSNC) Poland (PSNC) EA"ECNIS" Anselm Sisu Jade Babel JuRoPA Genius Stokes PLX Abel Supernova chimera cane IBM BG/P Bull Bullx Cray XC30 SGI ICE EX8200 IBM BG/P Intel cluster IMB BG/P iDataPlex Sgi ICE 8200EX iDataPlex MegWare cluster Cluster SGI UV1000 cluster AMD&GPU PowerPC 450 Intel Sandy Bridge-EP Intel Sandy Bridge Intel Quad-Core E5472/X5560 PowerPC 450 Intel Xeon X5570 PowerPC 450 Intel Sandy Bridge Intel Xeon E5650 Intel Westmere Intel Sandy Bridge Intel Westmere-EP Intel Xeon E7-8837 AMD Opteron™ 6234 Poland (ICM) boreasz IBM Power 775 (Power7) IBM Power7 Poland (Cyfronet) Zeus-gpgpu Linux Cluster Intel Xeon X5670/E5645 Spain (BSC) Sweden (PDC) Switzerland (CSCS) The Netherlands (SARA) Turkey (UYBHM) UK (EPCC) UK (ICE-CSE) MinoTauro Lindgren Monte Rosa Huygens Karadeniz HECToR ICE Advance Bull Cuda Cluster Cray XE6 Cray XE6 IBM pSeries 575 HP Cluster Cray XE6 IBM BG/Q Intel Xeon E5649 AMD Opteron AMD Opteron Power 6 Intel Xeon 5550 AMD Opteron PowerPC A2 peak perforGPU cards mance (Tflops) 27 66 23 nVIDIA Tesla 4 Intel Xeon Phi P5110 244.9 267.88 139 207 54 200 293 548 nVIDIA Tesla M2070/ M2070Q 260 51.58 21.8 224.3 334 NVIDIA TeslaM2050 74.5 136.8 48 M2050/160 M2090 182 256 nVIDIA Tesla M2090 305 402 65 2.5 829.03 1250 HPC Architectures Hybrid: Server class processors: Server class nodes Special purpose nodes Accelerator devices: two model Nvidia Intel AMD FPGA Homogeneus: Server class node: Standar processors Special porpouse nodes Special purpose processors Networks Standard/switched: Infiniband Special purpose/Topology: BGQ CRAY TOFU (Fujitsu) TH Express-2 (Thiane-2) Programming Models fundamental paradigm: Message passing Multi-threads Consolidated standard: MPI & OpenMP New task based programming model Special purpose for accelerators: CUDA Intel offload directives OpenACC, OpenCL, Ecc… NO consolidated standard Scripting: python Roadmap to Exascale (architectural trends) Dennard scaling law (downscaling) new VLSI gen. old VLSI gen. L’ = L / 2 V’ = V / 2 do not hold anymore! F’ = F * 2 D’ = 1 / L2 = 4D P’ = P The core frequency and performance do not grow following the Moore’s law any longer L’ = L / 2 V’ = ~V F’ = ~F * 2 D’ = 1 / L2 = 4 * D P’ = 4 * P The power crisis! Increase the number of cores to maintain the architectures evolution on the Moore’s law Programming crisis! Moore’s Law Economic and market law Stacy Smith, Intel’s chief financial officer, later gave some more detail on the economic benefits of staying on the Moore’s Law race. The cost per chip “is going down more than the capital intensity is going up,” Smith said, suggesting Intel’s profit margins should not suffer because of heavy capital spending. “This is the economic beauty of Moore’s Law.” And Intel has a good handle on the next production shift, shrinking circuitry to 10 nanometers. Holt said the company has test chips running on that technology. “We are projecting similar kinds of improvements in cost out to 10 nanometers,” he said. So, despite the challenges, Holt could not be induced to say there’s any looming end to Moore’s Law, the invention race that has been a key driver of electronics innovation since first defined by Intel’s co-founder in the mid-1960s. From It is all about the number of chips per Si wafer! WSJ But! 14nm VLSI 0.54 nm Si lattice 300 atoms! There will be still 4~6 cycles (or technology generations) left until we reach 11 ~ 5.5 nm technologies, at which we will reach downscaling limit, in some year between 2020-30 (H. Iwai, IWJT2008). What about Applications? In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law). maximum speedup tends to 1/(1−P) P= parallel fraction 1000000 core P = 0.999999 serial fraction= 0.000001 Architectural trends Peak Performance Moore law FPU Performance Dennard law Number of FPUs Moore + Dennard App. Parallelism Amdahl's law HPC Architectures two model Hybrid, but… Homogeneus, but… What 100PFlops system we will see … my guess IBM (hybrid) Power8+Nvidia GPU Cray (homo/hybrid) with Intel only! Intel (hybrid) Xeon + MIC Arm (homo) only arm chip, but… Nvidia/Arm (hybrid) arm+Nvidia Fujitsu (homo) sparc high density low power China (homo/hybrid) with Intel only Room for AMD console chips Chip Architecture Strongly market driven Mobile, Tv set, Screens Video/Image processing Intel New arch to compete with ARM Less Xeon, but PHI ARM Main focus on low power mobile chip Qualcomm, Texas inst. , Nvidia, ST, ecc new HPC market, server maket NVIDIA GPU alone will not last long ARM+GPU, Power+GPU Power Embedded market Power+GPU, only chance for HPC AMD Console market Still some chance for HPC CINECA Roadmaps Roadmap 50PFlops Power consumption EURORA 50KW, PLX 350 KW, BGQ 1000KW + ENI EURORA or PLX upgrade 400KW; BGQ 1000KW, Data repository 200KW; ENI R&D Eurora EuroExa STM / ARM board Deployment Eurora industrial prototype 150 TF Eurora or PLX upgrade 1PF peak, 350TF scalar Time line 2013 2014 EuroExa STM / ARM prototype PCP Proto 1PF in a rack EuroExa STM / ARM PF platform multi petaflop system 2015 2016 ETP proto towards exascale board Tier-1 towards exascale Tier-0 50PF 2017 2018 2019 2020 Tier 1 CINECA Procurement Q2014 Requisiti di alto livello del sistema Potenza elettrica assorbita: 400KW Dimensione fisica del sistema: 5 racks Potenza di picco del sistema (CPU+GPU): nell'ordine di 1PFlops Potenza di picco del sistema (solo CPU): nell'ordine di 300TFlops Tier 1 CINECA Requisiti di alto livello del sistema Architettura CPU: Intel Xeon Ivy Bridge Numero di core per CPU: 8 @ >3GHz, oppure 12 @ 2.4GHz La scelta della frequenza ed il numero di core dipende dal TDP del socket, dalla densità del sistema e dalla capacità di raffreddamento Numero di server: 500 - 600, ( Peak perf = 600 * 2socket * 12core * 3GHz * 8Flop/clk = 345TFlops ) Il numero di server del sistema potrà dipendere dal costo o dalla geometria della configurazione in termini di numero di nodi solo CPU e numero di nodi CPU+GPU Architettura GPU: Nvidia K40 Numero di GPU: >500 ( Peak perf = 700 * 1.43TFlops = 1PFlops ) Il numero di schede GPU del sistema potrà dipendere dal costo o dalla geometria della configurazione in termini di numero di nodi solo CPU e numero di nodi CPU+GPU Tier 1 CINECA Requisiti di alto livello del sistema Vendor identificati: IBM, Eurotech DRAM Memory: 1GByte/core Verrà richiesta la possibilità di avere un sottoinsieme di nodi con una quantità di memoria più elevata Memoria non volatile locale: >500GByte SSD/HD a seconda del costo e dalla configurazione del sistema Cooling: sistema di raffreddamento a liquido con opzione di free cooling Spazio disco scratch: >300TByte (provided by CINECA) Thank you