ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino Workshop sulle Problematiche di Calcolo e Reti nell’INFN Castiadas (CA) 24-29 maggio 2004 Outline of the talk ● Il calcolo di ALICE: AliRoot ed AliEn ● Da AliEn alle risorse condivise ● Prima prova: PDC2004 ● Conclusioni Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 3 AliRoot processing chain Alice Offline Framework Root hits structures C++ macro Persistency Root output file Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 Root digit structures C++ macro ROOT data visualisation Root tracks C++ macro Physics results ROOT & C++: Strategic decision in 1998 ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 4 AliRoot layout G3 AliRoot G4 FLUKA ISAJET Virtual MC AliReconstruction HIJING AliSimulation AliEn HBTAN EVGEN STEER MEVSIM PYTHIA6 PDF PMD STRUCT EMCAL CRT TRD ITS START PHOS FMD TOF MUON ZDC TPC RICH RALICE HBTP ESD AliAnalysis ROOT Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 5 AliEn Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 6 The AliEn Philosophy ● Standards are now emerging for the basic building blocks of a GRID There are millions lines of code in the OS domain dealing with these issues ● Why not using these to build the minimal GRID that does the job? Fast development of a prototype Hundreds of users and developers Immediate adoption of emerging standards External software Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 User Application Low level Packa ge Mgr (…) V.O. Packages & Command s Logger API (C/C++/perl) SE Config Mgr CE SOAP/XML Perl Core Perl Modules External Libraries RB Interfaces User Interface Authentication File & Metadata Catalogue LDAP Database Proxy ADBI DBI DBD RDBMS (MySQL) AliEn Core Components & services FS CLI GUI Web Portal High level ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 7 AliEn Timeline 2001 2002 2003 2004 Start 2005 ? 10% Data Challenge (analysis) Physics Performance Report (mixing & reconstruction) First production (distributed simulation) Functionality + Simulation Interoperability + Reconstruction Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 Performance, Scalability, Standards + Analysis ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 8 (Very) few details on AliEn ● The Workload Management is “pull-model”: a server holds a master queue of jobs and it is up to the CE that provides the CPU cycles to call it and ask for a job ● The system is integrated with a large-scale job submission and bookkeeping system “tuned” for Data Challenge productions, with job splitting, statistics, pie charts, automatic resubmissions, etc. ● The Job Monitoring model requires no “sensors” installed on the WN. It is the jobwrapper itself that talks to the server. Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 9 AliEn Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 10 Being bold: from AliEn to a Meta-Grid ● Several Grid infrastructures are becoming available: ● LCG, Grid.it, possibly others, maybe in the U.S. Lots of resources but, in principle, different middlewares ● Pull-model is well-suited for implementing higher-level submission systems, since it does not require knowledge about the periphery, that may be very complex: “A Grid is a system that […] coordinates resources that are not subject to centralized control […] using standard, open, general-purpose protocols and interfaces […] to deliver nontrivial qualities of service.” I. Foster “What is the Grid? A three Point Checklist” Grid Today (2001) Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 11 From AliEn to a Meta-Grid – cont’d Design strategy: ● Use AliEn as a general front-end Owned and shared resource are exploited transparently ● Minimize points of contact between the systems No need to reimplement services etc. No special services required to run on remote CE/WNs ● Make full use of provided services: Data Catalogues, scheduling, monitoring… Let the Grids do their jobs (they should know how) ● Use high-level tools and APIs to access Grid resources Developers put a lot of abstraction effort into hiding the complexity and shielding the user from implementation changes Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 12 Available resources ● Several AliEn “native” sites (some rather large) CERN, CNAF, Catania, Cyfronet, FZK, JINR, LBL, Lyon, OSC, Prague, Torino ● LCG-2 core sites CERN, CNAF, FZK, NIKHEF, RAL, Taiwan (more than 1000 CPUs) ● GRID.IT sites LNL.INFN, PD.INFN and several smaller ones (about 400 CPUs not including CNAF) ● Implementation: manage LCG resources through a “gateway”: an AliEn client (CE+SE) sitting on top of an LCG User Interface The whole of LCG computing is seen as a single, large AliEn CE associated with a single, large SE Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 13 Interfacing AliEn and LCG Job submission Server Interface Site LCG RB AliEn CE LCG Site LCG UI LCG SE AliEn SE Status report Data Registration Data Catalogue EDG CE LFN Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 Replica Catalogue PFN = LFN WN AliEn PFN Data Registration LFN ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 14 Production on two grids AliEn CE/SE Master job queue Submission AliEn CE/SE Server AliEn CE LCG UI LCG RB Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 AliEn CE/SE LCG CE/SE Catalogue LCG CE/SE LCG CE/SE Catalogue ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 15 Production on two grids With this structure: ● All resources in full competitive mode ● If LCG works well, it will gobble a large number of jobs, and it will be used heavily ● If LCG does not work well, AliEn will privilege other resources, and it will be less used ● In all cases we try to use LCG-2 and Grid.it as much as possible ● We need not take any a priori decision: the performance of the systems will decide for us Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 16 Cheating: two grids, same resources! ● “Double access” for selected sites (CNAF and CT.INFN) A User submits jobs WN Submission AliEn CE/SE WN Server WN Alien CE LCG CE/SE LCG UI LCG RB Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 WN WN ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 17 Software installation ● Both AliEn and AliRoot installed via LCG jobs Do some checks, download tarballs, uncompress, build environment script and publish relevant tags Single command available to get the list of available sites, send the jobs everywhere and wait for completion. Full update on LCG-2 + GRID.IT (16 sites) takes ~30’ Manual intervention still needed in few sites (e.g. CERN/LSF) Ready for integration into AliEn automatic installation system ● Experiment software shared area misconfiguration caused most of the trouble in the beginning NIKHEF installAlice.sh installAlice.jdl LCG-UI installAliEn.sh installAliEn.jdl Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 Taiwan RAL … CNAF TO.INFN ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 18 ALICE Physics Data Challenges Period (milestone) Fraction of the final capacity (%) 06/01-12/01 1% Physics Objective pp studies, reconstruction of TPC and ITS ● First test of the complete chain from simulation to 06/02-12/02 01/04-06/04 5% 10% ● ● reconstruction for the PPR Simple analysis tools Digits in ROOT format ● ● ● ● Complete chain used for trigger studies Prototype of the analysis tools Comparison with parameterised MonteCarlo Simulated raw data • Test of the final system for reconstruction and 01/06-06/06 Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 20% analysis ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 19 PDC2004 - 1 ● Phase 1: Production of RAW and shipment to CERN Large output files (up to 1GB/event in ~25 files) 1a: Central events (long jobs, large files) — DONE 1b: Peripheral events (short jobs, smaller files) — One week to go Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 20 PDC2004 - 2 ● Phase 1: Production of RAW + Shipment to CERN Large output files (up to 1GB/event in ~25 files) 1a: Central events (long jobs, large files) 1b: Peripheral events (short jobs, smaller files) ● Phase 2: Merging + Reconstruction in all T1’s Events are redistributed to remote sites before merging and reconstruction Smaller merged output (~100MB/event) Signal-free event Mixed signal Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 21 PDC2004 - 3 ● Phase 1: Production of RAW + Shipment to CERN Large output files (up to 1GB/event in ~25 files) 1a: Central events (long jobs, large files) 1b: Peripheral events (short jobs, smaller files) ● Phase 2: Merging + Reconstruction in all T1’s Events are redistributed to remote sites before merging and reconstruction Smaller merged output (~100MB/event) ● Phase 3: Distributed Analysis Will need interactivity Will need direct file access Towards the ARDA prototype… Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 22 ● Up to 1800 CPU simultaneously under AliEn control 1400 running jobs + 400 saving Two interface sites deployed (to LCG-2@CERN, to [email protected]) About half “native AliEn”, half LCG-2+GRID.IT Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 23 PDC2004 - Status ● Statistics after round 1 (ended april, 4): job distribution Alice::CERN::LCG is the interface to LCG-2 Alice::Torino::LCG is the interface to GRID.IT Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 24 AliEn Vs. AliEn+LCG ● LCG-2 jobs seen through AliEn MonaLisa monitoring Ramp-up slope shows no major performance degradation AliEn native site LCG-2 GRID.IT Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 25 Grid.it starting up Larger sites filled first Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 26 Phase II: Reconstrution ● Will need local storage for all sites So will need use of native LCG storage ● Interface system available, installed on the EIS testbed. SRM everywhere would simplify things a lot Use of GUIDs for files also simplified things ● Next week we’ll start trying to use it in production. Site managers, brace for the hit! Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 27 Phase III: Analysis PROOF uses AliEn Grid File Catalogue and Data Management to map LFN’s to a chain of PFN’s and Workload Management to detect which nodes in a cluster can be used in a parallel session SITE A SITE B SITE C PROXY MUX PROXY MUX PROXY MUX API ALIEN PROXY CONTROLLER + PROOF SERVER USER SESSION Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 28 Phase III: Analysis provides: Analysis Macro Input Files ? Query for Input Data new TAliEnAnalysis Object USER List of Input Data + Locations produces Job Splitting Job Submission IO Object 1 for Site A Job Object 1 for Site A Execution IO Object 2 for Site A IO Object 1 for Site BI IO Object 1 for Site C Job Object 2 for Site A Job Object 1 for Site B Job Object 1 for Site C Histogram Merging Tree Chaining Results: Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 29 Lessons learned ● The remote site configuration is the major source of problems, LCG-side. Software management tools are still rudimentary Large sites have often tighter security restrictions & other idiosincracies Investigating and fixing problems is hard and time-consuming ● The most difficult part of the management is monitoring LCG through a “keyhole”. Only integrated information available natively MonALISA for AliEn, GridICE for LCG ● For short jobs, submission time (and thus the interface system performance) can limit the number of jobs But the system is inherently scalable ● Our usage of CPUs has been partially limited by lack of storage availability… … but then, there was no usable SE deployed! Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 30 Conclusions ● Migration to common resources was made smooth by using AliEn as a common frontend, but could not use shared distributed storage yet ● First GRID production with fully transparent common access to different middlewares (AliEn & LCG) Apart for some glitches, AliEn was robust enough Up to 1800 simultaneous jobs running As of today, more than 53.000 jobs completed Very small crew! ● The LCG Workload Management seems stable enough to manage a production of this scale We were limited by storage issues (e.g. number of files discovered to be more critical than size!) Some pauses for fixes and improvements AliEn-side We’ll see about the storage and data transfer infrastructure ● Huge steps forward, but distributed analysis poses much more challenges! Workshop sul Calcolo nell’INFN Castiadas, 27 maggio 2004 ALICE: come passare da risorse proprie a risorse comuni Stefano Bagnasco, INFN Torino 31