ALICE: come passare da risorse
proprie a risorse comuni
Stefano Bagnasco, INFN Torino
Workshop sulle Problematiche di
Calcolo e Reti nell’INFN
Castiadas (CA) 24-29 maggio 2004
Outline of the talk
● Il calcolo di ALICE: AliRoot ed AliEn
● Da AliEn alle risorse condivise
● Prima prova: PDC2004
● Conclusioni
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
3
AliRoot processing chain
Alice Offline Framework
Root hits
structures
C++
macro
Persistency
Root output
file
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
Root digit
structures
C++
macro
ROOT data
visualisation
Root
tracks
C++
macro
Physics
results
ROOT & C++:
Strategic
decision in 1998
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
4
AliRoot layout
G3
AliRoot
G4
FLUKA
ISAJET
Virtual MC
AliReconstruction
HIJING
AliSimulation
AliEn
HBTAN
EVGEN
STEER
MEVSIM
PYTHIA6
PDF
PMD
STRUCT
EMCAL
CRT
TRD
ITS
START
PHOS
FMD
TOF
MUON
ZDC
TPC
RICH
RALICE
HBTP
ESD
AliAnalysis
ROOT
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
5
AliEn
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
6
The AliEn Philosophy
● Standards are now emerging for the basic building blocks of a GRID
 There are millions lines of code in the OS domain dealing with these issues
● Why not using these to build the minimal GRID that does the job?
 Fast development of a prototype
 Hundreds of users and developers
 Immediate adoption of emerging standards
External software
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
User
Application
Low level
Packa
ge
Mgr
(…)
V.O.
Packages
&
Command
s
Logger
API
(C/C++/perl)
SE
Config
Mgr
CE
SOAP/XML
Perl Core
Perl Modules
External
Libraries
RB
Interfaces
User Interface
Authentication
File &
Metadata
Catalogue
LDAP
Database
Proxy
ADBI
DBI
DBD
RDBMS
(MySQL)
AliEn Core Components & services
FS
CLI
GUI
Web
Portal
High level
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
7
AliEn Timeline
2001
2002
2003
2004
Start
2005
?
10% Data Challenge (analysis)
Physics Performance Report (mixing & reconstruction)
First production (distributed simulation)
Functionality
+
Simulation
Interoperability
+
Reconstruction
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
Performance, Scalability, Standards
+
Analysis
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
8
(Very) few details on AliEn
● The Workload Management is “pull-model”: a server
holds a master queue of jobs and it is up to the CE that
provides the CPU cycles to call it and ask for a job
● The system is integrated with a large-scale job
submission and bookkeeping system “tuned” for Data
Challenge productions, with job splitting, statistics, pie
charts, automatic resubmissions, etc.
● The Job Monitoring model requires no “sensors” installed
on the WN. It is the jobwrapper itself that talks to the
server.
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
9
AliEn
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
10
Being bold: from AliEn to a Meta-Grid
● Several Grid infrastructures are becoming available:
●
LCG, Grid.it, possibly others, maybe in the U.S.
Lots of resources but, in principle, different
middlewares
● Pull-model is well-suited for implementing higher-level
submission systems, since it does not require knowledge
about the periphery, that may be very complex:
“A Grid is a system that […] coordinates resources that are not
subject to centralized control […] using standard, open, general-purpose
protocols and interfaces […] to deliver nontrivial qualities of service.”
I. Foster
“What is the Grid? A three Point Checklist”
Grid Today (2001)
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
11
From AliEn to a Meta-Grid – cont’d
Design strategy:
● Use AliEn as a general front-end
 Owned and shared resource are exploited transparently
● Minimize points of contact between the systems
 No need to reimplement services etc.
 No special services required to run on remote CE/WNs
● Make full use of provided services: Data Catalogues, scheduling,
monitoring…
 Let the Grids do their jobs (they should know how)
● Use high-level tools and APIs to access Grid resources
 Developers put a lot of abstraction effort into hiding the complexity
and shielding the user from implementation changes
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
12
Available resources
● Several AliEn “native” sites (some rather large)
 CERN, CNAF, Catania, Cyfronet, FZK, JINR, LBL, Lyon, OSC, Prague,
Torino
● LCG-2 core sites
 CERN, CNAF, FZK, NIKHEF, RAL, Taiwan (more than 1000 CPUs)
● GRID.IT sites
 LNL.INFN, PD.INFN and several smaller ones (about 400 CPUs not
including CNAF)
● Implementation: manage LCG resources through a “gateway”: an
AliEn client (CE+SE) sitting on top of an LCG User Interface
The whole of LCG computing is seen
as a single, large AliEn CE
associated with a single, large SE
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
13
Interfacing AliEn and LCG
Job submission
Server
Interface
Site
LCG
RB
AliEn CE
LCG
Site
LCG UI
LCG SE
AliEn SE
Status
report
Data Registration
Data
Catalogue
EDG CE
LFN
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
Replica
Catalogue
PFN = LFN
WN
AliEn PFN
Data Registration
LFN
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
14
Production on two grids
AliEn
CE/SE
Master job
queue
Submission
AliEn
CE/SE
Server
AliEn CE
LCG UI
LCG RB
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
AliEn
CE/SE
LCG
CE/SE
Catalogue
LCG
CE/SE
LCG
CE/SE
Catalogue
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
15
Production on two grids
With this structure:
● All resources in full competitive mode
● If LCG works well, it will gobble a large number
of jobs, and it will be used heavily
● If LCG does not work well, AliEn will privilege
other resources, and it will be less used
● In all cases we try to use LCG-2 and Grid.it as much as
possible
● We need not take any a priori decision: the performance
of the systems will decide for us
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
16
Cheating: two grids, same resources!
● “Double access” for selected sites (CNAF and CT.INFN)
A User
submits
jobs
WN
Submission
AliEn
CE/SE
WN
Server
WN
Alien CE
LCG
CE/SE
LCG UI
LCG RB
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
WN
WN
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
17
Software installation
● Both AliEn and AliRoot installed via LCG jobs
 Do some checks, download tarballs, uncompress, build environment
script and publish relevant tags
 Single command available to get the list of available sites, send the jobs
everywhere and wait for completion. Full update on LCG-2 + GRID.IT
(16 sites) takes ~30’
 Manual intervention still needed in few sites (e.g. CERN/LSF)
 Ready for integration into AliEn automatic installation system
● Experiment software shared
area misconfiguration
caused most of the
trouble in the beginning
NIKHEF
installAlice.sh
installAlice.jdl
LCG-UI
installAliEn.sh
installAliEn.jdl
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
Taiwan
RAL
…
CNAF
TO.INFN
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
18
ALICE Physics Data Challenges
Period
(milestone)
Fraction of the
final capacity (%)
06/01-12/01
1%
Physics Objective
pp studies, reconstruction of TPC and ITS
● First test of the complete chain from simulation to
06/02-12/02
01/04-06/04
5%
10%
●
●
reconstruction for the PPR
Simple analysis tools
Digits in ROOT format
●
●
●
●
Complete chain used for trigger studies
Prototype of the analysis tools
Comparison with parameterised MonteCarlo
Simulated raw data
• Test of the final system for reconstruction and
01/06-06/06
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
20%
analysis
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
19
PDC2004 - 1
● Phase 1: Production of RAW and shipment to CERN
 Large output files (up to 1GB/event in ~25 files)
 1a: Central events (long jobs, large files) — DONE
 1b: Peripheral events (short jobs, smaller files) — One week to go
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
20
PDC2004 - 2
● Phase 1: Production of RAW + Shipment to CERN
 Large output files (up to 1GB/event in ~25 files)
 1a: Central events (long jobs, large files)
 1b: Peripheral events (short jobs, smaller files)
● Phase 2: Merging + Reconstruction in all T1’s
 Events are redistributed to remote sites before merging and reconstruction
 Smaller merged output (~100MB/event)
Signal-free
event
Mixed
signal
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
21
PDC2004 - 3
● Phase 1: Production of RAW + Shipment to CERN
 Large output files (up to 1GB/event in ~25 files)
 1a: Central events (long jobs, large files)
 1b: Peripheral events (short jobs, smaller files)
● Phase 2: Merging + Reconstruction in all T1’s
 Events are redistributed to remote sites before merging and reconstruction
 Smaller merged output (~100MB/event)
● Phase 3: Distributed Analysis
 Will need interactivity
 Will need direct file access
 Towards the ARDA prototype…
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
22
● Up to 1800 CPU simultaneously under AliEn control
 1400 running jobs + 400 saving
 Two interface sites deployed (to LCG-2@CERN, to [email protected])
 About half “native AliEn”, half LCG-2+GRID.IT
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
23
PDC2004 - Status
● Statistics after round 1 (ended april, 4): job
distribution
 Alice::CERN::LCG is the interface to LCG-2
 Alice::Torino::LCG is the interface to GRID.IT
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
24
AliEn Vs. AliEn+LCG
● LCG-2 jobs seen through AliEn MonaLisa monitoring
 Ramp-up slope shows no major performance degradation
AliEn native site
LCG-2
GRID.IT
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
25
Grid.it starting up
Larger sites filled first
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
26
Phase II: Reconstrution
● Will need local storage for all sites
 So will need use of native LCG storage
● Interface system available, installed on the EIS
testbed.
 SRM everywhere would simplify things a lot
 Use of GUIDs for files also simplified things
● Next week we’ll start trying to use it in
production. Site managers, brace for the hit!
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
27
Phase III: Analysis
PROOF uses AliEn Grid File Catalogue and Data Management to map
LFN’s to a chain of PFN’s and Workload Management to detect which
nodes in a cluster can be used in a parallel session
SITE A
SITE B
SITE C
PROXY
MUX
PROXY
MUX
PROXY
MUX
API
ALIEN
PROXY
CONTROLLER
+
PROOF SERVER
USER SESSION
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
28
Phase III: Analysis
provides:
Analysis Macro
Input Files
?
Query for Input Data
new TAliEnAnalysis Object
USER
List of Input Data + Locations
produces
Job Splitting
Job Submission
IO Object 1
for Site A
Job Object 1
for Site A
Execution
IO Object 2
for Site A
IO Object 1
for Site BI
IO Object 1
for Site C
Job Object 2
for Site A
Job Object 1
for Site B
Job Object 1
for Site C
Histogram Merging
Tree Chaining
Results:
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
29
Lessons learned
● The remote site configuration is the major source of problems,
LCG-side.
 Software management tools are still rudimentary
 Large sites have often tighter security restrictions & other
idiosincracies
 Investigating and fixing problems is hard and time-consuming
● The most difficult part of the management is monitoring LCG
through a “keyhole”.
 Only integrated information available natively
 MonALISA for AliEn, GridICE for LCG
● For short jobs, submission time (and thus the interface system
performance) can limit the number of jobs
 But the system is inherently scalable
● Our usage of CPUs has been partially limited by lack of storage availability…
 … but then, there was no usable SE deployed!
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
30
Conclusions
● Migration to common resources was made smooth by using AliEn as a
common frontend, but could not use shared distributed storage yet
● First GRID production with fully transparent common access to
different middlewares (AliEn & LCG)




Apart for some glitches, AliEn was robust enough
Up to 1800 simultaneous jobs running
As of today, more than 53.000 jobs completed
Very small crew!
● The LCG Workload Management seems stable enough to manage a
production of this scale
 We were limited by storage issues (e.g. number of files discovered to be more
critical than size!)
 Some pauses for fixes and improvements AliEn-side
 We’ll see about the storage and data transfer infrastructure
● Huge steps forward, but distributed analysis poses much more
challenges!
Workshop sul Calcolo nell’INFN
Castiadas, 27 maggio 2004
ALICE: come passare da risorse proprie a risorse comuni
Stefano Bagnasco, INFN Torino
31
Scarica

Alice: come e quando passare da risorse proprie a risorse comuni