Relazione referee calcolo LHC
WLCG
INFNGRID
Tier1
Tier2
Francesco Forti, Università e INFN – Pisa
Per il gruppo di referaggio:
F. Bossi, C. Bozzi, R. Carlin, R. Ferrari, F.F., M.Morandin, M. Taiuti
LCG Comprehensive Review

LHCC di Settembre 2006


Due giorni di review di LCG  LCG Phase 2 = WLCG
Presentazioni dei referees:
27 Novembre 2006
F.Forti - Referee Calcolo LHC
2
LCG Organisation – Phase 2
LHC Committee - LHCC
Scientific Review
Computing Resources
Review Board - C-RRB
Funding Agencies
Collaboration Board – CB
Experiments and Regional Centres
Overview Board - OB
Management Board - MB
Management of the Project
Architects Forum
Grid Deployment Board
Coordination of
Common Applications
Coordination of
Grid Operation
Physics
Applications
Software
27 Novembre 2006
Activity Areas
Distributed
Analysis &
Grid Support
F.Forti - Referee Calcolo LHC
Grid
Deployment
Computing
Fabric
3
WLCG Infrastructure

Based on two major science grid infrastructures:

EGEE (Enabling Grids for E-science)


OSG (Open Science Grid)


phase 2 approved after last CR, funded until Apr 2008
5-year funding cycle pending approval with DOE/NSF, (positive) decision expected in a
few months
At time of 2005 CR, interoperability between grids was a major concern

this issue has been worked on in the
meantime: authentication, job submission,
mass storage access across grids show
progress, though no common interface in sight
Jobs per day (EGEE grid)
Jobs/Day - EGEE Grid
60
50
alice
atlas
cms
lhcb
geant4
dteam
non-LHC
40
K jobs/day 30
20
10
0
Jun05
Jul05
Aug- Sep05
05
Oct05
Nov05
Dec05
Jan06
Feb06
Mar06
Apr06
May06
Jun06
Jul06
Aug06
month
Jobs per day (OSG grid)
10 k
27 Novembre 2006
F.Forti - Referee Calcolo LHC
4
Metrics & Monitoring

~74%
Monitoring of availability
& reliability has been
major milestone



for T-1 centers now
done now regularly
(fails on some sites)
still below MoU level
Monitoring of job failures at application level is much harder




experiment dashboards
analysis of job logs. Still much manual work.
reliable automated system for job failure classification not around
the corner
key point to sustained reliability  should be pursued with priority
27 Novembre 2006
F.Forti - Referee Calcolo LHC
5
Accounting
CERN + T-1s

Since ~4 months, full
accounting data for CERN
+ T-1s


Monthly use relatively low


27 Novembre 2006
F.Forti - Referee Calcolo LHC
comparison with
installed & pledged
resources
related to present use
pattern (testing/
commissioning/
challenges)
No indication that
performance bottlenecks
may be due to resource
limitations
6
Impact of Schedule Change
Reminder: running scenarios assumed for TDR requirements:



50 days of physics in 2007
107s pp + 106s AA in subsequent years
New scenario after revision of schedule:


Experiments will provide revised estimated requirements by
begin of October  WLCG & funding agencies

preliminary (non-endorsed) numbers exist from ALICE, ATLAS &
LHCb
NOTA: il processo di revisione è ancora in corso
27 Novembre 2006
F.Forti - Referee Calcolo LHC
7


Assuming the
preliminary
numbers (!)
from the
experiments’
revised
requirements
estimate
Shortfall of 13.9 MCHF for phase 2 (as of Apr 2006) reduced to 3.4
MCHF
27 Novembre 2006
F.Forti - Referee Calcolo LHC
8
WLCG Personnel

Much depends on a suitable succession project to
EGEE-II from Apr 2008 onwards


15 FTEs at stake alone at CERN
similarly crucial for external centers
 This
is a point of concern. WLCG should strive for
a consolidation in terms of a more structural
project, in particular also at the level beyond T-0
27 Novembre 2006
F.Forti - Referee Calcolo LHC
9
Commissioning Schedule


Still an ambitious
programme
ahead
Timely testing of
full data chain
from DAQ to T-2
chain was major
item from last
CR

27 Novembre 2006
F.Forti - Referee Calcolo LHC
DAQ T-0 still
largely
untested
10
Middleware





Very significant progress during the last year on middleware and
grid activities by the different experiments.
A system is in place and works in scheduled production periods.
It has been used by the experiments and if/when stable and
reliable it should meet needs. Now robustness and stability is
the key to make sure the system survives heavy (unscheduled)
use as LHC startup approaches.
Many important aspects still not totally accomplished
(remote site monitoring, accounting, job priorities & user tools)
essential in a realistic system for a running experiment
Fundamental to allocate the required level of manpower beyond
2008 to maintain basic functionality, user support, upgrades and
interoperability among grids.

Interoperability essential to make use of all available resources
27 Novembre 2006
F.Forti - Referee Calcolo LHC
11
EEGE Middleware Development

gLite 3.0






Successfully deployed in May 2006
Debug of different components still continuing
Reliability, reliability, reliability
50% resources spent on user support of existing
infrastructure and software bug fixing.
Current activities (triggered by experiments)
 Security
 Data Management
 Usage Accounting
 Job Priorities (new GP-Box project…one year time scale)
Job priorities: absolutely non trivial when it is a decentralized
system…experiments should carefully develop and manage this
(perhaps starting from existing examples in running experiments)
27 Novembre 2006
F.Forti - Referee Calcolo LHC
12
Application Area Projects

SPI – Software process infrastructure (A. Pfeiffer)


ROOT – Core Libraries and Services (R. Brun)


Foundation class libraries, math libraries, framework services,
dictionaries, scripting, GUI, graphics, SEAL libraries, etc.
POOL – Persistency Framework (D. Duellmann)


Software and development services: external libraries, savannah,
software distribution, support for build, test, QA, etc.
Storage manager, file catalogs, event collections, relational access
layer, conditions database, etc.
SIMU - Simulation project (G. Cosmo)

Simulation framework, physics validation studies, MC event
generators, Garfield, participation in Geant4, Fluka.
27 Novembre 2006
F.Forti - Referee Calcolo LHC
13
AA Example - PROOF
Relative speed-up
27 Novembre 2006
F.Forti - Referee Calcolo LHC
14
AA – CONCLUSIVE REMARKS 1/2

Lots of of work



the Simulation project


important progress and achievements
Managerial difficulties due to the project fragmentation
some difficulties in interfacing some Monte Carlo generators to
the LCG simulation infrastructure
ROOT project




properly managed ; appropriate manpower resources
achievements: consolidation, fast access to data
Merging of SEAL (Shared Environment for Applications at LHC)
progressing successfully
important progress of PROOF, powerful tool to extend ROOT to
run on a distributed, heterogeneous system


27 Novembre 2006
Alice, CMS and LHCb are expressing interest in using of PROOF
clear decisions by the experiments needed
F.Forti - Referee Calcolo LHC
15
AA – CONCLUSIVE REMARKS 2/2

persistency framework project



important effort by AA to keep the link with the
experiments and the users strong and effective




key ingredient for LHC computing
difficult to assess the progress level
LCG Generator monthly meetings
Architects Forum, AA Meetings every 2 weeks
Savannah portal
manpower




present level globally very near to the needs
some reassignment can cure the limitations affecting individual
projects
possible manpower crisis in 2008 (retirements and contract
ends)
appropriate action be taken in2007 to guarantee adequate
manpower level in 2008 and beyond
27 Novembre 2006
F.Forti - Referee Calcolo LHC
16
Computing fabric - CERN

T0 and CAF are well on track



Still slightly underfunded despite recent improvements
Impressice empty space in computer center


Building, cooling and power upgrades planned as required
T0 well understood


But aggregate capacity for 4 experiments not demonstrated
Demonstrated capabilities in full scale ATLAS test
CAF requirements still not well defined

CERN Analysis Facility or Calibration and Alignment Facility ?
Experiments need to deliver well in advance
 Keep in mind purchasing cycles of 6+ months
Storage systems have improved performance





Still adding features, need ongoing attention
Manpower tight; need perspective with EGEE successor
Scalability - still an order of magnitude to go:

CASTOR2 and Directory service are critical
27 Novembre 2006
F.Forti - Referee Calcolo LHC
17
Computing fabric – storage

Storage Resource Manager v2.2




WLCG Standard storage interface, defined in May 2006
Hybrid between 2.1 and 3.0
Implementation is essential for LCG service
Castor2




Deployment at T0 successful, well integrated
Inherent performance problems hit ATLAS and CMS, fix underway
Tier sites had problems - high support load for CERN
Review in June positive towards the project, but


dCache




„Many years of […] periods of operational distress“
Project manpower has improved - 1 FTE for dCache user support now
No clear deadline for implementing SRM v2.2 - But seems to be on track
Community Support: OSG fund their own requests
DPM – Disk Pool Manager



In widespread use at 50+ smaller sites
Will be late in implementing SRM v 2.2
Serious manpower troublesNot an issue for T0 and CAF

Indirect issue for T1s (transfer to/from T2s with DPM)
27 Novembre 2006
F.Forti - Referee Calcolo LHC
18
Fabric – Distributed Deployment of
Databases – 3D

Provides database infrastructure and replication


eg. for detector conditions and geometry, file catalogues, event
tags
Initially set up at CERN and 6 “phase 1” Tier1 sites


to do: monitoring and (at some sites) backup systems
Replication performance sufficient for conditions and catalogues



Moving from R&D to service phase
Experiment data rates and access patterns still not fully
known



T0->T1 replication 50% of T0->T0 rates. More optimisation possible
All experiments are testing real applications on the real
infrastructure
Tier1 resources should be adequate
CNAF one of the first sites online for 3D
27 Novembre 2006
F.Forti - Referee Calcolo LHC
19
TIER1
Availability of WLCG Tier-1 Sites + CERN
average (all sites):
average (8 best sites):
availability
74%
85%
reliability
75%
86%
target
88%
site average colour coding: < 90% of target
100%
90%
90%
90%
80%
80%
80%
70%
70%
70%
60%
60%
60%
50%
50%
50%
40%
40%
40%
30%
30%
20%
20%
30%
20%
10%
10%
0%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
avail:
95%
reliability:
10%
0%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
FZK-LCG2
95%
avail:
69%
reliability:
71%
0%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
IN2P3-CC
100%
90%
90%
80%
80%
80%
70%
70%
70%
60%
60%
60%
50%
50%
50%
40%
40%
40%
30%
30%
30%
20%
20%
20%
10%
10%
0%
0%
INFN-T1
avail:
69%
reliability:
RAL-LCG2
avail:
59%
reliability:
60%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
SARA-MATRIX
100%
100%
90%
90%
80%
80%
80%
70%
70%
70%
60%
60%
60%
50%
50%
50%
40%
40%
40%
30%
30%
30%
20%
20%
20%
10%
10%
avail:
87%
reliability:
Taiwan-LCG2
avail:
97%
reliability:
97%
100%
100%
90%
90%
80%
80%
80%
70%
70%
70%
60%
60%
50%
50%
40%
40%
30%
30%
20%
20%
10%
10%
avail:
88%
reliability:
88%
avail:
4%
reliability:
4%
60%
50%
Site not integrated into the Site
Availability Monitoring (SAM) system not included in overall average
Site not integrated into the Site
Availability Monitoring (SAM) system not included in overall average
40%
30%
20%
10%
0%
PIC 2006
27 Novembre
83%
SAM tests fail due to dCache function
failure that does not affect CMS jobs.
The problem is understood and is being
worked on
USCMS-FNAL-WC1
90%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
reliability:
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
100%
0%
83%
0%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
87%
avail:
10%
0%
TRIUMF-LCG2
94%
0%
90%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
reliability:
10%
100%
0%
94%
90%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
73%
avail:
100%
100%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
≥ 90% of target ≥ target
100%
100%
CERN-PROD
All sites assumed up while SAM had
problems on 1, 3, 4 August
Data from SAM monitoring. Site availability and reliability as agreed in WLCG MB on 11 July 2006 (scheduled interruptions are excluded when calculating reliability)
tests passed
scheduled down
legend:
August 2006
0%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
avail:
n/a
reliability:
BNL- Referee
F.Forti
Calcolo
LHC
0%
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
NDGF
avail:
n/a
reliability:
0%
20
Tier1 Issues and
recommendations…

Tier1s need to know the consequences of the
schedule change



required resource changes affect procurement
Tier1s must be fully integrated in the experiments’
planning and decision process
Communication with experiments in vital to bridge
the “culture gap”

Recommend liaison officer in both Tier1 and
experiments



Meet regularly
Tier1 liaisons should attend experiment computing meetings
Experiment monitoring should be available to Tier1s
27 Novembre 2006
F.Forti - Referee Calcolo LHC
21
…Tier1 Issues and
recommendations

24x7 operation

Requires “on-call” service





Not all problems resolvable by on-call responsibles
Can reduce outages, but some are still unavoidable
Coordinate with experiments to avoid scheduling
outages at multiple Tier1s at the same time


Still not at all Tier1s
Can never have all experts on call all the time
especially an issue with core MW upgrades
Stability of Middleware is crucial


Both problems and upgrades lead to down-time
Developers need to concentrate on reliability over
functionality, and very well-tested releases
27 Novembre 2006
F.Forti - Referee Calcolo LHC
22
7x24 Operations (K.Woller‘s view)
What we have
What people suggest
1 FTE
230 day x 8 hours
24/7 expert service
8760 h/year
1840 h/year
4.8 FTE
There´s no way to have experts 24x7 (x52)
Need to design services to survive trivial failures
Commercially available load balancers may help
Need to design for increased reaction times
By building service level redundancy where possible
For rare complex problems, „on duty“ coordinator may
help getting the required experts together fast.
27 Novembre 2006
F.Forti - Referee Calcolo LHC
23
Tier2 Summary

Tier2s are a very diverse bunch





400-2500 CPUs, 50-800 TB, 1-4 experiments (also non LHC)
1-13 staff FTE (most ~5), Mostly 1GB/s network, and no MSS (tape)
Most Tier2s participated in SC4 - Critical for experiments
Funding uncertainties
Some Tier2s are federations

up to 8 geographical sites




Share experience and some services, allow small sites to participate
Can work well, but requires close cooperation
Collaboration with “local” Tier1 is essential




Mostly 1 CE/SE per site (ie. Middleware sees them as separate)
Data transfers
Tier1 can provide advice and perhaps some services
CMS Europe: Not enough Tier1s for all Tier2s
2/3 of Tier2s rely on DPM


concern for support and future compatibility (eg. SRM 2.2)
DPM support team is undermanned
27 Novembre 2006
F.Forti - Referee Calcolo LHC
24
Service Challenge

Old data, new limit

Come l’atrazina
New target – 1.3GB/s
27 Novembre 2006
F.Forti - Referee Calcolo LHC
Old target – 1.6GB/s
25
SC4 - What were the problems ?

No simple answer



Many reports of instabilities






T1 sites (ATLAS report all 9 T1s only all available for a few
hours/month)
Hardware failures
SRM/mass storage
Castor/dCache
File catalogues
Site differences



Many, many individual one-off problems were mentioned
Little quantitative information was presented
Firewalls
Badly configured
nodes/sites
EGEE software


File access (GFAL)
File transfer (FTS)
27 Novembre 2006
F.Forti - Referee Calcolo LHC
26
SC4 – How to improve ?
•
Many comments that manual intervention was required
•
•
Need for communication improvements and problem reporting
between the sites
•
•
Error reporting, tutorials, phone meetings, workshops, Wikis, etc.
He sees this as the way to improve performance and reliability
•
•
•
“heroic efforts” ; “at the limit of what the system can do”
Have to live with this level of problems; just get more efficient at
overcoming them when they occur
Castor is a notable exception
However, must also put a lot of effort into bug fixing
•
•
•
Not “sexy”; may need to push to keep the effort in the right direction
Effectively division of effort in maintenance vs. development
Important to get the balance of effort right here
27 Novembre 2006
F.Forti - Referee Calcolo LHC
27
SC4 – Other Comments
•
Experiments will not ramp up to nominal rates by Jul07
•
•
•
•
•
•
Almost all service performance reported as data transfer rates
•
•
•
•
Obviously critical to get data out, both for storage and analysis
Some information given on job performance
Very little on CPU usage efficiency; this seems to be underutilised
Scheduled outages can be worse than unscheduled ones
•
•
•
E.g. ATLAS simulation is x10 below right now
Most are aiming for this around early 2008
No direct DAQ output has been included yet
Hence, service commissioning period will not be based on realistic loads
Should commissioning targets be relaxed for 2007, given LHC schedule?
Only makes sense if frees up effort to use elsewhere; not clear if true
They hit more than one site simultaneously
More than one item tends to be removed from service
A usable albeit imperfect service
27 Novembre 2006
F.Forti - Referee Calcolo LHC
28
Coordination and communication
Coordination Meetings
EIS + Experiments
Feedback
Issues with Services
Resource Requirements
Issues with
Services
SC Team
ECM
Experiments
Coordination
Meeting
Service updates
SC Team
Alberto Aimar
27 Novembre 2006
SCM
Services
Coordination
Meeting
Services
Status of Services
Progress
OPS
Operations
Meeting
Change of
Requirements
SC Team
CERN – LCG
F.Forti - Referee Calcolo LHC
Operating Issues
Site Reports
Sites
17
29
Service coordination

Meeting structure setup to ensure communication



Service Coordination Meetings ideally should regularly be
held at each Tier1 site in addition to CERN.
Clear need for a sort of service operation coordinator that
acts as a central collection point for everything’s that going
on



Make sure experiment and sites representatives have enough
authority
Discussion on length of term for operation coordinator appointment
Should be reasonably long (>2-3 months)
Need to continue to increase the involvement of remote
sites in the decision, planning, and monitoring process




Develop realistic plans and adhere to them
Convince remote sites that the plans are real
Keep everybody in the system consistenly informed
Careful in keeping the bureaucracy under control and the reporting
load at acceptable levels
27 Novembre 2006
F.Forti - Referee Calcolo LHC
30
SUMMARY
Middleware/deployment/Service Challenges





Stability needs to be improved. No new functionality, but
need stable, running service
Experiments need to start using all the features of gLite3,
to find the new problems.
Need to keep developer to fix the bugs and make the
system stable rather than devloping new nice functionality
Analysis of job failure rates still needs improvement
User support model needs to be revisited




maybe a first line of defense internal to the experiment
Target performance goals not quite reached
Continuous unattended operation still a long way off
A full scale test of the entire chain starting from
experiments DAQ is still missing
27 Novembre 2006
F.Forti - Referee Calcolo LHC
31
SUMMARY
Fabric

Technologically there doesn’t seem to be an issue



CERN T0 still needs to demonstrate the full aggregated capacity for 4
experiments
CASTOR2 still an issue – critical item




Is it going to be supported in the long term ? If yes, need manpower.
24x7 operation and staffing at external sites very difficult


Essential, but not yet ready nor deployed
dCache a bit late in developing srm2.2
DPM – essential for small sites


Is manpower sufficient ?
Issue of external sites support
SRM 2.2


Some scalability issues with LSF and service machines
Mixed level of readiness
To PROOF or not to PROOF

Encourage experiments to take a clear stand at whether they want it, since
it has broad implications.
27 Novembre 2006
F.Forti - Referee Calcolo LHC
32
SUMMARY
Management and global issues

Involvement of external sites improved, but keep
going.


Experiments involvement is essential


At CERN as well as at Tier1 sites
Staffing problem if there is not EGEE-III


Communication, communication, communication
How to make a transition to structural operation staffing
The modification in LHC schedules somewhat
reduces the gap between needed and available
resources. There should be no temptation for the
funding agencies to reduce the level of funding.
27 Novembre 2006
F.Forti - Referee Calcolo LHC
33
In Italia: INFN GRID

EGEE Operation: Il Regional Operation Center Italiano garantisce







Sviluppo e mantenimento di MiddleWare: gLite
Garantire l’ evoluzione del Middleware Grid Open Source verso standards internazionali:
OMII Europe
La disponibilità del MW in un’efficiente repository: ETICS
Partecipare alle attivita’ informatiche di Ricerca e Sviluppo
Coordinare l’espansione di EGEE nel mondo







GRIDCC (Applicazioni real time e controllo apparati)
BionfoGrid (Bionformatici; Coordinato dal CNR)
LIBI (MIUR; Bionfomatici in Italia)
Cyclops (Protezione Civile)
Garantire la sostenibilita’ futura delle e-Infrastrutture con consorzi et al.





Gilda – attività di disseminazione
EUMedGrid , Eu-IndiaGrid (MoU…)
EUChinaGrid (Argo..) , EELA(LHC-B…)
Sostenere l’allargamento di EGEE a nuove comunita’ scientifiche


Il funzionamento giornaliero della e-infrastruttura Europea
Un supporto per Molte VO (Virtual Organizations) sulla stessa infrastruttura multi-science
A
A
A
A
livello EU : EGEE II -> EGI
livello Nazionale IGI
livello di middleware EU OMII EU
livello di middleware nazionale c-OMEGA
Coordinare la partecipazione all’Open Grid Forum (ex GGF)
27 Novembre 2006
F.Forti - Referee Calcolo LHC
34
CNAF

Punto focale di tutte le attività INFNGRID


Tier1 per esperimenti LHC




Manpower finanziato sui progetti
Funzionante e pienamente utilizzato
Manpower INFN fortemente carente
Sia per la gestione, sia per l’upgrade infrastrutturale
Indicazione dei referee:

Concentrarsi sulle attività di Core GRID necessarie per il calcolo degli
esperimenti LHC


Questione delicata per i contributi approvati dall’INFN ai progetti internazionali
Piano di sviluppo ancora in discussione. Elementi da definire:


Necessità degli esperimenti nel 2007-2008
Espandibilità del CNAF nel 2007



Le infrastrutture esistenti mostrano forti limiti sia per il condizionamento sia per
la distribuzione elettrica
Interventi urgenti previsti in parallelo all’upgrade infrastrutturale completo (che
non si concluderà prima di primavera 2008)
Le risorse pledged a WLCG per il 2007 non sembrano raggiungibili
27 Novembre 2006
F.Forti - Referee Calcolo LHC
35
Risorse fornite vs. pledged
27 Novembre 2006
F.Forti - Referee Calcolo LHC
36
Suddivisione risorse CNAF

Comitato di gestione

Gruppo deputato a definire la suddivisione delle risorse e le altre
scelte operative del centro



Nuovo coordinatore delle richieste degli esperimenti è Umberto
Marconi (grazie Paolo)
Richieste da parte di esperimenti di CSN2


Argo, Virgo, Pamela, Opera, Magic
Soprattutto spazio disco: critico perchè gli esperimenti sono in
presa dati



Da utilizzare in modo sistematico e continuativo
Da privilegiare l’acquisto di disco, che consuma anche meno potenza...
Incontro con la commissione II domani
Tutti gli esperimenti che calcolano al CNAF dovrebbero
essere referati nello stesso gruppo

Attualmente Babar e CDF sono a parte
27 Novembre 2006
F.Forti - Referee Calcolo LHC
37
Tier2


I Tier2 sono finanziati per il 2007 esclusivamente
con fondi SJ.
Gli esperimenti stanno preparando un piano
dettagliato di attività per il 2007 in modo da
definire gli sblocchi di SJ


Esaminati in dettaglio i progetti di Roma (Torino) e
Pisa



Pronto verso fine anno
Relazione di referaggio di CCR (26/10/06): R.
Gomezel, M. Morandin, L. Pellegrino, R. Ricci, R.
Stroili
In generale sono stati fatti notevoli progressi
L’efficienza di utilizzo e la collaborazione tra TIER2
cresciuti notevolmente
27 Novembre 2006
F.Forti - Referee Calcolo LHC
38
T2-Roma






Il progetto ha raggiunto un livello globalmente adeguato di
approfondimento delle questioni tecniche, tale da poter permettere il
passaggio alla fase esecutiva.
Il gruppo di Roma ha sfruttato competenze locali esistenti, si è
appoggiato al servizio di LNL e inoltre si servirà per il progetto
esecutivo di una ditta specializzata già individuata.
La questione tecnica risultata di più difficile soluzione, ovvero la
collocazione delle macchine condensanti esterne, sembra ora risolta.
Le questioni tecniche critiche sono state affrontate e le opzioni
tecniche presentate non presentano particolari rischi.
Si nota la mancanza di un documento preliminare di progettazione per
gli impianti elettrici che invece è stato reso disponibile per la parte di
condizionamento.
Da fare:



progetto di massima del sistema di gestione dei guasti critici,
integrazione impiantistica del sistema antiincendio della sala
progetto aggiornato della rete locale
27 Novembre 2006
F.Forti - Referee Calcolo LHC
39
T2-Pisa






Il progetto ha subito modifiche importanti rispetto alla sua prima formulazione.
Nella sua versione attuale, con il trasferimento di parte delle macchine ad una
nuova sala, si sono ottenuti considerevoli semplificazioni e risparmi.
I principali elementi del progetto sono stati definiti e le soluzioni tecniche
proposte sembrano in generale adeguate.
I documenti forniti esaminano in dettaglio gli aspetti critici, ma, per servire
come base per una progettazione esecutiva, andrebbero ulteriormente
integrati con le informazioni che ora risultano mancanti.
Il gruppo si appoggia ai tecnologi disponibili in Sezione e non prevede di
coinvolgere professionisti esterni per stilare il progetto esecutivo.
Da fare:

Condizionamento:




Documento dettagliato per l’affidamento della progettazione
Ottimizzazione e ridondanza dell’impianto
Analisi guasti critici
Impianti elettrici




tabella dei carichi elettrici includendo tutte le utenze (SNS; Dipartimento)
Riconsiderae i margini di potenza, che sembrano molto stretti
Analisi dell’affidabilità del sistema
Riconsiderare la scelta di non utilizzare un UPS per una parte delle macchine
27 Novembre 2006
F.Forti - Referee Calcolo LHC
40
Piano dei TIER2

Il piano dettagliato dei TIER2 verrà esaminato a
gennaio


Si vogliono finanziare le attività, non le tabelle
Per i TIER2 ancora SJ (Milano e Pisa)

Come detto a settembre le condizioni tecniche sono
ragionevolmente soddisfatte




(a volte funzionano meglio dei Tier2 approvati)
Le comunità di riferimento sono attive
Rimane la questione generale dell’effettiva necessità di
calcolo di LHC
Necessario finanziare anche i T2 SJ ad un livello
sufficiente a sopravvivere
27 Novembre 2006
F.Forti - Referee Calcolo LHC
41
Conclusioni







La matassa del calcolo LHC è certamente intricata
La GRID ha dimostrato che potenzialmente può risolvere il
problema, anche se la performance non è ancora
sufficiente
Rispetto al passato, maggiore enfasi su reliability e
availability piuttosto che su nuove caratteristiche
L’INFN è piazzato centralmente in questa attività e
contribuisce moltissimo
E’ essenziale risolvere al più presto le difficoltà
infrastrutturali del CNAF per farlo operare a pieno ritmo
E’ necessario focalizzare tutte le forze per la realizzazione
del calcolo LHC, anche se questo può limitare altre attività
interessanti
Le sezioni rappresentano un serbatoio vitale di idee e
persone per far avanzare il programma e devono essere
pienamente coinvolte
27 Novembre 2006
F.Forti - Referee Calcolo LHC
42
Scarica

Computing Referees