Anaphe - OO Libraries for Data Analysis using C++ and Python AIDA – Abstract Interfaces for Data Analysis Gran Sasso Lab, Jul2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 1 Anaphe OO Libraries for Data Analysis using C++ and Python Andreas Pfeiffer CERN IT/API [email protected] Gran Sasso Lab, Jul2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 2 Outline Motivation Anaphe Components C++ Lizard: Interactive Data Analysis Python Software quality control Summary Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 3 LHC Computing challenge Gran Sasso Lab, Jul2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 4 LHC & The Alps Interaction Points ~100m deep 27km circumference Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 5 LHC Computing Challenge 4 experiments will create huge amount of data >1 PetaByte/year for each experiment ! 1015 Bytes 1,000 TeraBytes 20,000 Redwood tapes 100,000 dual-sided DVD-RAM disks 1,500,000 sets of the Encyclopaedia Britannica (w/o photos) Need lots of CPU power to reconstruct/analyse about 1000 PC boxes per experiment (2005 ones !) 40.000 of today’s boxes (dual P-III 800 MHz) complex data models reconstruction s/w is also used for online filtering needs high quality s/w in order not to waste beam time Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 6 Lifetime of LHC software = 25 yrs SPS 1969 W and Z 1983 K&R C 1978 Ethernet standar d 1983 Unix V6 first IBM PC public version 1981 1975 Gran Sasso Lab, Jul-2002 LEP ends 2000 LEP 1989 C++ 1985 WWW Linux V 0.01 1991 Intel Pentium 1992 Andreas Pfeiffer, CERN/IT-API, [email protected] Java 1995 XML 1.0 1997 7 Technology (R)Evolution 10 yrs major cycle length (HW,SW,OS) ~12 evolutionary changes in the market 1 revolutionary change towards greater diversity don’t forget changes of requirements Consequences s/w written today most probably will be rewritten tomorrow we must anticipate changes Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 8 Anaphe: what it is Analysis for physics experiments Modular (OO/C++) replacement of CERNLIB functionality for use in HEP experiments memory management I/O foundation classes histogramming minimizing/fitting visualization interactive data analysis Trying to use standards wherever possible Trying to re-use existing class libraries Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 9 Anaphe Components Data Analysis Lizard - AIDA Custom graphics (2-D) Qt - Qplotter Basic graphics (3-D) OpenInventor – OpenGL Basic math NAG C HEP foundation CLHEP Minimization/Fitting FML - Gemini Histograms HTL Database HepODBMS Persistency ODMG/Objectivity DB C++ Standard Libraries Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 10 AIDA Abstract Interfaces for Data Analysis next talk Gran Sasso Lab, Jul2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 11 Anaphe components Gran Sasso Lab, Jul2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 12 ‘Layered’ Approach Basic functionalities (histograms, fitting, etc.) are available as individual C++ class libraries. Easy replacing one part without throwing away everything Objectivity/DB to provide persistence HepODBMS library (“insulating layer”, “tags”) Histogram library (HTL) Fitting libraries (Gemini, HepFitting) Graphics libraries (Qt, Qplotter) Insulate components through Abstract Interfaces “wrapper” layer to implement Interfaces in terms of existing libs Apply s/w quality control tools code checking, testing Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 13 ANAPHE Components Lizard Interactive Commands Histograms NTuples Fitting Plotting VectorOfPoints Functions Analyzer HTL Tags (HepODBMS Gemini/HepFitting Qplotter VectorOfPoints AIDA CLHEP Class Libraries for HEP (Abstract Interfaces for Data Analysis) Gran Sasso Lab, Jul-2002 Python / SWIG Objectivity/DB | HBook NAG-C | Minuit Qt (free edition) Abstract types Implementations (HEP-specific) non-HEP components User Interface - using Abstract Types Andreas Pfeiffer, CERN/IT-API, [email protected] 14 Basic 3D Graphic Libraries OpenGL (basic graphics) De-facto industry standard for basic 3D graphics Used in CAD/CAE, games, VR, medical imaging OpenInventor (scene mgmt.) OO 3D toolkit for graphics Cubes, polygons, text, materials Cameras, lights, picking 3D viewers/editors,animation Based on OpenGL/MesaGL Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 15 2D Graphics libraries Qt multi-platform C++ GUI toolkit C++ class library, not wrapper around C libs superset of Motif and MFC available on Unix and MS Windows no change for developer commercial but with public domain version www.troll.no Qplotter “add-on” functionality for HEP “HIGZ/HPLOT” Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 16 Mathematical Libraries NAG (Numerical Algorithms Group) C Library Covers a broad range of functionality Linear algebra differential equations quadrature, etc. Special functions of CERNLIB added to Mark-6 release mostly for theory and accelerator Quality assurance extensive testing done by NAG www.nag.com Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 17 CLHEP - foundation classes HEP foundation class library Random number generators Physics vectors 3- and 4- vectors Geometry Linear algebra System of units more packages recently added will continue to evolve wwwinfo.cern.ch/asd/lhc++/clhep/ Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 18 Histograms: the HTL package Histograms are the basic tool for physics analysis Statistical information of density distributions Histogram Template Library (HTL) design based on C++ templates Modular : separation between sampling and display Extensible : open for user defined binning systems Flexible: support transient/persistent at the same time Open: large use of abstract interfaces recent addition: 3D histograms Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 19 Fitting and Minimization Fitting and Minimization Library (FML) common OO interface NAG-C, MINUIT based on Abstract Interfaces IVector, IModelFunction, … fitting as a special case of minimization minimize “distance” between data and model replacement for HepFitting (and Gemini) Gemini common interface to minimizer engine very thin layer Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 20 Opening bracket: Persistency Gran Sasso Lab, Jul2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 21 Object persistency Two concepts: serial and page I/O “Sequential access to objects” (streaming) good in networking context or serial writes to file(s) much like “good old Fortran” often perceived to be “simpler” to implement (“<<“, “>>”) “Navigational access to objects” (buffered) I/O on demand for complex data models location transparent (for user) access to object typically by de-referencing of a smart pointer optimized for (random) disk access (disks deliver pages) sequential write to file(s) still ok Both concepts need to take care about changes of the internal structure of the objects (schema evolution) Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 22 Architectural Issue: Persistency (“Object-I/O”) Brings a completely new quality into the design Objects have now lifetime don’t “delete” until you really are sure you want to persistency is kind of “intended memory leak” would like to see no difference between memory and disk “Layout” of objects may change during (extended) life “schema evolution” additions/deletions of attributes changes of inheritance relations Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 23 Architectural Issue: Persistency (“Object-I/O”) (II) Objects can be placed (“clustering”) de-coupling of logical and physical view of data Special care needed to ensure consistency in data set avoid reading group of objects (tracks, events,...) for which writing/updating is not (yet) complete clean up if only part of the objects are written typically taken care of by using transactions Complications possible in distributed computing need to protect disk access now like memory access in past (“Segmentation violation”) Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 24 Physical Model and Logical Model • Physical model may be changed to optimise performance • Existing applications continue to work transparently ! Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 25 Object Model Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] Thanks to Vincenzo Innocente (CMS) 26 Physical clustering Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] Thanks to Vincenzo Innocente (CMS 27 Closing bracket: Persistency Gran Sasso Lab, Jul2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 28 “Tags”, Ntuples and Events Tags - a special kind of Ntuple Always associated with an underlying persistent store Tags may be used to store “ntuple-like” data extracted from all over the event minPt, maxEmiss, nJets, nMuon, trigger, … Main use: speedup data selection for analysis … Tag simplifies selection without loosing complexity Events more complex than a tree structure (“CWN”) lots of cross-references between classes, containers Association from the Tag to the Event may be used to navigate to any other part of the Event even from an interactive visualization program Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 29 Anaphe components Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 30 Anaphe Internals: (Abstract) Interfaces Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 31 AIDA compliance of Anaphe Presently (Anaphe 3.x) only AIDA 1.0 compliant Plan to implement AIDA 2.2 Interfaces by end 2001 (Anaphe 4.x) initially as wrappers to existing interfaces/packages Will maintain 3.x for some time ensures stability for users Development will concentrate on 4.x while AIDA will evolve further Similar timeschedule as JAS (Tony Johnson) OpenScientist (Guy Barrand) already there Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 32 Lizard: a tool for Interactive Data Analysis Gran Sasso Lab, Jul2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 33 Interactive Data Analysis Aim: “OO replacement for PAW” (at least) analysis of “ntuple-like data” (“Tags”, “Ntuples”, …) visualisation of data (Histograms, scatter-plot, “Vectors”) fitting of histograms (and other data) access to experiment specific data/code Maximize flexibility and re-use Foresee customization/integration allow use from within experiment’s s/w Plan for extensions “code for now, design for the future” Ensure maintainability use of s/w quality control tools Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 34 Scripting - why Typical use of scripting is quite different from programming (reconstruction, analysis, ...) history “go back to where I was before” repetition/looping - with “modifiable parameters” avoid “one size fits all” or “using power-tool as hammer” rapid prototyping in “scripting language” quick turn-around times performance critical code in “core language” exploit richer set of features/functionality (e.g. templates in C++) scripting languages usually less susceptible to changes than “mainstream languages” potentially longer lifes Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 35 Python - why Python - OO (scripting) language no “strange $!%-variables” sensitive to indentation More easy for users as Java Lots of user supplied modules available and ready for use scientific, numerics, graphics, GUI, network, OS, games, DBs, … example: http://www.vex.net/parnassus/ Parnassus Totals: 1173 items in 49 categories. Also usable in Java (Jython) used in JAS for scripting minimize changes needed within AIDA compliant environments Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 36 Python - how SWIG to (semi-) automatically create connection to chosen scripting language allows flexibility to choose amongst several scripting languages Python, Perl, Tcl, Guile, Ruby, (Java) … Very easy to use swig -c++ -python -shadow -c myClass.h create shared lib from myClass.cpp and myClass_wrap.c start python and import myClass.h to use it Very easy to extend simply inherit from “swiggified” class in python modifications can later be fed back into C++ performance, type safety, special language features (templates), … Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 37 PAW -> Lizard translation Ntuple projection Lizard lizard --useHBook :-) nt = ntm.findNtuple(“higgscand.hbk::cands”) :-) nplot1D(nt, “mass”, “quality=5 && cut > 198”) Ntuple projection PAW Any valid C++ expression pawX11 paw> h/file 1 higgscand.hbk paw> nt/pl 10.mass quality=5.and.cut>198 Assuming file higgscand.hbk contains ntuple with number 10 and title cands Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 38 Tutorials and Examples available Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 39 Users and Collaborations AIDA spoken here! IGUANA (CMS visualization) GAUDI (LHCb/HARP) framework ATHENA (Atlas) framework Analyzer modules in Geant 4 JAS Open Scientist …you? Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 40 Software quality control Gran Sasso Lab, Jul2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 41 Software quality control Using tools for testing/checking has started Insure++, CodeWizard Package dependencies: Ignominy Set of perl and shell scripts by Lassi Tuura (CMS) Ignominy scans… ignominy: dishonour, disgrace, shame; infamy; the condition of being in disgrace, etc. (Oxford English Dictionary) Make dependency data produced by the compilers (*.d files) Source code for #includes (resolved against the ones actually seen) Shared library dependencies (“ldd” output) Defined and required symbols (“nm” output) And maps… Source code and binaries into packages #include dependencies into package dependencies Unresolved/defined symbols into package dependencies Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 42 Ignominy Analysis of Anaphe Distribution of tools and utilities for LHC era physics Combination of commercial, free and HEP software Claims to be a toolkit Seems to live up to its toolkit claims Good work on modularity Clean design is evident in many places Dependency diagrams often split naturally into functional units Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] Thanks to Lassi Tuura (CMS 43 Package Metrics Project Release Anaphe ATLAS 3.6.1 1.3.2 1.3.7 CMS/ORCA 4.6.0 CMS/COBRA 5.2.0 CMS/IGUANA 2.4.2 Geant4 4.3.2 ROOT 2.25/05 Packages Average # of direct dependencies Cycles (Packages Involved) 31 230 236 199 87 35 108 30 2.6 6.3 7.0 7.4 6.7 3.9 7.0 6.4 -2 (92) 2 (92) 7 (22) 4 (10) -3 (12) 1 (19) # of levels ACD* CCD* NCCD* 8 96 97 35 19 6 21 22 5.4 167 70 16211 77 18263 24 4815 15 1312 5.0 174 16 1765 19 580 Size 1.3 630/170k 10 1350k 11 1350k 3.6 420k 2.7 180k 1.2 150/38k 2.8 680k 4.7 660k *) John Lak os, Large-Scale C++ Programming Size = total amount of source code (not normalised across projects!) ACD = average component dependency (~ libraries linked in) CCD = sum of single-package component dependencies over whole release Indicates testing/integration cost NCCD = Measure of CCD compared to a balanced binary tree A good toolkit’s NCCD will be close to 1.0 < 1.0: structure is flatter than a binary tree (= independent packages) > 1.0: structure is more strongly coupled (vertical or cyclic) Aim: NCCD ~ 1 for given software/functionality Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] Thanks to Lassi Tuura (CMS) 44 Metrics: NCCD vs Cycles Includes Fortran 12 ATLAS 10 NCCD 8 NCCD (“spaghetti index”) 1.0: good toolkit < 1.0: indep. packages > 1.0: strongly-coupled 6 ROOT ORCA 4 G4 COBRA 2 Anaphe IGUANA 0 0% Toolkits & Frameworks Gran Sasso Lab, Jul-2002 10% 20% 30% 40% 50% 60% 70% Fraction of Packages in Cycles Andreas Pfeiffer, CERN/IT-API, [email protected] Thanks to Lassi Tuura (CMS 45 History Started after CHEP-2000 Full version out since June 2001 Established functionality exceeding PAW Analyzer component giving direct access to data and libraries of the experiment framework Based on Abstract Interfaces Flexible and extensible Established parallel development of “license free” version while re-using existing libraries Direct reading/writing of HBook files as an alternative to Objectivity/DB based persistency Use of Minuit as a replacement for the minimizer of NAG-C Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 46 Ongoing activities Persistency De-emphasize Objectivity/DB (in coordination with experiments, IT/DB and LCG) Use of HBook ntuples Text files (using AIDA defined XML format) Planning to use LCG persistency (POOL) Investigating direct reading of ROOT files Fitting Implementing minimizer from GSL Discussing with the IGUANA team (CMS) to integrate their GUI components Looking forward for confirmation and/or re-direction of our efforts following the SC2 (RTAGs) Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 47 Future enhancements Access to other implementations of components HBOOK CWNtuples Communication with Java tools/packages (JAS, Wired) via AIDA Reading of ROOT (> V3.0) files similar to Tony Johnson’s (Java) RootIO package depends on “stability” of Root file format AIDA Ntuple/Histo store optimized for Ntuples, Histograms as (compressed) XML Adding other “scripting” languages Perl , Tcl, cint ? Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 48 Challenge: Distributed Computing Motivation move code to data parallel analysis Techniques services via AI late binding plug-in architecture End-user (Lizard) look-and-feel of local analysis R&D started and first prototype available soon CORBA based Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 49 Summary The architecture of Anaphe shows some important items for flexible and modular data analysis: weak coupling between components through use of Abstract Interface basic functionality is covered by individual C++ class libraries emphasis on usability and maintainability Major criteria are flexibility, extensibility and interoperability Recent example: GEANT-4 examples (based on AIDA) Lizard is an Interactive Data Analysis Tool based on Anaphe components and the Python scripting language (through SWIG) Lizard is young but has very solid base in mature Anaphe libraries real plug-in structure Software quality control is important tools help to optimize dependencies / minimize maintenance effort Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 50 More information cern.ch/Anaphe cern.ch/Anaphe/Lizard aida.freehep.org/ cern.ch/DB wwwinfo.cern.ch/asd/lhc++/clhep/ Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 51 Additional slides Gran Sasso Lab, Jul2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 52 Analysis of Geant4 Fairly large C++ project Very fine-grained (and multi-level) package structuring Seems quite clean from the preliminary analysis Fine package subdivision helps in many ways but makes analysis and code understanding more complicated One subsystem seems strongly coupled and needs attention Need to study the use of the internal command system Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] Thanks to Lassi Tuura (CMS 53 Analysis of ROOT ROOT developers have done a formidable job of breaking binary (shared library) dependencies, but… For example: By static analysis, nothing seems to use the postscript package directly (no incoming dependencies), but there is this code: void TPad::Print (const char *filename, Option_t *option) { […] TVirtualPS *psave = gVirtualPS; if (gROOT->LoadClass("TPostScript","Postscript")) return; gROOT->ProcessLineFast("new TPostScript()"); gVirtualPS->Open(psname,pstype); gVirtualPS->SetBit(kPrintingPS); […] } Taking these and global objects into account makes the dependency diagrams very different Sign of fast growth? Need a “next evolutionary step”? So “coherent” that replacing parts could get painful… Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] Thanks to Lassi Tuura (CMS 54 Analysis of ROOT… Binary only Gran Sasso Lab, Jul-2002 Binary + Source + Logical = Real Andreas Pfeiffer, CERN/IT-API, [email protected] Thanks to Lassi Tuura (CMS 55 Metrics: NCCD vs ACD 12 ATLAS 10 NCCD 8 6 ROOT ORCA 4 G4 COBRA IGUANA Anaphe 2 0 0% Toolkits & Frameworks Gran Sasso Lab, Jul-2002 10% 20% 30% 40% 50% 60% 70% Av. Component Deps (Fraction of Packages) Andreas Pfeiffer, CERN/IT-API, [email protected] Thanks to Lassi Tuura (CMS 56 Metrics: NCCD vs Size 12 ATLAS 10 NCCD 8 6 ROOT 4 ORCA G4 COBRA 2 IGUANA Anaphe 0 0 Toolkits & Frameworks Gran Sasso Lab, Jul-2002 200 400 600 800 1000 1200 1400 1600 Size (k-lines of source [files]) Andreas Pfeiffer, CERN/IT-API, [email protected] Thanks to Lassi Tuura (CMS 57 Metrics: NCCD vs AID 12 ATLAS 10 NCCD 8 6 ROOT ORCA 4 COBRA G4 Anaphe 2 IGUANA 0 0% Toolkits & Frameworks Gran Sasso Lab, Jul-2002 5% 10% 15% 20% 25% Av. Immediate Deps (Fraction of Packages) Andreas Pfeiffer, CERN/IT-API, [email protected] Thanks to Lassi Tuura (CMS 58 Metrics: Packages vs Size 250 ATLAS ORCA Packages 200 150 G4 100 COBRA 50 IGUANA Anaphe ROOT 0 0 Toolkits & Frameworks Gran Sasso Lab, Jul-2002 200 400 600 800 1000 1200 1400 1600 Size (Own Only) Andreas Pfeiffer, CERN/IT-API, [email protected] Thanks to Lassi Tuura (CMS 59 Metrics: Packages vs Size 250 ATLAS ORCA Packages 200 150 G4 100 COBRA 50 IGUANA Anaphe ROOT 0 0 Toolkits & Frameworks Gran Sasso Lab, Jul-2002 200 400 600 800 1000 1200 1400 1600 Size (All) Andreas Pfeiffer, CERN/IT-API, [email protected] Thanks to Lassi Tuura (CMS 60 Example script (ntuple) # get list of names of all tuples from tuplemanager ntm.listTuples() nt1=ntm.findNtuple(“Charm1”) # retrieve tuple by name # create 1D histos to project into h1=hm.create1D(10, “mass” ,100, 0., 5000.) h2=hm.create1D(20, “mass for pt1>10” ,100, 0., 5000.) # project the attribute ”MASS" into histo h1 without cut ("") nt1.project1D( h1, “” , “MASS”) # project the attribute ”MASS" into histo h2 with cut (”PT1>10") nt1.project1D( h2, “PT1>10” , “MASS”) Gran Sasso Lab, Jul-2002 Andreas Pfeiffer, CERN/IT-API, [email protected] 61