Consorzio COMETA “Progetto PI2S2” FESR Gestione del Resource Broker Giuseppe Platania, INFN Catania Tutorial per Site Administrator Progetto PI2S2 Messina, 9-11.07.2007 martedi 8 novembre 2005 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 1 Consorzio COMETA “Progetto PI2S2” FESR – Components ● Network Server ● Workload Manager ● Job Controller ● Logging & Bookkeeping ● Log Monitor – Management of RB – Troubleshooting Outline martedi 8 novembre 2005 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 2 Consorzio COMETA “Progetto PI2S2” FESR ● ● Authentication RB components: NS – user CA is authorized – user DN is in the grid-mapfile – user certificate was revoked by his CA Authorization – Pool accounts mapping (LCMAPS) – Sandbox disk space – Size of input sandbox (< MAX_INPUT_SB_SIZE) martedi 8 novembre 2005 – Creation of sandbox dir with input files and user proxy Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 3 Consorzio COMETA “Progetto PI2S2” FESR RB components: WM and JC ● ● Workload Manager – Receives job submission command and put the user request in the WM queue – Match making: CE choise – Job file creation (job wrapper) being sent to JC Job Controller – Job submission to the choosen CE via Condorg – Input sandbox copy to the choosen CE martedi 8 novembre 2005 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 4 Consorzio COMETA “Progetto PI2S2” FESR RB components: LB, Locallogger and LM ● Logging&Bookkeeping – ● Locallogger – ● Logging of all job events in its database It’s the LB proxy because it stores all job events even if LB doesn’t work Log Monitor – Condor log parsing and writing on LB database If needed, job being resubmitted to WM queue martedi 8–novembre 2005 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 5 Consorzio COMETA “Progetto PI2S2” FESR Logs ● Log files can be found in /var/edgwl/ logging proxyrenewal logmonitor SandboxDir jobcontrol networkserver workload_manager Init scripts can be found in /etc/init.d/edg-wl-* ● martedi 8 novembre 2005 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 6 Consorzio COMETA “Progetto PI2S2” FESR Possible job flags Flag Meaning SUBMITTED submission logged in the LB WAIT job match making for resources READY job being sent to executing CE SCHEDULED job scheduled in the CE queue manager RUNNING job executing on a WN of the selected CE queue DONE job terminated without grid errors CLEARED job output retrieved ABORT job aborted by middleware, check reason martedi 8 novembre 2005 ResourceBroker Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2,EGEE Messina 9-11.07.2007 7 Consorzio COMETA “Progetto PI2S2” FESR Management of RB martedi 8 novembre 2005 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 8 Consorzio COMETA “Progetto PI2S2” FESR ● Checks to do – CA updates and CRL fetching (fetch-crl cron job) – VOMS servers’ certificate – Date (NTP synchronization) – Size of Sandboxdir – Mysql status – Daemons status ( for daemon in `ls /etc/init.d | grep edg-wl-` ; do /etc/init.d/$daemon status ; done ) – Configuration file /opt/edg/etc/edg_wl.conf ● – Check if II_Contact string is pointing to the right top BDII All reasons of the jobs are logged in /var/edgwl/logmonitor/log/events.log martedi 8 novembre 2005 – GRIS: globus-mds status (test it running ldapsearch) Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 9 Consorzio COMETA “Progetto PI2S2” FESR ● Suggestions – Create a separated partition for /var/edgwl dir – Backup every day lbserver20 database and store the file in a log server (es. mysqldump --databases lbserver20 --password=(your password) > `hostname s`_databases_`date +%y-%m-%d`.sql ) – If needed, remove by hand old jobs directories stored under /var/edgwl/SandboxDir (often purge cron job doesn’t work well) Who wishes to view all contents of lbserver20 database, can install phpmyadmin tool and restrict the access to the trusted machines (and users ……) martedi 8 novembre 2005 – Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 10 Consorzio COMETA “Progetto PI2S2” FESR Troubleshooting martedi 8 novembre 2005 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 11 Consorzio COMETA “Progetto PI2S2” Troubleshooting /1 FESR If the edg-job-submit/edg-job-list-match commands returns the following error message: **** Error: API_NATIVE_ERROR **** Error while calling the "NSClient::multi" native api AuthenticationException: Failed to establish security context... **** Error: UI_NO_NS_CONTACT **** Unable to contact any Network Server it means that there are authentication problems between the UI and the Network Server martedi 8 novembre 2005 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 12 Consorzio COMETA “Progetto PI2S2” • Check FESR Solution (I) your Proxy. • Maybe you have not a valid proxy. Remember to initialized the proxy with the VOMS extensions. $ voms-proxy-info --all subject : /C=IT/O=GILDA/OU=Personal Certificate/L=INFN Catania/CN=Giuseppe Platania/[email protected]/CN=proxy issuer : /C=IT/O=GILDA/OU=Personal Certificate/L=INFN Catania/CN=Giuseppe Platania/[email protected] identity : /C=IT/O=GILDA/OU=Personal Certificate/L=INFN Catania/CN=Giuseppe Platania/[email protected] type : proxy strength : 512 bits path : /tmp/x509up_u512 martedi 8 timeleft novembre 2005 : 11:59:55 No VOMS extensions! ! Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 13 Consorzio COMETA “Progetto PI2S2” Solution (II) FESR Verify the synchronization between the UI and the WMS. Check if nptd is running /etc/init.d/ntpd status ntpd (pid 1742) is running... and if the date is correctly ! martedi 8 novembre 2005 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 14 Consorzio COMETA “Progetto PI2S2” FESR • Inspect the log file /var/edgwl/networkserver/log/events.log 05 Sep, 17:00:43 -F- "NS2WM::convertProtocol": Converted String: [ arguments = [ ad = [ requirements = ( other.GlueCEStateStatus == "Production" ) && ( other.GlueCEStateStatus == "Production" ); RetryCount = 3; Arguments = "-f"; JobType = "normal "; Executable = "/bin/hostname"; CertificateSubject ="/C=IT/O=GILDA/OU=Personal Certificate/L=INFN Catania/CN=Giuseppe Platania/Email=giuseppe.plataniact.infn.it"; StdOutput = "hostname.out"; X509UserProxy = "/tmp/user.proxy.0xb74f6768.20060905170043677437"; OutputSandbox = { "hostname.err","hostname.out" }; VirtualOrganisation = "gilda"; rank = other.GlueCEStateEstimatedResponseTime; Type = "job"; StdError = "hostname.err"; No CRL installed or no CA supported 05 Sep, 17:01:49 -F- "Manager::run": Exception Caught during Client Authentication. martedi 8 novembre 2005 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 15 Consorzio COMETA “Progetto PI2S2” FESR • Inspect the log file /var/edgwl/networkserver/log/events.log 05 Sep, 17:00:43 -F- "NS2WM::convertProtocol": Converted String: [ arguments = [ ad = [ requirements = ( other.GlueCEStateStatus == "Production" ) && ( other.GlueCEStateStatus == "Production" ); RetryCount = 3; Arguments = "-f"; JobType = "normal "; Executable = "/bin/hostname"; CertificateSubject ="/C=IT/O=GILDA/OU=Personal Certificate/L=INFN Catania/CN=Giuseppe Platania/[email protected]"; StdOutput = "hostname.out"; X509UserProxy = "/tmp/user.proxy.0xb74f6768.20060905170043677437"; OutputSandbox = { "hostname.err","hostname.out" }; VirtualOrganisation = "gilda"; rank = other.GlueCEStateEstimatedResponseTime; Type = "job"; StdError = "hostname.err"; No user DN in the grid-mapfile 05 Sep, 17:01:49 -F- "Manager::run": Can’t authorize /C=IT/O=GILDA/OU=Personal martedi 8 novembre 2005 Certificate/L=INFN Catania/CN=Giuseppe Platania/[email protected] . Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 16 Consorzio COMETA “Progetto PI2S2” Troubleshooting /2 FESR If the edg-job-status commands returns Aborted reason: Cannot read JobWrapper output, both from Condor and from Maradona • job did not start : • batch system submission problem (e.g. batch system in crazy state) • WN disk full - home directory absent or unwritable • time not synchronized between CE and WN • mismatch between forward and reverse DNS for CE name/IPaddress • WN cannot globus-url-copy from/to CE - WN cannot scp to/from martedi 8 CE novembre 2005 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 17 Consorzio COMETA “Progetto PI2S2” Troubleshooting /2 FESR job did finish : the WN could not do a globus-url-copy to the RB • Globus could not send back the job wrapper stdout, e.g. because it was not copied back from the WN to the CE, or because globus-url-copy does not work from the CE to the RB. This combined set of problems still can have a single cause: - a firewall limiting outgoing connections (to ports 20000- 25000) - some CRLs out of date both on CE and WN - some CA files absent martedi 8 -novembre 2005 time (zone) on CE and WN wrong • Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 18 Consorzio COMETA “Progetto PI2S2” Troubleshooting /2 FESR If the edg-job-status commands returns Aborted reason: ************************************************************* BOOKKEEPING INFORMATION: Printing status info for the Job : https://wn-02-32a.cr.cnaf.infn.it:9000/LHrUgJsLYN4q0VHnJNuz0Q Current Status: Aborted Status Reason: Cannot plan (a helper failed) reached on: Fri Sep 19 10:51:48 2003 ************************************************************* It means that a matchmaker failed because no suitable resources for a given job are found. martedi 8 novembre 2005 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 19 Consorzio COMETA “Progetto PI2S2” Troubleshooting /2 FESR • middleware failure is due to Information Service problems: • the service is down • the information database is not updated and does not contain all the required information • application software unavailable: • the JDL requires a wrong/unsupported software version • the site does not support the requested software • the version required is new and the site has not yet updated the application software area • wrong user request takes place when the user asks for: • an unsopported CPU type • an unsupported operating system unavailable memory martedi 8 •novembre 2005 • an unsupported VO Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 20 Consorzio COMETA “Progetto PI2S2” FESR LINK of RB troubleshooting http://goc.grid.sinica.edu.tw/gocwiki/SiteProblemsFollowUp Faq#head-2c6e726a9368ae7ac0e052ce15fd52c7d3f600ef martedi 8 novembre 2005 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 21 Consorzio COMETA “Progetto PI2S2” Questions… FESR martedi 8 novembre 2005 Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007 22