Consorzio COMETA “Progetto PI2S2”
FESR
Gestione del Resource Broker
Giuseppe Platania, INFN Catania
Tutorial per Site Administrator Progetto PI2S2
Messina, 9-11.07.2007
martedi 8 novembre 2005
Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007
1
Consorzio COMETA “Progetto PI2S2”
FESR
–
Components
●
Network Server
●
Workload Manager
●
Job Controller
●
Logging & Bookkeeping
●
Log Monitor
–
Management of RB
–
Troubleshooting
Outline
martedi 8 novembre 2005
Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007
2
Consorzio COMETA “Progetto PI2S2”
FESR
●
●
Authentication
RB components: NS
–
user CA is authorized
–
user DN is in the grid-mapfile
–
user certificate was revoked by his CA
Authorization
–
Pool accounts mapping (LCMAPS)
–
Sandbox disk space
–
Size of input sandbox (< MAX_INPUT_SB_SIZE)
martedi 8 novembre 2005
–
Creation of sandbox dir with input files and user proxy
Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007
3
Consorzio COMETA “Progetto PI2S2”
FESR
RB components: WM and JC
●
●
Workload Manager
–
Receives job submission command and put the user
request in the WM queue
–
Match making: CE choise
–
Job file creation (job wrapper) being sent to JC
Job Controller
–
Job submission to the choosen CE via Condorg
–
Input sandbox copy to the choosen CE
martedi 8 novembre 2005
Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007
4
Consorzio COMETA “Progetto PI2S2”
FESR
RB components: LB, Locallogger and LM
●
Logging&Bookkeeping
–
●
Locallogger
–
●
Logging of all job events in its database
It’s the LB proxy because it stores all job events even if
LB doesn’t work
Log Monitor
–
Condor log parsing and writing on LB database
If needed, job being resubmitted to WM queue
martedi 8–novembre 2005
Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007
5
Consorzio COMETA “Progetto PI2S2”
FESR
Logs
●
Log files can be found in /var/edgwl/
logging
proxyrenewal
logmonitor
SandboxDir
jobcontrol
networkserver
workload_manager
Init scripts can be found in /etc/init.d/edg-wl-*
●
martedi
8 novembre 2005
Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007
6
Consorzio COMETA “Progetto PI2S2”
FESR
Possible job flags
Flag
Meaning
SUBMITTED
submission logged in the LB
WAIT
job match making for resources
READY
job being sent to executing CE
SCHEDULED
job scheduled in the CE queue manager
RUNNING
job executing on a WN of the selected
CE queue
DONE
job terminated without grid errors
CLEARED
job output retrieved
ABORT
job aborted by middleware, check reason
martedi 8 novembre 2005
ResourceBroker
Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2,EGEE
Messina
9-11.07.2007
7
Consorzio COMETA “Progetto PI2S2”
FESR
Management of RB
martedi 8 novembre 2005
Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007
8
Consorzio COMETA “Progetto PI2S2”
FESR
●
Checks to do
–
CA updates and CRL fetching (fetch-crl cron job)
–
VOMS servers’ certificate
–
Date (NTP synchronization)
–
Size of Sandboxdir
–
Mysql status
–
Daemons status
( for daemon in `ls /etc/init.d | grep edg-wl-` ; do /etc/init.d/$daemon status ;
done )
–
Configuration file /opt/edg/etc/edg_wl.conf
●
–
Check if II_Contact string is pointing to the right top BDII
All reasons of the jobs are logged in
/var/edgwl/logmonitor/log/events.log
martedi 8 novembre 2005
– GRIS: globus-mds
status (test it running ldapsearch)
Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007
9
Consorzio COMETA “Progetto PI2S2”
FESR
●
Suggestions
–
Create a separated partition for /var/edgwl dir
–
Backup every day lbserver20 database and store
the file in a log server
(es. mysqldump --databases lbserver20 --password=(your password) > `hostname s`_databases_`date +%y-%m-%d`.sql )
–
If needed, remove by hand old jobs directories
stored under /var/edgwl/SandboxDir (often purge
cron job doesn’t work well)
Who wishes to view all contents of lbserver20
database, can install phpmyadmin tool and restrict
the access to the trusted machines (and users ……)
martedi 8 novembre 2005
–
Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007
10
Consorzio COMETA “Progetto PI2S2”
FESR
Troubleshooting
martedi 8 novembre 2005
Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007
11
Consorzio COMETA “Progetto PI2S2”
Troubleshooting /1
FESR
If the edg-job-submit/edg-job-list-match commands
returns the following error message:
**** Error: API_NATIVE_ERROR ****
Error while calling the "NSClient::multi" native api
AuthenticationException: Failed to establish security context...
**** Error: UI_NO_NS_CONTACT ****
Unable to contact any Network Server
it means that there are authentication problems
between the UI and the Network Server
martedi 8 novembre 2005
Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007
12
Consorzio COMETA “Progetto PI2S2”
• Check
FESR
Solution (I)
your Proxy.
• Maybe you have not a valid proxy. Remember to
initialized the proxy with the VOMS extensions.
$ voms-proxy-info --all
subject
: /C=IT/O=GILDA/OU=Personal Certificate/L=INFN
Catania/CN=Giuseppe
Platania/[email protected]/CN=proxy
issuer
: /C=IT/O=GILDA/OU=Personal Certificate/L=INFN
Catania/CN=Giuseppe Platania/[email protected]
identity : /C=IT/O=GILDA/OU=Personal Certificate/L=INFN
Catania/CN=Giuseppe Platania/[email protected]
type
: proxy
strength : 512 bits
path
: /tmp/x509up_u512
martedi 8 timeleft
novembre 2005 : 11:59:55
No VOMS
extensions!
!
Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007
13
Consorzio COMETA “Progetto PI2S2”
Solution (II)
FESR
Verify the synchronization between the UI and
the WMS.
Check if nptd is running
/etc/init.d/ntpd status
ntpd (pid 1742) is running...
and if the date is correctly !
martedi 8 novembre 2005
Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007
14
Consorzio COMETA “Progetto PI2S2”
FESR
• Inspect the log file /var/edgwl/networkserver/log/events.log
05 Sep, 17:00:43 -F- "NS2WM::convertProtocol": Converted String: [
arguments =
[ ad = [ requirements = ( other.GlueCEStateStatus == "Production" ) && (
other.GlueCEStateStatus == "Production" );
RetryCount = 3; Arguments = "-f"; JobType = "normal
"; Executable = "/bin/hostname"; CertificateSubject
="/C=IT/O=GILDA/OU=Personal Certificate/L=INFN Catania/CN=Giuseppe
Platania/Email=giuseppe.plataniact.infn.it"; StdOutput = "hostname.out";
X509UserProxy = "/tmp/user.proxy.0xb74f6768.20060905170043677437";
OutputSandbox = { "hostname.err","hostname.out" };
VirtualOrganisation = "gilda"; rank = other.GlueCEStateEstimatedResponseTime; Type = "job"; StdError =
"hostname.err";
No CRL installed or
no CA supported
05 Sep, 17:01:49 -F- "Manager::run":
Exception Caught during
Client Authentication.
martedi 8 novembre 2005
Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007
15
Consorzio COMETA “Progetto PI2S2”
FESR
• Inspect the log file /var/edgwl/networkserver/log/events.log
05 Sep, 17:00:43 -F- "NS2WM::convertProtocol": Converted String: [
arguments =
[ ad = [ requirements = ( other.GlueCEStateStatus == "Production" ) && (
other.GlueCEStateStatus == "Production" );
RetryCount = 3; Arguments = "-f"; JobType = "normal
"; Executable = "/bin/hostname"; CertificateSubject
="/C=IT/O=GILDA/OU=Personal Certificate/L=INFN Catania/CN=Giuseppe
Platania/[email protected]"; StdOutput =
"hostname.out";
X509UserProxy = "/tmp/user.proxy.0xb74f6768.20060905170043677437";
OutputSandbox = { "hostname.err","hostname.out" };
VirtualOrganisation = "gilda"; rank = other.GlueCEStateEstimatedResponseTime; Type = "job"; StdError =
"hostname.err";
No user DN in the
grid-mapfile
05 Sep, 17:01:49 -F- "Manager::run": Can’t authorize
/C=IT/O=GILDA/OU=Personal
martedi
8 novembre 2005
Certificate/L=INFN Catania/CN=Giuseppe
Platania/[email protected] .
Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007
16
Consorzio COMETA “Progetto PI2S2”
Troubleshooting /2
FESR
If the edg-job-status commands returns Aborted reason:
Cannot read JobWrapper output, both from Condor and from
Maradona
• job did not start :
• batch system submission problem (e.g. batch system in crazy
state)
• WN disk full - home directory absent or unwritable
• time not synchronized between CE and WN
• mismatch between forward and reverse DNS for CE name/IPaddress
• WN cannot globus-url-copy from/to CE - WN cannot scp to/from
martedi 8 CE
novembre 2005
Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007
17
Consorzio COMETA “Progetto PI2S2”
Troubleshooting /2
FESR
job did finish :
the WN could not do a globus-url-copy to the RB
• Globus could not send back the job wrapper stdout,
e.g. because it was not copied back from the WN to
the CE, or because globus-url-copy does not work
from the CE to the RB. This combined set of problems
still can have a single cause:
- a firewall limiting outgoing connections (to ports
20000- 25000)
- some CRLs out of date both on CE and WN
- some CA files absent
martedi 8 -novembre
2005 time (zone) on CE and WN
wrong
•
Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007
18
Consorzio COMETA “Progetto PI2S2”
Troubleshooting /2
FESR
If the edg-job-status commands returns Aborted reason:
*************************************************************
BOOKKEEPING INFORMATION:
Printing status info for the Job : https://wn-02-32a.cr.cnaf.infn.it:9000/LHrUgJsLYN4q0VHnJNuz0Q
Current Status: Aborted Status Reason: Cannot plan (a helper failed)
reached on: Fri Sep 19 10:51:48 2003
*************************************************************
It means that a matchmaker failed because no suitable
resources for a given job are found.
martedi 8 novembre 2005
Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007
19
Consorzio COMETA “Progetto PI2S2”
Troubleshooting /2
FESR
• middleware failure is due to Information Service problems:
• the service is down
• the information database is not updated and does not
contain all the required information
• application software unavailable:
• the JDL requires a wrong/unsupported software version
• the site does not support the requested software
• the version required is new and the site has not yet updated
the application software area
• wrong user request takes place when the user asks for:
• an unsopported CPU type
• an unsupported operating system
unavailable
memory
martedi 8 •novembre
2005
• an unsupported VO
Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007
20
Consorzio COMETA “Progetto PI2S2”
FESR
LINK of RB troubleshooting
http://goc.grid.sinica.edu.tw/gocwiki/SiteProblemsFollowUp
Faq#head-2c6e726a9368ae7ac0e052ce15fd52c7d3f600ef
martedi 8 novembre 2005
Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007
21
Consorzio COMETA “Progetto PI2S2”
Questions…
FESR
martedi 8 novembre 2005
Giuseppe Platania - Tutorial per Site Administrator Progetto PI2S2, Messina 9-11.07.2007
22
Scarica

Slide 1