The SAGA use case FAQ

  1. How to run a job on the Purdue Condor pool using the condor adaptor?
  2. How to run a job on the TG clusters using Condor-G and the condor adaptor?
  3. How to run a Condor-G job on the TG clusters without changing SAGA ini files?

  1. How to run a job on the Purdue Condor pool using the condor adaptor?
  2. In the moment of writing this FAQ the Purdue Condor pool (tg-condor.purdue.teragrid.org) uses Condor 7.2.3. All command line tools required by the SAGA (condor_submit, condor_rm, condor_q) are in /opt/condor/bin/. To set your environment variables to use the Condor you can use an appropriate softenv key:
    $ soft add +condor-current
    
    You should be aware that the condor_submit command you execute now in the command line is not the original condor_submit. It's a wrapper written in the /opt/condor/wrapper/ directory. The wrapper calls the original condor_submit adding to command line parameters the "-a" ("-append") option to specify user's allocation. The SAGA can use Condor command tools from the only one directory which should be written in $SAGA_LOCATION/share/saga/saga_adaptor_condor.ini. It means that there is no problem here with the Purdue wrapper but you will have to take care of informing the condor_submit about your allocation. To do that you can use the mentioned ini file and specify your allocation in the [saga.adaptors.condor_job.default_attributes] section, e.g.:
    +TGProject = "TG-STA080000N"
    
    The SAGA condor adaptor adds two extra parameters in the command line to the condor_submit, e.g.:
    -append "log = /tmp/saga-condor-log-qToJzf" -append "log_xml = True"
    
    Thanks to that both the SAGA and a user can easily track a state of submitted jobs.
    After the above preparations you should be able to submit successfully a simple job:
    # job_condor.py
    
    import saga
    
    try:
      job_service_url = saga.url("condor://localhost/")
      job_service = saga.job.service(job_service_url)
    
      job_description = saga.job.description()
      job_description.executable = "/bin/hostname"
      job_description.arguments = ["--fqdn"]
    
      my_job = job_service.create_job(job_description)
      my_job.run()
    
      print my_job.get_job_id()
    
    except saga.exception, e:
      print "SAGA Error: ", e
    
    lacinski@tg-condor ~ $ python job_condor.py
    [condor://localhost/]-[2458734]
    
    When the job is queued you can get information how job conditions meet Condor pool requirements and look into the log file in /tmp/ directory.
    lacinski@tg-condor ~ $ condor_q 2458734 -better-analyze
    
    
    -- Submitter: tg-condor.rcac.purdue.edu : <128.211.128.45:53085> : tg-condor.rcac.purdue.edu
    ---
    2458734.000:  Run analysis summary.  Of 1828 machines,
         16 are rejected by your job's requirements
         89 reject your job because of their own requirements
        159 match but are serving users with a better priority in the pool
       1510 match but reject the job for unknown reasons
         27 match but will not currently preempt their existing job
         27 are available to run your job
    
    The Requirements expression for your job is:
    
    ( ( JobUniverse == 7 || JobUniverse == 9 || JobUniverse == 12 ) || ( TGProject isnt undefined ) ) &&
    ( target.Arch == "X86_64" ) && ( target.OpSys == "LINUX" ) &&
    ( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize ) &&
    ( ( target.HasFileTransfer ) || ( TARGET.FileSystemDomain == MY.FileSystemDomain ) )
    
        Condition                         Machines Matched    Suggestion
        ---------                         ----------------    ----------
    1   ( target.Arch == "X86_64" )       1812                 
    2   ( target.OpSys == "LINUX" )       1828                 
    3   ( target.Disk >= 17 )             1828                 
    4   ( ( 1024 * target.Memory ) >= 17 )1828                 
    5   ( ( target.HasFileTransfer ) || ( TARGET.FileSystemDomain == "rcac.purdue.edu" ) )
                                          1828                 
    
    The following attributes are missing from the job ClassAd:
    
    CheckpointPlatform
    lacinski@tg-condor ~ $ cat /tmp/saga-condor-log-HoWlw6
    <c>
        <a n="MyType"><s>SubmitEvent</s></a>
        <a n="EventTypeNumber"><i>0</i></a>
        <a n="MyType"><s>SubmitEvent</s></a>
        <a n="EventTime"><s>2009-10-08T00:48:56</s></a>
        <a n="Cluster"><i>2458736</i></a>
        <a n="Proc"><i>0</i></a>
        <a n="Subproc"><i>0</i></a>
        <a n="SubmitHost"><s><128.211.128.45:53085></s></a>
    </c>
    lacinski@tg-condor ~ $
    

  3. How to run a job on the TG clusters using Condor-G and the condor adaptor?
  4. Condor-G, which should be available on all TG cluster, gives an opportunity to submit jobs on different clusters in the same way as to a single nodes in a traditional Condor pool. All request send to Condor-G are sent by it to gatekeepers running on TG clusters. The available pool of gatekeepers can be listed using the condor_status command:
    lacinski@qb1 ~$ condor_status
    
    Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
    
    ncsa.cobalt.debug  LINUX      IA64   Unclaimed Idle     0.000  109967  0+00:00:00
    ncsa.cobalt.extend LINUX      IA64   Unclaimed Idle     0.000  109967  0+00:00:00
    ncsa.cobalt.long   LINUX      IA64   Unclaimed Idle     1.230  109967  0+00:00:00
    ncsa.cobalt.standa LINUX      IA64   Unclaimed Idle     1.601  109967  0+00:00:00
    ncsa.cobalt.workq  LINUX      IA64   Unclaimed Idle     0.000  109967  0+00:00:00
    tacc.lonestar.deve LINUX      X86_64 Unclaimed Idle     0.000  1118433  0+00:00:01
    tacc.lonestar.high LINUX      X86_64 Unclaimed Idle     100.000  1118433  0+00:00:01
    tacc.lonestar.norm LINUX      X86_64 Unclaimed Idle     2.377  1118433  0+00:00:01
    tacc.lonestar.seri LINUX      X86_64 Unclaimed Idle     4.500  1118433  0+00:00:01
    tacc.ranger.develo LINUX      X86_64 Unclaimed Idle     1.118  55308288  0+00:00:00
    tacc.ranger.large  LINUX      X86_64 Unclaimed Idle     100.000  55308288  0+00:00:01
    tacc.ranger.long   LINUX      X86_64 Unclaimed Idle     2.596  55308288  0+00:00:00
    tacc.ranger.normal LINUX      X86_64 Unclaimed Idle     6.809  55308288  0+00:00:00
    tacc.ranger.serial LINUX      X86_64 Unclaimed Idle     0.000  55308288  0+00:00:01
    ncsa.abe.cap1      LINUX      X86_64 Unclaimed Idle     0.000  8901860  0+00:00:00
    ncsa.abe.debug     LINUX      X86_64 Unclaimed Idle     0.000  8901860  0+00:00:00
    ncsa.abe.lincoln   LINUX      X86_64 Unclaimed Idle     0.000  8901860  0+00:00:00
    ncsa.abe.long      LINUX      X86_64 Unclaimed Idle     1.721  8901860  0+00:00:00
    ncsa.abe.normal    LINUX      X86_64 Unclaimed Idle     3.982  8901860  0+00:00:00
    ncsa.abe.wide      LINUX      X86_64 Unclaimed Idle     100.000  8901860  0+00:00:00
    loni-lsu.queenbee. LINUX      X86_64 Unclaimed Idle     1.827  502110  0+00:00:00
    loni-lsu.queenbee. LINUX      X86_64 Unclaimed Idle     0.000  502110  0+00:00:00
    loni-lsu.queenbee. LINUX      X86_64 Unclaimed Idle     2.771  502110  0+00:00:00
    
                         Total Owner Claimed Unclaimed Matched Preempting Backfill
    
              IA64/LINUX     5     0       0         5       0          0        0
            X86_64/LINUX    18     0       0        18       0          0        0
    
                   Total    23     0       0        23       0          0        0
    lacinski@qb1 ~$
    
    A job desctiption file for Condor-G should define "grid" as a universe and a gatekeeper if a user wants to submit a job on a chosen cluster. If a user doesn't define the "grid_resource" then the Condor-G will sumbit a job to the one of the all gatekeepers on the list. Using matchmaking a user can define a subset of gatekeeepers, the job will be submitted to. A simple job description file can look like the below one:
    executable = /bin/hostname
    arguments = --fqdn
    output = condor7-gt2.$(CLUSTER).$(PROCESS).out
    error = condor7-gt2.$(CLUSTER).$(PROCESS).err
    notification = NEVER
    universe = grid
    grid_resource = gt2 queenbee.loni-lsu.teragrid.org/jobmanager-pbs
    globus_rsl = (project=TG-STA080000N)(maxWallTime=10)(jobType=single)
    x509userproxy = /home/lacinski/.globus/userproxy.pem
    queue
    
    Not all of these attributes can be set in the SAGA program, but they can be set in the saga_adaptor_condor_job.ini file in the [saga.adaptors.condor_job.default_attributes] section, e.g.:
    # Universe = Vanilla
    universe = grid
    grid_resource = gt2 queenbee.loni-lsu.teragrid.org/jobmanager-pbs
    globus_rsl = (project=TG-STA080000N)(maxWallTime=10)(jobType=single)
    x509userproxy = /home/lacinski/.globus/userproxy.pem
    
    You have to be sure that your valid credential is written in the file, that name you assigned to the x509userproxy attribute. You can use the grid-proxy-info command to check it.
    # job_condor.py
    
    import saga
    
    try:
      job_service_url = saga.url("condor://localhost/")
      job_service = saga.job.service(job_service_url)
    
      job_description = saga.job.description()
      job_description.executable = "/bin/hostname"
      job_description.arguments = ["--fqdn"]
      job_description.output = "condor7-gt2.$(CLUSTER).$(PROCESS).out"
      job_description.error = "condor7-gt2.$(CLUSTER).$(PROCESS).err"
    
      my_job = job_service.create_job(job_description)
      my_job.run()
    
      print my_job.get_job_id()
    
    except saga.exception, e:
      print "SAGA Error: ", e
    
    A status of a submitted job can be track using the condor_q command and /tmp/saga-condor-log-* file, the same like for a regular Condor jobs.

  5. How to run a Condor-G job on the TG clusters without changing SAGA ini files?
  6. Both methods presented above have a certain drawback. Setting written in SAGA ini files cannot be changed from a program. It would more convenient to have a possibility to change dynamically a grid_resource attribbute. There are several solutions of that problem. One of them based on wrappers is presented below. We need to create a seperated directory where all Condor command line tools used by the SAGA will be written
    lacinski@qb1 ~$ mkdir condor
    lacinski@qb1 ~$ cd condor
    lacinski@qb1 ~$ ln -s /usr/local/packages/condor-7.2.1-r1/bin/condor_q .
    lacinski@qb1 ~$ ln -s /usr/local/packages/condor-7.2.1-r1/bin/condor_rm .
    lacinski@qb1 ~$ cat > condor_submit
    #!/bin/bash
    
    HOME=/home/lacinski
    ATTR=$HOME/condor/condor_attr
    JOB=$HOME/condor/condor_job
    
    cat $ATTR > $JOB
    
    while read l; do
        if [[ $l == ${l/#universe/} ]]; then
            echo $l >> $JOB
        fi
    done
    
    exec /usr/local/packages/condor-7.2.1-r1/bin/condor_submit "$1" "$2" "$3" "$4" $JOB
    <Ctrl+d>
    
    The condor_submit wrapper creates the "condor_job" job description file, writes to this file the entire "condor_attr" file with all extra attributes written before in the SAGA condor ini file. At the end the original "condor_submit" is executed. Now the "condor_attr" file can be created and changed dynamically in a program:
    # job_condor_py
    
    import saga
    import os
    
    try:
      js_url = saga.url("condor://localhost/")
      job_service = saga.job.service(js_url)
    
      job_desc = saga.job.description()
      job_desc.executable = "/bin/hostname"
      job_desc.arguments = ["--fqdn"]
      job_desc.output = "condor7-gt2.$(CLUSTER).$(PROCESS).out"
      job_desc.error = "condor7-gt2.$(CLUSTER).$(PROCESS).err"
    
      home = os.environ.get("HOME")
      attr = open(home + "/condor/condor_attr", "w")
      attr.write("log = condor7-gt2.$(CLUSTER).$(PROCESS).log\n")
      attr.write("universe = grid\n")
      attr.write("grid_resource = gt2 queenbee.loni-lsu.teragrid.org/jobmanager-pbs\n")
      attr.write("globus_rsl = (project=TG-STA080000N)(maxWallTime=10)(jobType=single)\n")
      attr.write("x509userproxy = " + home + "/.globus/userproxy.pem\n")
      attr.close()
    
      my_job = job_service.create_job(job_desc)
      my_job.run()
    
    except saga.exception, e: 
      print "SAGA Error: ", e