Job monitoring

Once you submit your job request, the job will be handed off to Tapis’s back end execution service. Your job may run right away, or it may wait in a batch queue on the execution system until the required resources are available. Either way, the execution process occurs completely asynchronous to the submission process. To monitor the status of your job, Tapis supports two different mechanisms: polling and webhooks.

information_source:
 For the sake of brevity, we placed a detailed explanation of the job lifecycle in a separate, aptly title post, The Job Lifecycle. There you will find detailed information about how, when, and why everything moves from place to place and how you can peek behind the curtains.

Polling

If you have ever taken a long road trip with children, you are probably painfully aware of how polling works. Starting several minutes from the time you leave the house, a child asks, “Are we there yet? You reply, “No.” Several minutes later the child again asks, “Are we there yet?” You again reply, “No.” This process continues until you finally arrive at your destination. This is called polling and polling is bad

Polling for your job status works the same way. After submitting your job, you start a while loop that queries the Jobs service for your job status until it detects that the job is in a terminal state. The following two URLs both return the status of your job. The first will result in a list of abbreviated job descriptions, the second will result in a full description of the job with the given $JOB_ID, exactly like that returned when submitting the job. The third will result in a much smaller response object that contains only the $JOB_ID and status being returned.

Show curl
curl -sk -H "Authorization: Bearer  $ACCESS_TOKEN" https://agave.iplantc.org/jobs/v2/?pretty=true
curl -sk -H "Authorization: Bearer  $ACCESS_TOKEN" https://agave.iplantc.org/jobs/v2/$JOB_ID
curl -sk -H "Authorization: Bearer  $ACCESS_TOKEN" https://agave.iplantc.org/jobs/v2/$JOB_ID/status

Show json response
{
"id" : "$JOB_ID",
"name" : "$USERNAME-$APP_ID",
"owner" : "$USERNAME",
"appId" : "$APP_ID",
"executionSystem" : "$PUBLIC_EXECUTION_SYSTEM",
"batchQueue": "normal",
"nodeCount": 1,
"processorsPerNode": 16,
"memoryPerNode": 32,
"maxRunTime": "01:00:00",
"archive": false,
"retries": 0,
"localId": "659413",
"created": "2018-01-26T15:08:02.000-06:00",
"lastUpdated": "2018-01-26T15:09:55.000-06:00",
"outputPath": "$USERNAME/$JOB_ID-$APP_ID",
"status": "FINISHED",
"submitTime": "2018-01-26T15:09:45.000-06:00",
"startTime": "2018-01-26T15:09:53.000-06:00",
"endTime": "2018-01-26T15:09:55.000-06:00",
"inputs": {
  "inputBam": [
    "agave://data.iplantcollaborative.org/shared/iplantcollaborative/example_data/Samtools_mpileup/ex1.bam"
  ]
},
"parameters": {
  "nameSort": true,
  "maxMemSort": 800000000
},
"_links": {
  "self": {
    "href": "https://api.tacc.utexas.edu/jobs/v2/$JOB_ID"
  },
  "app": {
    "href": "https://api.tacc.utexas.edu/apps/v2/$APP_ID"
  },
  "executionSystem": {
    "href": "https://api.tacc.utexas.edu/systems/v2/$PUBLIC_EXECUTION_SYSTEM"
  },
  "archiveSystem": {
    "href": "https://api.tacc.utexas.edu/systems/v2/$PUBLIC_EXECUTION_SYSTEM""
  },
  "archiveData": {
    "href": "https://api.tacc.utexas.edu/jobs/v2/$JOB_ID/outputs/listings"
  },
  "owner": {
    "href": "https://api.tacc.utexas.edu/profiles/v2/$USERNAME"
  },
  "permissions": {
    "href": "https://api.tacc.utexas.edu/jobs/v2/$JOB_ID/pems"
  },
  "history": {
    "href": "https://api.tacc.utexas.edu/jobs/v2/$JOB_ID/history"
  },
  "metadata": {
    "href": "https://api.tacc.utexas.edu/meta/v2/data/?q=%7B%22associationIds%22%3A%22462259152402771480-242ac113-0001-007%22%7D"
  },
  "notifications": {
    "href": "https://api.tacc.utexas.edu/notifications/v2/?associatedUuid=$JOB_ID"
  }
}
}

The list of all possible job statuses is given in table 2.

Event Description
CREATED The job was updated
UPDATED The job was updated
DELETED The job was deleted
PERMISSION_GRANT User permission was granted
PERMISSION_REVOKE Permission was removed for a user on this job
PENDING Job accepted and queued for submission.
STAGING_INPUTS Transferring job input data to execution system
CLEANING_UP Job completed execution
ARCHIVING Transferring job output to archive system
STAGING_JOB Job inputs staged to execution system
FINISHED Job complete
KILLED Job execution killed at user request
FAILED Job failed
STOPPED Job execution intentionally stopped
RUNNING Job started running
PAUSED Job execution paused by user
QUEUED Job successfully placed into queue
SUBMITTING Preparing job for execution and staging binaries to execution system
STAGED Job inputs staged to execution system
PROCESSING_INPUTS Identifying input files for staging
ARCHIVING_FINISHED Job archiving complete
ARCHIVING_FAILED Job archiving failed
HEARTBEAT Job heartbeat received

Table 2. Job statuses listed in progressive order from job submission to completion.

Polling is an incredibly effective approach, but it is bad practice for two reasons. First, it does not scale well. Querying for one job status every few seconds does not take much effort, but querying for 100 takes quite a bit of time and puts unnecessary load on Tapis’s servers. Second, polling provides what is effectively a binary response. It tells you whether a job is done or not done, it does not give you any information on what is actually going on with the job or where it is in the overall execution process.

The job history URL provides much more detailed information on the various state changes, system messages, and progress information associated with data staging. The syntax of the job history URL is as follows:

Show curl
curl -sk -H "Authorization: Bearer  $ACCESS_TOKEN" https://agave.iplantc.org/jobs/v2/$JOB_ID/history?pretty=true

Show json response
{
"status":"success",
"message":null,
"version":"2.1.0-r6d11c",
"result":[
  {
    "created":"2014-10-24T04:47:45.000-05:00",
    "status":"PENDING",
    "description":"Job accepted and queued for submission."
  },
  {
    "created":"2014-10-24T04:47:47.000-05:00",
    "status":"PROCESSING_INPUTS",
    "description":"Attempt 1 to stage job inputs"
  },
  {
    "created":"2014-10-24T04:47:47.000-05:00",
    "status":"PROCESSING_INPUTS",
    "description":"Identifying input files for staging"
  },
  {
    "created":"2014-10-24T04:47:48.000-05:00",
    "status":"STAGING_INPUTS",
    "description":"Staging agave://$PUBLIC_STORAGE_SYSTEM/$API_USERNAME/inputs/pyplot/testdata.csv to remote job directory"
  },
  {
    "progress":{
      "averageRate":0,
      "totalFiles":1,
      "source":"agave://$PUBLIC_STORAGE_SYSTEM/$API_USERNAME/inputs/pyplot/testdata.csv",
      "totalActiveTransfers":0,
      "totalBytes":3212,
      "totalBytesTransferred":3212
    },
    "created":"2014-10-24T04:47:48.000-05:00",
    "status":"STAGING_INPUTS",
    "description":"Copy in progress"
  },
  {
    "created":"2014-10-24T04:47:50.000-05:00",
    "status":"STAGED",
    "description":"Job inputs staged to execution system"
  },
  {
    "created":"2014-10-24T04:47:55.000-05:00",
    "status":"SUBMITTING",
    "description":"Preparing job for submission."
  },
  {
    "created":"2014-10-24T04:47:55.000-05:00",
    "status":"SUBMITTING",
    "description":"Attempt 1 to submit job"
  },
  {
    "created":"2014-10-24T04:48:08.000-05:00",
    "status":"RUNNING",
    "description":"Job started running"
  },
  {
    "created":"2014-10-24T04:48:12.000-05:00",
    "status":"CLEANING_UP"
  },
  {
    "created":"2014-10-24T04:48:15.000-05:00",
    "status":"FINISHED",
    "description":"Job completed. Skipping archiving at user request."
  }
]
}

Depending on the nature of your job and the reliability of the underlying systems, the response from this service can grow rather large, so it is important to be aware that this query can be an expensive call for your client application to make. Everything we said before about polling job status applies to polling job history with the additional caveat that you can chew through quite a bit of bandwidth polling this service, so keep that in mind if your application is bandwidth starved.

Often times, however, polling is unavoidable. In these situations, we recommend using an exponential backoff to check job status. An exponential backoff is an alogrithm that increases the time between retries as the number of failures increases.

Webhooks

Webhooks are the alternative, preferred way for your application to monitor the status of asynchronous actions in Tapis. If you are a Gang of Four disciple, webhooks are a mechanism for implementing the Observer Pattern. They are widely used across the web and chances are that something you’re using right now is leveraging them. In the context of Tapis, a webhook is a URL that you give to Tapis in advance of an event which it later POSTs a response to when that event occurs. A webhook can be any web accessible URL.

information_source:
 For more information about webhooks, events, and notifications in Tapis, please see the Notifications and Events Guides.

The Jobs service provides several template variables for constructing dynamic URLs. Template variables can be included anywhere in your URL by surrounding the variable name in the following manner ${VARIABLE_NAME}. When an event of interest occurs, the variables will be resolved and the resulting URL called. Several example urls are given below.

The full list of template variables are listed in the following table.

Variable Description
UUID The UUID of the job
EVENT The event which occurred
JOB_STATUS The status of the job at the time the event occurs
JOB_URL The url of the job within the API
JOB_ID The unique id used to reference the job within Tapis.
JOB_SYSTEM ID of the job execution system (ex. ssh.execute.example.com)
JOB_NAME The user-supplied name of the job
JOB_START_TIME The time when the job started running in ISO8601 format.
JOB_END_TIME The time when the job stopped running in ISO8601 format.
JOB_SUBMIT_TIME The time when the job was submitted to Tapis for execution by the user in ISO8601 format.
JOB_ARCHIVE_PATH The path on the archive system where the job output will be staged.
JOB_ARCHIVE_URL The Tapis URL for the archived data.
JOB_ERROR The error message explaining why a job failed. Null if completed successfully.

Table 3. Template variables available for use when defining webhooks for your job.

Email

In situations where you do not have a persistent web address, or access to a backend service, you may find it more convenient to subscribe for email notifications rather then providing a webhook. Tapis supports email notifications as well. Simply specify a valid email address in the url field in your job submission notification object and an email will be sent to that address when a relevant event occurs. A sample email message is given below.