Execution Systems¶
In contrast to storage systems, execution systems specify compute resources where application binaries can be run. In addition to the storage
attribute found in storage systems, execution systems also have a login
attribute describing how to connect to the remote system to submit jobs as well as several other attributes that allow Tapis to determine how to stage data and run software on the system. The full list of execution system attributes is given in the following tables.
Name | Type | Description |
---|---|---|
available | boolean | Whether the system is currently available for use in the API. Unavailable systems will not be visible to anyone but the owner. This differs from the status attribute in that a system may be UP, but not available for use in Tapis. Defaults to true |
description | string | Verbose description of this system. |
environment | String | List of key-value pairs that will be added to the environment prior to execution of any command. |
executionType | HPC, Condor, CLI | Required: Specifies how jobs should go into the system. HPC and Condor will leverage a batch scheduler. CLI will fork processes. |
id | string | Required: A unique identifier you assign to the system. A system id must be globally unique across a tenant and cannot be reused once deleted. |
maxSystemJobs | integer | Maximum number of jobs that can be queued or running on a system across all queues at a given time. Defaults to unlimited. |
maxSystemJobsPerUser | integer | Maximum number of jobs that can be queued or running on a system for an individual user across all queues at a given time. Defaults to unlimited. |
name | string | Required: Common display name for this system. |
queues | JSON Array | An array of batch queue definitions providing descriptive and quota information about the queues you want to expose on your system. If not specified, no other system queues will be available to jobs submitted using this system. |
scheduler | LSF, LOADLEVELER, PBS, SGE, CONDOR, FORK, COBALT, TORQUE, MOAB, SLURM, CUSTOM_LSF, CUSTOM_LOADLEVELER, CUSTOM_PBS, CUSTOM_SGE, CUSTOM_CONDOR, FORK, CUSTOM_COBALT, CUSTOM_TORQUE, CUSTOM_MOAB, CUSTOM_SLURM, UNKNOWN | Required: The type of batch scheduler available on the system. This only applies to systems with executionType HPC and CONDOR. The *_CUSTOM version of each scheduler provides a mechanism for you to override the default scheduler directives added by Tapis and explicitly add your own through the customDirectives field in each of the batchQueue definitions for your system. |
scratchDir | string | Path to use for a job scratch directory. This value is the first choice for creating a job's working directory at runtime. The path will be resolved relative to the rootDir value in the storage config if it begins with a "/", and relative to the system homeDir otherwise. |
site | string | The site associated with this system. Primarily for logical grouping. |
startupScript | String | Path to a script that will be run prior to execution of any command on this system. The path will be a standard path on the remote system. A limited set of system macros are supported in this field. They are rootDir, homeDir, systemId, workDir, and homeDir. The standard set of runtime job attributes are also supported. Between the two set of macros, you should be able to construct distinct paths per job, user, and app. Any environment variables defined in the system description will be added after this script is sourced. If this script fails, output will be logged to the .agave.log file in your job directory. Job submission will still continue regardless of the exit code of the script. |
status | UP, DOWN, MAINTENANCE, UNKNOWN | The functional status of the system. Systems must be in UP status to be used. |
storage | JSON Object | Required: Storage configuration describing the storage config defining how to connect to this system for data staging. |
type | STORAGE, EXECUTION | Required: Must be EXECUTION. |
workDir | string | Path to use for a job working directory. This value will be used if no scratchDir is given. The path will be resolved relative to the rootDir value in the storage config if it begins with a "/", and relative to the system homeDir otherwise. |
Startup startupScript¶
Every time Tapis establishes a connection to an execution system, local or remote, it will attempt to source the startupScript
provided in your system definition. The value of startupScript
may be an absolute path on the system (ie. “/usr/local/bin/common_aliases.sh”, “/home/nryan/.bashrc”, etc.) or a path relative to physical home directory of the account used to authenticate to the system (“.bashrc”, “.profile”, “agave/scripts/startup.sh”, etc).
The startupScript
field supports the use of template variables which Tapis will resolve at runtime before establishing a connection. If you would prefer to specify the startup script as a virtualized path on the system, prepend ${SYSTEM_ROOT_DIR}
to the path. If the system will be made public, you can specify a file relative to the home directory of the calling user by prefixing your startupScript
value with ${SYSTEM_ROOT_DIR}/${SYSTEM_HOME_DIR}/${USERNAME}
A full list of the variables available is given in the following table.
Variable | Description | ||
---|---|---|---|
SYSTEM_ID | ID of the system (ex. ssh.execute.example.com) | ||
SYSTEM_UUID | fThe UUID of the system | ||
SYSTEM_STORAGE_PROTOCOL | The protocol used to move data to and from this system | ||
SYSTEM_STORAGE_HOST | The storage host for this sytem | ||
SYSTEM_STORAGE_PORT | The storage port for this system | ||
SYSTEM_STORAGE_RESOURCE | The system resource for iRODS systems | ||
SYSTEM_STORAGE_ZONE | The system zone for iRODS systems | ||
SYSTEM_STORAGE_ROOTDIR | The virtual root directory exposed on this system | ||
SYSTEM_STORAGE_HOMEDIR | The home directory on this system relative to the STORAGE_ROOT_DIR | ||
SYSTEM_STORAGE_AUTH_TYPE | The storage authentication method for this system | ||
SYSTEM_STORAGE_CONTAINER | The the object store bucket in which the rootDir resides. | ||
SYSTEM_LOGIN_PROTOCOL | The protocol used to establish a session with this system (eg SSH, GSISSH, etc *NOTE: OpenSSH Keys are not supported. ) | ||
SYSTEM_LOGIN_HOST | The login host for this system | ||
SYSTEM_LOGIN_PORT | The login port for this system | ||
SYSTEM_LOGIN_AUTH_TYPE | The login authentication method for this system | ||
SYSTEM_OWNER | The username of the user who created the system. |
executionType |
scheduler |
Description |
---|---|---|
HPC | LSF, LOADLEVELER, PBS, SGE, COBALT, TORQUE, MOAB, SLURM | Jobs will be submitted to the local scheduler using the appropriate scheduler commands. Systems with this execution type will not allow forked jobs. |
CONDOR | CONDOR | Jobs will be submitted to the condor scheduler running locally on the remote system. Tapis will not do any installation for you, so the setup and administration of the Condor server is up to you. |
CLI | FORK | Jobs will be started as a forked process and monitored using the system process id. |
When you are describing your system, consider the policies put in place by your system administrators. If the system you are defining has a scheduler, chances are they want you to use it.
Defining batch queues¶
Tapis supports the notion of multiple submit queues. On HPC systems, queues should map to actual batch scheduler queues on the target server. Additionally, queues are used by Tapis as a mechanism for implementing quotas on job throughput in a given queue or across an entire system. Queues are defined as a JSON array of objects assigned to the queues
attribute. The following table summarizes all supported queue parameters.
Name | Type | Description |
---|---|---|
name | string | Arbitrary name for the queue. This will be used in the job submission process, so it should line up with the name of an actual queue on the execution system. |
maxJobs | integer | Maximum number of jobs that can be queued or running within this queue at a given time. Defaults to 10. -1 for no limit |
maxUserJobs | integer | Maximum number of jobs that can be queued or running by any single user within this queue at a given time. Defaults to 10. -1 for no limit |
maxNodes | integer | Maximum number of nodes that can be requested for any job in this queue. -1 for no limit |
maxProcessorsPerNode | integer | Maximum number of processors per node that can be requested for any job in this queue. -1 for no limit |
maxMemoryPerNode | string | Maximum memory per node for jobs submitted to this queue in ###.#[E|P|T|G]B format. |
maxRequestedTime | string | Maximum run time for any job in this queue given in hh:mm:ss format. |
customDirectives | string | Arbitrary text that will be appended to the end of the scheduler directives in a batch submit script. This could include a project number, system-specific directives, etc. |
default | boolean | True if this is the default queue for the system, false otherwise. |
Configuring quotas¶
In the batch queues table above, several attributes exist to specify limits on the number of total jobs and user jobs in a given queue. Corresponding attributes exist in the execution system to specify limits on the number of total and user jobs across an entire system. These attributes, when used appropriately, can be used to tell Tapis how to enforce limits on the concurrent activity of any given user. They can also ensure that Tapis will not unfairly monopolize your systems as your application usage grows.
If you have ever used a shared HPC system before, you should be familiar with batch queue quotas. If not, the important thing to understand is that they are a critical tool to ensure fair usage of any shared resource. As the owner/administrator for your registered system, you can use the batch queues you define to enforce whatever usage policy you deem appropriate.
Consider one example where you are using a VM to run image analysis routines on demand through Tapis, your server will become memory bound and experience performance degradation if too many processes are running at once. To avoid this, you can set a limit using a batch queue configuration that limits the number of simultaneous tasks that can run at once on your server.
Another example where quotas can be helpful is to help you properly partitioning your system resources. Consider a user analyzing unstructured data. The problem is computationally and memory intensive. To preserve resources, you could create one queue with a moderate value of maxJobs
and conservative maxMemoryPerNode
, maxProcessorsPerNode
, and maxNodes
values to allow good throughput of small job. You could then create another queue with large maxMemoryPerNode
, maxProcessorsPerNode
, and maxNodes
values while only allowing a single job to run at a time. This gives you both high throughput and high capacity on a single system.
The following sample queue definitions illustrate some other interesting use cases.
{ "name":"short_job", "mappedName": null, "maxJobs":100, "maxUserJobs":10, "maxNodes":32, "maxMemoryPerNode":"64GB", "maxProcessorsPerNode":12, "maxRequestedTime":"00:15:00", "customDirectives":null, "default":true }
System login protocols¶
As with storage systems, Tapis supports several different protocols and mechanisms for job submission. We already covered scheduler and queue support. Here we illustrate the different login configurations possible. For brevity, only the value of the login
JSON object is shown.
{ "host": "execute.example.com", "port": 22, "protocol": "SSH", "auth": { "username": "systest", "password": "changeit", "type": "PASSWORD" } }
The full list of login configuration options is given in the following table. We omit the login.auth
and login.proxy
attributes as they are identical to those used in the storage config.
Attribute | Type | Description |
---|---|---|
auth | JSON object | Required: A JSON object describing the default login authentication credential for this system. |
host | string | Required: The hostname or ip address of the server where the job will be submitted. |
port | int | The port number of the server where the job will be submitted. Defaults to the default port of the protocol used. |
protocol | SSH, GSISSH, LOCAL | Required: The protocol used to submit jobs for execution. *NOTE: OpenSSH Keys are not supported. |
proxy | JSON Object | The proxy server through with Tapis will tunnel when submitting jobs. Currently proxy servers will use the same authentication mechanism as the target server. |
Scratch and work directories¶
In the Job Management tutorial we will dive into how Tapis manages the end-to-end lifecycle of running a job. Here we point out two relevant attributes that control where data is staged and where your job will physically run. The scratchDir
and workDir
attributes control where the working directories for each job will be created on an execution system. The following table summarizes the decision making process Tapis uses to determine where the working directories should be created.
rootDir value |
homeDir value |
scratchDir value |
Effective system path for job working directories |
---|---|---|---|
/ | / | — | / |
/ | / | / | / |
/ | / | /scratch | /scratch |
/ | /home/nryan | — | /home/nryan |
/ | /home/nryan | / | / |
/ | /home/nryan | /scratch | /scratch |
/home/nryan | / | — | /home/nryan |
/home/nryan | / | / | /home/nryan |
/home/nryan | / | /scratch | /home/nryan/scratch |
/home/nryan | /home | — | /home/nryan/home |
/home/nryan | /home | / | /home/nryan |
/home/nryan | /home | /scratch | /home/nryan/scratch |
While it is not required, it is a best practice to always specify scratchDir
and workDir
values for your execution systems and, whenever possible, place them outside of the system homeDir
to ensure data privacy. The reason for this is that the file system available on many servers is actually made up of a combination of physically attached storage, mounted volumes, and network mounts. Often times, your home directory will have a very conservative quota while the mounted storage will essentially be quota free. As the above table shows, when you do not specify a scratchDir
or workDir
, Tapis will attempt to create your job work directories in your system homeDir
. It is very likely that, in the course of running simulations, you will reach the quota on your home directory, thereby causing that job and all future jobs to fail on the system until you clear up more space. To avoid this, we recommend specifying a location with sufficient available space to handle the work you want to do.
Another common error that arises from not specifying thoughtful scratchDir
and workDir
values for your execution systems is jobs failing due to “permission denied” errors. This often happens when your scratchDir
and/or workDir
resolve to the actual system root. Usually the account you are using to access the system will not have permission to write to /
, so all attempts to create a job working directory fail, accurately, due to a “permission denied” error.
While it is not required, it is a best practice to always specify scratchDir
and workDir
values for your execution systems and, whenever possible, place them outside of the system homeDir
to ensure data privacy.
Creating a new execution system¶
tapis systems create -v -F ssh-password.json
curl -sk -H "Authorization: Bearer $ACCESS_TOKEN" -F "fileToUpload=@ssh-password.json" https://api.tacc.utexas.edu/systems/v2
The response from the server will be similar to the following
{
"id":"demo.execute.example.com",
"uuid":"0001323106792914-5056a550b8-0001-006",
"name":"Example SSH Execution Host",
"status":"UP",
"type":"EXECUTION",
"description":"My example system using ssh to submit jobs used for testing.",
"site":"example.com",
"revision":1,
"public":false,
"lastModified":"2013-07-02T10:16:11.000-05:00",
"executionType":"HPC",
"scheduler":"SGE",
"environment":null,
"startupScript":"./bashrc",
"maxSystemJobs":100,
"maxSystemJobsPerUser":10,
"workDir":"/work",
"scratchDir":"/scratch",
"queues":[
{
"name":"normal",
"maxJobs":100,
"maxUserJobs":10,
"maxNodes":32,
"maxMemoryPerNode":"64GB",
"maxProcessorsPerNode":12,
"maxRequestedTime":"48:00:00",
"customDirectives":null,
"default":true
},
{
"name":"largemem",
"maxJobs":25,
"maxUserJobs":5,
"maxNodes":16,
"maxMemoryPerNode":"2TB",
"maxProcessorsPerNode":4,
"maxRequestedTime":"96:00:00",
"customDirectives":null,
"default":false
}
],
"login":{
"host":"texas.rangers.mlb.com",
"port":22,
"protocol":"SSH",
"proxy":null,
"auth":{
"type":"PASSWORD"
}
},
"storage":{
"host":"texas.rangers.mlb.com",
"port":22,
"protocol":"SFTP",
"rootDir":"/home/nryan",
"homeDir":"",
"proxy":null,
"auth":{
"type":"PASSWORD"
}
}
}
- Versions
- latest
- Downloads
- html
- On Read the Docs
- Project Home
- Builds
Free document hosting provided by Read the Docs.