Share this:

Platform as a Service doesn’t have to be limited to just deploying and scaling single applications at a time.  Now you can deploy an entire cluster with workload management, on demand, with a single API call or a few clicks in an interactive portal.  You can easily test or deploy applications that already interface directly with workload managers, specifically the very popular (open source) TORQUE Resource Manager.

Dynamic TORQUE Cluster

Configuring and deploying TORQUE is not rocket science, especially if your resources are known ahead of time and you don’t need to do this too often.  In a commodity Infrastructure as a Service delivery model, you would need to do this each time you provisioned your resources, because cloud computing environments generally offer ephemeral “instances” rather than fully persistent ones.  Even if full persistence is available, it’s usually an expensive and complex option compared to the ephemeral variety, due to the dynamic, elastic nature of cloud computing.  This is especially true in a public cloud setting, where your resources are typically provisioned on demand and may have been running entirely different workloads just moments before.

Beyond persistent versus ephemeral, what if you need to rapidly deploy multiple TORQUE clusters?  Clusters of different sizes (in number of nodes)?  Or even TORQUE clusters with different hardware types (e.g. different core counts, GPU counts, etc.)?  This would all mean a ground up configuration effort in most cloud computing environments.

To be truly elastic and dynamic, a TORQUE cluster “as a service” must be effectively self-configuring.  The same exact code must handle a 4 node, 12 core per node configuration in one “instance”, and a 32 node, 16 core per node configuration in another.  Otherwise you lose the velocity and agility that cloud computing promises, and in fact, you pay for it (in time wasted on configuration effort).

API to Provision the TORQUE Cluster

So what does the API call look like to launch such a cluster?  Once you’ve used the JARVICE Create command to create your own copy of the CentOS-6-TORQUE-Server template NAE, all that’s left is to decide how many nodes and what type of hardware to use.  You can of course layer in your own applications on top of the template once you’ve created your own copy, and save off the changes at any time for future use.  Like all JARVICE parallel NAEs (e.g. the Hadoop Cluster on demand), the “master” node can be snapshotted, while the “slave” nodes are purely ephemeral.  In this example, we launch our NAE (which was created from the CentOS-6-TORQUE-Server template) on 4 nodes using the 12 core node type:

TORQUE Cluster API - 4 node

If we want GPUs and 8 16 core Ivy Bridge nodes instead, the API call is almost identical except for the 2 key parameters:

TORQUE Cluster API - 8 node

Note that once you submit this API, it spins up the TORQUE cluster and shortly thereafter allows you to log into the “master” node.  The “master” node in the TORQUE cluster on demand hosts the TORQUE server and scheduler, as well as the node manager (MOM).  The slaves run only the MOM.  In the default configuration, the “master” node is capable of executing work as well.  JARVICE also creates a default queue so literally it’s all plug and play.  Once you log into the master, you can see all the nodes configured:

TORQUE Cluster - pbsnodes

And submitting jobs can be done as easily as this:

TORQUE Cluster - qsub

As with other JARVICE parallel NAEs, Infiniband is available on most hardware types for message passing between processes.  See the JARVICE Platform Documentation for more information.

Whether you need to deploy an application that already works directly with a workload manager, or you are interested in this technology from a purely academic perspective, JARVICE’s new TORQUE Cluster on demand is the fastest way to provision a fully functional HPC cluster on any hardware configuration and scale in the cloud today.