High Speed BWA in the Nimbix Cloud


Written by: on July 25, 2012

Hundreds of organizations around the world are working to align and map raw sequence data and many have turned to the cloud to augment computing capacity for analysis pipelines.  While there are a number of commercial alignment and mapping software applications to help with the challenge, one of the popular open source options is BWA.

When people think of running BWA in the cloud, most think about Amazon, Rackspace, or other commodity cloud infrastructure providers on which to provision virtual machines billed by the hour.  This is certainly an option for on-demand compute capacity, but it can be slow and time consuming to provision for the first time.  But what if you simply wanted a cloud-based BWA pipeline ready to run your sequence data as fast as possible?

At Nimbix, the cloud is all about the workload and not the machines.   Below is only one example, but running high speed BWA for paired-end sequence data is as simple as making the below API call to the Nimbix Accelerated Compute Cloud:

{ "api-version" : "2.0", "customer" : { "username" : "nimbixusername", "api-key" : "************************************" },"application" : { "name" : "bwa", "command" : "paired-end", "parameters" : { "dbfile" : "input1-file1", "inputseqfile1" : "input2-file1", "inputseqfile2" : "input2-file2", "sub-commands" : { "aln" : {}, "sampe" : {} } } }, "files" : [ { "files" : { "input1-file1" : "human" }, "method" : "nimbixfiles" }, { "files" : { "input2-file1" : "MyIlluminaData_1.fastq.gz", "input2-file2" : "MyIlluminaData_2.fastq.gz" }, "method" : "sftp", "address" : "mysftpserver.location", "username" : "mysftpusername", "password" : "*****************************" } ] }

For human reference alignments, simply replace the data in italics with your data and post to the Nimbix cloud.  Your pipeline is automatically run and your SAM/BAM files generated.   Since Nimbix operates optimized machines for its bioinformatics processing tasks, users can generally expect results 5 to 15 times faster than any other cloud solution. Different reference genomes can be specified in the API call for other available references.

For more information on making the above API call using curl, wfetch, perl or python, have a look at Josh Devinney’s blog post, Programmatic Job Posting to NACC.  If you need an account to try out the above, you can sign up on the Nimbix portal.

Comparing Costs: Dedicated HPC Cloud versus On-Demand Cluster


Written by: on July 5, 2012

When evaluating options for cloud based clusters for use in HPC applications, costs are often a major consideration.  For the occasional HPC processing task, preparing a cluster from the instance up (not always a trivial task) can be a cost effective way to solve those compute problems.  But what if the HPC processing task is more than occasional?  What if it is part of your ongoing business process?  At what point does it make sense to consider deployment alternatives? 

To take a more quantitative view let’s start by looking at inputs and cost components of a deployment:

  1. Average walltime for HPC Job on fixed cluster size
  2. Jobs required per month
  3. Software licensing costs (if applicable)
  4. Machine cost (purchased)
  5. Depreciation
  6. Power/Cooling/Space 
  7. Staffing

The costs may vary from organization to organization depending on datacenter location, cost of electricity, type of cooling deployed, number of staff to support, etc., but in any deployment scenario, understanding these inputs and factors are important.

From the cloud perspective, this can be fairly straightforward, since all costs are abstracted to an hourly or monthly rate.  Let’s take a theoretical example of an application that runs on a 12 node cluster requiring 16 CPU cores per node and 3-4GB RAM per core.  Let’s assume that the application has an average run time of 5 hours.

The simplest cost to calculate is a single run using on-demand cloud resources.  Let’s assume that the hourly rate for a compute instance with the above attributes (excluding data transfer and any cluster creation setup costs) is $2.20/hour.  This means the total hourly cost for the cluster is $2.20/hr x 12 nodes = $26.40/hr.  A single job run would cost $132.00.   Keeping the analysis simple, if a user only needed to run 1 job per month, using an on-demand cluster is likely the way to go.  But what if s/he needed to run more than one job per month, or actually install a workload manager/ job scheduler and enable multi-user job submissions?  What do costs look like if some jobs fail?

Considering the other extreme, let’s suppose the cluster was needed for a month.  The total cost to operate the on-demand cluster becomes $26.40/hr x 720 hours = $19,000 per month…. a pretty expensive endeavor.

Turning to dedicated HPC clouds for a moment, let’s assume that to rent the same type of cluster on a monthly basis was $7200.00 per month.  In the above scenario, the break even point between the two deployment approaches is at 272 hours of cluster usage.  If the HPC processing tasks requires more than this, dedicated is the way to go.

While the above example is simplistic, it does highlight a quantitative approach to selecting cost-optimized HPC cloud deployment models.  Other factors can weigh in; factors like software license management, user location, cluster management support, data storage, node-attributes, interconnect, security, and walltime variance between virtualized and bare-metal clusters.  Ultimately, these factors must be reviewed by the consumer and the best, most efficient path selected.