Hadoop Cluster in the Cloud: Snapshot a Supercomputer!


What if you could quickly build a Hadoop cluster in the cloud, and then snapshot it for later use on demand?

Snapshotting VMs has become routine in cloud computing. A user can install their applications on the virtual machine, run some stuff, and then save it off to use it at a later time while only paying for the time used while the machine was turned on.

As HPC has become more mainstream over the last few years, there have been lots of experiments in the cloud for running workloads and orchestrating VMs to do the processing. Interestingly, as Nimbix has continued to add features to JARVICE, our HPC cloud platform, we discovered an interesting new capability: Snapshotting a supercomputer.

While the functionality is novel, what’s the benefit and use case? Well, imagine that you are a student or post-doctoral researcher who needs access to a certain class of supercomputing resources to get your work done? I know for some, grant proposals have to be written or budgets have to be scraped to pull together actual hardware to build the supercomputer. I recall my brother’s work as a post-doctoral chemical oceanographer from Texas A&M. He literally had to build his computing environment, which took him several weeks of working with hardware suppliers, getting machine specs, allocating funds, and building the environment before he could even start his science.

Building a Cloud Supercomputer

There is an alternative. Of course, for available cloud HPC applications, users can just submit the jobs to NACC, but with JARVICE this hypothetical researcher could construct a cloud supercomputer in minutes, complete with GPUs and Infiniband interconnect! Here’s an example:

"files": [],
"application": {
"parameters": {
"USER_NAE": "my_cloud_supercomputer",
"qsub-nodes": 32,
"sub-commands": {}
"name": "nae_16c32-2m2090-ib",
"command": "start"
"customer": {
"username": "naccuserid",
"email": "email@emailaddress.net",
"notifications": {
"sms": {},
"email": {
"email@emailaddress.net": {
"messages": { } } } } },
"api-version": "2.1"

The above command submitted to the Nimbix cloud would construct a 32-node (512-core) system with dual NVIDIA M2090 cards interconnected with 56Gbps FDR Infiniband. The system is provisioned almost instantaneously with one master (headnode and the rest as slave compute nodes) in the cluster.

Once the supercomputer is provisioned, the user can install their preferred workload management software, applications and other tools used to manage the cluster. After customizing the environment, the head node’s Nimbix Application Environment (NAE) can be saved or “snapshotted” for provisioning at a later time or cloning to build a second cloud supercomputer.

Building a Hadoop Cluster in the Cloud

I have personally been playing with this functionality to experiment with building a small Hadoop cluster. Since the functionality described above is currently available for advanced users willing to work with the API, I used our NACC CLI tool to submit an API call similar to the one above to build a 4-node cluster. I created a Nimbix Application Environment on a NAE_16C32-M2090 and installed Apache Hadoop with Infiniband RMDA support available from Ohio State. (http://hadoop-rdma.cse.ohio-state.edu). While I’m not a Hadoop cluster expert personally, I was amazed at how quick it was to provision with my installed stack. With minor initial configuration, I was ready to run some benchmarks like TestDFSIO on my cloud Hadoop cluster. After running the default benchmark, I found that with RDMA enabled, it ran almost 2x faster versus TCP. When I was finished, since I wasn’t going to come back to it for a few days, I ran a snapshot and simply terminated the Hadoop cluster with a mouse click and it was deprovisioned. I can now re-launch it later at any time for further benchmarking activities. Pretty cool!

This capability facilitates, but is not limited to a Hadoop cluster of course. We think there is a lot of potential when put in the hands of smart HPC users around the world. What kind of environments can we build? What kind of HPL benchmarks can be achieved? How much more efficient can we make researchers and data scientists? There are tremendous opportunities for accelerating innovation!