Big data is becoming much more than just widespread distribution of cheap storage and cheap computation on commodity hardware. Big data analytics may soon become the new “killer app” for high performance computing (HPC).
There is more to big data than large amounts of information. It also pertains to massive distributed activities such as complex queries and computations (a.k.a analytics). In other words, deriving value through computation is just as “big” as the size of the data sets themselves. In fact, big data on HPC has already been coined by the analyst firm IDC – as High Performance Data Analysis.
HPC is well positioned to enable big data use cases through all three phases of typical workflows, including: data capture and filtering; analytics; and results visualization. In addition to the three phases, the speed of computation matters just as much as the scale. In order to unlock the full potential of big data, we have to pair it with “big compute,” or HPC. Few industries can benefit as much from converged big data and HPC as the life sciences, where the data sets are enormous, the queries and comparisons intensive, and the visualizations complex.
Here are three ways big data and HPC are converging and how life sciences organizations can take full advantage of the phenomenon right now to improve large-scale processing.
1. Hadoop Meets Infiniband
Many consider Infiniband, the most commonly used interconnect technology in supercomputers, just as basic of a requirement for HPC as bare metal processing. If you can’t move information back and forth between nodes quickly, it limits the horizontal scalability you can achieve. RDMA for Apache Hadoop provides an excellent high speed, low latency interconnect option for big data platforms. You can even provision a Hadoop cluster in the cloud that leverages RDMA in no time. Consider that 56Gbps FDR Infiniband can be over 100 times faster than even 10Gbps Ethernet due to its superior bandwidth and latency advantage. Short of using very expensive custom bus fabrics, this is the fastest way to distribute data and processing across computational nodes. Finally, you can scale that big data platform to the size it deserves without worrying nearly as much about bottlenecks. Imagine being able to quickly connect phenotypes to annotations for a given gene, without worrying about the network impact of distributing the data to all the computational nodes. Not only would you obtain results faster, but the setup time would be far lower than if using commodity networking technology.
2. Hadoop Meets Accelerators
Another key feature of HPC is the use of popular coprocessors and accelerators, such as passively cooled NVIDIA Tesla and Kepler GPUs. Just as these technologies greatly assist technical computing solutions, they can also help big data and analytics in bioinformatics much like they already do for sequencing and alignment.
Hadoop leveraging GPU technologies such as CUDA and OpenCL can boost big data performance by a significant factor. All other things being equal, higher-performance big data platforms and technologies such as Hadoop, Spark, and MapReduce lead to faster results for complex analytics. In fact, the only way to keep up with the growing amount of data we are collecting is to increase computation speed at the same time. Big data leveraging co-processors and accelerators is an important way for HPC to make a big impact in this space.
In bioinformatics, there is already a rich tradition of leveraging GPU and FPGA technology to accelerate sequence alignment. Imagine all the benefits to be gained by extending these high-performance capabilities to complex queries and comparisons as well!
3. Big Data and HPC Converge in the Cloud
As Big Data fuels public cloud growth faster than any other application, HPC on demand is an emerging force ready to meet this challenge. The more data we collect, the more computational capacity we need to analyze the data. Simply stated, big data and HPC growth in the cloud go hand in hand. The only way to provide enough scale to keep up with demand is to deploy HPC class assets to increase processing performance and density.
Thanks to the marriage of Big Data platforms with supercomputing technologies such as high-speed interconnects and coprocessors, life sciences organizations can utilize and deploy HPC on demand services designed to enable the next major wave of analytics innovation. The same computational power that accelerates sequencing and alignment today can vastly improve queries and comparisons in the future. With distributed file systems such as Hadoop rather than expensive, traditional HPC parallel storage, the economics become more attractive. Finally, with the time to value and elastic scale only possible in the public cloud, scientific researchers can now focus exclusively on their work rather than wrestling with IT platforms.
Thanks to the convergence of big data and HPC on demand, we can’t wait to see the benefits that researchers in the life sciences sector will enjoy. Leveraging the scale and availability of computation in the public cloud, all that stands between big data and scientific breakthroughs are the limitations of the human mind.
About the author: Leo Reiter is CTO of Nimbix, providers of cloud-based High Performance Computing infrastructure and applications helping Life Sciences organizations crunch data faster and easier.