Last week we presented our latest Compute30™ webinar focused on scaling HPC applications…
We began with the premise that increasing performance of HPC applications in today’s world is about scalability across cores and nodes, not just pure algorithmic efficiency. As we’ve seen in recent years, CPU clock speeds are not really increasing anymore given they are now thousands of times higher than they were in the 1980’s. Instead, chip vendors, whether CPU or GPU, are packing more cores into the silicon. This requires HPC applications to be highly parallelized in order to realize the performance capabilities of the underlying hardware. We’ve also seen that cores within a single node alone are not enough to solve today’s toughest problems – they are just way too complex. Whether it’s Big Data Analytics, CAE, Life Sciences, or another other compute-intensive domain, 16 CPU cores plus 3000 CUDA cores just aren’t enough if you are trying to solve problems quickly. This is why most HPC applications are installed on clusters instead of single servers. Whether cloud-delivered or on premises, clusters provide a platform for HPC applications to scale horizontally.
Parallel HPC Applications
The challenge today comes not from how big and fast the cluster can be, but how the HPC applications themselves can take advantage of all the resources. Just like algorithmic efficiency can make a huge difference with lots of iterations, so can the distributed or parallel architecture used to build HPC applications. In no way do we stop optimizing loops and other code constructs, but now we have to pay attention to how we can distribute the work across many, many nodes in order to reduce the time it takes to solve big problems. The “holy grail” of course is linear scalability, which means that the amount of time a problem takes is exactly inversely proportional to the number of nodes it runs across. So, for example, if the problem takes 12 minutes on a single compute node, it would take only 6 minutes if run on 2 compute nodes at the same time, and only 3 minutes if scaled across 4 compute nodes. This all works great in theory, but in practice, it’s not currently possible. Even if you’re scaling across cores within the same node, there are many factors, such as bus bandwidth, context switching overhead, etc., that impact the ability to scale in a linear fashion.
So, most of us are forced to accept that “linear scale” really means “near linear scale”. If for example the problem that takes 12 minutes on a single node takes 6 minutes and 10 seconds when run parallel on 2 nodes, then we probably did a pretty good job with the software architecture. How much of that extra 10 seconds we want to reduce is typically related to how much time and money we have in our budget. That then becomes a business problem rather than a technical one!
There are a number of ways that HPC applications can function in parallel across resources, but here are two basic examples:
- The data to process is divided across the processing nodes, so that the same code processes its own section in parallel. Software designed like this typically has a “setup” phase where the data is actually divided, and a “convergence” phase where the results are collated for the end user. Both of these phases of course affect linear scalability, but can usually be kept very minimal to not impact the overall processing time too much. The “heavy lifting” in this model is done at the data processing level, in order to realize maximum parallelism.
- The data must be processed in various “parameter sweeps”. For example, an application may need to apply different variables to the same data in order to calculate the final result. This method is very straightforward, because the sweeps themselves can be divided across the compute nodes. For example, if there are 8 sweeps to do and 2 compute nodes, each node would process 4 sweeps. Typically there are still setup and convergence phases just like in the first method, so the “heavy lifting” is done at the data processing level again.
Given these 2 designs, how do we achieve maximum parallelism?
Design Rules for Parallel HPC Applications
Here are some general rules to keep in mind when designing parallel HPC applications, as explained in our webinar:
- Be stateless: applications that must store persistent state during processing typically introduce contention between threads or even processes running on other nodes. For example, if there is some element that must be updated on shared storage, this can have a major impact. State also affects something else: what if a user (or users) wants to run several instances of your application? Keep in mind this is a common use case of cloud-based Software-as-a-Service. Much like global variables in ordinary programs affect the ability of multiple execution threads to consume the same functions or methods, persistent state during runs can cause conflict if there are many instances of the same application running across nodes. Designing scalable HPC applications starts with keeping only transient state during processing, typically confined to each node. Only at convergence time, where the results are collated, should any persistent data be stored.
- Use “scratch” space: most compute nodes will have local storage which can be dozens or hundreds of times faster than persistent user storage. Use this space for transient state or temporary files to achieve maximum parallelism and improve performance of individual I/O-heavy iterations. In some cases, it is best to copy the data sets you will process into this scratch space rather than read directly from the persistent source. The time spent copying the data in the “setup” phase can, in many cases, be a fraction of the aggregate time spent on accessing it directly from the persistent storage in an iterative fashion.
- Use high performance interconnects such as RDMA. Remember that TCP/IP and UDP/IP were designed to transmit packets across large networks, and even in their simplest forms have serious performance overhead thanks to the underlying Ethernet technology below. If you’re running an application on a local cluster of compute nodes, especially if that cluster or cloud is purpose built for HPC, you will likely have access to Infiniband. Instead of relying on the fast but not optimal IP over Infiniband (IPoIB for short), design your application to use RDMA. Instead of sending packets (and all their related overhead), you simply mark regions of memory as shared between nodes and Infiniband does the rest. The RDMA APIs are simple to use, and the mechanism is extremely fast. There is no “transmission” overhead since this is not really a transfer protocol. It’s literally shared memory between nodes. Put data in one place, and it shows up automatically on another node. With 56Gbps and less than 2 microseconds of latency, FDR Infiniband, such as that offered in JARVICE, is hard to beat.
Any way you slice it, the key to performance in HPC applications is scalability, both across resources within a compute node, and horizontally across compute nodes. Follow a few simple rules and you’ll solve problems quicker than ever before.