Share this:

Benchmarking GPUs in the Nimbix Cloud for Deep Learning is easy and straight forward on JARVICE 2.0. All of our GPU offerings are available on-demand, and your applications can run on re-configurable hardware. In this post, we evaluate the performance of the Titan X, K40 and K80 GPUs in deep learning. In our earlier post, we announced turn-key availability of the NVIDIA DIGITS application on JARVICE 2.0. You can get started training your deep learning models with GPUs directly in the Nimbix Cloud in under five minutes. Developers can customize their own environments to run other deep learning applications and frameworks with re-configurable hardware accelerators directly in the cloud. Now that you are up and running GPU-accelerated applications, we would like to benchmark performance of our available GPUs. We compare the elapsed run times and economics of training Deep Learning models using NVIDIA DIGITs using the well-known MNIST handwritten digits data set.

Nimbix’s industry-leading pricing with to-the-minute billing granularity is summarized on our pricing page. We offer heterogeneous cloud computing environments with accelerators that include a host of GPU offerings, Xilinx FPGAs (dynamic silicon, on-demand), and compute nodes with the latest generation Intel Haswell processors. Each compute node contains up to 256 GB of RAM per node, and all of our servers are connected with High Speed 56/100Gbps Infiniband Cluster Fabric. Infiniband is an important technology for minimizing latency in modern HPC environments, and is a standard feature available in our purpose-built high performance computing cloud.

To perform the NVIDIA DIGITs Deep Learning benchmarks in the cloud, we use the stock MNIST handwritten digits data set. You can easily follow along and repeat these benchmarks on your own, or perform similar benchmarks with your own data and algorithms.

Hardware Overview for Deep Learning

We would like to compare our three GPU offerings, the NVIDIA Titan X, the NVIDIA Telsa K40 and the NVIDIA Telsa K80, for Deep Learning using NVIDIA DIGITS. Among these are two distinct GPU architectures, Maxwell and Kepler. The Titan X is the top of the line in the Maxwell architecture series from NVIDIA boasting 3072 CUDA Cores, 12 GB of GDDR5 Memory, memory bandwidth of 336.5 GB/s and clock speed of 1000 MHz (1075 MHz boost). The Kepler architecture, which includes the Tesla K40 and K80 GPUs, has more advanced memory management (optional ECC), and multiple precision architecture. The K40 powered by the GK110B chip has 2880 CUDA cores, a clock speed of 745 MHz (875 MHz boost), 12 GB of GDDR5 Memory, and memory bandwidth of 288 GB/s. The K80 is powered by two GK210 chips, which gives it a total of 4992 CUDA cores, 2 x 12 GB of GDDR5 memory per chip, and 2 x 240 GB/s memory bandwidth.

Database Set Up for Classification

How does this hardware stack up for deep learning? We compare training times for AlexNet and GoogLeNet across single and double GPU configurations for the Titan X, K40 and K80s. We train with AlexNet and GoogLeNet. Under the hood, this uses caffe-nv, NVIDIA’s patched version of caffe which supports mulitple GPUs, along with cuDNN and other CUDA tools. Since we are comparing hardware, we stick to the defaults for training the MNIST data set. I use /db/mnist/train and /db/mnist/test for training and testing data sets respectively. These databases are pre-downloaded into the DIGITs environment for your convenience.

MNIST data setup for Deep Learning

It takes a few minutes to perform this one-time set up of the database, but once it has completed, we are ready to begin running our tests. One thing users should keep in mind is that your DIGITs environment is ephemeral, but your data is stored persistently in your user account. Once you terminate the environment and re-start it, your databases and results will still be accessible through the DIGITs panel. However, if you run multiple environments simultaneously, concurrent training sessions will not be visible, but those sessions (which are recorded in the DIGITs web interface) will be accessible in future DIGITs jobs, launched from the JARVICE platform page.

Training Settings

We have used default training settings when creating a new training model for classification. The only adjustments we make are to the type and quantity of GPUs. We train on AlexNet and GoogLeNet. Here is an example screenshot of the default settings. We kick this off in our single and dual configurations for Titan X, K40 and K80. Machine types and their current on-demand prices can be found on the Nimbix On-Demand Cloud Pricing page.

Training for Deep Learning

Benchmark Results

After running all of our test cases, we can see a quick overview of the results inside the DIGITs web interface. Here are the results given in the DIGITs application. The fastest, and also lowest cost training goes to the Titan X,

For AlexNet, the Dual Titan X at 48.43 minutes edges out the Single Titan X at 49.37 minutes. The Tesla family lag behind, with the third best time as the single K80 training for 57.55 minutes. Overall, the difference between single and dual GPU performance, with the exception of the K80, is small for AlexNet. The best case training cost for AlexNet is with the Single Titan X at $2.26. The dual Titan X configuration ran just under a minute faster, $0.16 more per training.

For GoogLeNet, dual GPU configurations have superior runtime to the same single GPU configuration of the same hardware in all cases. However, the Maxwell-architecture Titan X still wins with the fastest runtime at 85.95 minutes to train the MNIST data set. The best case training cost is also the best total runtime for GoogLeNet, which is the dual Titan X configuration at $4.30.

AlexNet GPU Training Comparison
GoogLeNet GPU Training Summary

How do your algorithms stack up against these? We hope these results help you select the best GPU for your deep learning problems. You can run them, modify the caffe files, and choose which GPU is best for your problem. With re-configurable accelerators, you can run your image simultaneously on different hardware configurations to perform your benchmarks or run your pipelines.

Want More?

Feel free to reach out to us with comments at or tweet @Nimbix. We are always excited to learn more from you about how you integrate GPU and HPC jobs into your production pipelines.