In the investment community, the time value of money states that money is always more valuable today than in the future. This is largely due to interest and dividends that grow wealth over time, so it’s always better to put money to work sooner rather than later.
In terms of data, there’s a relatively similar concept known as the time value of data, generally associated with business intelligence. It states that customer data decays over time and it’s best to inspect it sooner rather than later to gain insight.
Over the last couple of years, I’ve spoken often about “the time value of big data”, and how the convergence of Big Data and HPC can help optimize that. It brings together 2 concepts: the first is processing large, unstructured datasets using analytics rather than traditional indexing and search (also known as “big data”). The second is doing it fast enough for the insight derived to be meaningful.
Example of Time Value of Big Data
Here is a simple (albeit morbid) example: imagine we are facing an epidemic that can wipe out humanity. Finding the cure involves lots of different things – obviously determining the origin, which means combing through exabytes of information from various origins to identify patterns using analytics. It also involves computational sequencing and simulation, because let’s face it, we don’t have time to do trial and error on live subjects to “see what happens”. The analytics feed the simulations, creating what we call “heterogeneous workflows”. All this happens over and over again until we find the magical cure that will keep us all alive. Failure is not an option, and we have a virtually unlimited budget as long as we spend it on cloud computing (because obviously, we don’t have time for months and months of procurement and system builds – we need the capacity immediately.) The time value of big data is quite obvious here – the cure is only valuable if it’s found in time to save us. Found even 10 minutes too late, and it’s worthless – even though the data and results would be the same. Obviously, the correct answer means nothing if we’re all too dead (or undead, depending on what your favorite apocalypse is) to leverage it!
How We Find the Cure
Let’s pretend that it’s 2016, and the debate about data being stored in the cloud is over. Most of the relevant data we need to find the cure is already in the cloud. We want to know who people associate with, where they go (and have been), what they buy, what their family history is, and even what their DNA looks like. Companies like 23andme and Ancestry.com have joined the fight, because let’s face it, they’re out of business if everyone’s dead. The data points by themselves don’t have much value – it’s all about processing it, at scale, using complex analytics, and then running simulations on the derivatives. So there is no real security concern here, even if you discount the pressure of having to save the human race as an extra motivator for collaboration.
There are two possible approaches: one is that we leverage the commodity public cloud, which means ephemeral virtual machine instances, archival storage, and ordinary (read: slow) networks. Sure, it’s cheap on a per unit basis but we need TONS of units to get anything meaningful done, and because the networks are slow, we can’t scale the units very far. Let’s also not forget the setup time – commodity cloud requires us to install software, configure it, test it, and troubleshoot it before we can do anything meaningful with it. Let’s hope we managed to keep some system administrators alive to help, because we’ll need them. My guess is that we’re all dead before the first analytic actually even runs.
The second option is to use a purpose built platform for High Performance Data Analysis (HPDA), like the Nimbix Cloud powered by JARVICE. Here we have baremetal machines running existing workflows, using low latency interconnects (like InfiniBand), and leveraging high performance scalable storage (JARVICE Vaults) so we don’t have to worry about constantly managing data and moving it around. Thankfully in this case, there is no setup, because the workflows are already there and ready to run. And when we do run, the scale is linear, the processing is really fast, and we need far fewer units than with commodity cloud, so we actually save money (not just time). Yes, our budget is unlimited, but it’s nice to have some left in the bank after we save the world. In this scenario, we find the cure, deploy it in the water supply, and we all live. Our biggest problem after that is moderating the debate over what to do with the budget surplus.
Not all Big Data problems are as morbid as our epidemic example, but certainly, more and more of them will be time sensitive. This can lead to competitive advantages, more sales, better customer relationships, etc. Shouldn’t your Big Data platform be able to perform on the compute side as well? We think so.