If you dive in to the field of Supercomputing and Big Data you will begin to run across blog posts talking about the “V’s” of the field, the six, the eight, the ten, the twelve, and so forth. No, we’re not talking about engines, we’re talking about lists of nouns that name aspects or properties of Big Data or Supercomputing that need to be balanced or optimized. The list of eight balances being complete while remaining concise, the higher numbered lists tend to veer off into data governance issues that are generally not issues we need concern ourselves with at this point.
The eight V’s: Volume, Velocity, Variety, Veracity, Vocabulary, Vagueness, Viability and Value
Most of these are pretty self-explanatory, but let’s go through them just for drill.
Volume: The amount of data needing to be processed at a given time. This can manifest either as amount over time or amount that needs to be processed at one time. For example, doing a matrix operation on a 1 billion by 1 billion matrix or scanning the contents of every published newspaper in a day for key words are both examples of volume that can constrain computing.
Velocity: Similar to Volume, this has to do with the speed of the data coming in and the speed of the transformed data leaving the compute. An example of a high velocity requirement is telemetry that needs to be analyzed in real time for a self-driving car. The enemy of velocity is latency.
Variety: The spice of life, or the bane of computing? In the computing context we are discussing, this term refers to heterogeneous data sources that need to be identified and normalized before the compute can occur. In data science, this is often referred to as data cleaning, this operation is frequently the most labor intensive as it involves all of the pre-work required to set-up the high-performance compute. This is where the vast majority of errors and issues are found with data and this is the fundamental bottle neck in high-performance computing.
Vocabulary: This term has two meanings. The first meaning is less a computing issue than it is a communication issue between provider and customer and it has to do with the language used to describe the desired outcome of an analysis. For example, the term “accuracy” or “performance” may have different meaning in the context of structural engineering than it does in rendering animation. The second meaning branches into semantic searching and operations within a semantic space. Here we are dealing with controlled vocabularies (ontologies) that represent a specific definition but also a relatedness to another term. For example, the term “child” infers that it has a “parent” and so forth. This term architecture is very important when operating with clients in the artificial intelligence space where search and retrieval is used to uncover unknown relationships. As it turns out, the strength of the ontology is what leads to the relative success or failure in projects that mine with semantic-based technologies.
Vagueness: This term describes an interpretation issue with results being returned. Douglas Adams articulated this beautifully in the “Hitchhiker’s Guide to the Galaxy” where the answer to all questions in the galaxy was postulated to be the number 42. This is a bit tongue-in-cheek, but, it is a very real problem with scientific and big data computes. These computes are able to marshal and transforms huge oceans of data but what does it mean? What do I do with the answer. We see the same issue in statistics when we do correlation studies. A famous example is the direct correlation between sales of chocolate ice cream and violent crime in Cleveland. So, what does this mean, does this mean that there is something in chocolate ice cream that makes people violent? As a well-meaning city official, you might consider banning the sale of chocolate ice cream, but, you’d look foolish, here’s why. Correlation does not imply causation, as it turns out, both ice cream sales and violent crime spike in the summer due to heat and lack of central air conditioning. This is vagueness. Computes that produce correlations are often misinterpreted as causation, more data doesn’t necessarily mean better or more accurate results, this is something that we all need to keep in the back of our minds when dealing with clients.
Viability: This refers to a model’s ability to represent reality. Model’s by their very nature are idealized approximations of reality. Some are very good, others are all dangerously flawed. Frequently, model builders simplify their models in order for them to be computationally tractable. With hardware acceleration, we can remove these shackles from the model builder and let them simulate closer to reality.
Value: This term is defined as whatever is important to the customer. Another way to define value is the removal of obstacles in their path to allow them to get to their stated destination. We often think of value in terms of cost, but, we can also think of Value in terms of enablement and what that is worth to the customer.
Here are some relationships between these terms that might be helpful…
As the first six V’s increase for any given problem, the problem outstrips the ability and capacity of commodity hardware and leads to a decrease in Viability and Value from that compute on commodity hardware.
Hardware deals primarily with Volume and Velocity as these are physical constraints of the data.
Software deals primarily with Variety, Veracity, Vocabulary, and Vagueness as these are logical or organizational constraints upon the data.
Artificial Intelligence/Machine Learning can be described as any technology that contains logic that discriminates between two or more classifications (member or non-member, odd or even, etc.) These systems deal primarily in the area of controlling or limiting Vocabulary and Vagueness and add Value and Viability through this control.
From these eight V’s and their relationships to hardware, software and artificial intelligence/machine learning we now have a lens though which we can examine our customer’s requirements and determine a measure of Value for the service that we provide.