Tool Your HPC Systems for Data Analytics

Single-Node Runs

Although not true for all DA workloads, the majority of applications tend to be designed for single-node systems. The code might be threaded, but very little code runs in parallel, communicating across multiple nodes on a single data set. MapReduce jobs split the data into pieces and roughly perform the same operation on each piece of the data, but with no communication between tasks. In other words, they are a collection of single-node runs with no real coupling or communication between them as part of the computations.

Just because most code is executed as single-node runs doesn't mean data analytics doesn't run a large volume at the same time. In some cases, the analysis is performed with a variety of starting points or conditions. Sometimes you make different assumptions in the analysis or use different analysis techniques, resulting in the need to run a large number of single-node runs.

An example of a data analysis technique is principal component analysis [26] (PCA), in which a data set is decomposed. The resulting decomposition can then be used to reduce the data set or problem to a smaller one that is more easily analyzed. However, choosing how much of the data set to remove is something of an art, so the reduction analysis on the reduced data set might need to be performed several times or the amount of data reduction varied. It also means the analysis on the reduced data set might have to be performed many times. The end result is that the overall analysis could take a while and use a great deal of computation.

Other examples use multiple nodes when NoSQL databases are used. I've seen situations in which users have their data in a NoSQL database distributed across many nodes and then run MapReduce jobs across all of the nodes. The computations employed in the map phase place much emphasis on floating-point operations. The reduce phase has some fairly heavy floating-point math, but also many integer operations. All of these operations can also have a heavy I/O component. The situation I'm familiar with was not that large in HPC terms  -- perhaps 12 to 20 nodes. However, it was just a ``sample'' problem. The complete problem would have started with 128 nodes and quickly moved to several thousand nodes. Although these are not MPI or closely coupled workloads, they can stress the capabilities of a large number of nodes.


A key aspect of data analytics is interacting with the analysis itself. Interactivity can mean multiple things. One form involves manual (i.e., interactive) preparation of the data for analysis, such as examining the data for outliers.

An outlier can be one data point or several data points that lie outside the expected range of data. A more mathematical way of stating this is that an outlier is an observation that is distant from other observations (you can substitute the word ``measurement'' for ``observation''). The sources of outliers vary and include experimental error, instrument error, and human error. Typically, either the use of robust analysis methods that can tolerate outliers or removal of outliers from the data is desirable. The process of removing outliers is something of an art that requires scrutinizing the data, often with the use of visual analysis methods.

Sometimes, outliers are retained in the data set, which requires that you use a new or different set of tools. A simple example uses the idea that the median of a data set is more robust [27] than the mean (average). If you take a value in a data set and make it extremely large (approaching infinity), then the mean obviously changes. However, the median value changes very little. This doesn't mean the median is necessarily a better statistic than the mean, just that it is more robust in the presence of outliers. It may require several different analyses of the data set to understand the effect of outliers on the analysis. Robust computations might have to be employed if the presence of outliers greatly affects non-robust analysis. In either case, it takes a fair amount of analysis either to find the outliers or to determine what analysis techniques are most appropriate.

Visualization is a key component in data analytics. The human mind is very good at finding patterns, especially in visual information, so researchers plot various aspects of data to get a visual representation that can lead to some sort of understanding or knowledge. During data analysis, the plots used are not always the same, because the plots need to adapt to the data itself. Researchers might want to try several kinds of charts on a single data set to better understand the data. Unfortunately, it is difficult to know which charts use a priori. Whatever the case, DA systems need compute nodes with graphics cards.

HPC systems are not typically equipped with graphics cards for visualizing data; they are typically used for computation, with the results pulled back either to an interactive system or a user's workstation. Some HPC systems offer what is termed ``remote visualization.'' The University of Texas Advanced Computing Center (TACC) has a system named Longhorn [28] that was designed to provide remote visualization and computation in one system. HPC systems like this allow researchers to get an immediate view of their work, so they can either make changes to the computations or change the analysis.

Data analytics is not like a typical HPC application that is run in batch mode with no user interaction. On the contrary, data analytics may require a great deal of interactivity.