Tool your HPC systems for data analytics
Get a Grip
I was very hesitant to use the phrase "Big Data" in the title, because it's somewhat ill defined (plus some HPC people would have given me no end of grief), so I chose to use the more generic "data analytics." I prefer this term because the basic definition refers to the process, not the size of the data or the "three Vs" [1] of Big Data: velocity, volume, and variety.
The definition I tend to favor is from TechTarget [2]: "Data Analytics is the science of examining raw data with the purpose of drawing conclusions about that information." It doesn't mention the amount of data, although the implication is that there has to be enough to be meaningful. It doesn't say anything about the velocity, or variety, or volume in the definition. It simply communicates the highlevel process.
Another way to think of data analytics is the combination of two concepts: "data analysis" and "analytics." Data analysis [3] is very similar to data analytics, in that it is the process of massaging data with the goal of discovering useful information that can be used for suggesting conclusions and supporting decision making. Analytics [4], on the other hand, is the discovery and communication of meaningful patterns in data. Even though one could argue that analytics is really a subset of data analysis, I prefer to combine the two terms, so it gathers everything from collecting the data in raw form to examining the data with algorithms or mathematics (typically implying computations) to look for possible information. I'm sure some people will disagree with me, and that's perfectly fine. We're blind men trying to define something we can't easily see and isn't easy to define, even if you can see it. (Think of defining "art," and you get the idea.)
Data analytics is the coming storm across the Oklahoma plains. You can see it miles away, and you had better get ready for it, or it will land on you with a pretty hard thump. The potential of data analytics (DA) has been fairly well documented. An easy example is PayPal, which uses DA and HPC for realtime fraud detection by adapting their algorithms all the time and throwing a great deal of computational horsepower into it. I don't want to dive into the mathematics, statistics, or machine learning of DA tools; instead, I want to take a different approach and discuss some aspects of data analytics that affect one of the audiences of this magazine – HPC people.
Specifically, I want to discuss some of the characteristics or tendencies of DA applications, methods, and tools, since these workloads are finding their way into HPC systems. I know one director of a large HPC center who gets at least three or four requests a week from users who want to perform data analytics on the HPC systems. Digging a little deeper, the HPC staff finds that the users are mostly not "classic" HPC users, and they have their own tools and methods. Integrating their needs into existing HPC systems has proven to be more difficult than they thought. As a result, it might be a good idea to present some of the characteristics or tendencies of these tools and users so you can be prepared when the users start knocking on your door and sending you email. By the way, these tools might be running on your systems already and you don't even know it.
Workload Characteristics
Before jumping in with both feet and listing all of the things that are needed in DA workloads, I think it's far better first to describe or characterize the bulk of DA workloads, which might reveal some indicators for what is needed. With these characteristics, I'm aiming for the "center of mass." I'm sure many people can come up with counterexamples, as can I, but I'm trying to develop some generalizations that can be used as a starting point.
In the subsequent sections, I'll describe some major workload characteristics, and I'll begin with the languages used in data analytics.
New Languages
The classic languages of HPC, Fortran and C/C++, are used somewhat in data analytics, but a whole host of new languages and tools are used as well. A wide variety of languages show up, particularly because Big Data is so hyped, which means everyone is trying to jump in with their particular favorite. However, a few have risen to the top:
 R [5]
 Python [6]
 Julia (up and coming) [7]
 Java [8]
 Matlab [9] and Matlabcompatible tools (Octave [10], Scilab [11], etc.)
Java is the lingua franca of MapReduce [12] and Hadoop [13]. Many toolkits for data analytics are written in Java, with the ability to be interfaced into other languages.
Because data analytics is, for the most part, about statistical methods, R, the language of statistics, is very popular. If you know a little Python or some Perl or some Matlab, then learning R is not too difficult. It has a tremendous number of builtin functions, primarily for statistics, and great graphics for making charts. Several of its libraries also are appropriate for data analytics (Table 1).
Table 1
R Libraries and Helpers
Software  Description  Source 

Analytics libraries  
R/parallel  Addon package extends R by adding parallel computing capabilities  http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2557021/ 
Rmpi  Wrapper to MPI  http://www.stats.uwo.ca/faculty/yu/Rmpi/ 
HPC tools  R with BLAS, LAPACK, and MPI in Linux  http://lostingeospace.blogspot.com/2012/06/randhpcblasmpiinlinuxenvironment.html 
RHadoop [14]  R packages to manage and analyze data with Hadoop  https://github.com/RevolutionAnalytics/RHadoop/wiki 
Database tools [15]  
RSQLite [16]  R driver for SQLite  http://cran.rproject.org/web/packages/RSQLite/index.html 
rhbase [17]  Connectivity to HBASE  https://github.com/RevolutionAnalytics/rhbase 
graph  Package to handle graph data structures  http://www.bioconductor.org/packages/devel/bioc/html/graph.html 
neuralnet [18]  Training neural networks  http://cran.rproject.org/web/packages/neuralnet/ 
Python is becoming one of the most popular programming languages. The language is well suited for numerical analysis and general programming. Although it comes with a great deal of capability, lots of addons extend Python in the DA kingdom (Table 2).
Table 2
Python AddOns
Software  Description  Source 

Pandas  Data analysis library for data analytics  http://pandas.pydata.org 
scikitlearn  Machine learning tools  http://scikitlearn.org/stable/ 
SciPy  Open source software for mathematics, science, and engineering  http://www.scipy.org 
NumPy  A library for array objects including tools for integrating C/C++ and Fortran code, linear algebra computations, Fourier transforms, and random number capabilities  http://www.numpy.org 
matplotlib  Plotting library  http://matplotlib.org 
Database tools  
sqlite3 [19]  SQLite database interface  https://docs.python.org/2/library/sqlite3.html 
PostgreSQL [20]  Drivers for PostgreSQL  https://wiki.postgresql.org/wiki/Python 
MySQLPython [21]  MySQL interface  http://mysqlpython.sourceforge.net 
HappyBase  Library to interact with Apache HBase  http://happybase.readthedocs.org/en/latest/ 
NoSQL  List of NoSQL packages  http://nosqldatabase.org 
PyBrain  Modular machine learning library  http://pybrain.org 
ffnet  Feedforward neural network  http://ffnet.sourceforge.net 
Disco  Framework for distributed computing based on the MapReduce paradigm  http://discoproject.org 
Hadoopy [22]  Wrapper for Hadoop using Cython  http://www.hadoopy.com/en/latest/ 
Graph libraries  
NetworkX  Package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks  http://networkx.github.io 
igraph  Network analysis  http://igraph.org 
pythongraph  Library for working with graphs  https://code.google.com/p/pythongraph/ 
pydot  Interface to Graphviz's Dot language [23]  https://code.google.com/p/pydot/ 
graphtool  Manipulation and statistical analysis of graphs (networks)  http://graphtool.skewed.de 
Julia is an upandcoming language for HPC, but it is also drawing in DA researchers and practitioners. Julia is still a very young language; nonetheless, it has some very useful packages for data analytics (Table 3).
Table 3
Julia Packages
Software  Description  Source 

MLBase.jl

Functions to support the development of machine learning algorithms  https://github.com/JuliaStats/MLBase.jl 
StatsBase.jl

Basic statistics  https://github.com/JuliaStats/StatsBase.jl 
Distributions.jl

Probability distributions and associated functions  https://github.com/JuliaStats/Distributions.jl 
Optim.jl

Optimization functions  https://github.com/JuliaOpt/Optim.jl 
DataFrames.jl

Library for working with tabular data  https://github.com/JuliaStats/DataFrames.jl 
Gadfly.jl

Crafty statistical graphics  https://github.com/dcjones/Gadfly.jl 
PyPlot.jl

Interface to matplotlib [24]  https://github.com/stevengj/PyPlot.jl 
Matlab is a popular language in science and engineering, so it's natural for people to use it for data analytics. In general, a great deal of code is available for Matlab and Matlablike applications (e.g., Octave and Scilab). Matlab and similar tools have data and matrix manipulation tools already built in, as well as graphics tools for plotting the results. You can write code in the respective languages of the different tools to create new functions and capabilities. The languages are reasonably close to each other, making portability easier than you might think, with the exception of graphical interfaces. Table 4 lists a few Matlab toolboxes from MathWorks and an open source toolbox for running parallel Matlab jobs. Octave and Scilab have similar functionality, but it might be spread across multiple toolboxes or come with the tool itself.
Table 4
Matlab Toolboxes
Software  Description  Source 

Statistics  Analyze and model data using statistics and machine learning  http://www.mathworks.com/products/statistics/?s_cid=sol_des_sub2_relprod3_statistics_toolbox 
Data Acquisition  Connect to data acquisition cards, devices, and modules  http://www.mathworks.com/products/daq/ 
Image Processing  Image processing, analysis, and algorithm development  http://www.mathworks.com/products/image/ 
Econometrics  Model and analyze financial and economic systems using statistical methods  http://www.mathworks.com/products/econometrics/?s_cid=HP_FP_ML_EconometricsToolbox 
System Identification  Linear and nonlinear dynamic system models from measured inputoutput data  http://www.mathworks.com/products/sysid/ 
Database  Exchange data with relational databases  http://www.mathworks.com/products/database/ 
Clustering and Data Analysis  Clustering and data analysis algorithms  http://www.mathworks.com/matlabcentral/fileexchange/7473clusteringanddataanalysistoolbox 
pMatlab [25]  Parallel Matlab toolbox  http://www.ll.mit.edu/mission/cybersec/softwaretools/pmatlab/pmatlab.html 
These are just a few of the links to DA libraries, modules, addons, toolboxes, or what have you for languages that are increasingly popular in the DA world.
SingleNode Runs
Although not true for all DA workloads, the majority of applications tend to be designed for singlenode systems. The code might be threaded, but very little code runs in parallel, communicating across multiple nodes on a single data set. MapReduce jobs split the data into pieces and roughly perform the same operation on each piece of the data, but with no communication between tasks. In other words, they are a collection of singlenode runs with no real coupling or communication between them as part of the computations.
Just because most code is executed as singlenode runs doesn't mean data analytics doesn't run a large volume at the same time. In some cases, the analysis is performed with a variety of starting points or conditions. Sometimes you make different assumptions in the analysis or use different analysis techniques, resulting in the need to run a large number of singlenode runs.
An example of a data analysis technique is principal component analysis [26] (PCA), in which a data set is decomposed. The resulting decomposition can then be used to reduce the data set or problem to a smaller one that is more easily analyzed. However, choosing how much of the data set to remove is something of an art, so the reduction analysis on the reduced data set might need to be performed several times or the amount of data reduction varied. It also means the analysis on the reduced data set might have to be performed many times. The end result is that the overall analysis could take a while and use a great deal of computation.
Other examples use multiple nodes when NoSQL databases are used. I've seen situations in which users have their data in a NoSQL database distributed across many nodes and then run MapReduce jobs across all of the nodes. The computations employed in the map phase place much emphasis on floatingpoint operations. The reduce phase has some fairly heavy floatingpoint math, but also many integer operations. All of these operations can also have a heavy I/O component. The situation I'm familiar with was not that large in HPC terms – perhaps 12 to 20 nodes. However, it was just a "sample" problem. The complete problem would have started with 128 nodes and quickly moved to several thousand nodes. Although these are not MPI or closely coupled workloads, they can stress the capabilities of a large number of nodes.
Buy this article as PDF
(incl. VAT)