Lead Image © Sergey Nivens, 123RF.com

Lead Image © Sergey Nivens, 123RF.com

Tool your HPC systems for data analytics

Get a Grip

Article from ADMIN 22/2014
As data analytics workloads become more common, administrators need to assess their hardware, software, and processes.

I was very hesitant to use the phrase "Big Data" in the title, because it's somewhat ill defined (plus some HPC people would have given me no end of grief), so I chose to use the more generic "data analytics." I prefer this term because the basic definition refers to the process, not the size of the data or the "three Vs" [1] of Big Data: velocity, volume, and variety.

The definition I tend to favor is from TechTarget [2]: "Data Analytics is the science of examining raw data with the purpose of drawing conclusions about that information." It doesn't mention the amount of data, although the implication is that there has to be enough to be meaningful. It doesn't say anything about the velocity, or variety, or volume in the definition. It simply communicates the high-level process.

Another way to think of data analytics is the combination of two concepts: "data analysis" and "analytics." Data analysis [3] is very similar to data analytics, in that it is the process of massaging data with the goal of discovering useful information that can be used for suggesting conclusions and supporting decision making. Analytics  [4], on the other hand, is the discovery and communication of meaningful patterns in data. Even though one could argue that analytics is really a subset of data analysis, I prefer to combine the two terms, so it gathers everything from collecting the data in raw form to examining the data with algorithms or mathematics (typically implying computations) to look for possible information. I'm sure some people will disagree with me, and that's perfectly fine. We're blind men trying to define something we can't easily see and isn't easy to define, even if you can see it. (Think of defining "art," and you get the idea.)

Data analytics is the coming storm across the Oklahoma plains. You can see it miles away, and you had better get ready for it, or it will land on you with a pretty hard thump. The potential of data analytics (DA) has been fairly well documented. An easy example is PayPal, which uses DA and HPC for real-time fraud detection by adapting their algorithms all the time and throwing a great deal of computational horsepower into it. I don't want to dive into the mathematics, statistics, or machine learning of DA tools; instead, I want to take a different approach and discuss some aspects of data analytics that affect one of the audiences of this magazine – HPC people.

Specifically, I want to discuss some of the characteristics or tendencies of DA applications, methods, and tools, since these workloads are finding their way into HPC systems. I know one director of a large HPC center who gets at least three or four requests a week from users who want to perform data analytics on the HPC systems. Digging a little deeper, the HPC staff finds that the users are mostly not "classic" HPC users, and they have their own tools and methods. Integrating their needs into existing HPC systems has proven to be more difficult than they thought. As a result, it might be a good idea to present some of the characteristics or tendencies of these tools and users so you can be prepared when the users start knocking on your door and sending you email. By the way, these tools might be running on your systems already and you don't even know it.

Workload Characteristics

Before jumping in with both feet and listing all of the things that are needed in DA workloads, I think it's far better first to describe or characterize the bulk of DA workloads, which might reveal some indicators for what is needed. With these characteristics, I'm aiming for the "center of mass." I'm sure many people can come up with counterexamples, as can I, but I'm trying to develop some generalizations that can be used as a starting point.

In the subsequent sections, I'll describe some major workload characteristics, and I'll begin with the languages used in data analytics.

New Languages

The classic languages of HPC, Fortran and C/C++, are used somewhat in data analytics, but a whole host of new languages and tools are used as well. A wide variety of languages show up, particularly because Big Data is so hyped, which means everyone is trying to jump in with their particular favorite. However, a few have risen to the top:

  • R [5]
  • Python [6]
  • Julia (up and coming) [7]
  • Java [8]
  • Matlab [9] and Matlab-compatible tools (Octave [10], Scilab [11], etc.)

Java is the lingua franca of MapReduce [12] and Hadoop [13]. Many toolkits for data analytics are written in Java, with the ability to be interfaced into other languages.

Because data analytics is, for the most part, about statistical methods, R, the language of statistics, is very popular. If you know a little Python or some Perl or some Matlab, then learning R is not too difficult. It has a tremendous number of built-in functions, primarily for statistics, and great graphics for making charts. Several of its libraries also are appropriate for data analytics (Table 1).

Table 1

R Libraries and Helpers

Software Description Source
Analytics libraries
R/parallel Add-on package extends R by adding parallel computing capabilities http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2557021/
Rmpi Wrapper to MPI http://www.stats.uwo.ca/faculty/yu/Rmpi/
HPC tools R with BLAS, LAPACK, and MPI in Linux http://lostingeospace.blogspot.com/2012/06/r-and-hpc-blas-mpi-in-linux-environment.html
RHadoop [14] R packages to manage and analyze data with Hadoop https://github.com/RevolutionAnalytics/RHadoop/wiki
Database tools [15]
RSQLite [16] R driver for SQLite http://cran.r-project.org/web/packages/RSQLite/index.html
rhbase [17] Connectivity to HBASE https://github.com/RevolutionAnalytics/rhbase
graph Package to handle graph data structures http://www.bioconductor.org/packages/devel/bioc/html/graph.html
neuralnet [18] Training neural networks http://cran.r-project.org/web/packages/neuralnet/

Python is becoming one of the most popular programming languages. The language is well suited for numerical analysis and general programming. Although it comes with a great deal of capability, lots of add-ons extend Python in the DA kingdom (Table 2).

Table 2

Python Add-Ons

Software Description Source
Pandas Data analysis library for data analytics http://pandas.pydata.org
scikit-learn Machine learning tools http://scikit-learn.org/stable/
SciPy Open source software for mathematics, science, and engineering http://www.scipy.org
NumPy A library for array objects including tools for integrating C/C++ and Fortran code, linear algebra computations, Fourier transforms, and random number capabilities http://www.numpy.org
matplotlib Plotting library http://matplotlib.org
Database tools
sqlite3 [19] SQLite database interface https://docs.python.org/2/library/sqlite3.html
PostgreSQL [20] Drivers for PostgreSQL https://wiki.postgresql.org/wiki/Python
MySQL-Python [21] MySQL interface http://mysql-python.sourceforge.net
HappyBase Library to interact with Apache HBase http://happybase.readthedocs.org/en/latest/
NoSQL List of NoSQL packages http://nosql-database.org
PyBrain Modular machine learning library http://pybrain.org
ffnet Feed-forward neural network http://ffnet.sourceforge.net
Disco Framework for distributed computing based on the MapReduce paradigm http://discoproject.org
Hadoopy [22] Wrapper for Hadoop using Cython http://www.hadoopy.com/en/latest/
Graph libraries
NetworkX Package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks http://networkx.github.io
igraph Network analysis http://igraph.org
python-graph Library for working with graphs https://code.google.com/p/python-graph/
pydot Interface to Graphviz's Dot language [23] https://code.google.com/p/pydot/
graph-tool Manipulation and statistical analysis of graphs (networks) http://graph-tool.skewed.de

Julia is an up-and-coming language for HPC, but it is also drawing in DA researchers and practitioners. Julia is still a very young language; nonetheless, it has some very useful packages for data analytics (Table 3).

Table 3

Julia Packages

Software Description Source
MLBase.jl Functions to support the development of machine learning algorithms https://github.com/JuliaStats/MLBase.jl
StatsBase.jl Basic statistics https://github.com/JuliaStats/StatsBase.jl
Distributions.jl Probability distributions and associated functions https://github.com/JuliaStats/Distributions.jl
Optim.jl Optimization functions https://github.com/JuliaOpt/Optim.jl
DataFrames.jl Library for working with tabular data https://github.com/JuliaStats/DataFrames.jl
Gadfly.jl Crafty statistical graphics https://github.com/dcjones/Gadfly.jl
PyPlot.jl Interface to matplotlib [24] https://github.com/stevengj/PyPlot.jl

Matlab is a popular language in science and engineering, so it's natural for people to use it for data analytics. In general, a great deal of code is available for Matlab and Matlab-like applications (e.g., Octave and Scilab). Matlab and similar tools have data and matrix manipulation tools already built in, as well as graphics tools for plotting the results. You can write code in the respective languages of the different tools to create new functions and capabilities. The languages are reasonably close to each other, making portability easier than you might think, with the exception of graphical interfaces. Table 4 lists a few Matlab toolboxes from MathWorks and an open source toolbox for running parallel Matlab jobs. Octave and Scilab have similar functionality, but it might be spread across multiple toolboxes or come with the tool itself.

Table 4

Matlab Toolboxes

Software Description Source
Statistics Analyze and model data using statistics and machine learning http://www.mathworks.com/products/statistics/?s_cid=sol_des_sub2_relprod3_statistics_toolbox
Data Acquisition Connect to data acquisition cards, devices, and modules http://www.mathworks.com/products/daq/
Image Processing Image processing, analysis, and algorithm development http://www.mathworks.com/products/image/
Econometrics Model and analyze financial and economic systems using statistical methods http://www.mathworks.com/products/econometrics/?s_cid=HP_FP_ML_EconometricsToolbox
System Identification Linear and nonlinear dynamic system models from measured input-output data http://www.mathworks.com/products/sysid/
Database Exchange data with relational databases http://www.mathworks.com/products/database/
Clustering and Data Analysis Clustering and data analysis algorithms http://www.mathworks.com/matlabcentral/fileexchange/7473-clustering-and-data-analysis-toolbox
pMatlab [25] Parallel Matlab toolbox http://www.ll.mit.edu/mission/cybersec/softwaretools/pmatlab/pmatlab.html

These are just a few of the links to DA libraries, modules, add-ons, toolboxes, or what have you for languages that are increasingly popular in the DA world.

Single-Node Runs

Although not true for all DA workloads, the majority of applications tend to be designed for single-node systems. The code might be threaded, but very little code runs in parallel, communicating across multiple nodes on a single data set. MapReduce jobs split the data into pieces and roughly perform the same operation on each piece of the data, but with no communication between tasks. In other words, they are a collection of single-node runs with no real coupling or communication between them as part of the computations.

Just because most code is executed as single-node runs doesn't mean data analytics doesn't run a large volume at the same time. In some cases, the analysis is performed with a variety of starting points or conditions. Sometimes you make different assumptions in the analysis or use different analysis techniques, resulting in the need to run a large number of single-node runs.

An example of a data analysis technique is principal component analysis [26] (PCA), in which a data set is decomposed. The resulting decomposition can then be used to reduce the data set or problem to a smaller one that is more easily analyzed. However, choosing how much of the data set to remove is something of an art, so the reduction analysis on the reduced data set might need to be performed several times or the amount of data reduction varied. It also means the analysis on the reduced data set might have to be performed many times. The end result is that the overall analysis could take a while and use a great deal of computation.

Other examples use multiple nodes when NoSQL databases are used. I've seen situations in which users have their data in a NoSQL database distributed across many nodes and then run MapReduce jobs across all of the nodes. The computations employed in the map phase place much emphasis on floating-point operations. The reduce phase has some fairly heavy floating-point math, but also many integer operations. All of these operations can also have a heavy I/O component. The situation I'm familiar with was not that large in HPC terms  – perhaps 12 to 20 nodes. However, it was just a "sample" problem. The complete problem would have started with 128 nodes and quickly moved to several thousand nodes. Although these are not MPI or closely coupled workloads, they can stress the capabilities of a large number of nodes.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs

Support Our Work

ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.

Learn More”>


		<div class=