Tool Your HPC Systems for Data Analytics

Data Analytics Pipeline

Data analytics is a young and growing field that is trying to mature rapidly. As a result, many computations might be needed to arrive at some final information or answers. To get there, the computations are typically done in what is called a ``pipeline'' (also called a workflow). The phrase is commonly used in biological sequence analysis referring to a series of sequential steps taken to arrive at an answer. The output from one step is the input to the next step. Data analytics is headed rapidly in this same direction, but perhaps without explicitly stating it. Data analytics starts with raw data and then massages and manipulates it to get it ready for analysis. The analysis typically consists of a series of computational steps to arrive at an answer. Sounds like a pipeline to me.

Some of the steps in a DA pipeline might require visual output, others an interactive capability, or you might need both. Each step may require a different language or tool, as well. The consequence is that data analytics could be very complex from a job flow perspective. This can have a definite effect on how you design HPC systems and how you set up the system for users.

These characteristics -- new languages, single-node runs (but lots of them), interactivity, and DA pipelines -- are common to the majority of data analytics. Keep these concepts in mind when you deal with DA workloads.

Lots of Rapidly Changing Tools

Two features of DA tools is that they vary widely and change rapidly. A number of these tools are Java based, so HPC admins will have to contend with finding a good Java runtime environment for Linux. Moreover, given the fairly recent security concerns, you might have to patch Java fairly frequently.

Other language tools experience fairly rapid changes, as well. For example, the current two streams of Python are the 2.x and the 3.x series. Python 3.0 chose to drop certain features from Python 2.x and add new features. Some toolkits still work with Python 2.x and some work with Python 3.x. Some even work with both, although the code is a bit different for each. As an HPC admin, you will have to have several different versions of tools available on every node for DA users. (Hint: You should start thinking about using Environment Modules [29] or Lmod [30] to your advantage.)

One general class of tools that changes fairly rapidly is NoSQL databases. As the name implies, NoSQL databases don't follow SQL design guidelines and do things differently. For example, they can store data differently to adapt to a type of data and a type of analysis. They also forgo consistency in favor of availability and partition tolerance. The focus of NoSQL databases is on simplicity of design, horizontal scaling, and finer control over availability.

A large number of NoSQL databases are available with different characteristics (Table 5). Depending on the type of data analytics being performed, several of the databases can be running at the same time. As an administrator, you need to define how these databases are architected. For example, do you create dedicated nodes to store the database, or do you store the database on a variety of nodes that are then allocated by the resource manager to a user who needs the resource? If so, then you need to define these nodes with special properties for the resource manager.

Table 5: Databases
Name URL
Wide column stores
HBase []
Hypertable []
Document stores  
MongoDB []
Elasticsearch []
CouchDB []
Key value stores
DynamoDB []
Riak []
Berkeley DB []
Oracle NoSQL []
MemcacheDB []
PickleDB []
OpenLDAP []
HyperGraphDB []
GraphBase []
Bigdata []
AllegroGraph []
SciDB []