Tool Your HPC Systems for Data Analytics

Storage and Compute -- Hadoop

One of the tools to which you will really have to pay attention is Hadoop. By Hadoop, I mean not only the Hadoop filesystem (HDFS) [31], but also the idea of MapReduce and how you write applications using MapReduce concepts. These are really two different things, but to make my life easier, I will use ``Hadoop'' to mean both, unless I specifically refer to one or the other.

Using Hadoop can greatly complicate your life as an HPC administrator. Embodied in Hadoop is the concept of moving the compute to where the data is located. Therefore, submitting a DA job that uses Hadoop is a bit more complicated because certain nodes will contain the needed data. While HDFS can copy the data to other nodes that's not really the thrust of Hadoop. Therefore your resource manager needs to be ``data aware'' so that it can find the nodes where the data is located or copy the data to other nodes that are available.

Another complication is that Hadoop  2, the current version of Hadoop, uses something called YARN [32]. YARN stands for Yet Another Resource Negotiator. Fundamentally, it is a resource manager similar to Slurm, Moab, OGE, or Torque. If you have DA applications that depend on Hadoop and YARN within an HPC system that already has a resource manager, you will get a situation of ``Who's on First?'' -- that is, which resource manager ``owns'' or ``controls'' which specific resources (nodes). I think all HPC administrators know that you can't have two resource managers trying to manage the same nodes; you will have lots of problems very fast.

Most compute nodes in HPC systems either have no disk (diskless) or a single disk. Consequently, it is really difficult to use them as Hadoop nodes. You have several options at this point. One option is to give all or a portion of the compute nodes in the cluster a fair amount of local disk for storing data. If all of the nodes have the capability, then the resource manager's life is a bit easier, but you will need more racks and the general cost of the system will go up. If you only make a portion of the compute nodes appropriate for Hadoop (lots of local disks), then you need to tell the resource manager that these nodes have different properties (e.g., ``Hadoop'') and set up the resource scheduling appropriately. Although this is cheaper than having all the nodes stuffed with disks, it is a bit more complicated.

A second option is to build HDFS on some sort of centralized storage. HDFS is a meta-filesystem, in that it is a filesystem built on top of other filesystems (usually ``local'' filesystems). This means you can build HDFS on top of almost any storage you want. For example, if you had a centralized storage such as Lustre, you could just build HDFS on top of it [33]. However, this approach does not take advantage of centralized storage, nor does it allow HDFS to be used effectively.

As a third alternative, Intel has created some tools for Intel Enterprise Edition of Lustre (IEEL) that allow MapReduce applications to write directly to Lustre [34]. These tools also allow the ``shuffle'' phase of MapReduce to be skipped. In the current version of IEEL, version 2.0, a beta version of a tool replaces YARN with your existing resource manager. This allows you to use a single resource manager within your HPC system.

Summary

Data analytics is a probably the fastest growing computational workload today. Relative to HPC, it is still done on a somewhat small scale, although companies such as PayPal are proving the need for larger scale computations. Naturally, the desire is for these computations to be done on HPC systems to avoid the cost of a second system. However, data analytics is a different workload than what you have experienced in the HPC world to date.

In this article, I've reviewed some aspects of data analytics workloads. Be ready for:

  • Lots of new languages, including interfaces to traditional databases and NoSQL databases
  • Lots of single-node runs (possibly lots of memory)
  • Interactivity
  • Interactive login
  • Visualization
  • Graphics cards in nodes
  • Data analytics pipelines
  • Lots of rapidly changing tools
  • SQL tools
  • NoSQL tools
  • Hadoop and storage
  • Hadoop moves computation to storage (most of the time)
  • Hadoop uses local storage
  • Hadoop 2.0 uses its own resource manager YARN, which can easily cause problems with the resource manager

If you read through these highlights and talk to your DA users, you will see that you might need to add or change your processes and you might need to add new hardware. If you don't have DA users today, then I suggest you look a little closer or be ready for the data analytics wave to overtake you.