OpenStack Sahara brings Hadoop as a Service

Computing Machine

Intelligent Provisioning

Many components seem to be duplicated in Sahara and OpenStack. For example, user management is already handled in the form of Keystone, and Heat is OpenStack's engine for templates, which is evidently reproduced in the form of the provisioning engine.

The key element in Sahara is the provisioning engine (i.e., the part that launches Hadoop environments). Nothing works in Sahara without it. If the administrator invokes the command to start a Hadoop cluster via a web interface or command line, the command initially reaches the provisioning engine, where it is then processed.

The service has several tasks. First, it interprets the user's input. Regardless of whether the user invoked the command from a GUI or at the command line, the command contains several configuration parameters that control Sahara (Figure 3). Even if the user does not specify the necessary parameters, the Sahara plugin uses appropriate defaults for Horizon or the command line. The most important parameters describe the Hadoop distribution and the underlying Linux distribution. The administrator also sets the topology of the cluster by issuing commands to Sahara (i.e., determining how many nodes each group will have initially).

Figure 3: Anyone building a template for a Hadoop cluster will find the appropriate parameters in the dashboard.

Then, the party begins: The provisioning scheduler is responsible for transmitting the necessary commands to the other OpenStack services, including the OpenStack authenticator Keystone and the Glance image service. Because the service that handles authentication in Sahara draws on Keystone, Sahara does not implement a Keystone replacement. The same applies to Sahara's own image registry: It is based on Glance's functions but adds Hadoop-specific functions missing in Glance for image management.

The provisioning engine itself does not duplicate code needlessly. Earlier, I referred to the different groups that belong to a Hadoop cluster (master, core workers, workers); however, Heat, which is responsible for central orchestration in OpenStack, is not familiar with these divisions. A cluster of VMs (stack) attached via Heat always assumes that the administrator will use templates to define individual VMs that are then launched by Heat.

For Heat and Sahara's provisioning engines to communicate meaningfully with one another, the one in Sahara contains a built-in "interpreter" that generates templates with which Heat can work. From the administrator's perspective, it is sufficient to define templates in Sahara that basically describe a Hadoop cluster. Sahara takes care of the rest internally.

Tastes Differ

Hadoop is a concrete implementation of Google's MapReduce algorithm. However, in recent years, as with Linux distros, several providers have begun to offer users separate Hadoop distributions. Apache's Hadoop might always be at the core, but it is extended to include various patches and is supplied to the user with all sorts of more or less useful extensions.

When working on Sahara, the developers had to decide whether to support only the original Hadoop or whether to provide support for other Hadoop implementations as well. The result a the second variant. Using plugins, Sahara can be expanded so that it either supports the original Hadoop version or some other version, such as Hortonworks or Cloudera.

Sahara is therefore very flexible. Depending on the setup, the administrator can even set several plugins to be active at the same time. The user is then left to choose a preferred flavor.

Analytics as a Service

Until now, this article has been concerned with Sahara's ability to launch clusters using Hadoop. However, Sahara is not confined just to that task. The tool also implements comprehensive job management for Hadoop in the background. Basically, Sahara as an external component wants to know what is happening with Hadoop in the VMs. In this alternative mode, the user would not launch any VMs.

To use this Sahara functionality meaningfully, which its developers describe as "Analytics as a Service," the user would instead register the computing job directly with Sahara. To do this, the user needs to set some parameters: To which of the categories specified in Hadoop does the task belong? Which script will be run for the task, and where will Hadoop find the matching data? Where should Sahara store the results of the computations and the logs?

As soon as Sahara is familiar with the values, it takes care of the rest automatically. It starts a Hadoop cluster, performs the corresponding tasks in it, and only provides the administrator the results of these calculations at the end. Using elastic data processing (EDP), the administrator has the option of saving the final results in an object store in line with the OpenStack Swift standard.

When faced with this type of task, Sahara forms a second abstraction layer (in addition to the layer for starting VMs) between the user and the software. Administrators can thus still use Hadoop, even if they do not want to deal with the details of Hadoop.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • The new OpenStack version 2014.1 alias "Icehouse"
    The new OpenStack version "Icehouse" comes with new features and new components, on top of numerous improvements to existing components.
  • Big data tools for midcaps and others
    Hadoop 2.x and its associated tools promise to deliver big data solutions not just to the IT-heavy big players, but to anyone with unstructured data and the need for multidimensional data analysis.
  • Hadoop for Small-to-Medium-Sized Businesses

    Hadoop 2.x and its associated tools promise to deliver big data solutions not just to the IT-heavy big players, but to anyone with unstructured data and the need for multidimensional data analysis.

  • Ubuntu Server 14.04 LTS, 64-Bit
    The 64-bit server install image on this month's CD is for computers with the AMD64 or EM64T architecture (e.g., Athlon64, Opteron, EM64T Xeon, Core 2). Ubuntu Server emphasizes scale-out computing, whether you are administering an OpenStack cloud, a Hadoop cluster, or a massive render farm.
  • The New Hadoop

    Hadoop version 2 expands Hadoop beyond MapReduce and opens the door to MPI applications operating on large parallel data stores.

comments powered by Disqus
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs

Support Our Work

ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.

Learn More”>


		<div class=