Lead Image © Sergei Tryapitsyn, 123RF.com

Lead Image © Sergei Tryapitsyn, 123RF.com

Apache Storm

Analyzing large volumes of data with Apache Storm

Article from ADMIN 29/2015
By
We take you through the installation of a Storm cluster and discuss how to create your own topologies.

Huge amounts of data that are barely manageable are created in corporate environments every day. This data includes information from a variety of sources such as business metrics, network nodes, or social networking. Comprehensive real-time analysis and evaluation are required to ensure smooth operation and as a basis for business-critical decisions. A big data specialist such as Apache Storm is necessary to organize such amounts of data. In this article, I will walk you through the installation of a Storm cluster and touch on the subject of creating your own topologies.

Whether your company is in production or the service industry, the volumes of data that need to be processed keep growing from year to year. Today, many different sources deliver huge volumes of information to data centers and staff computers. Thus, the focus is on big data – a buzzword that seems to electrify the IT industry.

Big data concerns the economically meaningful production and use of relevant findings from qualitatively different and structurally highly diverse information. To make matters worse, this raw data is often subject to rapid change. Big data requires concepts, methods, technologies, IT architectures, and tools that companies can use to control this flood of information in a meaningful way.

Storm at a Glance

Storm was originally developed by Twitter and has been maintained under the aegis of the Apache Software Foundation since 2013. It is a scalable open source tool that focuses on real-time analysis of large amounts of data. Whereas Hadoop primarily relies on batch processing, Storm is a distributed, fault-tolerant system which – like Hadoop – specializes in processing very large amounts of data. However, the crucial difference lies in real-time processing.

Another feature is its high scalability: Storm uses Hadoop ZooKeeper for cluster coordination and is therefore highly scalable. Storm clusters are also easier to manage. Storm is designed so that all incoming information is processed. Topologies can, in principle, be defined in any programming language, although Storm typically uses Java.

A Storm environment usually consists of several components. A Storm cluster resembles a Hadoop cluster in many ways. Whereas MapReduce jobs are run in Hadoop, Storm uses the aforementioned topologies. Additionally, MapReduce operations are used by Hadoop for data consolidation; Storm again uses topologies. They are both very similar, except that MapReduce operations terminate on completion, whereas Storm constantly runs its topologies.

You will encounter two types of nodes in a Storm cluster: master and worker nodes (Figure 1). The Nimbus daemon, whose functionality is comparable to that of the JobTracker in the Hadoop environment, runs on the master node. Nimbus is responsible for distributing the code among the cluster nodes. It assigns tasks to the available nodes and monitors their accessibility and availability.

Figure 1: The architecture of a simple Storm topology.

The supervisor daemon, which waits for work from Nimbus, runs on each worker node. Each worker process executes a topology subset. Conversely, this means that multiple worker processes (which are usually distributed over different computers) are responsible for executing a topology.

ZooKeeper is responsible for the liaison and coordination between Nimbus and the supervisor processes (Figure 2). The daemons are fail-fast systems and are therefore designed so that you can detect errors at an early stage and counter them. Because the daemon status information is managed in ZooKeeper, or alternatively on a local drive, they are extremely fail-safe. If the Nimbus or several supervisors fail, they are restarted automatically, as if nothing had happened.

Figure 2: ZooKeeper mediates between Nimbus and supervisors.

Topologies and Streams

Distributed data processing is referred to as a topology in the Storm big data environment; it consists of streams, spouts, and bolts. Storm topologies are very similar to legacy batch mechanisms; however, unlike batch processing, they do not have a beginning and endpoint but instead keep running until they are finished.

The central data structure in Storm is the tuple, which – in simplified terms – consists of a list of key-value pairs. A stream consists of an unlimited number of tuples. The tuples are comparable to the events from the field of Complex Event Processing (CEP).

Spouts are the steam sources. You can understand them as adapters to the output sources, which convert the source data into tuples and then output these tuples as streams. Storm provides a simple API for implementing the spouts. Possible sources of data can include:

  • Output of (network) sensors
  • Social media feeds
  • Click streams from web-based and mobile applications
  • Application event logs

Because spouts do not typically use specific business logic, they can be used as often as you like in different topologies.

You can imagine topologies as a network of spouts and bolts. The bolts comprise the processing mechanism that receives the incoming streams, processes them, and then generates one or more output streams from them. Bolts can apply various actions to the incoming information. Typical functions can include:

  • Filtering tuples
  • Carrying out joins and aggregations
  • Simple and complex calculations
  • Reading and writing from/to databases

In principle, all processing steps are possible. Predefined topologies exist for typical processing steps; alternatively, adapting sample topologies is quite simple.

Setting Up Your Own Storm Cluster

Like Hadoop, Storm uses a typical master-slave environment but with slightly different semantics. In a classic master-slave system, the central server is usually fixed or set dynamically in the configuration. Storm uses a slightly different approach and is regarded as extremely fail-safe, thanks to the use of Apache ZooKeeper.

Storm is a Java-based environment, and all Storm demons are controlled by a Python file. Before actually installing Storm, you must first ensure that the necessary interpreters are correctly installed on the system concerned. You can also run the various Storm components on one system to familiarize yourself with them.

Storm was originally designed to run on Linux, but there are now also packages for Windows. Using an Ubuntu server is advisable if you want to evaluate Storm, because both Storm and the required components can be installed easily this way. If you are setting up a new Linux system, you should also select the OpenSSH server in the package selection.

The first step is the installation of Java components. The easiest way to obtain them is via the Apt package management system:

sudo apt-get update
sudo apt-get --yes install openjdk-6-jdk

To start, the simplest way is set up a single-node pseudo-cluster on which ZooKeeper and the different Storm components can be run side by side. Storm requires the use of ZooKeeper from version 3.3.x onward. You can install the latest version 3.4.6 using the following command:

sudo apt-get --yes install zookeeper=3.4.6 zookeeperd=3.4.6

This command sets up the ZooKeeper binaries and the service scripts for starting and stopping ZooKeeper.

Next, you can turn to the actual installation of Storm. You will find the current archive on the Internet [1]. Start with the Storm users and the associated group. You can create this as follows:

sudo groupadd storm
sudo useradd --gid storm --home /home/storm --create-home --shell /bin/bash storm

After downloading Storm, install the big data environment in the /usr/share directory and create a symlink to the /usr/share/storm directory. The advantage of this approach is that you can easily set up newer versions and only need to change a single symbolic link. Additionally, you can link the storm executables to /usr/bin/storm:

sudo wget <Download-URL>
sudo unzip -o apache-storm-0.9.2-incubating.zip -d /usr/share/
sudo ln -s /usr/share/apache-storm-0.9.2-incubating /usr/share/storm
sudo ln -s /usr/share/storm/bin/storm /usr/bin/storm

Storm writes its logfiles in the /usr/share/storm/logs directory by default instead of /var/log, the default log directory for most Unix variants. To change this, create a subdirectory for Storm in which the log data can be written. To do so, enter the following commands:

sudo mkdir /var/log/storm
sudo chown storm:storm /var/log/storm
sudo sed -i 's/${storm.home}\/logs/\/var\/log\/storm/g' /usr/share/storm/log4j/storm.log.properties

Finally, move the Storm configuration file to /etc/storm and create a symbolic link:

sudo mkdir /etc/storm
sudo chown storm:storm /etc/storm
sudo mv /usr/share/storm/conf/storm.yaml /etc/storm/
sudo ln -s /etc/storm/storm.yaml /usr/share/storm/conf/storm.yaml

This completes the installation of Storm, and you can now configure it and ensure the Storm daemon runs automatically.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus

SysAdmin Day 2017!

  • Happy SysAdmin Day 2017!

    Download a free gift to celebrate SysAdmin Day, a special day dedicated to system administrators around the world. The Linux Professional Institute (LPI) and Linux New Media are partnering to provide a free digital special edition for the tireless and dedicated professionals who keep the networks running: “10 Terrific Tools."

Special Edition

Newsletter

Subscribe to ADMIN Update for IT news and technical tips.

ADMIN Magazine on Twitter

Follow us on twitter