Big data tools for midcaps and others

Arithmagician

Distributions

An entire ecosystem of specialized solutions has emerged around Hadoop. Apache's Hadoop distribution primarily addresses providers of big data tools that build their own (commercial) solutions. This category includes, among others, Cloudera, Hortonworks, IBM, SAP, and EMC.

For mission-critical use of Hadoop, a Hadoop distribution with 24/7 support by a service provider such as Cloudera  [5] or Hortonworks [6] can actually be beneficial. However, these providers make you pay quite handsomely for this privilege. If you do not need a service-level agreement, you can instead choose the free versions of these distributions. Additionally, Hadoop distributions have been specially created for small to mid-sized companies, such as Stratosphere by the Technical University of Berlin [7].

Stratosphere

Stratosphere combines easy installation with ease of use and high performance. The platform also scales to large clusters, uses multicore processors, and supports in-memory data processing. It also features advanced analytics functionality and lets users program jobs in Java and Scala.

Stratosphere is developed under the leadership of Professor Volker Markl from TU Berlin's Department of Database Systems and Information Management (DIMA). Stratosphere runs both on-premises and in the cloud (e.g., on Amazon EC2).

Hadoop Services

Growth-oriented midcaps can choose from a wide range of Hadoop services. Amazon offers Elastic MapReduce (EMR) [8], an implementation of Hadoop with support for Hadoop 2.2 and HBase 0.94.7, as well as the MapR M7, M5, and M3 Hadoop distributions by MapR Technologies [9]. The service targets companies, researchers, data analysts, and developers in the fields of web indexing, data mining, logfile analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research.

For customers who want to implement HBase, Elastic MapReduce with M7 provides seamless splits without compression, immediate error recovery, timing recovery, full HA, mirroring, and consistent low latency. This version does involve some additional costs, however. Google (with Compute Engine) and Microsoft (with Azure) have their own implementations of Hadoop.

Using Hadoop as a service in the cloud means less capital outlay on hardware and avoids delays in the deployment of infrastructure and other expenses. Amazon EMR is a good example because of its clear pricing structure. In EMR, you can only set up a Hadoop cluster temporarily so that it automatically dissolves after analyzing your data, thus avoiding additional charges. Prices start at US$  0.015/hour per instance for the EMR service, plus the EC2 costs for each instance of the selected type (from US$  0.06 per instance), which are also billed on an hourly basis.

Thus, you would pay US$  1.50 for one hour with 100 instances for Hadoop (100 x US$ 0.015) and up to US$  6.00 for up to 100 instances that run on-demand (100 x US$ 0.06). The bottom line is that you are billed for US$  7.50 per hour for 100 small instances. To keep costs down even further, you could reserve these instances for up to three years.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • The New Hadoop

    Hadoop version 2 expands Hadoop beyond MapReduce and opens the door to MPI applications operating on large parallel data stores.

  • Hadoop for Small-to-Medium-Sized Businesses

    Hadoop 2.x and its associated tools promise to deliver big data solutions not just to the IT-heavy big players, but to anyone with unstructured data and the need for multidimensional data analysis.

  • Is Hadoop the New HPC?

    Hadoop has been growing clusters in data centers at a rapid pace. Is Hadoop the new corporate HPC?

  • Is Hadoop the new HPC?
    Hadoop has been growing clusters in data centers at a rapid pace. Is Hadoop the new corporate high-performance computing?
  • OpenStack Sahara brings Hadoop as a Service
    Apache Hadoop is currently the marketing favorite for Big Data; however, setting up a complete Hadoop environment is not easy. OpenStack Sahara, on the other hand, promises Hadoop at the push of a button.
comments powered by Disqus
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs



Support Our Work

ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.

Learn More”>
	</a>

<hr>		    
			</div>
		    		</div>

		<div class=