Distributed Linear Algebra with Apache Mahout

Matrix Math

Engine Agnosticism (and Why It Matters)

Engine agnosticism is important to many organizations, many of whom don't even realize it. In just the 2000s, clusters have moved from Hadoop to Spark, and from Spark to Kubernetes. Teams and organizations often find themselves in an uncomfortable position of being pinned to outmoded technology, rather than bringing in costly consultants to port business-critical algorithms and methods from an old system to a new one.

The Mahout project lived through the migration from Hadoop to Spark and incorporated the lessons learned into its very fabric, making it simple to port algorithms from any arbitrary platform to any other (Figures 3 and 4).

Figure 3: Matrix math with the Apache Flink engine.
Figure 4: Matrix math with the Apache Spark engine.

Note that the code after the import statements requires no changes. Therefore, a team that uses one back-end engine is able to migrate code onto another engine, without having to change the code performing the mathematical operations and avoiding the more error-prone part of porting code from one platform to another.

Mahout Use Cases

Mahout is for organizations who have statistical methods to run on distributed datasets but want to minimize their exposure to the technical debt that arises from writing algorithms against a specific engine that may or may not have a successful future. Mahout allows organizations to switch their systems of record while having a minimal effect on their data outputs.

Mahout also lets users add linear algebra concepts to data stores that have either weak or nonexistent implementations for linear algebra concepts, such as Apache Spark. (Spark's linear algebra works fine in single-node deployments but has issues scaling to larger distributed datasets.)

Mahout in the Wild

A major used car marketplace in North America used the Mahout codebase when creating their car recommendation system. This recommendation engine is based on Mahout's Correlated Cross-Occurrence (CCO) analysis. The CCO algorithm is very similar to the more popular co-occurence (CO) algorithm, but it also incorporates other attributes of the user into its recommendations; in more technical parlance, it is multimodal.

Mahout has been used in many situations where customer privacy and intellectual property concerns keep them from being published, but many researchers and practitioners have built recommenders, similarity engines, and other predictive models at scale with the use of its tools.

A paper written by T. Grant [2] illustrates another Mahout use case. During the outset of the COVID-19 pandemic, CT scans were shown to be as good as, or in some cases superior to, RT-PCR tests. A major issue however was the high dose of radiation they delivered. With Mahout, Grant showed how "noisier," "low-dose" CT scans could be quickly and easily denoised, with about five lines of Mahout code.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Distributed Linear Algebra with Mahout

    The Apache Mahout distributed linear algebra framework delivers new tools and methods for performing data analysis, building machine learning data pipelines, and implementing machine learning models in production.

comments powered by Disqus
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs



Support Our Work

ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.

Learn More”>
	</a>

<hr>		    
			</div>
		    		</div>

		<div class=