Lead Image © Luxuz, photocase.com

A Hands-on Look at Kubernetes with OpenAI

Learning in Containers

Article from ADMIN 41/2017

By Jonas Schneider

For research into deep learning algorithms that automatically acquire new skills, OpenAI operates some of the largest Kubernetes clusters worldwide, with up to 36,000 CPU cores. We look at some practical experience with the container management system.

OpenAI [1] is a non-profit, privately funded research institution in San Francisco, where I work with about 60 other employees on machine learning and artificial intelligence. In concrete terms, my colleagues are examining how to teach a computer new behavioral patterns through experience, without being deliberately programmed to handle a task. Contributors include Elon Musk (Tesla, SpaceX) [2] and Sam Altman (Y Combinator) [3], among others.

The people of OpenAI contribute academic publications on the web, presentations at conferences, and software for researchers and developers. In this article, I show how OpenAI prepares its Kubernetes cluster to run artificial intelligence experiments across thousands of computers.

Go Deep

The company's main focus is on deep learning – that is, researching large neural networks with many layers. In recent years, deep learning has gained importance because it can generally solve extremely complicated problems.

For example, the AlphaGo bot, developed by Google's DeepMind, has learned how to play the Chinese board game Go, which is considered to be extremely complicated, with a far wider range of moves than chess. Go experts agreed that it would take at least 20 to 30 years until a computer could beat the best human Go players. However, in the spring of 2016, the AlphaGo deep-learning software defeated the (at that time) top Go player, Lee Sedol, and even a team of five world champions in 2017.

In contrast, OpenAI researches algorithms that, unlike AlphaGo, learn not just a single game, but a wide range of tasks. The company recently developed a robot that observes only once how a person does a task previously unknown to the robot; then, Fetch (the robot's internal name) is able

...

Use Express-Checkout link below to read the full article (PDF).