Photo by Toby Elliott on Unsplash

Application-aware batch scheduler

Eruption

Article from ADMIN 86/2025

By Abe Sharp

Volcano optimizes high-performance workloads on Kubernetes to avoid deadlocks.

The basic Kubernetes scheduler – kube-scheduler – does a great job of bin-packing pods into nodes, but it can result in a scheduling deadlock for the more complicated multipod jobs that are created by analytics and artificial intelligence and machine learning (AI/ML) frameworks such as Apache Spark and PyTorch. That means expensive cluster resources, such as GPUs, sitting idle and unavailable to any workload.

Volcano scheduler is a Cloud Native Computing Foundation (CNCF) project that introduces the Queue and PodGroup custom resources to enable gang scheduling (i.e., the simultaneous scheduling of multiple related objects) and facilitate more efficient use of the cluster. Complex jobs run more reliably, and data engineers become more productive.

In this article, I demonstrate default Kubernetes scheduling behavior with the use of short-lived single-pod jobs, show how multipod jobs from Apache Spark and PyTorch can trigger a scheduling lock, and use Volcano to run the same jobs smoothly and predictably. The Git repository [1] gives full details of how to create a test Kubernetes cluster on Digital Ocean and to run all the examples.

Kubernetes for Analytics and ML

Kubernetes, often considered the operating system of the cloud, is often thought of in terms of distributed microservices – in other words, client-server applications with an indefinite lifespan, decomposed into smaller services components (e.g., database, business logic, web front end) for containerized deployment in a way that makes each part redundant, scalable, and easy to upgrade. In that use case, the Kubernetes cluster was most likely designed and scaled with the application's resource requirements in mind, but Kubernetes also lends itself to the "batch" use case – that is, for running resource-intensive

...

Use Express-Checkout link below to read the full article (PDF).