Photo by Toby Elliott on Unsplash

Photo by Toby Elliott on Unsplash

Application-aware batch scheduler

Eruption

Article from ADMIN 86/2025
By
Volcano optimizes high-performance workloads on Kubernetes to avoid deadlocks.

The basic Kubernetes scheduler – kube-scheduler – does a great job of bin-packing pods into nodes, but it can result in a scheduling deadlock for the more complicated multipod jobs that are created by analytics and artificial intelligence and machine learning (AI/ML) frameworks such as Apache Spark and PyTorch. That means expensive cluster resources, such as GPUs, sitting idle and unavailable to any workload.

Volcano scheduler is a Cloud Native Computing Foundation (CNCF) project that introduces the Queue and PodGroup custom resources to enable gang scheduling (i.e., the simultaneous scheduling of multiple related objects) and facilitate more efficient use of the cluster. Complex jobs run more reliably, and data engineers become more productive.

In this article, I demonstrate default Kubernetes scheduling behavior with the use of short-lived single-pod jobs, show how multipod jobs from Apache Spark and PyTorch can trigger a scheduling lock, and use Volcano to run the same jobs smoothly and predictably. The Git repository [1] gives full details of how to create a test Kubernetes cluster on Digital Ocean and to run all the examples.

Kubernetes for Analytics and ML

Kubernetes, often considered the operating system of the cloud, is often thought of in terms of distributed microservices – in other words, client-server applications with an indefinite lifespan, decomposed into smaller services components (e.g., database, business logic, web front end) for containerized deployment in a way that makes each part redundant, scalable, and easy to upgrade. In that use case, the Kubernetes cluster was most likely designed and scaled with the application's resource requirements in mind, but Kubernetes also lends itself to the "batch" use case – that is, for running resource-intensive jobs of finite duration, particularly in the fields of Big Data analytics, high-performance computing, and training of AI models. In these applications, to make the best use of expensive GPUs, they needs to be shared by multiple users. They do not need to be available to everyone immediately; the users just need to be able to submit their workloads to the cluster and get the results (e.g., a trained AI model) within a reasonable time frame.

Kubernetes' inherent scheduling and bin-packing capabilities – by which it can make efficient use of the compute and memory resources provided by its worker nodes – make it well suited to running such short-lived tasks. However, the Kubernetes implementations of many widely used AI/ML frameworks require the near-simultaneous creation of multiple pods for a job to complete successfully, as the examples in this article will show. If multiple jobs are submitted together, you end up with a situation in which some of the pods from one job are running, some pods from another job are also running, but the cluster has insufficient resources remaining to schedule the remaining pods from either job. As a result, both jobs are indefinitely stuck and the cluster is in a state of "scheduling deadlock."

Scheduling a Kubernetes Pod

A pod is the basic unit of a Kubernetes workload (deriving from a "pod" of containers sharing a common namespace on a node). Its specification includes an array of container images, commands to execute in each, storage and configuration details, and quantities of CPUs, GPUs and memory for each container. The Kubernetes scheduler (kube-scheduler), one of the control plane's essential services, decides which node, if any, a pod is scheduled on (or bound to). In Kubernetes terminology, a pod is said to be "scheduled" when it is assigned to a worker node and can be created; it doesn't imply that it can only be run at a particular time, despite what the word implies. (Kubernetes does include a CronJob object for that purpose.)

The Kubernetes scheduler operates on a queue of pods that are still in the Pending state (i.e., all the pods that ought to be scheduled, but haven't yet been). That queue is populated by kube-controller-manager, another essential control plane service, whose job is to reconcile continually the difference between the declarative state of the cluster's objects and what is actually running on the cluster's nodes in real life.

The scheduler follows a two-step process: (1) It filters the list of nodes to exclude those on which the pod cannot be run. For example, the pod specification requests two CPUs for a container in the pod, so it excludes all nodes that don't have two CPUs worth of free capacity; or, the pod requests a GPU, so the scheduler excludes all nodes that don't have an unused GPU; or, some nodes are tainted to prevent the scheduling of pods matching a certain specification, so kube-scheduler excludes those nodes. (2) It ranks the remaining nodes according to several criteria (e.g., the node that has the most available capacity) and binds the pod to the highest scoring node. Figure 1 illustrates this process, starting with a user submitting a workload to the cluster's API server (kube-apiserver).

Figure 1: Overview of how kube-scheduler filters and scores nodes for pod binding.

Limited Resources

In the case of simple batch jobs that each create a single pod, the standard Kubernetes scheduler queues the pending pods and schedules them in turn as resources that are released by the preceding completed pods. I'll demonstrate this by submitting (as simultaneously as possible) 20 two-CPU pods to the test cluster described in the GitHub repo [1]. Each pod runs a Python word count program on the war-and-peace.txt file mentioned in the "Deploying MinIO" boxout.

Deploying MinIO

The examples in this article use a large text file (war-and-peace.txt from Project Gutenberg [2]) saved in a local Amazon Simple Storage Service (S3)-compatible object storage (Figure 2). This type of object storage, being easy to deploy and access, is a great choice for distributed storage in hybrid environments, where jobs are initiated from a local workstation but executed in a public cloud environment. You can install MinIO with a single helm command:

helm -n minio install objectstore oci://registry-1.docker.io/bitnamicharts/minio --create-namespace
Figure 2: Uploading the text dataset to the MinIO web user interface (UI).

Follow the instructions printed out by the successful Helm deployment to see the MinIO admin password, and then use the

kubectl forward

command in the same output to access the web UI of your object store. Create a bucket path called data/events and upload to it the war-and-peace.txt file [2]. Export the MinIO admin password in your local environment as MINIO_PASSWORD for later use by the example pods. As a more reliable alternative to kubectl forward, change the MinIO service type to make it accessible from outside the cluster via a load balancer and watch the service to discover the public IP address that is assigned to your MinIO service:

kubectl -n minio patch svc objectstore-minio-p '{"spec": {"type": "LoadBalancer"}}'
watch kubectl -n minio get svc -o wide

The test cluster has only six CPUs available for workloads, so it obviously won't be able to schedule all the pods at the time they are submitted to the API server. The example directly creates pods for the sake of clarity; however, in a production scenario, you'd probably use a Kubernetes Job object, which can be viewed as an overlay to the Pod object that allows you to specify the number of attempts that will be made to get the specified pod to run to completion and allow a certain number of historical pod attempts to be retained to allow their inspection of their logs.

The following command downloads the .yaml file, submits it to the cluster 20 times (with envsubst to create a unique name for each pod and populate the MinIO password into the pod spec), and watches the progress of the resulting pod statuses:

curl -LO https://raw.githubusercontent.com/datadoc24/admin-volcano-article/refs/heads/main/single-pod-wordcount-job/wordcount-pod.yaml
for n in $(seq 1 20);do i=$n envsubst < wordcount-pod.yaml | kubectl apply -f -;done
watch kubectl get po

In the resulting output shown in Figure 3, you can see each pod transitioning one by one from Pending to Running to Completed. If you want to double-check that your pods really did some useful work, the kubectl logs command can be used to show that each pod (even in its Completed state) contains the output of the word count program.

Figure 3: The Kubernetes scheduler schedules single-pod jobs in turn, and no cluster resources are locked.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Kick-start your AI projects with Kubeflow
    Training language models and AI algorithms requires a powerful infrastructure that is difficult to create manually. Although Kubeflow promises a remedy, it is itself a complex monster … unless you are familiar with the right approach that lets you get it up and running fairly quickly.
  • Persistent storage management for Kubernetes
    The container storage interface (CSI) allows CSI-compliant plugins to connect their systems to Kubernetes and other orchestrated container environments for persistent data storage.
  • An open source object storage solution
    We introduce the MinIO high-performance object store, its key features and applications, and some performance tips.
  • Kubernetes containers, fleet management, and applications
    Kubernetes is all the rage, but many admins find themselves struggling to get started. We present the basic architecture and the most important components and terms.
  • Nested Kubernetes with Loft
    Kubernetes has limited support for multitenancy, so many admins prefer to build multiple standalone Kubernetes clusters that eat up resources and complicate management. As a solution, Loft launches any number of clusters within the same control plane.
comments powered by Disqus
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs



Support Our Work

ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.

Learn More”>
	</a>

<hr>		    
			</div>
		    		</div>

		<div class=