
Photo by Toby Elliott on Unsplash
Application-aware batch scheduler
Eruption
The basic Kubernetes scheduler – kube-scheduler
– does a great job of bin-packing pods into nodes, but it can result in a scheduling deadlock for the more complicated multipod jobs that are created by analytics and artificial intelligence and machine learning (AI/ML) frameworks such as Apache Spark and PyTorch. That means expensive cluster resources, such as GPUs, sitting idle and unavailable to any workload.
Volcano scheduler is a Cloud Native Computing Foundation (CNCF) project that introduces the Queue
and PodGroup
custom resources to enable gang scheduling (i.e., the simultaneous scheduling of multiple related objects) and facilitate more efficient use of the cluster. Complex jobs run more reliably, and data engineers become more productive.
In this article, I demonstrate default Kubernetes scheduling behavior with the use of short-lived single-pod jobs, show how multipod jobs from Apache Spark and PyTorch can trigger a scheduling lock, and use Volcano to run the same jobs smoothly and predictably. The Git repository [1] gives full details of how to create a test Kubernetes cluster on Digital Ocean and to run all the examples.
Kubernetes for Analytics and ML
Kubernetes, often considered the operating system of the cloud, is often thought of in terms of distributed microservices – in other words, client-server applications with an indefinite lifespan, decomposed into smaller services components (e.g., database, business logic, web front end) for containerized deployment in a way that makes each part redundant, scalable, and easy to upgrade. In that use case, the Kubernetes cluster was most likely designed and scaled with the application's resource requirements in mind, but Kubernetes also lends itself to the "batch" use case – that is, for running resource-intensive jobs of finite duration, particularly in the fields of Big Data analytics, high-performance computing, and training of AI models. In these applications, to make the best use of expensive GPUs, they needs to be shared by multiple users. They do not need to be available to everyone immediately; the users just need to be able to submit their workloads to the cluster and get the results (e.g., a trained AI model) within a reasonable time frame.
Kubernetes' inherent scheduling and bin-packing capabilities – by which it can make efficient use of the compute and memory resources provided by its worker nodes – make it well suited to running such short-lived tasks. However, the Kubernetes implementations of many widely used AI/ML frameworks require the near-simultaneous creation of multiple pods for a job to complete successfully, as the examples in this article will show. If multiple jobs are submitted together, you end up with a situation in which some of the pods from one job are running, some pods from another job are also running, but the cluster has insufficient resources remaining to schedule the remaining pods from either job. As a result, both jobs are indefinitely stuck and the cluster is in a state of "scheduling deadlock."
Scheduling a Kubernetes Pod
A pod is the basic unit of a Kubernetes workload (deriving from a "pod" of containers sharing a common namespace on a node). Its specification includes an array of container images, commands to execute in each, storage and configuration details, and quantities of CPUs, GPUs and memory for each container. The Kubernetes scheduler (kube-scheduler
), one of the control plane's essential services, decides which node, if any, a pod is scheduled on (or bound to). In Kubernetes terminology, a pod is said to be "scheduled" when it is assigned to a worker node and can be created; it doesn't imply that it can only be run at a particular time, despite what the word implies. (Kubernetes does include a CronJob
object for that purpose.)
The Kubernetes scheduler operates on a queue of pods that are still in the Pending state (i.e., all the pods that ought to be scheduled, but haven't yet been). That queue is populated by kube-controller-manager
, another essential control plane service, whose job is to reconcile continually the difference between the declarative state of the cluster's objects and what is actually running on the cluster's nodes in real life.
The scheduler follows a two-step process: (1) It filters the list of nodes to exclude those on which the pod cannot be run. For example, the pod specification requests two CPUs for a container in the pod, so it excludes all nodes that don't have two CPUs worth of free capacity; or, the pod requests a GPU, so the scheduler excludes all nodes that don't have an unused GPU; or, some nodes are tainted to prevent the scheduling of pods matching a certain specification, so kube-scheduler
excludes those nodes. (2) It ranks the remaining nodes according to several criteria (e.g., the node that has the most available capacity) and binds the pod to the highest scoring node. Figure 1 illustrates this process, starting with a user submitting a workload to the cluster's API server (kube-apiserver
).
Limited Resources
In the case of simple batch jobs that each create a single pod, the standard Kubernetes scheduler queues the pending pods and schedules them in turn as resources that are released by the preceding completed pods. I'll demonstrate this by submitting (as simultaneously as possible) 20 two-CPU pods to the test cluster described in the GitHub repo [1]. Each pod runs a Python word count program on the war-and-peace.txt
file mentioned in the "Deploying MinIO" boxout.
Deploying MinIO
The examples in this article use a large text file (war-and-peace.txt
from Project Gutenberg [2]) saved in a local Amazon Simple Storage Service (S3)-compatible object storage (Figure 2). This type of object storage, being easy to deploy and access, is a great choice for distributed storage in hybrid environments, where jobs are initiated from a local workstation but executed in a public cloud environment. You can install MinIO with a single helm
command:
helm -n minio install objectstore oci://registry-1.docker.io/bitnamicharts/minio --create-namespace
Follow the instructions printed out by the successful Helm deployment to see the MinIO admin password, and then use the
kubectl forward
command in the same output to access the web UI of your object store. Create a bucket path called data/events
and upload to it the war-and-peace.txt
file [2]. Export the MinIO admin password in your local environment as MINIO_PASSWORD
for later use by the example pods. As a more reliable alternative to kubectl forward
, change the MinIO service type to make it accessible from outside the cluster via a load balancer and watch the service to discover the public IP address that is assigned to your MinIO service:
kubectl -n minio patch svc objectstore-minio-p '{"spec": {"type": "LoadBalancer"}}' watch kubectl -n minio get svc -o wide
The test cluster has only six CPUs available for workloads, so it obviously won't be able to schedule all the pods at the time they are submitted to the API server. The example directly creates pods for the sake of clarity; however, in a production scenario, you'd probably use a Kubernetes Job
object, which can be viewed as an overlay to the Pod
object that allows you to specify the number of attempts that will be made to get the specified pod to run to completion and allow a certain number of historical pod attempts to be retained to allow their inspection of their logs.
The following command downloads the .yaml
file, submits it to the cluster 20 times (with envsubst
to create a unique name for each pod and populate the MinIO password into the pod spec), and watches the progress of the resulting pod statuses:
curl -LO https://raw.githubusercontent.com/datadoc24/admin-volcano-article/refs/heads/main/single-pod-wordcount-job/wordcount-pod.yaml for n in $(seq 1 20);do i=$n envsubst < wordcount-pod.yaml | kubectl apply -f -;done watch kubectl get po
In the resulting output shown in Figure 3, you can see each pod transitioning one by one from Pending to Running to Completed. If you want to double-check that your pods really did some useful work, the kubectl logs
command can be used to show that each pod (even in its Completed state) contains the output of the word count program.
Buy this article as PDF
(incl. VAT)
Buy ADMIN Magazine
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Most Popular
Support Our Work
ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.
