Choosing the Right Orchestration Tool for ML Workloads: Slurm vs. Kubernetes

Nick Jones

July 19, 2024

•

3 minutes

Nick Jones

July 19, 2024

•

3 minutes

Introduction

If you’re looking to scale your machine learning workload, you will eventually require resource orchestration.

Orchestration is the execution of multiple automated tasks or processes, typically applied across multiple systems, to ensure deployment, configuration and management are performed efficiently.

As environments become more complex, automating tasks improves efficiency but scaling this automation has its own challenges. When dealing with a web of individual tasks, you need to ensure that those tasks can work together so that once one task is completed, it initiates the follow-up task.

When training large models, you will face a variety of challenges, such as:

High-Cost GPUs: Training large models requires the right computing power. In today's market, GPUs are scarce and expensive.

Infrastructure Management: Running training on one cloud platform whilst performing inference on another can create technical issues and significantly increase operational overhead.

Performance Optimisation: High data throughput and low latency are key factors with AI training processes, keeping GPUs fed with data in order to minimise training times.

Hardware Failures: Hardware failures are highly sensitive to distributed training and put your stored progression in the GPU memory at risk.

Data Lifecycle: Large datasets and the ability to present these to nodes in a cluster come with specific architectural considerations.

In this blog, we will dive into Slurm and Kubernetes, two types of orchestration tools that can improve your workflow when scaling your machine learning workload.

Why is Job Scheduling Important in AI and ML?

At any given moment in a HPC system, thousands of jobs and nodes can be operating concurrently. Without job scheduling, the system's tasks cannot be correctly matched and performed to available resources. The purpose of job schedulers is to minimise the queue length of tasks, ensure jobs are running simultaneously, and optimise resources to maximise ROI.

Considering the cost of implementing large-scale HPC systems, job scheduling ensures that the technical debt steadily decreases while the overall efficiency increases.

What is Slurm?

Slurm, originally known as Simple Linux Utility for Resource Management is a highly scalable cluster management and job scheduling tool. Slurm primarily focuses on the following three key functions:

Access to compute nodes to users
Provides a framework for starting, executing and monitoring jobs
Manages a queue of pending work

Slurm is well-established and very mature, having been popularised by the HPC industry as a tool that lets researchers express their job requirements with relative simplicity, while at the same time providing a hugely powerful and configurable set of mechanisms for carefully coordinating resource requirements where necessary.

Architecture

The Slurm scheduler architecture is a modular approach which allows you to customise your deployment to suit your infrastructure. The main component of the architecture is the centralised manager (slurmctld) which monitors work and resources and is backed up by failover copy, ensuring continued operations.

Each compute node in the system has a daemon (slurmd) which is controlled by the manager. It functions like a remote shell and provides hierarchical, fault-tolerant communications to other nodes and the manager. Slurmdbd is the database that records information across your clusters.

‍

What is Kubernetes?

Kubernetes, often shortened to K8s, is an open-source container orchestration system for automating software deployment, scaling, and operations. Originating from Google, it was a reimplementation of their global-scale ‘Borg’ cluster manager, and although initially designed for stateless microservices, over the last decade it has evolved into a sophisticated and comprehensive Enterprise-grade mechanism for orchestrating containerised applications.

Before getting into Kubernetes, let’s quickly define a container. A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another.

Kubernetes provides a framework to run distributed systems of all shapes and sizes, taking care of automatic scaling and failover for your application, efficient resource management, comprehensive admission control, and more.

Finally, Kubernetes unifies access to cloud infrastructure resources and helps with the portability of deployments. All things being equal, it should be straightforward to run a workload using a cluster on one provider and then to be able to redeploy on another with little to no changes to the deployment definitions.

Kubernetes itself might seem like the ‘new kid on the block’, but having recently celebrated its 10th birthday it can also be considered mature and featureful, especially given the project's explosive growth and adoption.

Architecture

Unsurprisingly, given the design goals, Kubernetes has more responsibilities than Slurm and the control plane is more complex.

In Kubernetes, workloads run on one or more nodes (either physical or virtual machines), with scheduling handled by a set of services which typically comprise a ‘control plane’. These control plane components work to reconcile the desired state of the system and handle aspects such as resource allocation as well as self-healing and scaling.

Each worker node hosts applications running in one or more containers, known as Pods. Kubernetes typically includes its own cluster networking abstraction, giving Pods access to a shared network that’s configured in a way to provide the right level of flexibility as well as security controls.

‍

Head-to-Head Comparison: Slurm vs Kubernetes

When we’re looking at either using Slurm or Kubernetes, you need to take into account a variety of factors. Although both are orchestration tools, they are designed to serve different purposes and have their individual strengths and weaknesses.

When Should I Use Slurm?

Slurm is ideal for traditional HPC workloads, particularly those requiring Message Passing Interface (MPI), as it offers native MPI integration. This makes it a great choice for complex computational tasks requiring efficient node communication. It is also highly suitable for large-scale, training-focused AI jobs demanding substantial computational resources. Slurm clusters are well suited for high-throughput jobs of known applications, as these applications are pre-installed by administrators, eliminating delays associated with pulling them from a registry.

When more advanced scheduling is necessary, Slurm excels with its fine-grained resource control, allowing precise and efficient management of memory and GPUs. Organisations with existing Slurm knowledge and using established submission scripts will benefit from its seamless integration and a reduced learning curve, ensuring smooth and efficient job scheduling and resource management.

Slurm clusters tend to be more static, and when there are enough jobs to fill the cluster, a vast array of policy options is available to define how different users on the cluster wait. In contrast, Kubernetes often operates with the expectation that the cluster will scale to accommodate workloads without waiting.

When Should I Use Kubernetes?

While Kubernetes can be more complex, it offers more flexibility and should be considered over Slurm when dealing with Cloud Native applications where features such as dynamic resource allocation, seamless scalability, and self-healing are required. It excels in environments where containerisation and microservices are prevalent, making it ideal for AI and machine learning workflows involving large-scale training, serving and model deployment.

Kubernetes’ extensive ecosystem of tools and frameworks supports continuous integration and deployment, ensuring efficient updates and high availability.

Why Not Both?

Given the extensibility of Kubernetes, the obvious question is: why not both? This is a question Nscale is engineering an answer to with our SLONK project which will give users the ability to extend their Kubernetes deployment with a set of custom resources which, in turn, allow the full lifecycle management of Slurm clusters, including comprehensive “day 2” operational considerations.

It should also be mentioned that there is a lot of work being done by the Kubernetes community to bridge the gap in functionality, and for projects which require a more expressive method of managing jobs in Kubernetes there exist projects such as Volcano and Kueue.

Conclusion

Whether it’s Kubernetes, Slurm or a blend of both, we’ve got you covered with rapid access to MI250x and MI300x GPUs optimised for AI and HPC workloads. Nscale's flexible AI cloud platform seamlessly integrates into existing workflows, delivering high performance with the support of AMD AI experts.

Reserve your GPU cluster now and take your machine learning projects to the next level.

Nick Jones

Head of Engineering

Bio

Nick is an experienced engineering leader with a career spanning over two decades across a wide variety of industries and sectors. As an OpenUK and a CNCF Ambassador he's heavily active in the Cloud Native and Open Infrastructure communities. He's passionate about new technologies and methodologies, especially those in relation to Open Source, virtualisation, orchestration, automation, and all forms of cloud computing.