Migrating LLM Training Workloads from Nvidia to AMD

Published on
July 2, 2024
Author:
Amin Sabet
Subscribe to our newsletter
Read about our privacy policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

AMD showcased their leadership roadmap at Computex 2024, featuring the new AMD Instinct MI325X and the 5th Gen EPYC processors. AMD GPUs like MI250 are gaining popularity for their unparalleled performance and cost-effectiveness. However many users still need to learn their capabilities for fine-tuning large language model (LLM) workloads. 

In this blog, we will demonstrate how easy it is to convert LLM training from Nvidia to AMD GPUs, ensuring no changes are needed in the core training logic. To facilitate this transition and showcase the ease of migration, we provide a practical example using the widely available AMD GPU instances at Nscale.

Introduction:

The following demonstrates training a GPT-2 model using DeepSpeed and Flash Attention on a single GPU. DeepSpeed, an optimisation library, enhances the efficiency of training large-scale models, providing features such as mixed-precision training, gradient checkpointing, and ZeRO (Zero Redundancy Optimiser). These features collectively reduce memory consumption and improve computational efficiency, enabling faster and resource-efficiency training of larger models.

Flash Attention optimises the speed and memory usage of attention mechanisms in your model. By replacing standard attention with Flash Attention, you can achieve faster training times and reduce memory overhead when dealing with large models and datasets.

Prerequisites:

Ensure you have the following hardware prerequisites:

  • Nvidia GPU: A Nvidia GPU with more than 15GB of memory
  • AMD GPU: An AMD MI250 GPU

Training GPT-2 on a Nvidia GPU

Starting from a Nvidia Docker Image with PyTorch Installed

First, we will initiate a Docker container with PyTorch pre-installed for Nvidia GPUs. Nvidia Container Toolkit needs to be installed. This ensures that all necessary dependencies for GPU-accelerated training are readily available, simplifying the process of training the GPT-2 model.

sudo docker run --gpus all -it --rm  -v :workspace nvcr.io/nvidia/pytorch:24.04-py3

Install Dependencies

Inside the Docker container, install the required libraries:

pip install transformers evaluate accelerate deepspeed mpi4py flash-attn --no-build-isolation

Training GPT-2 Model on a Single GPU

Save the following code in a file named train.py:

import os
import numpy as np
import torch
from datasets import load_dataset
from transformers import (
    Trainer,
    TrainingArguments,
    AutoTokeniser,
    AutoModelForCausalLM,
)
from flash_attn.flash_attn_interface import flash_attn_varlen_kvpacked_func
import deepspeed

# Load the dataset
dataset = load_dataset("squad")

# Initialise the tokeniser
tokeniser = AutoTokeniser.from_pretrained("gpt2")
tokeniser.pad_token = tokeniser.eos_token
tokeniser.add_special_tokens({'pad_token': '[PAD]'})

# Tokenisation function
def tokenise_function(examples):
    inputs = tokeniser(examples["question"], examples["context"], truncation=True, padding="max_length", max_length=32)
    inputs["labels"] = inputs["input_ids"].copy()
    return inputs

# Prepare the training dataset
small_train_dataset = dataset["train"].select(range(1000)).map(tokenise_function, batched=True)

# Initialise the model
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Apply Flash Attention
def apply_flash_attn(module):
    for child_name, child in module.named_children():
        if isinstance(child, torch.nn.MultiheadAttention):
            child.forward = flash_attn_varlen_kvpacked_func(child.forward)
        else:
            apply_flash_attn(child)

model.apply(apply_flash_attn)

# Custom Trainer class to integrate DeepSpeed
class MyTrainer(Trainer):
    def create_optimiser_and_scheduler(self, num_training_steps: int):
        if self.optimiser is None:
            model, optimiser, _, _ = deepspeed.initialise(model=self.model, args=self.args)
            self.model = model
            self.optimiser = optimiser
        return super().create_optimiser_and_scheduler(num_training_steps)

# Define training arguments
training_args = TrainingArguments(
    output_dir="test_trainer",
    evaluation_strategy="no",
    report_to="none",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=8,
    fp16=True,
    learning_rate=1e-5,
    num_train_epochs=3,
    warmup_steps=1000,
    weight_decay=0.01,
    logging_dir="test_trainer/logs",
    logging_steps=10,
    save_strategy="epoch",
    save_total_limit=3,
    deepspeed='ds_config.json',  # Path to DeepSpeed config file
)

# Initialise Trainer
trainer = MyTrainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
)

# Start training
trainer.train()

DeepSpeed Configuration File (ds_config.json)

Save the following configuration file for DeepSpeed as ‘ds_config.json’.

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimiser": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": [0.9, 0.999],
            "eps": 1e-8,
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 1e-05,
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimisation": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true,
        "cpu_offload": true
    },
    "activation_checkpointing": {
        "partition_activations": true,
        "cpu_checkpointing": true,
        "number_checkpoints": 4,
        "synchronise_checkpoint_boundary": false,
        "contiguous_memory_optimisation": false
    },
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

Running the Training

To run the training, execute the following command:

deepspeed train.py 

This setup ensures efficient and optimised training of the GPT-2 model on a Nvidia GPU using DeepSpeed and Flash Attention.

Training GPT-2 on a AMD GPU-MI250

Using an AMD Docker Image with ROCm and PyTorch Installed

We will start by using AMD's latest Docker image, which includes ROCm 6.1.2 and PyTorch 2.1.2. This setup ensures all necessary dependencies for GPU-accelerated training on AMD hardware are in place.

sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE -v $(pwd):/var/lib/jenkins --security-opt seccomp=unconfined docker.io/rocm/pytorch:rocm6.1.2_ubuntu20.04_py3.9_pytorch_staging

Updating pip

Upgrade pip to the latest version:

pip install --upgrade pip

Installing Required Packages

Install the necessary Python libraries:

pip install deepspeed datasets numpy==1.26.4 transformers evaluate accelerate 

Installing Flash Attention with ROCm Support

In order to install Flash Attention with ROCm support, we cannot simply run pip install flash-attn as it is not compatible with AMD GPUs. Instead, we need to clone AMD’s flash-attention repository and build it from the source.

git clone --recursive <https://github.com/ROCm/flash-attention.git>
cd flash-attention
MAX_JOBS=$((`nproc` - 1)) pip install -v .

Running the Training

Once the Flash Attention is installed, execute the following training script without any changes to the logic of the configuration of DeepSpeed:

deepspeed train.py

By following these steps, you can seamlessly convert your LLM training from Nvidia to AMD GPUs, leveraging the full capabilities of AMD hardware for efficient and effective model training.

Conclusion

Migrating LLM training workloads from Nvidia to AMD GPUs is a straightforward and beneficial process. With AMD’s advanced MI250 GPUs and the support of DeepSpeed and Flash Attention, you can achieve efficient, cost-effective, and high-performance training.

At Nscale, we provide rapid access to MI250x and MI300x GPUs optimised for AI and HPC workloads. Nscale's flexible AI cloud platform seamlessly integrates into existing workflows, delivering high performance with the support of AMD AI experts. 

Reserve your GPU cluster now!

Request a Meeting with Nscale:
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Nscale Benchmarks: AMD MI300x GPUs with GEMM tuning improves throughput and latency by up to 7.2x

Optimising AI model performance: vLLM throughput and latency benchmarks and GEMM Tuning with rocBLAS and hipBLASlt

Elio Van Puyvelde, Kian Mohadjerin
June 28, 2024
5 min read

Unlock Green ROI: How to Achieve Carbon Neutrality with Sustainable Data Centres for AI

Discover how sustainable AI solutions and ultra-efficient data centres can help your organisation achieve carbon neutrality and a favourable Green ROI.

Daniel Bathurst
June 24, 2024
5 min read

Join Nscale in Mastering GPU Selection: Choosing the Right GPU for Your Needs

Unlock the full potential of GPUs with Nscale and AMD in our upcoming webinar on the 30th of May 2024 at 3 pm BST.

Nisha Arya Ahmed
May 8, 2024
5 min read