Introduction
I'm sure you've heard the term "LLM fine-tuning" before and have probably even tried fine-tuning a small model yourself! With the recent release of the powerful DeepSeek-R1 model, there's a growing interest in customising advanced open-source models for specific applications.
DeepSeek-R1, derived from the V3 foundational model, is an open-source reasoning model designed to tackle tasks requiring logical inference, mathematical problem-solving, and real-time decision-making. Its reasoning capabilities set it apart from traditional language models, positioning it alongside other advanced systems like OpenAI's o1.
However, fine-tuning large models like DeepSeek-R1 with more than 671 Billion parameters presents unique challenges, especially when their size exceeds the capacity of a single GPU or even a single node. Efficiently distributing the model across multiple GPUs and nodes becomes essential to manage computational and memory constraints.
In this blog, we're going big! We'll explore strategies for distributing your model across multiple GPUs and nodes, enabling you to run a distributed fine-tuning job effectively. So, stay with me! First, though, let's clarify some definitions.
What is AI Model Fine-Tuning?
AI model fine-tuning involves adjusting a pre-trained AI model already trained on a large dataset to fit your data better and perform a specific task. It is an excellent technique for getting the most out of your data.
However, to understand fine-tuning, we need to understand transfer learning. Transfer learning is a machine learning technique in which a model trained on one task is repurposed for a second related task.
You can transform your generalist AI model into a specialist by making small adjustments to your AI model parameters.
Why you need to fine-tune your AI model
Fine-tuning has significantly improved a model's performance, from accuracy to precision. Tailoring your AI models knowledge to fit your unique needs results in more relevant and accurate outputs for your task.
Efficiency
Training a model from scratch can be time-consuming and resource-intensive, especially when you can access open-source models online. Fine-tuning allows you to leverage the knowledge already embedded in the pre-trained model and optimise it to make it more specific to your needs, improving efficiency and saving time.
Data efficiency
When working with pre-trained models, you already have the advantage of not having to train your model with new data from scratch. Fine-tuning allows you to use less data to optimise your model more effectively without having to search for scarce or expensive data.
Performance
We’re all looking for new ways to improve our model's performance, and fine-tuning is one of the best techniques to help you achieve this. If the pre-trained model was already trained on a similar task or domain, your fine-tuning process can significantly improve its performance.
When should I fine-tune my AI model?
Before you go straight into fine-tuning your AI model, it is important to know when to fine-tune it. Here are a few use cases when fine-tuning your model would be appropriate:
Limited data
Data is food for models; the more they have, the more they appreciate it. However, if you’re working with a small dataset, you may encounter overfitting issues. This occurs when an algorithm fits too closely or even precisely to its training data, resulting in it performing poorly on unseen data. Fine-tuning your model on limited data allows you to leverage the strengths of your pre-trained model and adapt these to your specific task to achieve accurate outputs without the need for new data.
Domain specialisation
The purpose of pre-trained models is to provide you with a stepping stone in your AI initiatives. Pre-trained models perform well on their related tasks. However, they may not perform as well on your specific task at hand. This is where fine-tuning comes in to bridge the gap and create a model that performs well in a specialised task.
What is distributed AI model training, and why do we need it?
In recent years, AI models have grown significantly in size and complexity, making it nearly impossible to even fine-tune these high-performing models on a single GPU or server. Distributed AI model training addresses this challenge by splitting the model and data across multiple GPUs or servers, enabling faster training times and the capacity to handle large datasets. Here are three examples of distributed training methods:
Data Parallelism (DDP)
Data Parallelism refers to splitting the data into batches and distributing the data across multiple GPUs. Each GPU will train on a batch of data independently and then communicate with the other GPUs. In DDP each GPU keeps a copy of the entire model's parameters.
Fully Shared Data Parallelism (FSDP)
Fully shared data parallelism involves splitting the model and data across multiple GPUs. This means that each GPU stores only a fraction of the model's parameters and data, ultimately reducing memory consumption per GPU.
DeepSpeed
DeepSpeed is a distributed training library developed by Microsoft that combines DDP and other techniques, such as ZeRO (Zero Redundancy Optimiser), to help larger models achieve efficiency during the training phase. There are three different types of ZeRO optimizers:
- ZeRO 1: Only the optimizer states are partitioned across the processes, so that each process updates only its partition.
- ZeRO 2: The gradients for updating the model weights are also partitioned, such that each process retains only the gradients corresponding to its portion of the optimizer states.
- ZeRO 3: The model parameters are also partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes. In addition, ZeRO-3 includes the infinity offload engine to form ZeRO-Infinity, which can offload to both CPU and NVMe memory for huge memory savings.
Each distributed AI training method has advantages and is suitable for different scenarios and requirements, including model size, dataset size, and available hardware resources.
Now, let’s start fine-tuning a 70B-parameter model using all the above distribution methods!
Fine-tuning Deepseek-R1-Distill-Llama-70b
NOTE: The purpose here is not to optimise the training loss of the fine-tuning, but to show the process of setting up and running a distributed fine-tuning job.
Pre-requisites
The following examples have been run on 1, 2, and 4 nodes, each with 8x 64GB GPUs.
Installing the libraries and tools
In this blog, we’ll use LLama-Factory for our fine-tuning process, so let’s first see how we set everything up!
- Make sure you have the latest drivers for your underlying hardware. This includes GPU drivers, as well as network drivers for the multi-node use cases we’re going to use later.
- Install python3, python3-venv and python3-dev:
TIP: It’s always useful to have the dev package installed as well, in case you need to compile some libraries for your environment.
# In Ubuntu
sudo apt update
sudo apt install python3 python3-dev python3-venv
# In RHEL/Rocky
sudo dnf install python3 python3-devel python3-pip
- Create and activate a python venv for all your libraries:
TIP: Do this in a directory that’s shared across all nodes
cd /data # This is a NFS shared directory in my cluster
mkdir venvs && cd venvs
python3 -m venv llama-factory
- Download LLama-Factory and install it in your venv:
cd /data
source /data/venvs/llama-factory/bin/activate
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
pip install deepspeed # Needed for one of our distributed methods
pip install bitsandbytes # Needed for model quantization
pip install huggingface-hub # Needed to download models and datasets
- Dataset: LLama-Factory provides support for a few different pre-configured datasets, as well as bringing your own ones. In all the following examples we use the Identity and the Alpaca English datasets.
Our first try- DDP Fine tuning on a single GPU node!
To start our fine-tuning, we need a yaml file with our model and training details. Here are the contents of deepseekr1-D-llama70b_lora_sft-1node.yaml
, which is the one I’m using for my single-node DDP training (Note that llama-factory uses DDP by default). I created this file under /data/LLaMA-Factory/examples/train_nscale/
.
### model
model_name_or_path: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
trust_remote_code: true
### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
### dataset
dataset: identity,alpaca_en_demo
template: deepseek3
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
### output
output_dir: /data/saves/deepseek-r1-D-llama70b/lora-1node/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
Note some important values in our file above:
- We grab the model name from Huggingface
- We use LORA Supervised fine-tuning (sft)
- We use the Identity and the Alpaca English datasets to fine tune our model
The output of our fine tuning will be saved under the /data/saves/deepseek-r1-D-llama70b/lora-1node/sft
directory
Time to log in to huggingface and run our fine-tuning command! How excited are you?!
huggingface-cli login
# follow the prompt to login
llamafactory-cli train \
examples/train_nscale/deepseekr1-D-llama70b_lora_sft-1node.yaml
The last command above will download the model to your HF_HOME directory if it’s not there already. This process takes a while since our model takes around 132GB of disk space!
Now, depending on the node you’re running this on and the GPUs you have access to, you’ve probably experienced the same issue as me and saw the following error:
[rank0]: torch.OutOfMemoryError: HIP/CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 288.00 MiB is free. Of the allocated memory 62.96 GiB is allocated by PyTorch, and 815.50 KiB is reserved by PyTorch but unallocated.
That can only mean one thing! The model doesn’t fit in the GPU memory we have available! It's not the greatest news, I know, but don’t worry; we have ways to get around this!
QLORA: Reducing memory footprint
In the example above, we’ve used the LORA technique to try to fine-tune our model. This technique reduces the GPU memory footprint compared to full fine-tuning, but it wasn’t enough! So, let’s quantize our model and use QLORA instead!
LORA (Low-Rank Adaptation)
- Uses a low-rank matrix to adapt the weights of a pre-trained model to a new task
- Efficient and effective, especially for small datasets and few-shot learning tasks
- Reduces the number of trainable parameters, leading to faster training and lower memory requirements
- Preserves the pre-trained model's knowledge while adapting to new tasks
QLORA (Quantized LORA)
- Extends LORA by quantizing the low-rank matrix, resulting in a more compact and efficient representation
- Particularly useful for on-device fine-tuning and deployment of large language models
- Achieves similar performance to LORA with a significantly reduced memory footprint
- Enables fine-tuning on consumer hardware with limited resources
To retry our fine-tuning process with QLORA we need to add two lines in our yaml file. Here is the new file, which I’ve named deepseekr1-D-llama70b_qlora-4bit_sft-1node.yaml:
### model
model_name_or_path: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
quantization_bit: 4
quantization_method: bitsandbytes
trust_remote_code: true
### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
### dataset
dataset: identity,alpaca_en_demo
template: deepseek3
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
### output
output_dir: /data/saves/deepseek-r1-D-llama70b/qlora-4bit-1node/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
The two lines we added are these:
quantization_bit: 4
quantization_method: bitsandbytes
This will instruct llama-factory to use a 4-bit quantized QLORA fine-tuning for our next run.
So, let’s try again!
llamafactory-cli train \
examples/train_nscale/deepseekr1-D-llama70b_qlora-4bit_sft-1node.yaml
And boom! Your fine-tuning should be completed successfully this time! If you were monitoring your GPU memory on both runs, you’d notice that with QLORA, the memory needed was quite less compared to our first LORA run.
In the final lines of the output, you should get a nice, friendly message along with some training metrics, which are also saved in your output dir. For example:
Training completed. Do not forget to share your model on huggingface.co/models =)
***** train metrics *****
epoch = 2.8759
total_flos = 195489829GF
train_loss = 1.0757
train_runtime = 0:16:23.42
train_samples_per_second = 3.328
train_steps_per_second = 0.052
cat /data/saves/deepseek-r1-D-llama70b/qlora-4bit-1node/sft/all_results.json
{
"epoch": 2.875912408759124,
"total_flos": 2.099056055639081e+17,
"train_loss": 1.075687511294496,
"train_runtime": 983.4243,
"train_samples_per_second": 3.328,
"train_steps_per_second": 0.052
}
DeepSpeed ZeRO3: Distributing the model across nodes
Another way to successfully fine-tune our model while still using LORA is to shard its parameters and distribute them, not just across multiple GPUs but across multiple nodes as well! In this run, we’re going to use DeepSeep ZeRO3 to do this, which we explained earlier on.
To do this, we first need a DeepSpeed json config file, which I’ve named ds_z3_config.json
. FYI, llama factory provides this as well, in its examples/deepspeed
directory. Here it is:
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}
Then, we add this config file to our fine-tuning yaml file, which, in this run, I’ve named deepseekr1-D-llama70b_lora_sft-2node_ds.yaml
. Here it is:
### model
model_name_or_path: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
trust_remote_code: true
### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
deepspeed: examples/deepspeed/ds_z3_config.json
### dataset
dataset: identity,alpaca_en_demo
template: deepseek3
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
### output
output_dir: /data/saves/deepseek-r1-D-llama70b/lora-2node_ds/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
Note this new line in our yaml file, which references our deepspeed json config file:
deepspeed: examples/deepspeed/ds_z3_config.json
Now that we have our config files, we need to configure some environment variables, which will help us with the node-to-node communication!
Remember, we’re using two nodes in this example, so we need to ensure that these nodes have their networks properly configured, can talk to each other and ideally have RDMA configured, either with RoCE or with IB.
In my case, we have 4x 200Gb Ethernet interfaces per node and we’re configured RoCE, so here are the environment variables we need to set on both nodes:
export NCCL_IB_GID_INDEX=3
export NCCL_NET_GDR_LEVEL=3
export NCCL_SOCKET_IFNAME=enp37s0np0,enp12s0np0,enp139s0np0,enp180s0np0
export NCCL_IB_HCA=bnxt_re0:1,bnxt_re1:1,bnxt_re2:1,bnxt_re3:1
source /data/venvs/llama-factory/bin/activate && cd /data/LLaMA-Factory/
NOTE: You’ll need to adjust the NCCL_SOCKET_IFNAME
and NCCL_IB_HCA
environment variables to match your interface names and RDMA names accordingly.
Now, you need to run the following command on both nodes, changing only the NODE_RANK
variable.
On the first node:
cd /data/LLama-Factory
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NPROC_PER_NODE=8 FORCE_TORCHRUN=1 NNODES=2 NODE_RANK=0 MASTER_ADDR=172.16.17.206 MASTER_PORT=29500 llamafactory-cli train examples/train_lora/deepseekr1-D-llama70b_lora_sft-2node_ds.yaml
On the second node:
cd /data/LLama-Factory
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NPROC_PER_NODE=8 FORCE_TORCHRUN=1 NNODES=2 NODE_RANK=1 MASTER_ADDR=172.16.17.206 MASTER_PORT=29500 llamafactory-cli train examples/train_lora/deepseekr1-D-llama70b_lora_sft-2node_ds.yaml
Here is some explanation of the environment variables we’re using here:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
- Specifies which GPUs are visible to the current process. In this case, all 8 GPUs (0-7) are visible.
NPROC_PER_NODE=8
- Specifies the number of processes to launch per node. In this case, 8 processes will be launched on each node.
FORCE_TORCHRUN=1
- Forces the use of torch.distributed.launch for distributed training.
NNODES=2
- Specifies the number of nodes to use for distributed training. In this case, 2 nodes will be used.
NODE_RANK
- Specifies the rank of the current node.
MASTER_ADDR=172.16.17.206
- Specifies the IP address of the master node. In this case, the master node has the IP address 172.16.17.206.
MASTER_PORT=29500
- Specifies the port on which the master node is listening for connections. In this case, the master node is listening on port 29500.
We have another successful fine-tuning run here, and we’ve used two nodes for it! Here are the results from it in my case:
cat /data/saves/deepseek-r1-D-llama70b/lora-2node_ds/sft/all_results.json
{
"epoch": 2.927536231884058,
"total_flos": 153117065740288.0,
"train_loss": 1.0262513289264603,
"train_runtime": 1077.0014,
"train_samples_per_second": 3.039,
"train_steps_per_second": 0.095
}
FSDP+QLORA combination
The other 2-node distribution technique we will try is FSDP, and we will pair it with QLORA as well! For this, we need an FSDP yaml config file, which I’ve named fsdp_config-2nodes.yaml
and we put it under examples/accelerate/.
Here it is:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_forward_prefetch: false
fsdp_cpu_ram_efficient_loading: true
fsdp_offload_params: false # offload may affect training speed
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: true
main_training_function: main
mixed_precision: fp16 # or bf16
num_machines: 2 # the number of nodes
num_processes: 16 # the number of GPUs in all nodes
main_process_ip: 172.16.17.206
main_process_port: 29600
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Now, we also need our fine-tuning yaml definition file, which, in this example, I’ve named deepseekr1-D-llama70b_qlora-4bit_sft-2node_fsdp.yaml
. Here it is:
### model
model_name_or_path: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
quantization_bit: 4
quantization_method: bitsandbytes
trust_remote_code: true
### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
### dataset
dataset: identity,alpaca_en_demo
template: deepseek3
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
### output
output_dir: /data/saves/deepseek-r1-D-llama70b/qlora-4bit-2node_fsdp/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
In this case, we will use the accelerate
library to launch our distributed job to see another way of doing this. Here are the commands we need to run on each node, along with the env variables we used before as well.
On the first node:
export NCCL_IB_GID_INDEX=3
export NCCL_NET_GDR_LEVEL=3
export NCCL_SOCKET_IFNAME=enp37s0np0,enp12s0np0,enp139s0np0,enp180s0np0
export NCCL_IB_HCA=bnxt_re0:1,bnxt_re1:1,bnxt_re2:1,bnxt_re3:1
source /data/venvs/llama-factory/bin/activate && cd /data/LLaMA-Factory/
cd /data/LLama-Factory
accelerate-launch --config_file examples/accelerate/fsdp_config-2nodes.yaml --multi_gpu --gpu_ids 0,1,2,3,4,5,6,7 --machine_rank 0 src/train.py examples/train_nscale/deepseekr1-D-llama70b_qlora-4bit_sft-2node_fsdp.yaml
On the second node:
export NCCL_IB_GID_INDEX=3
export NCCL_NET_GDR_LEVEL=3
export NCCL_SOCKET_IFNAME=enp37s0np0,enp12s0np0,enp139s0np0,enp180s0np0
export NCCL_IB_HCA=bnxt_re0:1,bnxt_re1:1,bnxt_re2:1,bnxt_re3:1
source /data/venvs/llama-factory/bin/activate && cd /data/LLaMA-Factory/
cd /data/LLama-Factory
accelerate-launch --config_file examples/accelerate/fsdp_config-2nodes.yaml --multi_gpu --gpu_ids 0,1,2,3,4,5,6,7 --machine_rank 1 src/train.py examples/train_nscale/deepseekr1-D-llama70b_qlora-4bit_sft-2node_fsdp.yaml
And that should lead to our final successful run! Let’s check the results as well:
cat /data/saves/deepseek-r1-D-llama70b/qlora-4bit-2node_fsdp/sft/all_results.json
{
"epoch": 2.927536231884058,
"total_flos": 2.10647323867349e+17,
"train_loss": 1.2693326274553935,
"train_runtime": 499.7013,
"train_samples_per_second": 6.55,
"train_steps_per_second": 0.048
}
We can see that the runtime has been cut in half, and the train samples per second have been doubled compared to the single-node QLORA run!
I’ve also run this across 4 nodes and got equally great results:
cat /data/saves/deepseek-r1-D-llama70b/qlora-4bit-4node_fsdp/sft/all_results.json
{
"epoch": 2.4571428571428573,
"total_flos": 1.8300294296030413e+17,
"train_loss": 1.3717289964358013,
"train_runtime": 229.628,
"train_samples_per_second": 14.253,
"train_steps_per_second": 0.052
}
Again, we have linear scaling, and as we can see, the runtime is halved, and the samples are doubled per second!
Results table
Here is a table with all the results we have so far:

Conclusion
We’ve learned what fine-tuning is, why and when we should use it, and how to do it in detail, using the DeepSeek-R1-Distill-Llama-70b model and llama factory as our main tools!
We explained how to overcome memory constraints using quantization (QLORA), as well as two different multi-GPU and multi-node distribution methods, DeepSpeed ZeRO3 and FSDP! The scaling results are great; we’ll test even more on larger models in the future!
Fine-tuning has become an essential technique in AI. It enables developers to build high-performing models with limited data and resources for a wide range of tasks. By leveraging the power of pre-trained models, fine-tuning democratizes access to AI and accelerates the development of innovative applications across various domains.
Next Steps
If you need to fine-tune your models with your datasets, don’t hesitate to contact us by clicking here. We’ll happily provide you with a cluster of GPU nodes so you can follow the steps we discussed above.
Finally, after you’ve fine-tuned your models, I’m sure you’d like them deployed so you can start using them immediately! Nscale provides a fast and affordable serverless inference platform based on auto-scaling GPU compute!
Join our waitlist by clicking here.