CPU Inferencing: Why not?

Kian Mohadjerin

Cédric Bulteel

October 15, 2024

•

2 minutes

Kian Mohadjerin

Cédric Bulteel

October 15, 2024

•

2 minutes

Let me start by saying that we know that Nscale is a hyperscaler engineered for AI. Yes, we provide bare metal and virtualised GPU nodes, but that doesn't mean we ignore the potential of CPUs.

CPUs are still a key component in modern computing systems. With the transition to AI-specific systems, GPUs are undoubtedly the way to go regarding training and inference. At Nscale, we’re asking an important question: Could the use of CPUs evolve alongside this trend?

Could we use CPUs for small model inferencing?
Is it more cost/energy efficient than GPU inferencing?

Let us answer these questions for you.

‍

Can the CPU handle small model inferencing?

With AI becoming integral to every stack, the question whether CPUs can perform small-scale inferencing arises.

‍

Our test case

While GPUs are the go-to when it comes to high-demand scenarios, CPUs also have their time to shine in specific applications where scale and speed aren’t as critical. For instance, Retrieval-Augmented Generation (RAG) systems can benefit from the CPU's lower power consumption. This is why we decided to compare CPU/GPU inference for RAG.

Technical details of our set-up:

RAG system built with LangChain
Embedding model: bge-m3 on CPU
LLM: used with Llamafile to test GPU vs CPU

When it comes to small models, CPU inference can provide impressive results. To back this point, we have conducted some tests to compare the performance of GPUs against CPUs with human reading speed as a reference.

Human: < 10 tokens/s
Mistral 7B Instruct model:
- GPU average: 62.8 tokens/s
- CPU average: 17.65 tokens/s

Looking at the graph below, the majority of small models can benefit from CPU inference. Take the example of TinyLlama in the graph, which shows that the CPU inference speed could even allow scalability. However, we wouldn't recommend larger models on CPU.

‍

CPU vs GPU Inference energy consumption

‍

‍

Energy consumption is a critical aspect that a lot of AI companies need to take into account. When we compare CPUs and GPUs, there is a significant difference in their efficiency when it comes to AI inference tasks.

GPUs are the OG’s, they are designed for parallel processing making them the go-to when it comes to large-scale model inference. However, this high performance comes at a high energy cost. Due to their specialised hardware and high computational throughput, they consume more power. Moreover, the inefficiencies in GPUs for AI are also a result of our inadequacies in writing more efficient implementations. By optimising our code to leverage GPU architectures, we can reduce energy consumption and improve overall performance (for example: FlashAttention).

However, if we look at CPUs, although they are slower in terms of performance, they are more energy efficient. That brings us to the next graph.

‍

CPU and GPU inference energy consumption

‍

This graph speaks for itself, the difference between CPU consumption and GPU consumption is quite significant. Pair that with the price difference for a system with GPUs and without GPUs and you'll start to think "Why not?", in some cases.

That brings us to the next chapter.

‍

CPU-GPU System vs CPU-only System

To understand the true potential and efficiency of CPU inference, let’s compare two systems:

The CPU-GPU system:

Barebone: Supermicro GPU A+ Server 4125GS-TNRT - 4U 8 GPU Server - 24x 2.5" Hot-swap - Dual 10-Gigabit Ethernet - 2000W (3+1) Redundant
Processor: 2 x AMD EPYC™ 9684X Processor 96-core 2.55GHz 1152MB Cache (400W)
Memory: 24 x 128GB PC5-38400 4800MHz DDR5 ECC RDIMM
Hard Drive: 3.2TB Micron 7450 MAX Series U.3 PCIe 4.0 x4 NVMe Solid State Drive (15mm)
GPU Accelerator: 2 x AMD Instinct™ MI210 Accelerator - 64GB HBM2e - PCIe 4.0 x16 - Passive Cooling
Network Adapter: Broadcom NetXtreme 25-Gigabit Ethernet Network Adapter P225P - PCIe 3.0 x8 - 2x SFP28

The CPU-only system:

Barebone: Supermicro Hyper Server 1125HS-TNR - 1U - Up to 12x NVMe/SATA/SAS - 2x M.2 - 1x AIOM - 1200W (1+1) Redundant
Processor: 2 x AMD EPYC™ 9684X Processor 96-core 2.55GHz 1152MB Cache (400W)
Memory: 24 x 128GB PC5-38400 4800MHz DDR5 ECC RDIMM
Hard Drive: U.2/U.3 NVMe Drive: 3.2TB Micron 7450 MAX Series U.3 PCIe 4.0 x4 NVMe Solid State Drive (15mm)
Network Adapter: Broadcom NetXtreme 25-Gigabit Ethernet Network Adapter P225P - PCIe 3.0 x8 - 2x SFP28

We can see that both systems are identical except for the GPUs.

The table below highlights the key differences between the two systems:

‍

From the table, it is clear that a CPU-only system can provide significant cost savings while still delivering performance for appropriate workloads.

‍

Conclusion

As the AI landscape evolves, CPUs remain an important consideration for certain inferencing tasks. While GPUs excel in large-scale model training and inference, CPUs can offer cost-effective and energy-efficient solutions for smaller models and less intensive applications.

At Nscale, we focus on using the right tools for the right tasks.

Kian Mohadjerin

Head of AI

Bio

With an extensive background in ML, Python, and HIP/C++, Kian Mohadjerin leads AI innovation at Nscale. He has developed a deep expertise through hands-on development and leadership in various AI initiatives. Currently, his team is pioneering 'Paiton', a proprietary library aimed at optimizing AI models for AMD GPUs, showcasing their commitment to pushing the boundaries of AMD technology. Their work focuses not only on advancing technical capabilities but also on applying these advancements to real-world applications, significantly enhancing model efficiency and performance across AMD GPUs with CDNA architecture.

Cédric Bulteel

Junior Python Developer

Bio

A biomedical engineer with specialisation in AI and data science. With his thesis in using Transformers in classification problems he's developed significant experience in combining his knowledge of biomedicine with current AI technologies.