In recent years, we have witnessed how Artificial Intelligence (AI) has rapidly advanced and impacted our day-to-day lives. From converting speech to text, generating images to identifying pedestrians in self-driving cars. Tools such as ChatGPT and Sora have shown us the tangible benefits of AI in a user-friendly format. However, behind these easily accessible applications lies a complex process of developing and training AI models before they are ready for production.
Enter the practice of MLOps…
One of the main components of the MLOps process (bringing AI models into production) is Inference - the process where trained machine learning (ML) models make predictions or decisions based on new input data. In other words, inference is an ML model’s moment of truth, its time to shine. It puts the ML model to the test on how well it can apply the information learned during the training phase and make a prediction on unseen data.
Will it be able to flag the email as spam accurately? Or will it throw it in your inbox and frustrate you even more in the morning? That will depend on how well the model was trained.
What is AI Inference?
Before Inference, AI models learned from large, labelled datasets in which they would learn patterns and relationships to make predictions. For example, in vehicle recognition, the AI model would have analysed different types of vehicles and their unique features such as shape or size to provide any accurate output of the make and model of the vehicle identified.
The machine learning life cycle includes two main parts, and although these two phases are integral to the ML life cycle, they serve different purposes:
- ML model training - During this phase, the ML model is fed labelled data to find patterns and make predictions. After the training phase, you will then test and validate the model by running it on test data. The training and test data are split from the dataset in 70/30 (70% training data, 30% test data) or 80/20.
- ML model inference - This is the second phase, in which the trained ML model applies its knowledge to live unseen data to make predictions.
Model Training and Inference can be thought of as the difference between learning and putting what you learned into practice. The ML model training phase is typically computationally time-consuming and intensive - similar to when you’re learning something new. The ML model inference phase is much faster and more effective - showing that your learning has paid off.
Another way of seeing the AI Inference stage is calling it the “action phase” of AI, where models apply their learned knowledge to real-world scenarios.
How Does AI Inference Work?
A simple process of the inference pipeline:
- User Input/Request: A user or system sends data (e.g. prompt, image, etc) to the model to perform a task/prediction.
- Host System: The ML model’s host system (on cloud or on-prem) will receive input data and feed it into the ML model, which is run by an Inference engine.
- Output/Prediction: The ML model applies the model weights to unseen input data to generate the predicted output.
What is an Inference Engine?
ML model inference engines are a specialised computing environment that is specifically designed to host and run ML model inference. The inference engine works by accepting the input data from the data source and passing it into a trained ML model, returning the inference output. These engines can run on-premise, in data centres or cloud environments.
Just like everything else in the AI world, inference engines require ML models to be exported in a specific format that the engine can understand. These days we have a few common formats that are used, for example, Hugging Face Safetensors. A good overview of the different formats can be found here.
Common AI Inference Metrics
The most common AI inference metrics are:
- Latency: This refers to the speed at which an ML model can complete inference. Latency depends on how much compute resource is available for a specific task, how fine-tuned a model is and how optimised it is for that particular task. In an ideal world, your inference latency is always close to zero and is typically measured in milliseconds (ms).
- Throughput: Throughput is the number of predictions an ML model can produce within a specific period without failing. In an ideal world, your throughput should be as high as possible and is typically measured in predictions per second (PPS), requests per second (RPS) and tokens per second (specific to large language models).
- Accuracy: A well-known performance metric used in machine learning, accuracy in inference doesn’t necessarily measure the accuracy of the inference process itself, but more so the quality of your model’s outputs. It is typically measured as a percentage between 0 (highly inaccurate) and 1 (highly accurate).
Have a read of our updated inference benchmarks: AMD MI300X GPUs with GEMM tuning using Gradlib and textprompt.
ML Inference Challenges
Up until this point you were probably thinking this sounds so great. Unfortunately, every good thing has its downfalls and ML model inference has three primary challenges.
Infrastructure Cost
The main purpose of ML model Inference and the way it was designed is for it to be fast and efficient. However, with fast and efficient comes computationally intensive tasks that require GPUs and CPUs that are running in data centres or cloud environments.
The cost of inference is a key factor you need to consider in your ML model operations. The ML model inference workload must fully utilise the hardware infrastructure, resulting in minimising the cost per inference.
Latency
Latency is the time delay between when an AI system receives an input and generates the output. In simple terms, it measures the time taken by a single data packet to travel from the source to the destination.
Minimal latency is a common requirement for inference systems. For example, mission-critical applications such as autonomous navigation often require real-time inference and, therefore, require lower latency. However, some use cases do not require immediate responses, such as data analytics.
Compatibility
Different inference engines will have different implementations of operations that might increase or decrease a model's inference speed. The operations that are necessary for a trained model are implemented for specific GPUs, mostly AMD or Nvidia. This means that the efficiency of the inference engine will depend on the underlying GPUs and for which GPU the inference engine is made optimal.
Why is AI inference important?
Without AI inference, we wouldn’t be able to benefit from the development of AI. It brings together decades of AI discovery into the real world, from gathering data, cleaning it, training the model and optimising it to provide accurate tangible outcomes. The faster, more efficient and more accurate an AI inference system is, the more impact it has.
Here are some points on why AI inference is important:
- Real-time decision making: Inference allows AI systems the capability to respond to new data in real-time.
- Scalability: A well trained model back with inference allows these models to be deployed at scale across various platforms and applications.
- Cost-effective: Inference ensures you get the most out of your AI model without breaking the bank. Fast, low-latency predictions can reduce the cost per inference, making your model more accessible and useful to different industries.
- Accelerates innovation: AI is continuously learning and becoming a part of the real world, therefore AI inference constantly drives innovation across industries and pushes boundaries.
Wrapping it up
In a nutshell, AI inference is when the development and ability of machine learning are truly showcased - from what was learned during the training phase to how it can be applied to new, unseen data. This is the true worth of AI models and where businesses as well as end-users see the value of intelligent systems.