Serverless

Serverless endpoints for LLM inference

Instantly access popular Generative AI models without the need to manage infrastructure. Only pay for what you use and scale indefinitely with Nscale Serverless.

Models & Pricing

Prices are per 1 million tokens including input and output tokens for Chat, Multimodal, Language and Code models. Image model pricing is based on image size.
Get Started
swipe for more info
arrow pointing right
Serverless Endpoint
Llama 4 Scout 17B 16E Instruct
Qwen QwQ 32B
Qwen 2.5 Coder 32B Instruct
Qwen 2.5 Coder 7B Instruct
Qwen 2.5 Coder 3B Instruct
DeepSeek R1 671B
DeepSeek R1 Distill Llama 70B
DeepSeek R1 Distill Llama 8B
DeepSeek R1 Distill Qwen 32B
DeepSeek R1 Distill Qwen 14B
DeepSeek R1 Distill Qwen 7B (Math)
DeepSeek R1 Distill Qwen 1.5B (Math)
Llama 3.3 70B Instruct
Llama 3.1 8B Instruct
Mixtral 8x22B Instruct v0.1
Flux.1 [schnell]
Stable Diffusion XL 1.0
Type
Price
Text Generation
$0.09 Input / $0.29 Output
per 1m tokens
Text Generation
$0.18 Input / $0.20 Output
per 1m tokens
Text Generation
$0.06 Input / $0.20 Output
per 1m tokens
Text Generation
$0.01 Input / $0.03 Output
per 1m tokens
Text Generation
$0.01 Input / $0.03 Output
per 1m tokens
Text Generation
$0.8 Input / $2.40 Output
per 1m tokens
Text Generation
$0.75
per 1m tokens
Text Generation
$0.05
per 1m tokens
Text Generation
$0.3
per 1m tokens
Text Generation
$0.14
per 1m tokens
Text Generation
$0.4
per 1m tokens
Text Generation
$0.18
per 1m tokens
Text Generation
$0.4
per 1m tokens
Text Generation
$0.06
per 1m tokens
Text Generation
$1.2
per 1m tokens
Text-to-Image
$0.0013 (@ 4 steps)
per mega-pixel
Text-to-Image
$0.003 (@ 20 steps)
per mega-pixel
Need dedicated infrastructure?
Contact Sales
“A dark grid showcasing AI models categorized by their functionality. Top row: ‘Text Generation’ with ‘LLAMA 3.2 11B Instruct’ by Meta, ‘LLAMA 3 70B Instruct’ by Meta, and ‘Mixtral 8x22B Instruct’ by Mistral AI. Bottom row: ‘Text Generation’ with ‘AMD LLAMA 135M’ by AMD, ‘Text-to-Image’ with ‘Stable Diffusion 3 Medium’ by Stability AI, and ‘Text-to-Image’ with ‘Flux.1 [Schnell]’ by Black Forest Labs.”

Instant access to leading models

No more idle costs, infrastructure headaches, or cold starts, get instant access to all models in the Nscale ecosystem and scale as much as you need.
Access now

Built on high-
performance GPU compute

Our inference service is built on the latest AMD Instinct-series GPU accelerators. Combined with high-speed networking and fast storage, we deliver unmatched computational power for AI workloads.
Learn More
“Close-up of an AMD Instinct GPU system, featuring multiple modules with ‘AMD Instinct’ branding on metallic heat sinks. The hardware reveals intricate circuit boards and processors underneath, showcasing the advanced design of high-performance computing components.”
OUR HARDWARE

Access cutting-edge hardware

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Get Started
AMD MI300X

Harness the power of AMD's MI300X GPUs for unparalleled compute performance and efficiency.

Contact Sales
AMD MI250X

Instant access to AMD MI250X GPUs to drive results for all your computational needs.

Contact Sales
NVIDIA H100

Experience the pinnacle of AI performance with Nvidia H100 GPUs available instantly.

Contact Sales

Performance

80% LOWER COST
More performance for less
Nscale delivers on average 80% cost-saving in comparison to hyperscalers.
30% FASTER
On time to insights
Nscale Cloud accelerates time to insights by up to 30% thanks to its AI-optimised stack.
+40% EFFICIENCY
Improved resource utilisation
Up to 40% improvement on efficiency.
100%
Renewable Energy
Our data centres operate on 100% renewable power generated by hydropower dams.

Get access to a fully integrated suite of AI services and compute

Reduce costs, grow revenue, and run your AI workloads more efficiently on a fully integrated platform. Whether you're using Nscale's built-in AI/ML tools or your own, our platform is designed to simplify the journey from development to production.

Serverless
Marketplace
Training
Inference
GPU nodes
Nscale's Datacentres
Powered by 100% renewable energy
LLM Library
Pre-configured Software
Pre-configured Infrastructure
Job Management
Job Scheduling
Container Orchestration
Optimised Libraries
Optimised Compilers and Tools
Optimised Runtime
EYEBROW TEXT

Feature List

Feature Title
Example Feature Here
Example Feature Here
Example Feature Here
Example Feature Here
Feature Title
Example Feature Here
Example Feature Here
Example Feature Here
Example Feature Here
Feature Title
Example Feature Here
Example Feature Here
Example Feature Here
Example Feature Here
Feature Title
Example Feature Here
Example Feature Here
Example Feature Here
Example Feature Here

FAQs

What is Nscale Serverless Inference?

Nscale Serverless Inference is a fully managed platform that enables AI model inference without requiring complex infrastructure management. It provides instant access to leading Generative AI models with a simple pay-per-use pricing model.

Who is this service for?

This service is designed for developers, startups, enterprises, and research teams who want to deploy AI-powered applications quickly and cost-effectively without handling infrastructure complexities.

What AI models are available?

At launch, Nscale supports popular open-source models for text generation, image generation, and computer vision. We continuously expand our offerings based on user feedback.

How does the pricing work?

Nscale follows a pay-per-request model:
- Text models: Billed based on input and output tokens.
- Image models: Pricing depends on output image resolution.
- Vision models: Charged based on processing requirements.
- New users receive free credits to explore the platform.

What are the key benefits of using Nscale Serverless Inference?

No infrastructure hassles: We handle scaling, monitoring, and resource allocation.
Cost-effective: Our vertically integrated stack minimises compute costs.
Scalable & Reliable: Automatic scaling ensures optimal performance.
Secure & Private: No request or response data is logged or used for training.
OpenAI API & SDK compatibility: Easily integrate with existing tools.

How does scaling work?

Nscale automatically adjusts capacity based on real-time demand. There’s no need for manual configuration, making it easy to scale applications seamlessly.

Access thousands of GPUs tailored to your requirements.