What AI models are available?

At launch, Nscale supports popular open-source models for text generation, image generation, and computer vision. We continuously expand our offerings based on user feedback.

SERVERLESS

Most cost-effective AI Inference

Name: Nscale Serverless Inference
Brand: Nscale
Availability: InStock
Rating: 4 (5 reviews)

4 out of 5 developers ranked us as the most cost-effective GenAI inferencing provider - with access to popular models and zero rate limits.

Try for free

Docs

Text Generation

GPT OSS 120B

openai

Text Generation

Devstral Small 2505

Mistral AI

Text Generation

Qwen 3 235B A22B

Qwen

Text Generation

Llama 4 Scout 17B 16E Instruct

Meta

Text Generation

Qwen 2.5 Coder 32B Instruct

Qwen

Text-to-Image

Flux.1 [schnell]

Black Forest Labs

Lower cost, more power

Our fully optimised stack cuts out the inefficiencies you pay for elsewhere. You get high-performance serverless at a fraction of the usual cost - savings we pass directly to you.

Claim $5 free credits

Engineered for AI workloads

Get all the cost and performance benefits of our fully integrated stack, purpose built for AI workloads at any scale.

Start today

Scale without the overhead

From testing to production, scale your AI workloads with no bottlenecks, no setup, just results.

Generate your API key

Models & Pricing

Prices are per 1 million tokens including input and output tokens for Chat, Multimodal, Language and Code models. Image models is based on image size and steps.

Request Access

swipe for more info

Serverless Endpoint

GPT OSS 120B

GPT OSS 20B

Devstral Small 2505

Qwen 3 235B A22B Instruct 2507

Qwen 3 235B A22B

Qwen3 4B Instruct 2507

Qwen3 4B Thinking 2507

Qwen 2.5 Coder 3B Instruct

Qwen 2.5 Coder 7B Instruct

Qwen 2.5 Coder 32B Instruct

Qwen QwQ 32B

Llama 4 Scout 17B 16E Instruct

Llama 3.3 70B Instruct

Llama 3.2 11B Vision Instruct

Llama 3.1 8B Instruct

DeepSeek R1 Distill Llama 70B

DeepSeek R1 Distill Llama 8B

DeepSeek R1 Distill Qwen 32B

DeepSeek R1 Distill Qwen 14B

DeepSeek R1 Distill Qwen 7B (Math)

DeepSeek R1 Distill Qwen 1.5B (Math)

Mixtral 8x22B Instruct v0.1

Qwen3 Embedding 8B

Flux.1 [schnell]

Stable Diffusion XL 1.0

SDXL Lightning 8-step

SDXL Lightning 4-step

Author

Type

Price

openai

Text Generation

$0.1 input / $0.4 output

per 1m tokens

openai

Text Generation

$0.05 input / $0.2 output

per 1m tokens

Mistral AI

Text Generation

$0.1 Input / $0.3 Output

per 1m tokens

Qwen

Text Generation

$0.20 Input / $0.60 Output

per 1m tokens

Qwen

Text Generation

$0.20 Input / $0.60 Output

per 1m tokens

Qwen

Text Generation

$0.01 input / $0.03 output

per 1m tokens

Qwen

Text Generation

$0.01 input / $0.03 output

per 1m tokens

Qwen

Text Generation

$0.01 Input / $0.03 Output

per 1m tokens

Qwen

Text Generation

$0.01 Input / $0.03 Output

per 1m tokens

Qwen

Text Generation

$0.06 Input / $0.20 Output

per 1m tokens

Qwen

Text Generation

$0.18 Input / $0.20 Output

per 1m tokens

Meta

Text Generation

$0.09 Input / $0.29 Output

per 1m tokens

Meta

Text Generation

$0.4

per 1m tokens

Meta

Image-Text-to-Text

$0.06

per 1m tokens

Meta

Text Generation

$0.06

per 1m tokens

DeepSeek

Text Generation

$0.75

per 1m tokens

DeepSeek

Text Generation

$0.05

per 1m tokens

DeepSeek

Text Generation

$0.3

per 1m tokens

DeepSeek

Text Generation

$0.2

per 1m tokens

DeepSeek

Text Generation

$0.15

per 1m tokens

DeepSeek

Text Generation

$0.1

per 1m tokens

Mistral AI

Text Generation

$1.2

per 1m tokens

Qwen

Embeddings

$0.04

per 1m tokens

Black Forest Labs

Text-to-Image

$0.0013 (@ 4 steps)

per mega-pixel

Stability AI

Text-to-Image

$0.003 (@ 20 steps)

per mega-pixel

ByteDance

Text-to-Image

$0.0016

per mega-pixel

ByteDance

Text-to-Image

$0.0008

per mega-pixel

Need dedicated infrastructure?

Contact Sales

“A dark grid showcasing AI models categorized by their functionality. Top row: ‘Text Generation’ with ‘LLAMA 3.2 11B Instruct’ by Meta, ‘LLAMA 3 70B Instruct’ by Meta, and ‘Mixtral 8x22B Instruct’ by Mistral AI. Bottom row: ‘Text Generation’ with ‘AMD LLAMA 135M’ by AMD, ‘Text-to-Image’ with ‘Stable Diffusion 3 Medium’ by Stability AI, and ‘Text-to-Image’ with ‘Flux.1 [Schnell]’ by Black Forest Labs.”

Savings by design, not compromise

Our vertically integrated stack is optimised at every layer, from hardware to orchestration - driving down compute costs and delivering consistent performance. The result? Real savings that we pass directly to our customers, without sacrificing speed, scale, or security.

Get Started

Serverless without trade-offs

Serverless without compromise. Your models remain yours, and your data is never reused or retrained. Get full tenant isolation, built-in compliance, and high-performance compute - all delivered instantly, without the complexity of managing infrastructure.

Learn More

“Close-up of a GPU system, featuring multiple modules with ‘AMD Instinct’ branding on metallic heat sinks. The hardware reveals intricate circuit boards and processors underneath, showcasing the advanced design of high-performance computing components.”

Performance

80% LOWER COST

More performance for less

Nscale delivers on average 80% cost-saving in comparison to hyperscalers.

30% FASTER

On time to insights

Nscale Cloud accelerates time to insights by up to 30% thanks to its AI-optimised stack.

+40% EFFICIENCY

Improved resource utilisation

Up to 40% improvement on efficiency.

100%

Renewable Energy

Our Serverless Inference platform runs on 100% renewable power generated by hydropower dams.

Zero rate limits, maximum reliability

No rate limits, no cold starts, and no waiting - just fast, reliable inference with automatic scaling built to handle any AI workload. We handle scaling, monitoring, and operations behind the scenes, so your team can focus on building.

Nscale's Datacentres

LLM Library

Pre-configured Software

Pre-configured Infrastructure

Job Management

Job Scheduling

Container Orchestration

Optimised Libraries

Optimised Compilers and Tools

Optimised Runtime

FAQs

Nscale Serverless Inference is a fully managed platform that enables AI model inference without requiring complex infrastructure management. It provides instant access to leading Generative AI models with a simple pay-per-use pricing model.

This service is designed for developers, startups, enterprises, and research teams who want to deploy AI-powered applications quickly and cost-effectively without handling infrastructure complexities.

Nscale follows a pay-per-request model:
- Text models: Billed based on input and output tokens.
- Image models: Pricing depends on output image resolution.
- Vision models: Charged based on processing requirements.
- New users receive free credits to explore the platform.

No infrastructure hassles: We handle scaling, monitoring, and resource allocation.
Cost-effective: Our vertically integrated stack minimises compute costs.
Scalable & Reliable: Automatic scaling ensures optimal performance.
Secure & Private: No request or response data is logged or used for training.
OpenAI API & SDK compatibility: Easily integrate with existing tools.

Nscale automatically adjusts capacity based on real-time demand. There’s no need for manual configuration, making it easy to scale applications seamlessly.

Cost per token is an important metric for evaluating AI inference TCO. It is the measure of what your infrastructure actually delivers. Input metrics like hourly GPU pricing or FLOPs per dollar tell you what you're spending or what's theoretically possible, but cost per token captures a broader picture: hardware performance, software optimization, and real-world utilization in a single number. Nscale's full-stack approach is designed to maximize token throughput across every deployment model, from multi-year private cloud to self-serve on-demand, giving you more useful output from your budget.

Nscale's vertically integrated, full-stack approach is engineered to maximize delivered token output. Built on the latest architectures, including the NVIDIA Blackwell and NVIDIA Blackwell Ultra platforms, Nscale combines infrastructure efficiency with software optimization to drive down cost per token at every layer. For consumption-based customers, this translates directly into better economics per token. For reserved deployments, it means more useful output from every GPU-hour under contract.

Access thousands of GPUs tailored to your needs

Reserve GPUs

Stay up to date with Nscale

By submitting you agree to receive Nscale emails & accept our Terms & Privacy Policy.

Data Centers

Nscale Data Centers
Narvik
Glomfjord
Loughton
Texas

Inference Endpoints

Prompt Workbench

Fine-tuning

Managed Slurm

Kubernetes service

Instances

Compute

Networking

Storage

Control Center

Observability

Radar API

Most cost-effective AI Inference

Lower cost, more power

Engineered for AI workloads

Scale without the overhead

Models & Pricing

Need dedicated infrastructure?

Savings by design, not compromise

Serverless without trade-offs

Performance

Zero rate limits, maximum reliability

Feature List

FAQs

What is Nscale Serverless Inference?

Who is this service for?

How does the pricing work?

What are the key benefits of using Nscale Serverless Inference?

How does scaling work?

What is the right metric for evaluating my Total Cost of Ownership (TCO) for AI inference?

How does Nscale deliver low cost per token?

What is the right metric for evaluating my Total Cost of Ownership (TCO) for AI inference?

How does Nscale deliver low cost per token?

Access thousands of GPUs tailored to your needs

Stay up to date with Nscale