Maximum Throughput

Minimum Cost

A universal inference platform enabling you to run any model, any GPU, anywhere. Xinference helps you lower your GPU cost while realising higher and faster throughput.

How it Works Get Started

Proven in production

DataCore saves 55% NeuralSoft saves 48% CloudScale saves 99% QuantumAI saves 72%

Choose Your Plan

From startups to large enterprises — we scale with you

Open Source Support

US$3k

/ year

Best for startups and dev teams building a proof of concept

  • Expert-led support
  • Priority debugging support
  • Direct developer access
  • NVIDIA GPU compatible
Get Started

Xinference + Xagent Bundle

Enquire Here

Your full AI stack — private inference + Xagent, on your own infrastructure.

Full data privacy — no external API calls
Lower inference costs on your own GPUs
SOC 2 compliant · RBAC & audit logs
On-Prem, Cloud, or Hybrid deployment

Xagent

Build and deploy enterprise-grade agents with everyday simplicity.

+

Xinference

Run any model, any GPU, anywhere.
Heterogeneous GPU & hardware abstraction
GPU optimisation
Model lifecycle management
Autoscaling & performance optimisation
Enterprise security
Email Voice Outbound Inbound Chatbot Social Content Collateral Sales Supp. Mktg. Xagent Agent Platform Xinference Inference Engine
OpenAI
Claude
Gemini
DeepSeek
Qwen
Grok

300+ LLMs · multimodal · embeddings

Frequently Asked Questions

Everything you need to know about Xinference and how it fits into your AI stack.

What is Xinference and how does it work?

Xinference is an open-source platform that lets you deploy and serve large language models, embedding models, image models, and more — all through a unified API. It abstracts away the complexity of model loading, hardware management, and scaling so your team can focus on building applications.

How does Xinference compare to running models via cloud providers?

Cloud providers charge you for every token processed through their managed AI services, and your data passes through their infrastructure. With Xinference, you deploy models on your own infrastructure — cloud, on-prem, or hybrid.

Xinference is a unified, production-ready inference platform giving you full control over which models to run, which GPU to use, and where to deploy; all while ensuring best-in-class performance and cost optimisation.

How does pricing work?

Pricing is based on the number of nodes per cluster. Xinference Enterprise costs US$15k per node per cluster.

For example, a small deployment of 2 nodes (usually ~16 GPUs) would cost US$30k / annum. A larger deployment of 250 nodes (usually ~2,000 GPUs) would cost US$3.8m / annum. Running multiple clusters would mean multi billing per cluster.

What is the difference between the open source and Enterprise solution?

Xinference Enterprise delivers better performance and enterprise-grade reliability. Our customers pick the Enterprise solution as it delivers comprehensive hardware compatibility, enables running multiple models on a single GPU, and super charges performance with up to 2x greater throughput.

Most importantly, Xinference Enterprise comes with critical enterprise management features like RBAC, audit logs, a unified management console and SLA guarantees.

How does Xinference handle data privacy?

With Xinference, you can choose to run your models on your own infrastructure — cloud or on-premises — so your prompts and data never leave your environment. This makes Xinference purpose-built for industries with strict data requirements like finance and healthcare.

Can Xinference integrate with our existing MLOps stack?

Xinference provides a RESTful API compatible with OpenAI's protocol, meaning any tool already built around OpenAI's API works with Xinference by changing a single line of code. Xinference integrates with popular third-party libraries including LangChain, LlamaIndex, Dify, and Chatbox. Kubernetes deployment via Helm is also supported for teams running containerised infrastructure.