# vLLM (/docs/providers/vllm)


Overview [#overview]

[vLLM](https://docs.vllm.ai) is a high-throughput inference engine designed for serving Large Language Models in production with exceptional performance through PagedAttention.

**Official Website:** [https://vllm.ai](https://vllm.ai)
&#x2A;*Documentation:** [https://docs.vllm.ai](https://docs.vllm.ai)

Key Features [#key-features]

* **PagedAttention** — Efficient memory management
* **High Throughput** — 10-20x faster than traditional serving
* **OpenAI-Compatible** — Drop-in API compatibility
* **Multi-GPU** — Tensor and pipeline parallelism
* **Any HuggingFace Model** — Serve any compatible model

Usage Example [#usage-example]

```bash
# Start vLLM server
vllm serve model-name

# Use via API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "model-id",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
```

Available Models [#available-models]

vLLM supports any HuggingFace model. Specify the model ID when starting the server.

<Callout type="info">
  vLLM is a self-hosted solution. Models and capabilities depend on your hardware configuration.
</Callout>

Official Resources [#official-resources]

* [vLLM Website](https://vllm.ai)
* [Documentation](https://docs.vllm.ai)
* [GitHub](https://github.com/vllm-project/vllm)
