- Add new vLLM workflow documentation (vllm-workflow.md) - Update scheduler-architecture.md with Plugin GPU Quota and audit logging - Add See Also sections to jupyter-workflow.md, quick-start.md, configuration-reference.md for better navigation - Update landing page and index with vLLM and scheduler links - Cross-link all documentation for improved discoverability
13 KiB
vLLM Inference Service Guide
Comprehensive guide to deploying and managing OpenAI-compatible LLM inference services using vLLM in FetchML.
Overview
The vLLM plugin provides high-performance LLM inference with:
- OpenAI-Compatible API: Drop-in replacement for OpenAI's API
- Advanced Scheduling: Continuous batching for throughput optimization
- GPU Optimization: Tensor parallelism and quantization support
- Model Management: Automatic model downloading and caching
- Quantization: AWQ, GPTQ, FP8, and SqueezeLLM support
Quick Start
Start vLLM Service
# Start development stack
make dev-up
# Start vLLM service with default model
./cli/zig-out/bin/ml service start vllm --name llm-server --model meta-llama/Llama-2-7b-chat-hf
# Or with specific GPU requirements
./cli/zig-out/bin/ml service start vllm \
--name llm-server \
--model meta-llama/Llama-2-7b-chat-hf \
--gpu-count 1 \
--quantization awq
# Access the API
open http://localhost:8000/docs
Using the API
import openai
# Point to local vLLM instance
client = openai.OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
# Chat completion
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{"role": "user", "content": "Explain quantum computing in simple terms"}
]
)
print(response.choices[0].message.content)
Service Management
Creating vLLM Services
# Create basic vLLM service
./cli/zig-out/bin/ml service start vllm --name my-llm
# Create with specific model
./cli/zig-out/bin/ml service start vllm \
--name my-llm \
--model microsoft/DialoGPT-medium
# Create with resource constraints
./cli/zig-out/bin/ml service start vllm \
--name production-llm \
--model meta-llama/Llama-2-13b-chat-hf \
--gpu-count 2 \
--quantization gptq \
--max-model-len 4096
# List all vLLM services
./cli/zig-out/bin/ml service list
# Service details
./cli/zig-out/bin/ml service info my-llm
Service Configuration
Resource Allocation:
# vllm-config.yaml
resources:
gpu_count: 1
gpu_memory: 24gb
cpu: 4
memory: 16g
model:
name: "meta-llama/Llama-2-7b-chat-hf"
quantization: "awq" # Options: awq, gptq, squeezellm, fp8
trust_remote_code: false
max_model_len: 4096
serving:
port: 8000
host: "0.0.0.0"
tensor_parallel_size: 1
dtype: "auto" # auto, half, bfloat16, float
optimization:
enable_prefix_caching: true
swap_space: 4 # GB
max_num_batched_tokens: 4096
max_num_seqs: 256
Environment Variables:
# Model cache location
export VLLM_MODEL_CACHE=/models
# HuggingFace token for gated models
export HUGGING_FACE_HUB_TOKEN=your_token_here
# CUDA settings
export CUDA_VISIBLE_DEVICES=0,1
Service Lifecycle
# Start a service
./cli/zig-out/bin/ml service start vllm --name my-llm
# Stop a service (graceful shutdown)
./cli/zig-out/bin/ml service stop my-llm
# Restart a service
./cli/zig-out/bin/ml service restart my-llm
# Remove a service (stops and deletes)
./cli/zig-out/bin/ml service remove my-llm
# View service logs
./cli/zig-out/bin/ml service logs my-llm --follow
# Check service health
./cli/zig-out/bin/ml service health my-llm
Model Management
Supported Models
vLLM supports most HuggingFace Transformers models:
- Llama 2/3:
meta-llama/Llama-2-7b-chat-hf,meta-llama/Llama-2-70b-chat-hf - Mistral:
mistralai/Mistral-7B-Instruct-v0.2 - Mixtral:
mistralai/Mixtral-8x7B-Instruct-v0.1 - Falcon:
tiiuae/falcon-7b-instruct - CodeLlama:
codellama/CodeLlama-7b-hf - Phi:
microsoft/phi-2 - Qwen:
Qwen/Qwen-7B-Chat - Gemma:
google/gemma-7b-it
Model Caching
Models are automatically cached to avoid repeated downloads:
# Default cache location
~/.cache/huggingface/hub/
# Custom cache location
export VLLM_MODEL_CACHE=/mnt/fast-storage/models
# Pre-download models
./cli/zig-out/bin/ml service prefetch --model meta-llama/Llama-2-7b-chat-hf
Quantization
Quantization reduces memory usage and improves inference speed:
# AWQ (4-bit quantization)
./cli/zig-out/bin/ml service start vllm \
--name llm-awq \
--model TheBloke/Llama-2-7B-AWQ \
--quantization awq
# GPTQ (4-bit quantization)
./cli/zig-out/bin/ml service start vllm \
--name llm-gptq \
--model TheBloke/Llama-2-7B-GPTQ \
--quantization gptq
# FP8 (8-bit floating point)
./cli/zig-out/bin/ml service start vllm \
--name llm-fp8 \
--model meta-llama/Llama-2-7b-chat-hf \
--quantization fp8
Quantization Comparison:
| Method | Bits | Memory Reduction | Speed Impact | Quality |
|---|---|---|---|---|
| None (FP16) | 16 | 1x | Baseline | Best |
| FP8 | 8 | 2x | Faster | Excellent |
| AWQ | 4 | 4x | Fast | Very Good |
| GPTQ | 4 | 4x | Fast | Very Good |
| SqueezeLLM | 4 | 4x | Fast | Good |
API Reference
OpenAI-Compatible Endpoints
vLLM provides OpenAI-compatible REST API endpoints:
Chat Completions:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [
{"role": "user", "content": "Hello!"}
],
"max_tokens": 100,
"temperature": 0.7
}'
Completions (Legacy):
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"prompt": "The capital of France is",
"max_tokens": 10
}'
Embeddings:
curl http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"input": "Hello world"
}'
List Models:
curl http://localhost:8000/v1/models
Streaming Responses
Enable streaming for real-time token generation:
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
stream = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "Write a poem about AI"}],
stream=True,
max_tokens=200
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Advanced Parameters
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=messages,
# Generation parameters
max_tokens=500,
temperature=0.7,
top_p=0.9,
top_k=40,
# Repetition and penalties
frequency_penalty=0.5,
presence_penalty=0.5,
repetition_penalty=1.1,
# Sampling
seed=42,
stop=["END", "STOP"],
# Beam search (optional)
best_of=1,
use_beam_search=False,
)
GPU Quotas and Resource Management
Per-User GPU Limits
The scheduler enforces GPU quotas for vLLM services:
# scheduler-config.yaml
scheduler:
plugin_quota:
enabled: true
total_gpus: 16
per_user_gpus: 4
per_user_services: 2
per_plugin_limits:
vllm:
max_gpus: 8
max_services: 4
user_overrides:
admin:
max_gpus: 8
max_services: 5
allowed_plugins: ["vllm", "jupyter"]
Resource Monitoring
# Check GPU allocation for your user
./cli/zig-out/bin/ml service quota
# View current usage
./cli/zig-out/bin/ml service usage
# Monitor service resource usage
./cli/zig-out/bin/ml service stats my-llm
Multi-GPU and Distributed Inference
Tensor Parallelism
For large models that don't fit on a single GPU:
# 70B model across 4 GPUs
./cli/zig-out/bin/ml service start vllm \
--name llm-70b \
--model meta-llama/Llama-2-70b-chat-hf \
--gpu-count 4 \
--tensor-parallel-size 4
Pipeline Parallelism
For very large models with pipeline stages:
# Pipeline parallelism config
model:
name: "meta-llama/Llama-2-70b-chat-hf"
serving:
tensor_parallel_size: 2
pipeline_parallel_size: 2 # Total 4 GPUs
Integration with Experiments
Using vLLM from Training Jobs
# In your training script
import requests
# Call local vLLM service
response = requests.post(
"http://vllm-service:8000/v1/chat/completions",
json={
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [{"role": "user", "content": "Summarize this text"}]
}
)
result = response.json()
summary = result["choices"][0]["message"]["content"]
Linking with Experiments
# Start vLLM service linked to experiment
./cli/zig-out/bin/ml service start vllm \
--name llm-exp-1 \
--model meta-llama/Llama-2-7b-chat-hf \
--experiment experiment-id
# View linked services
./cli/zig-out/bin/ml service list --experiment experiment-id
Security and Access Control
Network Isolation
# Restrict to internal network only
./cli/zig-out/bin/ml service start vllm \
--name internal-llm \
--model meta-llama/Llama-2-7b-chat-hf \
--host 10.0.0.1 \
--port 8000
API Key Authentication
# vllm-security.yaml
auth:
api_key_required: true
allowed_ips:
- "10.0.0.0/8"
- "192.168.0.0/16"
rate_limit:
requests_per_minute: 60
tokens_per_minute: 10000
Audit Trail
All API calls are logged for compliance:
# View audit log
./cli/zig-out/bin/ml service audit my-llm
# Export audit report
./cli/zig-out/bin/ml service audit my-llm --export=csv
# Check access patterns
./cli/zig-out/bin/ml service audit my-llm --summary
Monitoring and Troubleshooting
Health Checks
# Check service health
./cli/zig-out/bin/ml service health my-llm
# Detailed diagnostics
./cli/zig-out/bin/ml service diagnose my-llm
# View service status
./cli/zig-out/bin/ml service status my-llm
Performance Monitoring
# Real-time metrics
./cli/zig-out/bin/ml service monitor my-llm
# Performance report
./cli/zig-out/bin/ml service report my-llm --format=html
# GPU utilization
./cli/zig-out/bin/ml service stats my-llm --gpu
Common Issues
Out of Memory:
# Reduce batch size
./cli/zig-out/bin/ml service update my-llm --max-num-seqs 128
# Enable quantization
./cli/zig-out/bin/ml service update my-llm --quantization awq
# Reduce GPU memory fraction
export VLLM_GPU_MEMORY_FRACTION=0.85
Model Download Failures:
# Set HuggingFace token
export HUGGING_FACE_HUB_TOKEN=your_token
# Use mirror
export HF_ENDPOINT=https://hf-mirror.com
# Pre-download with retry
./cli/zig-out/bin/ml service prefetch --model meta-llama/Llama-2-7b-chat-hf --retry
Slow Inference:
# Enable prefix caching
./cli/zig-out/bin/ml service update my-llm --enable-prefix-caching
# Increase batch size
./cli/zig-out/bin/ml service update my-llm --max-num-batched-tokens 8192
# Check GPU utilization
nvidia-smi dmon -s u
Best Practices
Resource Planning
- GPU Memory Calculation: Model size × precision × overhead (1.2-1.5x)
- Batch Size Tuning: Balance throughput vs. latency
- Quantization: Use AWQ/GPTQ for production, FP16 for best quality
- Prefix Caching: Enable for chat applications with repeated prompts
Production Deployment
- Load Balancing: Deploy multiple vLLM instances behind a load balancer
- Health Checks: Configure Kubernetes liveness/readiness probes
- Autoscaling: Scale based on queue depth or GPU utilization
- Monitoring: Track tokens/sec, queue depth, and error rates
Security
- Network Segmentation: Isolate vLLM on internal network
- Rate Limiting: Prevent abuse with per-user quotas
- Input Validation: Sanitize prompts to prevent injection attacks
- Audit Logging: Enable comprehensive audit trails
CLI Reference
Service Commands
# Start a service
ml service start vllm [flags]
--name string Service name (required)
--model string Model name or path (default: "meta-llama/Llama-2-7b-chat-hf")
--gpu-count int Number of GPUs (default: 1)
--quantization string Quantization method (awq, gptq, fp8, squeezellm)
--port int Service port (default: 8000)
--max-model-len int Maximum sequence length
--tensor-parallel-size int Tensor parallelism degree
# List services
ml service list [flags]
--format string Output format (table, json)
--all Show all users' services (admin only)
# Service operations
ml service stop <name>
ml service start <name> # Restart a stopped service
ml service restart <name>
ml service remove <name>
ml service logs <name> [flags]
--follow Follow log output
--tail int Number of lines to show (default: 100)
ml service info <name>
ml service health <name>
See Also
- Testing Guide - Testing vLLM services
- Deployment Guide - Production deployment
- Security Guide - Security best practices
- Scheduler Architecture - How vLLM integrates with scheduler
- CLI Reference - Command-line tools
- Jupyter Workflow - Jupyter integration with vLLM