fetch_ml/docs/src/vllm-workflow.md
Jeremie Fraeys 90ea18555c
Some checks failed
Security Scan / Security Analysis (push) Waiting to run
Security Scan / Native Library Security (push) Waiting to run
Verification & Maintenance / V.1 - Schema Drift Detection (push) Waiting to run
Verification & Maintenance / V.4 - Custom Go Vet Analyzers (push) Waiting to run
Verification & Maintenance / V.7 - Audit Chain Integrity (push) Waiting to run
Verification & Maintenance / V.6 - Extended Security Scanning (push) Waiting to run
Verification & Maintenance / V.10 - OpenSSF Scorecard (push) Waiting to run
Verification & Maintenance / Verification Summary (push) Blocked by required conditions
Build Pipeline / Build Binaries (push) Failing after 2m4s
Build Pipeline / Build Docker Images (push) Has been skipped
Build Pipeline / Sign HIPAA Config (push) Has been skipped
Build Pipeline / Generate SLSA Provenance (push) Has been skipped
Checkout test / test (push) Successful in 5s
CI Pipeline / Test (push) Failing after 1s
CI Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI Pipeline / Security Scan (push) Has been skipped
CI Pipeline / Test Scripts (push) Has been skipped
CI Pipeline / Test Native Libraries (push) Has been skipped
CI Pipeline / Native Library Build Matrix (push) Has been skipped
Contract Tests / Spec Drift Detection (push) Failing after 16s
Contract Tests / API Contract Tests (push) Has been skipped
Deploy API Docs / Build API Documentation (push) Failing after 5s
Deploy API Docs / Deploy to GitHub Pages (push) Has been skipped
Documentation / build-and-publish (push) Failing after 44s
CI Pipeline / Trigger Build Workflow (push) Failing after 0s
docs: add vLLM workflow and cross-link documentation
- Add new vLLM workflow documentation (vllm-workflow.md)
- Update scheduler-architecture.md with Plugin GPU Quota and audit logging
- Add See Also sections to jupyter-workflow.md, quick-start.md,
  configuration-reference.md for better navigation
- Update landing page and index with vLLM and scheduler links
- Cross-link all documentation for improved discoverability
2026-02-26 13:04:39 -05:00

13 KiB
Raw Blame History

vLLM Inference Service Guide

Comprehensive guide to deploying and managing OpenAI-compatible LLM inference services using vLLM in FetchML.

Overview

The vLLM plugin provides high-performance LLM inference with:

  • OpenAI-Compatible API: Drop-in replacement for OpenAI's API
  • Advanced Scheduling: Continuous batching for throughput optimization
  • GPU Optimization: Tensor parallelism and quantization support
  • Model Management: Automatic model downloading and caching
  • Quantization: AWQ, GPTQ, FP8, and SqueezeLLM support

Quick Start

Start vLLM Service

# Start development stack
make dev-up

# Start vLLM service with default model
./cli/zig-out/bin/ml service start vllm --name llm-server --model meta-llama/Llama-2-7b-chat-hf

# Or with specific GPU requirements
./cli/zig-out/bin/ml service start vllm \
  --name llm-server \
  --model meta-llama/Llama-2-7b-chat-hf \
  --gpu-count 1 \
  --quantization awq

# Access the API
open http://localhost:8000/docs

Using the API

import openai

# Point to local vLLM instance
client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

# Chat completion
response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "user", "content": "Explain quantum computing in simple terms"}
    ]
)

print(response.choices[0].message.content)

Service Management

Creating vLLM Services

# Create basic vLLM service
./cli/zig-out/bin/ml service start vllm --name my-llm

# Create with specific model
./cli/zig-out/bin/ml service start vllm \
  --name my-llm \
  --model microsoft/DialoGPT-medium

# Create with resource constraints
./cli/zig-out/bin/ml service start vllm \
  --name production-llm \
  --model meta-llama/Llama-2-13b-chat-hf \
  --gpu-count 2 \
  --quantization gptq \
  --max-model-len 4096

# List all vLLM services
./cli/zig-out/bin/ml service list

# Service details
./cli/zig-out/bin/ml service info my-llm

Service Configuration

Resource Allocation:

# vllm-config.yaml
resources:
  gpu_count: 1
  gpu_memory: 24gb
  cpu: 4
  memory: 16g

model:
  name: "meta-llama/Llama-2-7b-chat-hf"
  quantization: "awq"  # Options: awq, gptq, squeezellm, fp8
  trust_remote_code: false
  max_model_len: 4096

serving:
  port: 8000
  host: "0.0.0.0"
  tensor_parallel_size: 1
  dtype: "auto"  # auto, half, bfloat16, float

optimization:
  enable_prefix_caching: true
  swap_space: 4  # GB
  max_num_batched_tokens: 4096
  max_num_seqs: 256

Environment Variables:

# Model cache location
export VLLM_MODEL_CACHE=/models

# HuggingFace token for gated models
export HUGGING_FACE_HUB_TOKEN=your_token_here

# CUDA settings
export CUDA_VISIBLE_DEVICES=0,1

Service Lifecycle

# Start a service
./cli/zig-out/bin/ml service start vllm --name my-llm

# Stop a service (graceful shutdown)
./cli/zig-out/bin/ml service stop my-llm

# Restart a service
./cli/zig-out/bin/ml service restart my-llm

# Remove a service (stops and deletes)
./cli/zig-out/bin/ml service remove my-llm

# View service logs
./cli/zig-out/bin/ml service logs my-llm --follow

# Check service health
./cli/zig-out/bin/ml service health my-llm

Model Management

Supported Models

vLLM supports most HuggingFace Transformers models:

  • Llama 2/3: meta-llama/Llama-2-7b-chat-hf, meta-llama/Llama-2-70b-chat-hf
  • Mistral: mistralai/Mistral-7B-Instruct-v0.2
  • Mixtral: mistralai/Mixtral-8x7B-Instruct-v0.1
  • Falcon: tiiuae/falcon-7b-instruct
  • CodeLlama: codellama/CodeLlama-7b-hf
  • Phi: microsoft/phi-2
  • Qwen: Qwen/Qwen-7B-Chat
  • Gemma: google/gemma-7b-it

Model Caching

Models are automatically cached to avoid repeated downloads:

# Default cache location
~/.cache/huggingface/hub/

# Custom cache location
export VLLM_MODEL_CACHE=/mnt/fast-storage/models

# Pre-download models
./cli/zig-out/bin/ml service prefetch --model meta-llama/Llama-2-7b-chat-hf

Quantization

Quantization reduces memory usage and improves inference speed:

# AWQ (4-bit quantization)
./cli/zig-out/bin/ml service start vllm \
  --name llm-awq \
  --model TheBloke/Llama-2-7B-AWQ \
  --quantization awq

# GPTQ (4-bit quantization)
./cli/zig-out/bin/ml service start vllm \
  --name llm-gptq \
  --model TheBloke/Llama-2-7B-GPTQ \
  --quantization gptq

# FP8 (8-bit floating point)
./cli/zig-out/bin/ml service start vllm \
  --name llm-fp8 \
  --model meta-llama/Llama-2-7b-chat-hf \
  --quantization fp8

Quantization Comparison:

Method Bits Memory Reduction Speed Impact Quality
None (FP16) 16 1x Baseline Best
FP8 8 2x Faster Excellent
AWQ 4 4x Fast Very Good
GPTQ 4 4x Fast Very Good
SqueezeLLM 4 4x Fast Good

API Reference

OpenAI-Compatible Endpoints

vLLM provides OpenAI-compatible REST API endpoints:

Chat Completions:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

Completions (Legacy):

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "prompt": "The capital of France is",
    "max_tokens": 10
  }'

Embeddings:

curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "input": "Hello world"
  }'

List Models:

curl http://localhost:8000/v1/models

Streaming Responses

Enable streaming for real-time token generation:

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

stream = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[{"role": "user", "content": "Write a poem about AI"}],
    stream=True,
    max_tokens=200
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Advanced Parameters

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=messages,
    
    # Generation parameters
    max_tokens=500,
    temperature=0.7,
    top_p=0.9,
    top_k=40,
    
    # Repetition and penalties
    frequency_penalty=0.5,
    presence_penalty=0.5,
    repetition_penalty=1.1,
    
    # Sampling
    seed=42,
    stop=["END", "STOP"],
    
    # Beam search (optional)
    best_of=1,
    use_beam_search=False,
)

GPU Quotas and Resource Management

Per-User GPU Limits

The scheduler enforces GPU quotas for vLLM services:

# scheduler-config.yaml
scheduler:
  plugin_quota:
    enabled: true
    total_gpus: 16
    per_user_gpus: 4
    per_user_services: 2
    per_plugin_limits:
      vllm:
        max_gpus: 8
        max_services: 4
    user_overrides:
      admin:
        max_gpus: 8
        max_services: 5
        allowed_plugins: ["vllm", "jupyter"]

Resource Monitoring

# Check GPU allocation for your user
./cli/zig-out/bin/ml service quota

# View current usage
./cli/zig-out/bin/ml service usage

# Monitor service resource usage
./cli/zig-out/bin/ml service stats my-llm

Multi-GPU and Distributed Inference

Tensor Parallelism

For large models that don't fit on a single GPU:

# 70B model across 4 GPUs
./cli/zig-out/bin/ml service start vllm \
  --name llm-70b \
  --model meta-llama/Llama-2-70b-chat-hf \
  --gpu-count 4 \
  --tensor-parallel-size 4

Pipeline Parallelism

For very large models with pipeline stages:

# Pipeline parallelism config
model:
  name: "meta-llama/Llama-2-70b-chat-hf"
  
serving:
  tensor_parallel_size: 2
  pipeline_parallel_size: 2  # Total 4 GPUs

Integration with Experiments

Using vLLM from Training Jobs

# In your training script
import requests

# Call local vLLM service
response = requests.post(
    "http://vllm-service:8000/v1/chat/completions",
    json={
        "model": "meta-llama/Llama-2-7b-chat-hf",
        "messages": [{"role": "user", "content": "Summarize this text"}]
    }
)

result = response.json()
summary = result["choices"][0]["message"]["content"]

Linking with Experiments

# Start vLLM service linked to experiment
./cli/zig-out/bin/ml service start vllm \
  --name llm-exp-1 \
  --model meta-llama/Llama-2-7b-chat-hf \
  --experiment experiment-id

# View linked services
./cli/zig-out/bin/ml service list --experiment experiment-id

Security and Access Control

Network Isolation

# Restrict to internal network only
./cli/zig-out/bin/ml service start vllm \
  --name internal-llm \
  --model meta-llama/Llama-2-7b-chat-hf \
  --host 10.0.0.1 \
  --port 8000

API Key Authentication

# vllm-security.yaml
auth:
  api_key_required: true
  allowed_ips:
    - "10.0.0.0/8"
    - "192.168.0.0/16"
  
rate_limit:
  requests_per_minute: 60
  tokens_per_minute: 10000

Audit Trail

All API calls are logged for compliance:

# View audit log
./cli/zig-out/bin/ml service audit my-llm

# Export audit report
./cli/zig-out/bin/ml service audit my-llm --export=csv

# Check access patterns
./cli/zig-out/bin/ml service audit my-llm --summary

Monitoring and Troubleshooting

Health Checks

# Check service health
./cli/zig-out/bin/ml service health my-llm

# Detailed diagnostics
./cli/zig-out/bin/ml service diagnose my-llm

# View service status
./cli/zig-out/bin/ml service status my-llm

Performance Monitoring

# Real-time metrics
./cli/zig-out/bin/ml service monitor my-llm

# Performance report
./cli/zig-out/bin/ml service report my-llm --format=html

# GPU utilization
./cli/zig-out/bin/ml service stats my-llm --gpu

Common Issues

Out of Memory:

# Reduce batch size
./cli/zig-out/bin/ml service update my-llm --max-num-seqs 128

# Enable quantization
./cli/zig-out/bin/ml service update my-llm --quantization awq

# Reduce GPU memory fraction
export VLLM_GPU_MEMORY_FRACTION=0.85

Model Download Failures:

# Set HuggingFace token
export HUGGING_FACE_HUB_TOKEN=your_token

# Use mirror
export HF_ENDPOINT=https://hf-mirror.com

# Pre-download with retry
./cli/zig-out/bin/ml service prefetch --model meta-llama/Llama-2-7b-chat-hf --retry

Slow Inference:

# Enable prefix caching
./cli/zig-out/bin/ml service update my-llm --enable-prefix-caching

# Increase batch size
./cli/zig-out/bin/ml service update my-llm --max-num-batched-tokens 8192

# Check GPU utilization
nvidia-smi dmon -s u

Best Practices

Resource Planning

  1. GPU Memory Calculation: Model size × precision × overhead (1.2-1.5x)
  2. Batch Size Tuning: Balance throughput vs. latency
  3. Quantization: Use AWQ/GPTQ for production, FP16 for best quality
  4. Prefix Caching: Enable for chat applications with repeated prompts

Production Deployment

  1. Load Balancing: Deploy multiple vLLM instances behind a load balancer
  2. Health Checks: Configure Kubernetes liveness/readiness probes
  3. Autoscaling: Scale based on queue depth or GPU utilization
  4. Monitoring: Track tokens/sec, queue depth, and error rates

Security

  1. Network Segmentation: Isolate vLLM on internal network
  2. Rate Limiting: Prevent abuse with per-user quotas
  3. Input Validation: Sanitize prompts to prevent injection attacks
  4. Audit Logging: Enable comprehensive audit trails

CLI Reference

Service Commands

# Start a service
ml service start vllm [flags]
  --name string           Service name (required)
  --model string          Model name or path (default: "meta-llama/Llama-2-7b-chat-hf")
  --gpu-count int         Number of GPUs (default: 1)
  --quantization string   Quantization method (awq, gptq, fp8, squeezellm)
  --port int             Service port (default: 8000)
  --max-model-len int    Maximum sequence length
  --tensor-parallel-size int  Tensor parallelism degree

# List services
ml service list [flags]
  --format string    Output format (table, json)
  --all             Show all users' services (admin only)

# Service operations
ml service stop <name>
ml service start <name>      # Restart a stopped service
ml service restart <name>
ml service remove <name>
ml service logs <name> [flags]
  --follow          Follow log output
  --tail int        Number of lines to show (default: 100)
ml service info <name>
ml service health <name>

See Also