jfraeysd/fetch_ml

Fork 0

Jeremie Fraeys 90ea18555c

Security Scan / Security Analysis (push) Waiting to run

Details

Security Scan / Native Library Security (push) Waiting to run

Details

Verification & Maintenance / V.1 - Schema Drift Detection (push) Waiting to run

Details

Verification & Maintenance / V.4 - Custom Go Vet Analyzers (push) Waiting to run

Details

Verification & Maintenance / V.7 - Audit Chain Integrity (push) Waiting to run

Details

Verification & Maintenance / V.6 - Extended Security Scanning (push) Waiting to run

Details

Verification & Maintenance / V.10 - OpenSSF Scorecard (push) Waiting to run

Details

Verification & Maintenance / Verification Summary (push) Blocked by required conditions

Details

Build Pipeline / Build Binaries (push) Failing after 2m4s

Details

Build Pipeline / Build Docker Images (push) Has been skipped

Details

Build Pipeline / Sign HIPAA Config (push) Has been skipped

Details

Build Pipeline / Generate SLSA Provenance (push) Has been skipped

Details

Checkout test / test (push) Successful in 5s

Details

CI Pipeline / Test (push) Failing after 1s

Details

CI Pipeline / Dev Compose Smoke Test (push) Has been skipped

Details

CI Pipeline / Security Scan (push) Has been skipped

Details

CI Pipeline / Test Scripts (push) Has been skipped

Details

CI Pipeline / Test Native Libraries (push) Has been skipped

Details

CI Pipeline / Native Library Build Matrix (push) Has been skipped

Details

Contract Tests / Spec Drift Detection (push) Failing after 16s

Details

Contract Tests / API Contract Tests (push) Has been skipped

Details

Deploy API Docs / Build API Documentation (push) Failing after 5s

Details

Deploy API Docs / Deploy to GitHub Pages (push) Has been skipped

Details

Documentation / build-and-publish (push) Failing after 44s

Details

CI Pipeline / Trigger Build Workflow (push) Failing after 0s

Details

docs: add vLLM workflow and cross-link documentation

- Add new vLLM workflow documentation (vllm-workflow.md)
- Update scheduler-architecture.md with Plugin GPU Quota and audit logging
- Add See Also sections to jupyter-workflow.md, quick-start.md,
  configuration-reference.md for better navigation
- Update landing page and index with vLLM and scheduler links
- Cross-link all documentation for improved discoverability

2026-02-26 13:04:39 -05:00

13 KiB

Raw Blame History

vLLM Inference Service Guide

Comprehensive guide to deploying and managing OpenAI-compatible LLM inference services using vLLM in FetchML.

Overview

The vLLM plugin provides high-performance LLM inference with:

OpenAI-Compatible API: Drop-in replacement for OpenAI's API
Advanced Scheduling: Continuous batching for throughput optimization
GPU Optimization: Tensor parallelism and quantization support
Model Management: Automatic model downloading and caching
Quantization: AWQ, GPTQ, FP8, and SqueezeLLM support

Quick Start

Start vLLM Service

# Start development stack
make dev-up

# Start vLLM service with default model
./cli/zig-out/bin/ml service start vllm --name llm-server --model meta-llama/Llama-2-7b-chat-hf

# Or with specific GPU requirements
./cli/zig-out/bin/ml service start vllm \
  --name llm-server \
  --model meta-llama/Llama-2-7b-chat-hf \
  --gpu-count 1 \
  --quantization awq

# Access the API
open http://localhost:8000/docs

Using the API

import openai

# Point to local vLLM instance
client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

# Chat completion
response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[
        {"role": "user", "content": "Explain quantum computing in simple terms"}
    ]
)

print(response.choices[0].message.content)

Service Management

Creating vLLM Services

# Create basic vLLM service
./cli/zig-out/bin/ml service start vllm --name my-llm

# Create with specific model
./cli/zig-out/bin/ml service start vllm \
  --name my-llm \
  --model microsoft/DialoGPT-medium

# Create with resource constraints
./cli/zig-out/bin/ml service start vllm \
  --name production-llm \
  --model meta-llama/Llama-2-13b-chat-hf \
  --gpu-count 2 \
  --quantization gptq \
  --max-model-len 4096

# List all vLLM services
./cli/zig-out/bin/ml service list

# Service details
./cli/zig-out/bin/ml service info my-llm

Service Configuration

Resource Allocation:

# vllm-config.yaml
resources:
  gpu_count: 1
  gpu_memory: 24gb
  cpu: 4
  memory: 16g

model:
  name: "meta-llama/Llama-2-7b-chat-hf"
  quantization: "awq"  # Options: awq, gptq, squeezellm, fp8
  trust_remote_code: false
  max_model_len: 4096

serving:
  port: 8000
  host: "0.0.0.0"
  tensor_parallel_size: 1
  dtype: "auto"  # auto, half, bfloat16, float

optimization:
  enable_prefix_caching: true
  swap_space: 4  # GB
  max_num_batched_tokens: 4096
  max_num_seqs: 256

Environment Variables:

# Model cache location
export VLLM_MODEL_CACHE=/models

# HuggingFace token for gated models
export HUGGING_FACE_HUB_TOKEN=your_token_here

# CUDA settings
export CUDA_VISIBLE_DEVICES=0,1

Service Lifecycle

# Start a service
./cli/zig-out/bin/ml service start vllm --name my-llm

# Stop a service (graceful shutdown)
./cli/zig-out/bin/ml service stop my-llm

# Restart a service
./cli/zig-out/bin/ml service restart my-llm

# Remove a service (stops and deletes)
./cli/zig-out/bin/ml service remove my-llm

# View service logs
./cli/zig-out/bin/ml service logs my-llm --follow

# Check service health
./cli/zig-out/bin/ml service health my-llm

Model Management

Supported Models

vLLM supports most HuggingFace Transformers models:

Llama 2/3: meta-llama/Llama-2-7b-chat-hf, meta-llama/Llama-2-70b-chat-hf
Mistral: mistralai/Mistral-7B-Instruct-v0.2
Mixtral: mistralai/Mixtral-8x7B-Instruct-v0.1
Falcon: tiiuae/falcon-7b-instruct
CodeLlama: codellama/CodeLlama-7b-hf
Phi: microsoft/phi-2
Qwen: Qwen/Qwen-7B-Chat
Gemma: google/gemma-7b-it

Model Caching

Models are automatically cached to avoid repeated downloads:

# Default cache location
~/.cache/huggingface/hub/

# Custom cache location
export VLLM_MODEL_CACHE=/mnt/fast-storage/models

# Pre-download models
./cli/zig-out/bin/ml service prefetch --model meta-llama/Llama-2-7b-chat-hf

Quantization

Quantization reduces memory usage and improves inference speed:

# AWQ (4-bit quantization)
./cli/zig-out/bin/ml service start vllm \
  --name llm-awq \
  --model TheBloke/Llama-2-7B-AWQ \
  --quantization awq

# GPTQ (4-bit quantization)
./cli/zig-out/bin/ml service start vllm \
  --name llm-gptq \
  --model TheBloke/Llama-2-7B-GPTQ \
  --quantization gptq

# FP8 (8-bit floating point)
./cli/zig-out/bin/ml service start vllm \
  --name llm-fp8 \
  --model meta-llama/Llama-2-7b-chat-hf \
  --quantization fp8

Quantization Comparison:

Method	Bits	Memory Reduction	Speed Impact	Quality
None (FP16)	16	1x	Baseline	Best
FP8	8	2x	Faster	Excellent
AWQ	4	4x	Fast	Very Good
GPTQ	4	4x	Fast	Very Good
SqueezeLLM	4	4x	Fast	Good

API Reference

OpenAI-Compatible Endpoints

vLLM provides OpenAI-compatible REST API endpoints:

Chat Completions:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

Completions (Legacy):

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "prompt": "The capital of France is",
    "max_tokens": 10
  }'

Embeddings:

curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-chat-hf",
    "input": "Hello world"
  }'

List Models:

curl http://localhost:8000/v1/models

Streaming Responses

Enable streaming for real-time token generation:

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

stream = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=[{"role": "user", "content": "Write a poem about AI"}],
    stream=True,
    max_tokens=200
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Advanced Parameters

response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    messages=messages,
    
    # Generation parameters
    max_tokens=500,
    temperature=0.7,
    top_p=0.9,
    top_k=40,
    
    # Repetition and penalties
    frequency_penalty=0.5,
    presence_penalty=0.5,
    repetition_penalty=1.1,
    
    # Sampling
    seed=42,
    stop=["END", "STOP"],
    
    # Beam search (optional)
    best_of=1,
    use_beam_search=False,
)

GPU Quotas and Resource Management

Per-User GPU Limits

The scheduler enforces GPU quotas for vLLM services:

# scheduler-config.yaml
scheduler:
  plugin_quota:
    enabled: true
    total_gpus: 16
    per_user_gpus: 4
    per_user_services: 2
    per_plugin_limits:
      vllm:
        max_gpus: 8
        max_services: 4
    user_overrides:
      admin:
        max_gpus: 8
        max_services: 5
        allowed_plugins: ["vllm", "jupyter"]

Resource Monitoring

# Check GPU allocation for your user
./cli/zig-out/bin/ml service quota

# View current usage
./cli/zig-out/bin/ml service usage

# Monitor service resource usage
./cli/zig-out/bin/ml service stats my-llm

Multi-GPU and Distributed Inference

Tensor Parallelism

For large models that don't fit on a single GPU:

# 70B model across 4 GPUs
./cli/zig-out/bin/ml service start vllm \
  --name llm-70b \
  --model meta-llama/Llama-2-70b-chat-hf \
  --gpu-count 4 \
  --tensor-parallel-size 4

Pipeline Parallelism

For very large models with pipeline stages:

# Pipeline parallelism config
model:
  name: "meta-llama/Llama-2-70b-chat-hf"
  
serving:
  tensor_parallel_size: 2
  pipeline_parallel_size: 2  # Total 4 GPUs

Integration with Experiments

Using vLLM from Training Jobs

# In your training script
import requests

# Call local vLLM service
response = requests.post(
    "http://vllm-service:8000/v1/chat/completions",
    json={
        "model": "meta-llama/Llama-2-7b-chat-hf",
        "messages": [{"role": "user", "content": "Summarize this text"}]
    }
)

result = response.json()
summary = result["choices"][0]["message"]["content"]

Linking with Experiments

# Start vLLM service linked to experiment
./cli/zig-out/bin/ml service start vllm \
  --name llm-exp-1 \
  --model meta-llama/Llama-2-7b-chat-hf \
  --experiment experiment-id

# View linked services
./cli/zig-out/bin/ml service list --experiment experiment-id

Security and Access Control

Network Isolation

# Restrict to internal network only
./cli/zig-out/bin/ml service start vllm \
  --name internal-llm \
  --model meta-llama/Llama-2-7b-chat-hf \
  --host 10.0.0.1 \
  --port 8000

API Key Authentication

# vllm-security.yaml
auth:
  api_key_required: true
  allowed_ips:
    - "10.0.0.0/8"
    - "192.168.0.0/16"
  
rate_limit:
  requests_per_minute: 60
  tokens_per_minute: 10000

Audit Trail

All API calls are logged for compliance:

# View audit log
./cli/zig-out/bin/ml service audit my-llm

# Export audit report
./cli/zig-out/bin/ml service audit my-llm --export=csv

# Check access patterns
./cli/zig-out/bin/ml service audit my-llm --summary

Monitoring and Troubleshooting

Health Checks

# Check service health
./cli/zig-out/bin/ml service health my-llm

# Detailed diagnostics
./cli/zig-out/bin/ml service diagnose my-llm

# View service status
./cli/zig-out/bin/ml service status my-llm

Performance Monitoring

# Real-time metrics
./cli/zig-out/bin/ml service monitor my-llm

# Performance report
./cli/zig-out/bin/ml service report my-llm --format=html

# GPU utilization
./cli/zig-out/bin/ml service stats my-llm --gpu

Common Issues

Out of Memory:

# Reduce batch size
./cli/zig-out/bin/ml service update my-llm --max-num-seqs 128

# Enable quantization
./cli/zig-out/bin/ml service update my-llm --quantization awq

# Reduce GPU memory fraction
export VLLM_GPU_MEMORY_FRACTION=0.85

Model Download Failures:

# Set HuggingFace token
export HUGGING_FACE_HUB_TOKEN=your_token

# Use mirror
export HF_ENDPOINT=https://hf-mirror.com

# Pre-download with retry
./cli/zig-out/bin/ml service prefetch --model meta-llama/Llama-2-7b-chat-hf --retry

Slow Inference:

# Enable prefix caching
./cli/zig-out/bin/ml service update my-llm --enable-prefix-caching

# Increase batch size
./cli/zig-out/bin/ml service update my-llm --max-num-batched-tokens 8192

# Check GPU utilization
nvidia-smi dmon -s u

Best Practices

Resource Planning

GPU Memory Calculation: Model size × precision × overhead (1.2-1.5x)
Batch Size Tuning: Balance throughput vs. latency
Quantization: Use AWQ/GPTQ for production, FP16 for best quality
Prefix Caching: Enable for chat applications with repeated prompts

Production Deployment

Load Balancing: Deploy multiple vLLM instances behind a load balancer
Health Checks: Configure Kubernetes liveness/readiness probes
Autoscaling: Scale based on queue depth or GPU utilization
Monitoring: Track tokens/sec, queue depth, and error rates

Security

Network Segmentation: Isolate vLLM on internal network
Rate Limiting: Prevent abuse with per-user quotas
Input Validation: Sanitize prompts to prevent injection attacks
Audit Logging: Enable comprehensive audit trails

CLI Reference

Service Commands

# Start a service
ml service start vllm [flags]
  --name string           Service name (required)
  --model string          Model name or path (default: "meta-llama/Llama-2-7b-chat-hf")
  --gpu-count int         Number of GPUs (default: 1)
  --quantization string   Quantization method (awq, gptq, fp8, squeezellm)
  --port int             Service port (default: 8000)
  --max-model-len int    Maximum sequence length
  --tensor-parallel-size int  Tensor parallelism degree

# List services
ml service list [flags]
  --format string    Output format (table, json)
  --all             Show all users' services (admin only)

# Service operations
ml service stop <name>
ml service start <name>      # Restart a stopped service
ml service restart <name>
ml service remove <name>
ml service logs <name> [flags]
  --follow          Follow log output
  --tail int        Number of lines to show (default: 100)
ml service info <name>
ml service health <name>

13 KiB Raw Blame History Unescape Escape

vLLM Inference Service Guide

Overview

Quick Start

Start vLLM Service

Using the API

Service Management

Creating vLLM Services

Service Configuration

Service Lifecycle

Model Management

Supported Models

Model Caching

Quantization

API Reference

OpenAI-Compatible Endpoints

Streaming Responses

Advanced Parameters

GPU Quotas and Resource Management

Per-User GPU Limits

Resource Monitoring

Multi-GPU and Distributed Inference

Tensor Parallelism

Pipeline Parallelism

Integration with Experiments

Using vLLM from Training Jobs

Linking with Experiments

Security and Access Control

Network Isolation

API Key Authentication

Audit Trail

Monitoring and Troubleshooting

Health Checks

Performance Monitoring

Common Issues

Best Practices

Resource Planning

Production Deployment

Security

CLI Reference

Service Commands

See Also

13 KiB

Raw Blame History