docs: add vLLM workflow and cross-link documentation
Some checks failed
Security Scan / Security Analysis (push) Waiting to run
Security Scan / Native Library Security (push) Waiting to run
Verification & Maintenance / V.1 - Schema Drift Detection (push) Waiting to run
Verification & Maintenance / V.4 - Custom Go Vet Analyzers (push) Waiting to run
Verification & Maintenance / V.7 - Audit Chain Integrity (push) Waiting to run
Verification & Maintenance / V.6 - Extended Security Scanning (push) Waiting to run
Verification & Maintenance / V.10 - OpenSSF Scorecard (push) Waiting to run
Verification & Maintenance / Verification Summary (push) Blocked by required conditions
Build Pipeline / Build Binaries (push) Failing after 2m4s
Build Pipeline / Build Docker Images (push) Has been skipped
Build Pipeline / Sign HIPAA Config (push) Has been skipped
Build Pipeline / Generate SLSA Provenance (push) Has been skipped
Checkout test / test (push) Successful in 5s
CI Pipeline / Test (push) Failing after 1s
CI Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI Pipeline / Security Scan (push) Has been skipped
CI Pipeline / Test Scripts (push) Has been skipped
CI Pipeline / Test Native Libraries (push) Has been skipped
CI Pipeline / Native Library Build Matrix (push) Has been skipped
Contract Tests / Spec Drift Detection (push) Failing after 16s
Contract Tests / API Contract Tests (push) Has been skipped
Deploy API Docs / Build API Documentation (push) Failing after 5s
Deploy API Docs / Deploy to GitHub Pages (push) Has been skipped
Documentation / build-and-publish (push) Failing after 44s
CI Pipeline / Trigger Build Workflow (push) Failing after 0s

- Add new vLLM workflow documentation (vllm-workflow.md)
- Update scheduler-architecture.md with Plugin GPU Quota and audit logging
- Add See Also sections to jupyter-workflow.md, quick-start.md,
  configuration-reference.md for better navigation
- Update landing page and index with vLLM and scheduler links
- Cross-link all documentation for improved discoverability
This commit is contained in:
Jeremie Fraeys 2026-02-26 13:04:39 -05:00
parent 8f2495deb0
commit 90ea18555c
No known key found for this signature in database
9 changed files with 999 additions and 29 deletions

View file

@ -40,12 +40,14 @@ make test-unit
- [Environment Variables](environment-variables.md) - Configuration options
- [Smart Defaults](smart-defaults.md) - Default configuration settings
### Development
- [Architecture](architecture.md) - System architecture and design
- [CLI Reference](cli-reference.md) - Command-line interface documentation
- [Testing Guide](testing.md) - Testing procedures and guidelines
- [Jupyter Workflow](jupyter-workflow.md) - CLI and Jupyter integration
- [Queue System](queue.md) - Job queue implementation
### 🛠️ Development
- **[Architecture](architecture.md)** - System architecture and design
- **[Scheduler Architecture](scheduler-architecture.md)** - Job scheduler and service management
- **[CLI Reference](cli-reference.md)** - Command-line interface documentation
- **[Testing Guide](testing.md)** - Testing procedures and guidelines
- **[Jupyter Workflow](jupyter-workflow.md)** - Jupyter notebook services
- **[vLLM Workflow](vllm-workflow.md)** - LLM inference services
- **[Queue System](queue.md)** - Job queue implementation
### Production Deployment
- [Deployment Guide](deployment.md) - Production deployment instructions

View file

@ -244,6 +244,72 @@ Plugins can be configured via worker configuration under `plugins`, including:
- `mode`
- per-plugin paths/settings (e.g., artifact base path, log base path)
## Plugin GPU Quota System
The scheduler includes a GPU quota management system for plugin-based services (Jupyter, vLLM, etc.) that controls resource allocation across users and plugins.
### Quota Enforcement
The quota system enforces limits at multiple levels:
1. **Global GPU Limit**: Total GPUs available across all plugins
2. **Per-User GPU Limit**: Maximum GPUs a single user can allocate
3. **Per-User Service Limit**: Maximum number of service instances per user
4. **Plugin-Specific Limits**: Separate limits for each plugin type
5. **User Overrides**: Custom limits for specific users with allowed plugin restrictions
### Architecture
```mermaid
graph TB
subgraph "Plugin Quota System"
Submit[Job Submission] --> CheckQuota{Check Quota}
CheckQuota -->|Within Limits| Accept[Accept Job]
CheckQuota -->|Exceeded| Reject[Reject with Error]
Accept --> RecordUsage[Record Usage]
RecordUsage --> Assign[Assign to Worker]
Complete[Job Complete] --> ReleaseUsage[Release Usage]
subgraph "Quota Manager"
Global[Global GPU Counter]
PerUser[Per-User Tracking]
PerPlugin[Per-Plugin Tracking]
Overrides[User Overrides]
end
CheckQuota --> Global
CheckQuota --> PerUser
CheckQuota --> PerPlugin
CheckQuota --> Overrides
end
```
### Components
- **PluginQuotaConfig**: Configuration for all quota limits and overrides
- **PluginQuotaManager**: Thread-safe manager for tracking and enforcing quotas
- **Integration Points**:
- `SubmitJob()`: Validates quotas before accepting service jobs
- `handleJobAccepted()`: Records usage when jobs are assigned
- `handleJobResult()`: Releases usage when jobs complete
### Usage
Jobs must include `user_id` and `plugin_name` metadata for quota tracking:
```go
spec := scheduler.JobSpec{
Type: scheduler.JobTypeService,
UserID: "user123",
GPUCount: 2,
Metadata: map[string]string{
"plugin_name": "jupyter",
},
}
```
## Zig CLI Architecture
### Component Structure
@ -865,3 +931,13 @@ graph TB
---
This architecture provides a solid foundation for secure, scalable machine learning experiments while maintaining simplicity and developer productivity.
## See Also
- **[Scheduler Architecture](scheduler-architecture.md)** - Detailed scheduler design and protocols
- **[Security Guide](security.md)** - Security architecture and best practices
- **[Configuration Reference](configuration-reference.md)** - Configuration options and environment variables
- **[Deployment Guide](deployment.md)** - Production deployment architecture
- **[Performance & Monitoring](performance-monitoring.md)** - Metrics and observability
- **[Research Runner Plan](research-runner-plan.md)** - Roadmap and implementation phases
- **[Native Libraries](native-libraries.md)** - C++ performance optimizations

View file

@ -664,6 +664,45 @@ api_key = "<analyst-api-key>"
| `resources.podman_cpus` | string | "2" | CPU limit for Podman containers |
| `resources.podman_memory` | string | "4Gi" | Memory limit for Podman containers |
### Plugin GPU Quotas
Control GPU allocation for plugin-based services (Jupyter, vLLM, etc.).
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `scheduler.plugin_quota.enabled` | bool | false | Enable plugin GPU quota enforcement |
| `scheduler.plugin_quota.total_gpus` | int | 0 | Global GPU limit across all plugins (0 = unlimited) |
| `scheduler.plugin_quota.per_user_gpus` | int | 0 | Default per-user GPU limit (0 = unlimited) |
| `scheduler.plugin_quota.per_user_services` | int | 0 | Default per-user service count limit (0 = unlimited) |
| `scheduler.plugin_quota.per_plugin_limits.{plugin}.max_gpus` | int | 0 | Plugin-specific GPU limit |
| `scheduler.plugin_quota.per_plugin_limits.{plugin}.max_services` | int | 0 | Plugin-specific service count limit |
| `scheduler.plugin_quota.user_overrides.{user}.max_gpus` | int | 0 | Per-user GPU override |
| `scheduler.plugin_quota.user_overrides.{user}.max_services` | int | 0 | Per-user service limit override |
| `scheduler.plugin_quota.user_overrides.{user}.allowed_plugins` | array | [] | Plugins user is allowed to use (empty = all) |
**Example configuration:**
```yaml
scheduler:
plugin_quota:
enabled: true
total_gpus: 16
per_user_gpus: 4
per_user_services: 2
per_plugin_limits:
vllm:
max_gpus: 8
max_services: 4
jupyter:
max_gpus: 4
max_services: 10
user_overrides:
admin:
max_gpus: 8
max_services: 5
allowed_plugins: ["jupyter", "vllm"]
```
### Redis
| Option | Type | Default | Description |
@ -746,4 +785,16 @@ go run cmd/api-server/main.go --config configs/api/dev.yaml --validate
# Test CLI configuration
./cli/zig-out/bin/ml status --debug
```
```
---
## See Also
- **[Architecture](architecture.md)** - System architecture overview
- **[Scheduler Architecture](scheduler-architecture.md)** - Scheduler configuration details
- **[Environment Variables](environment-variables.md)** - Additional environment variable documentation
- **[Security Guide](security.md)** - Security-related configuration
- **[Deployment Guide](deployment.md)** - Production configuration guidance
- **[Jupyter Workflow](jupyter-workflow.md)** - Jupyter service configuration
- **[vLLM Workflow](vllm-workflow.md)** - vLLM service configuration

View file

@ -620,8 +620,10 @@ Common error codes in binary responses:
## See Also
- **[vLLM Workflow](vllm-workflow.md)** - LLM inference services (complementary to Jupyter)
- **[Scheduler Architecture](scheduler-architecture.md)** - How Jupyter services are scheduled
- **[Configuration Reference](configuration-reference.md)** - Service configuration options
- **[Testing Guide](testing.md)** - Testing Jupyter workflows
- **[Deployment Guide](deployment.md)** - Production deployment
- **[Security Guide](security.md)** - Security best practices
- **[API Reference](api-key-process.md)** - API documentation
- **[CLI Reference](cli-reference.md)** - Command-line tools

View file

@ -43,9 +43,11 @@ make test-unit
### 🛠️ Development
- [**Architecture**](architecture.md) - System architecture and design
- [**Scheduler Architecture**](scheduler-architecture.md) - Job scheduler and service management
- [**CLI Reference**](cli-reference.md) - Command-line interface documentation
- [**Testing Guide**](testing.md) - Testing procedures and guidelines
- [**Jupyter Workflow**](jupyter-workflow.md) - CLI and Jupyter integration
- [**Jupyter Workflow**](jupyter-workflow.md) - Jupyter notebook services
- [**vLLM Workflow**](vllm-workflow.md) - LLM inference services
- [**Queue System**](queue.md) - Job queue implementation
### 🏭 Production Deployment

View file

@ -329,4 +329,13 @@ make help # Show all available commands
---
*Ready in minutes!*
*Ready in minutes!*
## See Also
- **[Architecture](architecture.md)** - System architecture overview
- **[Scheduler Architecture](scheduler-architecture.md)** - Job scheduling and service management
- **[Jupyter Workflow](jupyter-workflow.md)** - Jupyter notebook services
- **[vLLM Workflow](vllm-workflow.md)** - LLM inference services
- **[Configuration Reference](configuration-reference.md)** - Configuration options
- **[Security Guide](security.md)** - Security best practices

View file

@ -169,6 +169,52 @@ The scheduler persists state for crash recovery:
State is replayed on startup via `StateStore.Replay()`.
## Service Templates
The scheduler provides built-in service templates for common ML workloads:
### Available Templates
| Template | Description | Default Port Range |
|----------|-------------|-------------------|
| **JupyterLab** | Interactive Jupyter environment | 8000-9000 |
| **Jupyter Notebook** | Classic Jupyter notebooks | 8000-9000 |
| **vLLM** | OpenAI-compatible LLM inference server | 8000-9000 |
### Port Allocation
Dynamic port management for service instances:
```go
type PortAllocator struct {
startPort int // Default: 8000
endPort int // Default: 9000
allocated map[int]time.Time // Port -> allocation time
}
```
**Features:**
- Automatic port selection from configured range
- TTL-based port reclamation
- Thread-safe concurrent allocations
- Exhaustion handling with clear error messages
### Template Variables
Service templates support dynamic variable substitution:
| Variable | Description | Example |
|----------|-------------|---------|
| `{{SERVICE_PORT}}` | Allocated port for the service | `8080` |
| `{{WORKER_ID}}` | ID of the assigned worker | `worker-1` |
| `{{TASK_ID}}` | Unique task identifier | `task-abc123` |
| `{{SECRET:xxx}}` | Secret reference from keychain | `api-key-value` |
| `{{MODEL_NAME}}` | ML model name (vLLM) | `llama-2-7b` |
| `{{GPU_COUNT}}` | Number of GPUs allocated | `2` |
| `{{GPU_DEVICES}}` | Specific GPU device IDs | `0,1` |
| `{{MODEL_CACHE}}` | Path to model cache directory | `/models` |
| `{{WORKSPACE}}` | Working directory path | `/workspace` |
## API Methods
```go
@ -188,7 +234,52 @@ func (h *SchedulerHub) Start() error
func (h *SchedulerHub) Stop()
```
## Configuration
## Audit Integration
The scheduler integrates with the audit logging system for security and compliance:
### Audit Logger Integration
```go
type SchedulerHub struct {
// ... other fields ...
auditor *audit.Logger // Security audit logger
}
```
**Initialization:**
```go
auditor := audit.NewLogger(audit.Config{
LogPath: "/var/log/fetch_ml/scheduler_audit.log",
Enabled: true,
})
hub, err := scheduler.NewHub(config, auditor)
```
### Audit Events
The scheduler logs the following audit events:
| Event | Description | Fields Logged |
|-------|-------------|---------------|
| `job_submitted` | New job queued | job_id, user_id, job_type, gpu_count |
| `job_assigned` | Job assigned to worker | job_id, worker_id, assignment_time |
| `job_accepted` | Worker accepted job | job_id, worker_id, acceptance_time |
| `job_completed` | Job finished successfully | job_id, worker_id, duration |
| `job_failed` | Job failed | job_id, worker_id, error_code |
| `job_cancelled` | Job cancelled | job_id, cancelled_by, reason |
| `worker_registered` | Worker connected | worker_id, capabilities, timestamp |
| `worker_disconnected` | Worker disconnected | worker_id, duration_connected |
| `quota_exceeded` | GPU quota violation | user_id, plugin_name, requested, limit |
### Tamper-Evident Logging
Audit logs use chain hashing for integrity:
- Each event includes SHA-256 hash of previous event
- Chain verification detects log tampering
- Separate log file from operational logs
### Configuration
```go
type HubConfig struct {
@ -203,6 +294,7 @@ type HubConfig struct {
GangAllocTimeoutSecs int // Multi-node allocation timeout
AcceptanceTimeoutSecs int // Job acceptance timeout
WorkerTokens map[string]string // Authentication tokens
PluginQuota PluginQuotaConfig // Plugin GPU quota configuration
}
```
@ -214,6 +306,11 @@ Process management is abstracted for Unix/Windows:
## See Also
- `internal/scheduler/hub.go` - Core implementation
- `tests/fixtures/scheduler_fixture.go` - Test infrastructure
- `docs/src/native-libraries.md` - Native C++ performance libraries
- **[Architecture Overview](architecture.md)** - High-level system architecture
- **[Security Guide](security.md)** - Audit logging and security features
- **[Configuration Reference](configuration-reference.md)** - Plugin GPU quotas and scheduler config
- **[Jupyter Workflow](jupyter-workflow.md)** - Jupyter service integration with scheduler
- **[vLLM Workflow](vllm-workflow.md)** - vLLM service integration with scheduler
- **[Testing Guide](testing.md)** - Testing the scheduler
- **`internal/scheduler/hub.go`** - Core implementation
- **`tests/fixtures/scheduler_fixture.go`** - Test infrastructure

View file

@ -112,27 +112,164 @@ The system detects and rejects plaintext secrets using:
### HIPAA-Compliant Audit Logging
**Tamper-Evident Logging:**
FetchML implements comprehensive HIPAA-compliant audit logging with tamper-evident chain hashing for healthcare and regulated environments.
**Architecture:**
```go
// Each event includes chain hash for integrity
audit.Log(audit.Event{
// Audit logger initialization
auditor := audit.NewLogger(audit.Config{
Enabled: true,
LogPath: "/var/log/fetch_ml/audit.log",
})
// Logging an event
auditor.Log(audit.Event{
EventType: audit.EventFileRead,
UserID: "user1",
Resource: "/data/file.txt",
UserID: "user123",
Resource: "/data/patient_records/file.txt",
IPAddress: "10.0.0.5",
Success: true,
Metadata: map[string]any{
"file_size": 1024,
"checksum": "abc123...",
},
})
```
**Event Types:**
- `file_read` - File access logged
- `file_write` - File modification logged
- `file_delete` - File deletion logged
- `auth_success` / `auth_failure` - Authentication events
- `job_queued` / `job_started` / `job_completed` - Job lifecycle
**Chain Hashing:**
- Each event includes SHA-256 hash of previous event
**Tamper-Evident Chain Hashing:**
- Each event includes SHA-256 hash of the previous event (PrevHash)
- Event hash covers all fields including PrevHash (chaining)
- Modification of any log entry breaks the chain
- `VerifyChain()` function detects tampering
- Separate `VerifyChain()` function detects tampering
- Monotonic sequence numbers prevent deletion attacks
```go
// Verify audit chain integrity
valid, err := audit.VerifyChain("/var/log/fetch_ml/audit.log")
if err != nil || !valid {
log.Fatal("AUDIT TAMPERING DETECTED")
}
```
**HIPAA-Specific Event Types:**
| Event Type | HIPAA Relevance | Fields Logged |
|------------|-----------------|---------------|
| `file_read` | Access to PHI | user_id, file_path, ip_address, timestamp, checksum |
| `file_write` | Modification of PHI | user_id, file_path, bytes_written, prev_checksum, new_checksum |
| `file_delete` | Deletion of PHI | user_id, file_path, deletion_type (soft/hard) |
| `dataset_access` | Bulk data access | user_id, dataset_id, record_count, access_purpose |
| `authentication_success` | Access control | user_id, auth_method, ip_address, mfa_used |
| `authentication_failure` | Failed access attempts | attempted_user, ip_address, failure_reason, attempt_count |
| `job_queued` | Processing PHI | user_id, job_id, input_data_classification |
| `job_started` | PHI processing begun | job_id, worker_id, data_accessed |
| `job_completed` | PHI processing complete | job_id, output_location, data_disposition |
**Standard Event Types:**
| Event Type | Description | Use Case |
|------------|-------------|----------|
| `authentication_attempt` | Login attempt (pre-validation) | Brute force detection |
| `authentication_success` | Successful login | Access tracking |
| `authentication_failure` | Failed login | Security monitoring |
| `job_queued` | Job submitted to queue | Workflow tracking |
| `job_started` | Job execution begun | Performance monitoring |
| `job_completed` | Job finished successfully | Completion tracking |
| `job_failed` | Job execution failed | Error tracking |
| `jupyter_start` | Jupyter service started | Resource tracking |
| `jupyter_stop` | Jupyter service stopped | Session tracking |
| `experiment_created` | Experiment initialized | Provenance tracking |
| `experiment_deleted` | Experiment removed | Data lifecycle |
**Scheduler Audit Integration:**
The scheduler automatically logs these events:
- `job_submitted` - Job queued (includes user_id, job_type, gpu_count)
- `job_assigned` - Job assigned to worker (worker_id, assignment_time)
- `job_accepted` - Worker confirmed job execution
- `job_completed` / `job_failed` / `job_cancelled` - Job terminal states
- `worker_registered` - Worker connected to scheduler
- `worker_disconnected` - Worker disconnected
- `quota_exceeded` - GPU quota violation attempt
**Audit Log Format:**
```json
{
"timestamp": "2024-01-15T10:30:00Z",
"event_type": "file_read",
"user_id": "researcher1",
"ip_address": "10.0.0.5",
"resource": "/data/experiments/run_001/results.csv",
"action": "read",
"success": true,
"sequence_num": 15423,
"prev_hash": "a1b2c3d4...",
"event_hash": "e5f6g7h8...",
"metadata": {
"file_size": 1048576,
"checksum": "sha256:abc123...",
"access_duration_ms": 150
}
}
```
**Log Storage and Rotation:**
- Default location: `/var/log/fetch_ml/audit.log`
- Automatic rotation by size (100MB) or time (daily)
- Retention policy: Configurable (default: 7 years for HIPAA)
- Immutable storage: Append-only with filesystem-level protection
**Compliance Features:**
- **User Identification**: Every event includes `user_id` for accountability
- **Timestamp Precision**: RFC3339 nanosecond precision timestamps
- **IP Address Tracking**: Source IP for all network events
- **Success/Failure Tracking**: Boolean success field for all operations
- **Metadata Flexibility**: Extensible key-value metadata for domain-specific data
- **Immutable Logging**: Append-only files with filesystem protections
- **Chain Verification**: Cryptographic proof of log integrity
- **Sealed Logs**: Optional GPG signing for regulatory submissions
**Audit Log Analysis:**
```bash
# View recent audit events
tail -f /var/log/fetch_ml/audit.log | jq '.'
# Search for specific user activity
grep '"user_id":"researcher1"' /var/log/fetch_ml/audit.log | jq '.'
# Find all file access events
jq 'select(.event_type == "file_read")' /var/log/fetch_ml/audit.log
# Detect failed authentication attempts
jq 'select(.event_type == "authentication_failure")' /var/log/fetch_ml/audit.log
# Verify audit chain integrity
./cli/zig-out/bin/ml audit verify /var/log/fetch_ml/audit.log
# Export audit report for compliance
./cli/zig-out/bin/ml audit export --start 2024-01-01 --end 2024-01-31 --format csv
```
**Regulatory Compliance:**
| Regulation | Requirement | FetchML Implementation |
|------------|-------------|------------------------|
| **HIPAA** | Access logging, tamper evidence | Chain hashing, file access events, user tracking |
| **GDPR** | Data subject access, right to deletion | Full audit trail, deletion events with chain preservation |
| **SOX** | Financial controls, audit trail | Immutable logs, separation of duties via RBAC |
| **21 CFR Part 11** | Electronic records integrity | Tamper-evident logging, user authentication, timestamps |
| **PCI DSS** | Access logging, data protection | Audit trails, encryption, access controls |
**Best Practices:**
1. **Enable Audit Logging**: Always enable in production
2. **Separate Storage**: Store audit logs on separate volume from application data
3. **Regular Verification**: Run chain verification daily
4. **Backup Strategy**: Include audit logs in backup procedures
5. **Access Control**: Restrict audit log access to security personnel only
6. **Monitoring**: Set up alerts for suspicious patterns (multiple failed logins, after-hours access)
---
@ -420,3 +557,16 @@ All API access is logged with:
- **Security Issues**: Report privately via email
- **Questions**: See documentation or create issue
- **Updates**: Monitor releases for security patches
---
## See Also
- **[Privacy & Security](privacy-security.md)** - PII detection and privacy controls
- **[Multi-Tenant Security](multi-tenant-security.md)** - Tenant isolation and cross-tenant access prevention
- **[API Key Process](api-key-process.md)** - Generate and manage API keys
- **[User Permissions](user-permissions.md)** - Role-based access control
- **[Runtime Security](runtime-security.md)** - Container sandboxing and seccomp profiles
- **[Scheduler Architecture](scheduler-architecture.md)** - Audit integration in the scheduler
- **[Configuration Reference](configuration-reference.md)** - Security-related configuration options
- **[Deployment Guide](deployment.md)** - Production security hardening

581
docs/src/vllm-workflow.md Normal file
View file

@ -0,0 +1,581 @@
# vLLM Inference Service Guide
Comprehensive guide to deploying and managing OpenAI-compatible LLM inference services using vLLM in FetchML.
## Overview
The vLLM plugin provides high-performance LLM inference with:
- **OpenAI-Compatible API**: Drop-in replacement for OpenAI's API
- **Advanced Scheduling**: Continuous batching for throughput optimization
- **GPU Optimization**: Tensor parallelism and quantization support
- **Model Management**: Automatic model downloading and caching
- **Quantization**: AWQ, GPTQ, FP8, and SqueezeLLM support
## Quick Start
### Start vLLM Service
```bash
# Start development stack
make dev-up
# Start vLLM service with default model
./cli/zig-out/bin/ml service start vllm --name llm-server --model meta-llama/Llama-2-7b-chat-hf
# Or with specific GPU requirements
./cli/zig-out/bin/ml service start vllm \
--name llm-server \
--model meta-llama/Llama-2-7b-chat-hf \
--gpu-count 1 \
--quantization awq
# Access the API
open http://localhost:8000/docs
```
### Using the API
```python
import openai
# Point to local vLLM instance
client = openai.OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
# Chat completion
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[
{"role": "user", "content": "Explain quantum computing in simple terms"}
]
)
print(response.choices[0].message.content)
```
## Service Management
### Creating vLLM Services
```bash
# Create basic vLLM service
./cli/zig-out/bin/ml service start vllm --name my-llm
# Create with specific model
./cli/zig-out/bin/ml service start vllm \
--name my-llm \
--model microsoft/DialoGPT-medium
# Create with resource constraints
./cli/zig-out/bin/ml service start vllm \
--name production-llm \
--model meta-llama/Llama-2-13b-chat-hf \
--gpu-count 2 \
--quantization gptq \
--max-model-len 4096
# List all vLLM services
./cli/zig-out/bin/ml service list
# Service details
./cli/zig-out/bin/ml service info my-llm
```
### Service Configuration
**Resource Allocation:**
```yaml
# vllm-config.yaml
resources:
gpu_count: 1
gpu_memory: 24gb
cpu: 4
memory: 16g
model:
name: "meta-llama/Llama-2-7b-chat-hf"
quantization: "awq" # Options: awq, gptq, squeezellm, fp8
trust_remote_code: false
max_model_len: 4096
serving:
port: 8000
host: "0.0.0.0"
tensor_parallel_size: 1
dtype: "auto" # auto, half, bfloat16, float
optimization:
enable_prefix_caching: true
swap_space: 4 # GB
max_num_batched_tokens: 4096
max_num_seqs: 256
```
**Environment Variables:**
```bash
# Model cache location
export VLLM_MODEL_CACHE=/models
# HuggingFace token for gated models
export HUGGING_FACE_HUB_TOKEN=your_token_here
# CUDA settings
export CUDA_VISIBLE_DEVICES=0,1
```
### Service Lifecycle
```bash
# Start a service
./cli/zig-out/bin/ml service start vllm --name my-llm
# Stop a service (graceful shutdown)
./cli/zig-out/bin/ml service stop my-llm
# Restart a service
./cli/zig-out/bin/ml service restart my-llm
# Remove a service (stops and deletes)
./cli/zig-out/bin/ml service remove my-llm
# View service logs
./cli/zig-out/bin/ml service logs my-llm --follow
# Check service health
./cli/zig-out/bin/ml service health my-llm
```
## Model Management
### Supported Models
vLLM supports most HuggingFace Transformers models:
- **Llama 2/3**: `meta-llama/Llama-2-7b-chat-hf`, `meta-llama/Llama-2-70b-chat-hf`
- **Mistral**: `mistralai/Mistral-7B-Instruct-v0.2`
- **Mixtral**: `mistralai/Mixtral-8x7B-Instruct-v0.1`
- **Falcon**: `tiiuae/falcon-7b-instruct`
- **CodeLlama**: `codellama/CodeLlama-7b-hf`
- **Phi**: `microsoft/phi-2`
- **Qwen**: `Qwen/Qwen-7B-Chat`
- **Gemma**: `google/gemma-7b-it`
### Model Caching
Models are automatically cached to avoid repeated downloads:
```bash
# Default cache location
~/.cache/huggingface/hub/
# Custom cache location
export VLLM_MODEL_CACHE=/mnt/fast-storage/models
# Pre-download models
./cli/zig-out/bin/ml service prefetch --model meta-llama/Llama-2-7b-chat-hf
```
### Quantization
Quantization reduces memory usage and improves inference speed:
```bash
# AWQ (4-bit quantization)
./cli/zig-out/bin/ml service start vllm \
--name llm-awq \
--model TheBloke/Llama-2-7B-AWQ \
--quantization awq
# GPTQ (4-bit quantization)
./cli/zig-out/bin/ml service start vllm \
--name llm-gptq \
--model TheBloke/Llama-2-7B-GPTQ \
--quantization gptq
# FP8 (8-bit floating point)
./cli/zig-out/bin/ml service start vllm \
--name llm-fp8 \
--model meta-llama/Llama-2-7b-chat-hf \
--quantization fp8
```
**Quantization Comparison:**
| Method | Bits | Memory Reduction | Speed Impact | Quality |
|--------|------|------------------|--------------|---------|
| None (FP16) | 16 | 1x | Baseline | Best |
| FP8 | 8 | 2x | Faster | Excellent |
| AWQ | 4 | 4x | Fast | Very Good |
| GPTQ | 4 | 4x | Fast | Very Good |
| SqueezeLLM | 4 | 4x | Fast | Good |
## API Reference
### OpenAI-Compatible Endpoints
vLLM provides OpenAI-compatible REST API endpoints:
**Chat Completions:**
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [
{"role": "user", "content": "Hello!"}
],
"max_tokens": 100,
"temperature": 0.7
}'
```
**Completions (Legacy):**
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"prompt": "The capital of France is",
"max_tokens": 10
}'
```
**Embeddings:**
```bash
curl http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"input": "Hello world"
}'
```
**List Models:**
```bash
curl http://localhost:8000/v1/models
```
### Streaming Responses
Enable streaming for real-time token generation:
```python
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
stream = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=[{"role": "user", "content": "Write a poem about AI"}],
stream=True,
max_tokens=200
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
```
### Advanced Parameters
```python
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-chat-hf",
messages=messages,
# Generation parameters
max_tokens=500,
temperature=0.7,
top_p=0.9,
top_k=40,
# Repetition and penalties
frequency_penalty=0.5,
presence_penalty=0.5,
repetition_penalty=1.1,
# Sampling
seed=42,
stop=["END", "STOP"],
# Beam search (optional)
best_of=1,
use_beam_search=False,
)
```
## GPU Quotas and Resource Management
### Per-User GPU Limits
The scheduler enforces GPU quotas for vLLM services:
```yaml
# scheduler-config.yaml
scheduler:
plugin_quota:
enabled: true
total_gpus: 16
per_user_gpus: 4
per_user_services: 2
per_plugin_limits:
vllm:
max_gpus: 8
max_services: 4
user_overrides:
admin:
max_gpus: 8
max_services: 5
allowed_plugins: ["vllm", "jupyter"]
```
### Resource Monitoring
```bash
# Check GPU allocation for your user
./cli/zig-out/bin/ml service quota
# View current usage
./cli/zig-out/bin/ml service usage
# Monitor service resource usage
./cli/zig-out/bin/ml service stats my-llm
```
## Multi-GPU and Distributed Inference
### Tensor Parallelism
For large models that don't fit on a single GPU:
```bash
# 70B model across 4 GPUs
./cli/zig-out/bin/ml service start vllm \
--name llm-70b \
--model meta-llama/Llama-2-70b-chat-hf \
--gpu-count 4 \
--tensor-parallel-size 4
```
### Pipeline Parallelism
For very large models with pipeline stages:
```yaml
# Pipeline parallelism config
model:
name: "meta-llama/Llama-2-70b-chat-hf"
serving:
tensor_parallel_size: 2
pipeline_parallel_size: 2 # Total 4 GPUs
```
## Integration with Experiments
### Using vLLM from Training Jobs
```python
# In your training script
import requests
# Call local vLLM service
response = requests.post(
"http://vllm-service:8000/v1/chat/completions",
json={
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [{"role": "user", "content": "Summarize this text"}]
}
)
result = response.json()
summary = result["choices"][0]["message"]["content"]
```
### Linking with Experiments
```bash
# Start vLLM service linked to experiment
./cli/zig-out/bin/ml service start vllm \
--name llm-exp-1 \
--model meta-llama/Llama-2-7b-chat-hf \
--experiment experiment-id
# View linked services
./cli/zig-out/bin/ml service list --experiment experiment-id
```
## Security and Access Control
### Network Isolation
```bash
# Restrict to internal network only
./cli/zig-out/bin/ml service start vllm \
--name internal-llm \
--model meta-llama/Llama-2-7b-chat-hf \
--host 10.0.0.1 \
--port 8000
```
### API Key Authentication
```yaml
# vllm-security.yaml
auth:
api_key_required: true
allowed_ips:
- "10.0.0.0/8"
- "192.168.0.0/16"
rate_limit:
requests_per_minute: 60
tokens_per_minute: 10000
```
### Audit Trail
All API calls are logged for compliance:
```bash
# View audit log
./cli/zig-out/bin/ml service audit my-llm
# Export audit report
./cli/zig-out/bin/ml service audit my-llm --export=csv
# Check access patterns
./cli/zig-out/bin/ml service audit my-llm --summary
```
## Monitoring and Troubleshooting
### Health Checks
```bash
# Check service health
./cli/zig-out/bin/ml service health my-llm
# Detailed diagnostics
./cli/zig-out/bin/ml service diagnose my-llm
# View service status
./cli/zig-out/bin/ml service status my-llm
```
### Performance Monitoring
```bash
# Real-time metrics
./cli/zig-out/bin/ml service monitor my-llm
# Performance report
./cli/zig-out/bin/ml service report my-llm --format=html
# GPU utilization
./cli/zig-out/bin/ml service stats my-llm --gpu
```
### Common Issues
**Out of Memory:**
```bash
# Reduce batch size
./cli/zig-out/bin/ml service update my-llm --max-num-seqs 128
# Enable quantization
./cli/zig-out/bin/ml service update my-llm --quantization awq
# Reduce GPU memory fraction
export VLLM_GPU_MEMORY_FRACTION=0.85
```
**Model Download Failures:**
```bash
# Set HuggingFace token
export HUGGING_FACE_HUB_TOKEN=your_token
# Use mirror
export HF_ENDPOINT=https://hf-mirror.com
# Pre-download with retry
./cli/zig-out/bin/ml service prefetch --model meta-llama/Llama-2-7b-chat-hf --retry
```
**Slow Inference:**
```bash
# Enable prefix caching
./cli/zig-out/bin/ml service update my-llm --enable-prefix-caching
# Increase batch size
./cli/zig-out/bin/ml service update my-llm --max-num-batched-tokens 8192
# Check GPU utilization
nvidia-smi dmon -s u
```
## Best Practices
### Resource Planning
1. **GPU Memory Calculation**: Model size × precision × overhead (1.2-1.5x)
2. **Batch Size Tuning**: Balance throughput vs. latency
3. **Quantization**: Use AWQ/GPTQ for production, FP16 for best quality
4. **Prefix Caching**: Enable for chat applications with repeated prompts
### Production Deployment
1. **Load Balancing**: Deploy multiple vLLM instances behind a load balancer
2. **Health Checks**: Configure Kubernetes liveness/readiness probes
3. **Autoscaling**: Scale based on queue depth or GPU utilization
4. **Monitoring**: Track tokens/sec, queue depth, and error rates
### Security
1. **Network Segmentation**: Isolate vLLM on internal network
2. **Rate Limiting**: Prevent abuse with per-user quotas
3. **Input Validation**: Sanitize prompts to prevent injection attacks
4. **Audit Logging**: Enable comprehensive audit trails
## CLI Reference
### Service Commands
```bash
# Start a service
ml service start vllm [flags]
--name string Service name (required)
--model string Model name or path (default: "meta-llama/Llama-2-7b-chat-hf")
--gpu-count int Number of GPUs (default: 1)
--quantization string Quantization method (awq, gptq, fp8, squeezellm)
--port int Service port (default: 8000)
--max-model-len int Maximum sequence length
--tensor-parallel-size int Tensor parallelism degree
# List services
ml service list [flags]
--format string Output format (table, json)
--all Show all users' services (admin only)
# Service operations
ml service stop <name>
ml service start <name> # Restart a stopped service
ml service restart <name>
ml service remove <name>
ml service logs <name> [flags]
--follow Follow log output
--tail int Number of lines to show (default: 100)
ml service info <name>
ml service health <name>
```
## See Also
- **[Testing Guide](testing.md)** - Testing vLLM services
- **[Deployment Guide](deployment.md)** - Production deployment
- **[Security Guide](security.md)** - Security best practices
- **[Scheduler Architecture](scheduler-architecture.md)** - How vLLM integrates with scheduler
- **[CLI Reference](cli-reference.md)** - Command-line tools
- **[Jupyter Workflow](jupyter-workflow.md)** - Jupyter integration with vLLM