From 90ea18555c03e6c6b1255977839db223346571bd Mon Sep 17 00:00:00 2001 From: Jeremie Fraeys Date: Thu, 26 Feb 2026 13:04:39 -0500 Subject: [PATCH] docs: add vLLM workflow and cross-link documentation - Add new vLLM workflow documentation (vllm-workflow.md) - Update scheduler-architecture.md with Plugin GPU Quota and audit logging - Add See Also sections to jupyter-workflow.md, quick-start.md, configuration-reference.md for better navigation - Update landing page and index with vLLM and scheduler links - Cross-link all documentation for improved discoverability --- docs/src/_index.md | 14 +- docs/src/architecture.md | 76 ++++ docs/src/configuration-reference.md | 53 ++- docs/src/jupyter-workflow.md | 4 +- docs/src/landing.md | 4 +- docs/src/quick-start.md | 11 +- docs/src/scheduler-architecture.md | 105 ++++- docs/src/security.md | 180 ++++++++- docs/src/vllm-workflow.md | 581 ++++++++++++++++++++++++++++ 9 files changed, 999 insertions(+), 29 deletions(-) create mode 100644 docs/src/vllm-workflow.md diff --git a/docs/src/_index.md b/docs/src/_index.md index 981d6a5..46aebe3 100644 --- a/docs/src/_index.md +++ b/docs/src/_index.md @@ -40,12 +40,14 @@ make test-unit - [Environment Variables](environment-variables.md) - Configuration options - [Smart Defaults](smart-defaults.md) - Default configuration settings -### Development -- [Architecture](architecture.md) - System architecture and design -- [CLI Reference](cli-reference.md) - Command-line interface documentation -- [Testing Guide](testing.md) - Testing procedures and guidelines -- [Jupyter Workflow](jupyter-workflow.md) - CLI and Jupyter integration -- [Queue System](queue.md) - Job queue implementation +### 🛠️ Development +- **[Architecture](architecture.md)** - System architecture and design +- **[Scheduler Architecture](scheduler-architecture.md)** - Job scheduler and service management +- **[CLI Reference](cli-reference.md)** - Command-line interface documentation +- **[Testing Guide](testing.md)** - Testing procedures and guidelines +- **[Jupyter Workflow](jupyter-workflow.md)** - Jupyter notebook services +- **[vLLM Workflow](vllm-workflow.md)** - LLM inference services +- **[Queue System](queue.md)** - Job queue implementation ### Production Deployment - [Deployment Guide](deployment.md) - Production deployment instructions diff --git a/docs/src/architecture.md b/docs/src/architecture.md index 7e4fbcc..8f8a95f 100644 --- a/docs/src/architecture.md +++ b/docs/src/architecture.md @@ -244,6 +244,72 @@ Plugins can be configured via worker configuration under `plugins`, including: - `mode` - per-plugin paths/settings (e.g., artifact base path, log base path) +## Plugin GPU Quota System + +The scheduler includes a GPU quota management system for plugin-based services (Jupyter, vLLM, etc.) that controls resource allocation across users and plugins. + +### Quota Enforcement + +The quota system enforces limits at multiple levels: + +1. **Global GPU Limit**: Total GPUs available across all plugins +2. **Per-User GPU Limit**: Maximum GPUs a single user can allocate +3. **Per-User Service Limit**: Maximum number of service instances per user +4. **Plugin-Specific Limits**: Separate limits for each plugin type +5. **User Overrides**: Custom limits for specific users with allowed plugin restrictions + +### Architecture + +```mermaid +graph TB + subgraph "Plugin Quota System" + Submit[Job Submission] --> CheckQuota{Check Quota} + CheckQuota -->|Within Limits| Accept[Accept Job] + CheckQuota -->|Exceeded| Reject[Reject with Error] + + Accept --> RecordUsage[Record Usage] + RecordUsage --> Assign[Assign to Worker] + + Complete[Job Complete] --> ReleaseUsage[Release Usage] + + subgraph "Quota Manager" + Global[Global GPU Counter] + PerUser[Per-User Tracking] + PerPlugin[Per-Plugin Tracking] + Overrides[User Overrides] + end + + CheckQuota --> Global + CheckQuota --> PerUser + CheckQuota --> PerPlugin + CheckQuota --> Overrides + end +``` + +### Components + +- **PluginQuotaConfig**: Configuration for all quota limits and overrides +- **PluginQuotaManager**: Thread-safe manager for tracking and enforcing quotas +- **Integration Points**: + - `SubmitJob()`: Validates quotas before accepting service jobs + - `handleJobAccepted()`: Records usage when jobs are assigned + - `handleJobResult()`: Releases usage when jobs complete + +### Usage + +Jobs must include `user_id` and `plugin_name` metadata for quota tracking: + +```go +spec := scheduler.JobSpec{ + Type: scheduler.JobTypeService, + UserID: "user123", + GPUCount: 2, + Metadata: map[string]string{ + "plugin_name": "jupyter", + }, +} +``` + ## Zig CLI Architecture ### Component Structure @@ -865,3 +931,13 @@ graph TB --- This architecture provides a solid foundation for secure, scalable machine learning experiments while maintaining simplicity and developer productivity. + +## See Also + +- **[Scheduler Architecture](scheduler-architecture.md)** - Detailed scheduler design and protocols +- **[Security Guide](security.md)** - Security architecture and best practices +- **[Configuration Reference](configuration-reference.md)** - Configuration options and environment variables +- **[Deployment Guide](deployment.md)** - Production deployment architecture +- **[Performance & Monitoring](performance-monitoring.md)** - Metrics and observability +- **[Research Runner Plan](research-runner-plan.md)** - Roadmap and implementation phases +- **[Native Libraries](native-libraries.md)** - C++ performance optimizations diff --git a/docs/src/configuration-reference.md b/docs/src/configuration-reference.md index d39f878..32e5ba7 100644 --- a/docs/src/configuration-reference.md +++ b/docs/src/configuration-reference.md @@ -664,6 +664,45 @@ api_key = "" | `resources.podman_cpus` | string | "2" | CPU limit for Podman containers | | `resources.podman_memory` | string | "4Gi" | Memory limit for Podman containers | +### Plugin GPU Quotas + +Control GPU allocation for plugin-based services (Jupyter, vLLM, etc.). + +| Option | Type | Default | Description | +|--------|------|---------|-------------| +| `scheduler.plugin_quota.enabled` | bool | false | Enable plugin GPU quota enforcement | +| `scheduler.plugin_quota.total_gpus` | int | 0 | Global GPU limit across all plugins (0 = unlimited) | +| `scheduler.plugin_quota.per_user_gpus` | int | 0 | Default per-user GPU limit (0 = unlimited) | +| `scheduler.plugin_quota.per_user_services` | int | 0 | Default per-user service count limit (0 = unlimited) | +| `scheduler.plugin_quota.per_plugin_limits.{plugin}.max_gpus` | int | 0 | Plugin-specific GPU limit | +| `scheduler.plugin_quota.per_plugin_limits.{plugin}.max_services` | int | 0 | Plugin-specific service count limit | +| `scheduler.plugin_quota.user_overrides.{user}.max_gpus` | int | 0 | Per-user GPU override | +| `scheduler.plugin_quota.user_overrides.{user}.max_services` | int | 0 | Per-user service limit override | +| `scheduler.plugin_quota.user_overrides.{user}.allowed_plugins` | array | [] | Plugins user is allowed to use (empty = all) | + +**Example configuration:** + +```yaml +scheduler: + plugin_quota: + enabled: true + total_gpus: 16 + per_user_gpus: 4 + per_user_services: 2 + per_plugin_limits: + vllm: + max_gpus: 8 + max_services: 4 + jupyter: + max_gpus: 4 + max_services: 10 + user_overrides: + admin: + max_gpus: 8 + max_services: 5 + allowed_plugins: ["jupyter", "vllm"] +``` + ### Redis | Option | Type | Default | Description | @@ -746,4 +785,16 @@ go run cmd/api-server/main.go --config configs/api/dev.yaml --validate # Test CLI configuration ./cli/zig-out/bin/ml status --debug -``` \ No newline at end of file +``` + +--- + +## See Also + +- **[Architecture](architecture.md)** - System architecture overview +- **[Scheduler Architecture](scheduler-architecture.md)** - Scheduler configuration details +- **[Environment Variables](environment-variables.md)** - Additional environment variable documentation +- **[Security Guide](security.md)** - Security-related configuration +- **[Deployment Guide](deployment.md)** - Production configuration guidance +- **[Jupyter Workflow](jupyter-workflow.md)** - Jupyter service configuration +- **[vLLM Workflow](vllm-workflow.md)** - vLLM service configuration \ No newline at end of file diff --git a/docs/src/jupyter-workflow.md b/docs/src/jupyter-workflow.md index 361326b..174b8d8 100644 --- a/docs/src/jupyter-workflow.md +++ b/docs/src/jupyter-workflow.md @@ -620,8 +620,10 @@ Common error codes in binary responses: ## See Also +- **[vLLM Workflow](vllm-workflow.md)** - LLM inference services (complementary to Jupyter) +- **[Scheduler Architecture](scheduler-architecture.md)** - How Jupyter services are scheduled +- **[Configuration Reference](configuration-reference.md)** - Service configuration options - **[Testing Guide](testing.md)** - Testing Jupyter workflows - **[Deployment Guide](deployment.md)** - Production deployment - **[Security Guide](security.md)** - Security best practices -- **[API Reference](api-key-process.md)** - API documentation - **[CLI Reference](cli-reference.md)** - Command-line tools \ No newline at end of file diff --git a/docs/src/landing.md b/docs/src/landing.md index 6f44890..ecd6ed0 100644 --- a/docs/src/landing.md +++ b/docs/src/landing.md @@ -43,9 +43,11 @@ make test-unit ### 🛠️ Development - [**Architecture**](architecture.md) - System architecture and design +- [**Scheduler Architecture**](scheduler-architecture.md) - Job scheduler and service management - [**CLI Reference**](cli-reference.md) - Command-line interface documentation - [**Testing Guide**](testing.md) - Testing procedures and guidelines -- [**Jupyter Workflow**](jupyter-workflow.md) - CLI and Jupyter integration +- [**Jupyter Workflow**](jupyter-workflow.md) - Jupyter notebook services +- [**vLLM Workflow**](vllm-workflow.md) - LLM inference services - [**Queue System**](queue.md) - Job queue implementation ### 🏭 Production Deployment diff --git a/docs/src/quick-start.md b/docs/src/quick-start.md index 9d41d47..a8e819b 100644 --- a/docs/src/quick-start.md +++ b/docs/src/quick-start.md @@ -329,4 +329,13 @@ make help # Show all available commands --- -*Ready in minutes!* \ No newline at end of file +*Ready in minutes!* + +## See Also + +- **[Architecture](architecture.md)** - System architecture overview +- **[Scheduler Architecture](scheduler-architecture.md)** - Job scheduling and service management +- **[Jupyter Workflow](jupyter-workflow.md)** - Jupyter notebook services +- **[vLLM Workflow](vllm-workflow.md)** - LLM inference services +- **[Configuration Reference](configuration-reference.md)** - Configuration options +- **[Security Guide](security.md)** - Security best practices \ No newline at end of file diff --git a/docs/src/scheduler-architecture.md b/docs/src/scheduler-architecture.md index d6532d8..89ac003 100644 --- a/docs/src/scheduler-architecture.md +++ b/docs/src/scheduler-architecture.md @@ -169,6 +169,52 @@ The scheduler persists state for crash recovery: State is replayed on startup via `StateStore.Replay()`. +## Service Templates + +The scheduler provides built-in service templates for common ML workloads: + +### Available Templates + +| Template | Description | Default Port Range | +|----------|-------------|-------------------| +| **JupyterLab** | Interactive Jupyter environment | 8000-9000 | +| **Jupyter Notebook** | Classic Jupyter notebooks | 8000-9000 | +| **vLLM** | OpenAI-compatible LLM inference server | 8000-9000 | + +### Port Allocation + +Dynamic port management for service instances: + +```go +type PortAllocator struct { + startPort int // Default: 8000 + endPort int // Default: 9000 + allocated map[int]time.Time // Port -> allocation time +} +``` + +**Features:** +- Automatic port selection from configured range +- TTL-based port reclamation +- Thread-safe concurrent allocations +- Exhaustion handling with clear error messages + +### Template Variables + +Service templates support dynamic variable substitution: + +| Variable | Description | Example | +|----------|-------------|---------| +| `{{SERVICE_PORT}}` | Allocated port for the service | `8080` | +| `{{WORKER_ID}}` | ID of the assigned worker | `worker-1` | +| `{{TASK_ID}}` | Unique task identifier | `task-abc123` | +| `{{SECRET:xxx}}` | Secret reference from keychain | `api-key-value` | +| `{{MODEL_NAME}}` | ML model name (vLLM) | `llama-2-7b` | +| `{{GPU_COUNT}}` | Number of GPUs allocated | `2` | +| `{{GPU_DEVICES}}` | Specific GPU device IDs | `0,1` | +| `{{MODEL_CACHE}}` | Path to model cache directory | `/models` | +| `{{WORKSPACE}}` | Working directory path | `/workspace` | + ## API Methods ```go @@ -188,7 +234,52 @@ func (h *SchedulerHub) Start() error func (h *SchedulerHub) Stop() ``` -## Configuration +## Audit Integration + +The scheduler integrates with the audit logging system for security and compliance: + +### Audit Logger Integration + +```go +type SchedulerHub struct { + // ... other fields ... + auditor *audit.Logger // Security audit logger +} +``` + +**Initialization:** +```go +auditor := audit.NewLogger(audit.Config{ + LogPath: "/var/log/fetch_ml/scheduler_audit.log", + Enabled: true, +}) +hub, err := scheduler.NewHub(config, auditor) +``` + +### Audit Events + +The scheduler logs the following audit events: + +| Event | Description | Fields Logged | +|-------|-------------|---------------| +| `job_submitted` | New job queued | job_id, user_id, job_type, gpu_count | +| `job_assigned` | Job assigned to worker | job_id, worker_id, assignment_time | +| `job_accepted` | Worker accepted job | job_id, worker_id, acceptance_time | +| `job_completed` | Job finished successfully | job_id, worker_id, duration | +| `job_failed` | Job failed | job_id, worker_id, error_code | +| `job_cancelled` | Job cancelled | job_id, cancelled_by, reason | +| `worker_registered` | Worker connected | worker_id, capabilities, timestamp | +| `worker_disconnected` | Worker disconnected | worker_id, duration_connected | +| `quota_exceeded` | GPU quota violation | user_id, plugin_name, requested, limit | + +### Tamper-Evident Logging + +Audit logs use chain hashing for integrity: +- Each event includes SHA-256 hash of previous event +- Chain verification detects log tampering +- Separate log file from operational logs + +### Configuration ```go type HubConfig struct { @@ -203,6 +294,7 @@ type HubConfig struct { GangAllocTimeoutSecs int // Multi-node allocation timeout AcceptanceTimeoutSecs int // Job acceptance timeout WorkerTokens map[string]string // Authentication tokens + PluginQuota PluginQuotaConfig // Plugin GPU quota configuration } ``` @@ -214,6 +306,11 @@ Process management is abstracted for Unix/Windows: ## See Also -- `internal/scheduler/hub.go` - Core implementation -- `tests/fixtures/scheduler_fixture.go` - Test infrastructure -- `docs/src/native-libraries.md` - Native C++ performance libraries +- **[Architecture Overview](architecture.md)** - High-level system architecture +- **[Security Guide](security.md)** - Audit logging and security features +- **[Configuration Reference](configuration-reference.md)** - Plugin GPU quotas and scheduler config +- **[Jupyter Workflow](jupyter-workflow.md)** - Jupyter service integration with scheduler +- **[vLLM Workflow](vllm-workflow.md)** - vLLM service integration with scheduler +- **[Testing Guide](testing.md)** - Testing the scheduler +- **`internal/scheduler/hub.go`** - Core implementation +- **`tests/fixtures/scheduler_fixture.go`** - Test infrastructure diff --git a/docs/src/security.md b/docs/src/security.md index e29056a..278d8c4 100644 --- a/docs/src/security.md +++ b/docs/src/security.md @@ -112,27 +112,164 @@ The system detects and rejects plaintext secrets using: ### HIPAA-Compliant Audit Logging -**Tamper-Evident Logging:** +FetchML implements comprehensive HIPAA-compliant audit logging with tamper-evident chain hashing for healthcare and regulated environments. + +**Architecture:** ```go -// Each event includes chain hash for integrity -audit.Log(audit.Event{ +// Audit logger initialization +auditor := audit.NewLogger(audit.Config{ + Enabled: true, + LogPath: "/var/log/fetch_ml/audit.log", +}) + +// Logging an event +auditor.Log(audit.Event{ EventType: audit.EventFileRead, - UserID: "user1", - Resource: "/data/file.txt", + UserID: "user123", + Resource: "/data/patient_records/file.txt", + IPAddress: "10.0.0.5", + Success: true, + Metadata: map[string]any{ + "file_size": 1024, + "checksum": "abc123...", + }, }) ``` -**Event Types:** -- `file_read` - File access logged -- `file_write` - File modification logged -- `file_delete` - File deletion logged -- `auth_success` / `auth_failure` - Authentication events -- `job_queued` / `job_started` / `job_completed` - Job lifecycle - -**Chain Hashing:** -- Each event includes SHA-256 hash of previous event +**Tamper-Evident Chain Hashing:** +- Each event includes SHA-256 hash of the previous event (PrevHash) +- Event hash covers all fields including PrevHash (chaining) - Modification of any log entry breaks the chain -- `VerifyChain()` function detects tampering +- Separate `VerifyChain()` function detects tampering +- Monotonic sequence numbers prevent deletion attacks + +```go +// Verify audit chain integrity +valid, err := audit.VerifyChain("/var/log/fetch_ml/audit.log") +if err != nil || !valid { + log.Fatal("AUDIT TAMPERING DETECTED") +} +``` + +**HIPAA-Specific Event Types:** + +| Event Type | HIPAA Relevance | Fields Logged | +|------------|-----------------|---------------| +| `file_read` | Access to PHI | user_id, file_path, ip_address, timestamp, checksum | +| `file_write` | Modification of PHI | user_id, file_path, bytes_written, prev_checksum, new_checksum | +| `file_delete` | Deletion of PHI | user_id, file_path, deletion_type (soft/hard) | +| `dataset_access` | Bulk data access | user_id, dataset_id, record_count, access_purpose | +| `authentication_success` | Access control | user_id, auth_method, ip_address, mfa_used | +| `authentication_failure` | Failed access attempts | attempted_user, ip_address, failure_reason, attempt_count | +| `job_queued` | Processing PHI | user_id, job_id, input_data_classification | +| `job_started` | PHI processing begun | job_id, worker_id, data_accessed | +| `job_completed` | PHI processing complete | job_id, output_location, data_disposition | + +**Standard Event Types:** + +| Event Type | Description | Use Case | +|------------|-------------|----------| +| `authentication_attempt` | Login attempt (pre-validation) | Brute force detection | +| `authentication_success` | Successful login | Access tracking | +| `authentication_failure` | Failed login | Security monitoring | +| `job_queued` | Job submitted to queue | Workflow tracking | +| `job_started` | Job execution begun | Performance monitoring | +| `job_completed` | Job finished successfully | Completion tracking | +| `job_failed` | Job execution failed | Error tracking | +| `jupyter_start` | Jupyter service started | Resource tracking | +| `jupyter_stop` | Jupyter service stopped | Session tracking | +| `experiment_created` | Experiment initialized | Provenance tracking | +| `experiment_deleted` | Experiment removed | Data lifecycle | + +**Scheduler Audit Integration:** + +The scheduler automatically logs these events: +- `job_submitted` - Job queued (includes user_id, job_type, gpu_count) +- `job_assigned` - Job assigned to worker (worker_id, assignment_time) +- `job_accepted` - Worker confirmed job execution +- `job_completed` / `job_failed` / `job_cancelled` - Job terminal states +- `worker_registered` - Worker connected to scheduler +- `worker_disconnected` - Worker disconnected +- `quota_exceeded` - GPU quota violation attempt + +**Audit Log Format:** +```json +{ + "timestamp": "2024-01-15T10:30:00Z", + "event_type": "file_read", + "user_id": "researcher1", + "ip_address": "10.0.0.5", + "resource": "/data/experiments/run_001/results.csv", + "action": "read", + "success": true, + "sequence_num": 15423, + "prev_hash": "a1b2c3d4...", + "event_hash": "e5f6g7h8...", + "metadata": { + "file_size": 1048576, + "checksum": "sha256:abc123...", + "access_duration_ms": 150 + } +} +``` + +**Log Storage and Rotation:** +- Default location: `/var/log/fetch_ml/audit.log` +- Automatic rotation by size (100MB) or time (daily) +- Retention policy: Configurable (default: 7 years for HIPAA) +- Immutable storage: Append-only with filesystem-level protection + +**Compliance Features:** + +- **User Identification**: Every event includes `user_id` for accountability +- **Timestamp Precision**: RFC3339 nanosecond precision timestamps +- **IP Address Tracking**: Source IP for all network events +- **Success/Failure Tracking**: Boolean success field for all operations +- **Metadata Flexibility**: Extensible key-value metadata for domain-specific data +- **Immutable Logging**: Append-only files with filesystem protections +- **Chain Verification**: Cryptographic proof of log integrity +- **Sealed Logs**: Optional GPG signing for regulatory submissions + +**Audit Log Analysis:** + +```bash +# View recent audit events +tail -f /var/log/fetch_ml/audit.log | jq '.' + +# Search for specific user activity +grep '"user_id":"researcher1"' /var/log/fetch_ml/audit.log | jq '.' + +# Find all file access events +jq 'select(.event_type == "file_read")' /var/log/fetch_ml/audit.log + +# Detect failed authentication attempts +jq 'select(.event_type == "authentication_failure")' /var/log/fetch_ml/audit.log + +# Verify audit chain integrity +./cli/zig-out/bin/ml audit verify /var/log/fetch_ml/audit.log + +# Export audit report for compliance +./cli/zig-out/bin/ml audit export --start 2024-01-01 --end 2024-01-31 --format csv +``` + +**Regulatory Compliance:** + +| Regulation | Requirement | FetchML Implementation | +|------------|-------------|------------------------| +| **HIPAA** | Access logging, tamper evidence | Chain hashing, file access events, user tracking | +| **GDPR** | Data subject access, right to deletion | Full audit trail, deletion events with chain preservation | +| **SOX** | Financial controls, audit trail | Immutable logs, separation of duties via RBAC | +| **21 CFR Part 11** | Electronic records integrity | Tamper-evident logging, user authentication, timestamps | +| **PCI DSS** | Access logging, data protection | Audit trails, encryption, access controls | + +**Best Practices:** + +1. **Enable Audit Logging**: Always enable in production +2. **Separate Storage**: Store audit logs on separate volume from application data +3. **Regular Verification**: Run chain verification daily +4. **Backup Strategy**: Include audit logs in backup procedures +5. **Access Control**: Restrict audit log access to security personnel only +6. **Monitoring**: Set up alerts for suspicious patterns (multiple failed logins, after-hours access) --- @@ -420,3 +557,16 @@ All API access is logged with: - **Security Issues**: Report privately via email - **Questions**: See documentation or create issue - **Updates**: Monitor releases for security patches + +--- + +## See Also + +- **[Privacy & Security](privacy-security.md)** - PII detection and privacy controls +- **[Multi-Tenant Security](multi-tenant-security.md)** - Tenant isolation and cross-tenant access prevention +- **[API Key Process](api-key-process.md)** - Generate and manage API keys +- **[User Permissions](user-permissions.md)** - Role-based access control +- **[Runtime Security](runtime-security.md)** - Container sandboxing and seccomp profiles +- **[Scheduler Architecture](scheduler-architecture.md)** - Audit integration in the scheduler +- **[Configuration Reference](configuration-reference.md)** - Security-related configuration options +- **[Deployment Guide](deployment.md)** - Production security hardening diff --git a/docs/src/vllm-workflow.md b/docs/src/vllm-workflow.md new file mode 100644 index 0000000..a915849 --- /dev/null +++ b/docs/src/vllm-workflow.md @@ -0,0 +1,581 @@ +# vLLM Inference Service Guide + +Comprehensive guide to deploying and managing OpenAI-compatible LLM inference services using vLLM in FetchML. + +## Overview + +The vLLM plugin provides high-performance LLM inference with: +- **OpenAI-Compatible API**: Drop-in replacement for OpenAI's API +- **Advanced Scheduling**: Continuous batching for throughput optimization +- **GPU Optimization**: Tensor parallelism and quantization support +- **Model Management**: Automatic model downloading and caching +- **Quantization**: AWQ, GPTQ, FP8, and SqueezeLLM support + +## Quick Start + +### Start vLLM Service + +```bash +# Start development stack +make dev-up + +# Start vLLM service with default model +./cli/zig-out/bin/ml service start vllm --name llm-server --model meta-llama/Llama-2-7b-chat-hf + +# Or with specific GPU requirements +./cli/zig-out/bin/ml service start vllm \ + --name llm-server \ + --model meta-llama/Llama-2-7b-chat-hf \ + --gpu-count 1 \ + --quantization awq + +# Access the API +open http://localhost:8000/docs +``` + +### Using the API + +```python +import openai + +# Point to local vLLM instance +client = openai.OpenAI( + base_url="http://localhost:8000/v1", + api_key="not-needed" +) + +# Chat completion +response = client.chat.completions.create( + model="meta-llama/Llama-2-7b-chat-hf", + messages=[ + {"role": "user", "content": "Explain quantum computing in simple terms"} + ] +) + +print(response.choices[0].message.content) +``` + +## Service Management + +### Creating vLLM Services + +```bash +# Create basic vLLM service +./cli/zig-out/bin/ml service start vllm --name my-llm + +# Create with specific model +./cli/zig-out/bin/ml service start vllm \ + --name my-llm \ + --model microsoft/DialoGPT-medium + +# Create with resource constraints +./cli/zig-out/bin/ml service start vllm \ + --name production-llm \ + --model meta-llama/Llama-2-13b-chat-hf \ + --gpu-count 2 \ + --quantization gptq \ + --max-model-len 4096 + +# List all vLLM services +./cli/zig-out/bin/ml service list + +# Service details +./cli/zig-out/bin/ml service info my-llm +``` + +### Service Configuration + +**Resource Allocation:** +```yaml +# vllm-config.yaml +resources: + gpu_count: 1 + gpu_memory: 24gb + cpu: 4 + memory: 16g + +model: + name: "meta-llama/Llama-2-7b-chat-hf" + quantization: "awq" # Options: awq, gptq, squeezellm, fp8 + trust_remote_code: false + max_model_len: 4096 + +serving: + port: 8000 + host: "0.0.0.0" + tensor_parallel_size: 1 + dtype: "auto" # auto, half, bfloat16, float + +optimization: + enable_prefix_caching: true + swap_space: 4 # GB + max_num_batched_tokens: 4096 + max_num_seqs: 256 +``` + +**Environment Variables:** +```bash +# Model cache location +export VLLM_MODEL_CACHE=/models + +# HuggingFace token for gated models +export HUGGING_FACE_HUB_TOKEN=your_token_here + +# CUDA settings +export CUDA_VISIBLE_DEVICES=0,1 +``` + +### Service Lifecycle + +```bash +# Start a service +./cli/zig-out/bin/ml service start vllm --name my-llm + +# Stop a service (graceful shutdown) +./cli/zig-out/bin/ml service stop my-llm + +# Restart a service +./cli/zig-out/bin/ml service restart my-llm + +# Remove a service (stops and deletes) +./cli/zig-out/bin/ml service remove my-llm + +# View service logs +./cli/zig-out/bin/ml service logs my-llm --follow + +# Check service health +./cli/zig-out/bin/ml service health my-llm +``` + +## Model Management + +### Supported Models + +vLLM supports most HuggingFace Transformers models: + +- **Llama 2/3**: `meta-llama/Llama-2-7b-chat-hf`, `meta-llama/Llama-2-70b-chat-hf` +- **Mistral**: `mistralai/Mistral-7B-Instruct-v0.2` +- **Mixtral**: `mistralai/Mixtral-8x7B-Instruct-v0.1` +- **Falcon**: `tiiuae/falcon-7b-instruct` +- **CodeLlama**: `codellama/CodeLlama-7b-hf` +- **Phi**: `microsoft/phi-2` +- **Qwen**: `Qwen/Qwen-7B-Chat` +- **Gemma**: `google/gemma-7b-it` + +### Model Caching + +Models are automatically cached to avoid repeated downloads: + +```bash +# Default cache location +~/.cache/huggingface/hub/ + +# Custom cache location +export VLLM_MODEL_CACHE=/mnt/fast-storage/models + +# Pre-download models +./cli/zig-out/bin/ml service prefetch --model meta-llama/Llama-2-7b-chat-hf +``` + +### Quantization + +Quantization reduces memory usage and improves inference speed: + +```bash +# AWQ (4-bit quantization) +./cli/zig-out/bin/ml service start vllm \ + --name llm-awq \ + --model TheBloke/Llama-2-7B-AWQ \ + --quantization awq + +# GPTQ (4-bit quantization) +./cli/zig-out/bin/ml service start vllm \ + --name llm-gptq \ + --model TheBloke/Llama-2-7B-GPTQ \ + --quantization gptq + +# FP8 (8-bit floating point) +./cli/zig-out/bin/ml service start vllm \ + --name llm-fp8 \ + --model meta-llama/Llama-2-7b-chat-hf \ + --quantization fp8 +``` + +**Quantization Comparison:** + +| Method | Bits | Memory Reduction | Speed Impact | Quality | +|--------|------|------------------|--------------|---------| +| None (FP16) | 16 | 1x | Baseline | Best | +| FP8 | 8 | 2x | Faster | Excellent | +| AWQ | 4 | 4x | Fast | Very Good | +| GPTQ | 4 | 4x | Fast | Very Good | +| SqueezeLLM | 4 | 4x | Fast | Good | + +## API Reference + +### OpenAI-Compatible Endpoints + +vLLM provides OpenAI-compatible REST API endpoints: + +**Chat Completions:** +```bash +curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "meta-llama/Llama-2-7b-chat-hf", + "messages": [ + {"role": "user", "content": "Hello!"} + ], + "max_tokens": 100, + "temperature": 0.7 + }' +``` + +**Completions (Legacy):** +```bash +curl http://localhost:8000/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "meta-llama/Llama-2-7b-chat-hf", + "prompt": "The capital of France is", + "max_tokens": 10 + }' +``` + +**Embeddings:** +```bash +curl http://localhost:8000/v1/embeddings \ + -H "Content-Type: application/json" \ + -d '{ + "model": "meta-llama/Llama-2-7b-chat-hf", + "input": "Hello world" + }' +``` + +**List Models:** +```bash +curl http://localhost:8000/v1/models +``` + +### Streaming Responses + +Enable streaming for real-time token generation: + +```python +import openai + +client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed") + +stream = client.chat.completions.create( + model="meta-llama/Llama-2-7b-chat-hf", + messages=[{"role": "user", "content": "Write a poem about AI"}], + stream=True, + max_tokens=200 +) + +for chunk in stream: + if chunk.choices[0].delta.content: + print(chunk.choices[0].delta.content, end="") +``` + +### Advanced Parameters + +```python +response = client.chat.completions.create( + model="meta-llama/Llama-2-7b-chat-hf", + messages=messages, + + # Generation parameters + max_tokens=500, + temperature=0.7, + top_p=0.9, + top_k=40, + + # Repetition and penalties + frequency_penalty=0.5, + presence_penalty=0.5, + repetition_penalty=1.1, + + # Sampling + seed=42, + stop=["END", "STOP"], + + # Beam search (optional) + best_of=1, + use_beam_search=False, +) +``` + +## GPU Quotas and Resource Management + +### Per-User GPU Limits + +The scheduler enforces GPU quotas for vLLM services: + +```yaml +# scheduler-config.yaml +scheduler: + plugin_quota: + enabled: true + total_gpus: 16 + per_user_gpus: 4 + per_user_services: 2 + per_plugin_limits: + vllm: + max_gpus: 8 + max_services: 4 + user_overrides: + admin: + max_gpus: 8 + max_services: 5 + allowed_plugins: ["vllm", "jupyter"] +``` + +### Resource Monitoring + +```bash +# Check GPU allocation for your user +./cli/zig-out/bin/ml service quota + +# View current usage +./cli/zig-out/bin/ml service usage + +# Monitor service resource usage +./cli/zig-out/bin/ml service stats my-llm +``` + +## Multi-GPU and Distributed Inference + +### Tensor Parallelism + +For large models that don't fit on a single GPU: + +```bash +# 70B model across 4 GPUs +./cli/zig-out/bin/ml service start vllm \ + --name llm-70b \ + --model meta-llama/Llama-2-70b-chat-hf \ + --gpu-count 4 \ + --tensor-parallel-size 4 +``` + +### Pipeline Parallelism + +For very large models with pipeline stages: + +```yaml +# Pipeline parallelism config +model: + name: "meta-llama/Llama-2-70b-chat-hf" + +serving: + tensor_parallel_size: 2 + pipeline_parallel_size: 2 # Total 4 GPUs +``` + +## Integration with Experiments + +### Using vLLM from Training Jobs + +```python +# In your training script +import requests + +# Call local vLLM service +response = requests.post( + "http://vllm-service:8000/v1/chat/completions", + json={ + "model": "meta-llama/Llama-2-7b-chat-hf", + "messages": [{"role": "user", "content": "Summarize this text"}] + } +) + +result = response.json() +summary = result["choices"][0]["message"]["content"] +``` + +### Linking with Experiments + +```bash +# Start vLLM service linked to experiment +./cli/zig-out/bin/ml service start vllm \ + --name llm-exp-1 \ + --model meta-llama/Llama-2-7b-chat-hf \ + --experiment experiment-id + +# View linked services +./cli/zig-out/bin/ml service list --experiment experiment-id +``` + +## Security and Access Control + +### Network Isolation + +```bash +# Restrict to internal network only +./cli/zig-out/bin/ml service start vllm \ + --name internal-llm \ + --model meta-llama/Llama-2-7b-chat-hf \ + --host 10.0.0.1 \ + --port 8000 +``` + +### API Key Authentication + +```yaml +# vllm-security.yaml +auth: + api_key_required: true + allowed_ips: + - "10.0.0.0/8" + - "192.168.0.0/16" + +rate_limit: + requests_per_minute: 60 + tokens_per_minute: 10000 +``` + +### Audit Trail + +All API calls are logged for compliance: + +```bash +# View audit log +./cli/zig-out/bin/ml service audit my-llm + +# Export audit report +./cli/zig-out/bin/ml service audit my-llm --export=csv + +# Check access patterns +./cli/zig-out/bin/ml service audit my-llm --summary +``` + +## Monitoring and Troubleshooting + +### Health Checks + +```bash +# Check service health +./cli/zig-out/bin/ml service health my-llm + +# Detailed diagnostics +./cli/zig-out/bin/ml service diagnose my-llm + +# View service status +./cli/zig-out/bin/ml service status my-llm +``` + +### Performance Monitoring + +```bash +# Real-time metrics +./cli/zig-out/bin/ml service monitor my-llm + +# Performance report +./cli/zig-out/bin/ml service report my-llm --format=html + +# GPU utilization +./cli/zig-out/bin/ml service stats my-llm --gpu +``` + +### Common Issues + +**Out of Memory:** +```bash +# Reduce batch size +./cli/zig-out/bin/ml service update my-llm --max-num-seqs 128 + +# Enable quantization +./cli/zig-out/bin/ml service update my-llm --quantization awq + +# Reduce GPU memory fraction +export VLLM_GPU_MEMORY_FRACTION=0.85 +``` + +**Model Download Failures:** +```bash +# Set HuggingFace token +export HUGGING_FACE_HUB_TOKEN=your_token + +# Use mirror +export HF_ENDPOINT=https://hf-mirror.com + +# Pre-download with retry +./cli/zig-out/bin/ml service prefetch --model meta-llama/Llama-2-7b-chat-hf --retry +``` + +**Slow Inference:** +```bash +# Enable prefix caching +./cli/zig-out/bin/ml service update my-llm --enable-prefix-caching + +# Increase batch size +./cli/zig-out/bin/ml service update my-llm --max-num-batched-tokens 8192 + +# Check GPU utilization +nvidia-smi dmon -s u +``` + +## Best Practices + +### Resource Planning + +1. **GPU Memory Calculation**: Model size × precision × overhead (1.2-1.5x) +2. **Batch Size Tuning**: Balance throughput vs. latency +3. **Quantization**: Use AWQ/GPTQ for production, FP16 for best quality +4. **Prefix Caching**: Enable for chat applications with repeated prompts + +### Production Deployment + +1. **Load Balancing**: Deploy multiple vLLM instances behind a load balancer +2. **Health Checks**: Configure Kubernetes liveness/readiness probes +3. **Autoscaling**: Scale based on queue depth or GPU utilization +4. **Monitoring**: Track tokens/sec, queue depth, and error rates + +### Security + +1. **Network Segmentation**: Isolate vLLM on internal network +2. **Rate Limiting**: Prevent abuse with per-user quotas +3. **Input Validation**: Sanitize prompts to prevent injection attacks +4. **Audit Logging**: Enable comprehensive audit trails + +## CLI Reference + +### Service Commands + +```bash +# Start a service +ml service start vllm [flags] + --name string Service name (required) + --model string Model name or path (default: "meta-llama/Llama-2-7b-chat-hf") + --gpu-count int Number of GPUs (default: 1) + --quantization string Quantization method (awq, gptq, fp8, squeezellm) + --port int Service port (default: 8000) + --max-model-len int Maximum sequence length + --tensor-parallel-size int Tensor parallelism degree + +# List services +ml service list [flags] + --format string Output format (table, json) + --all Show all users' services (admin only) + +# Service operations +ml service stop +ml service start # Restart a stopped service +ml service restart +ml service remove +ml service logs [flags] + --follow Follow log output + --tail int Number of lines to show (default: 100) +ml service info +ml service health +``` + +## See Also + +- **[Testing Guide](testing.md)** - Testing vLLM services +- **[Deployment Guide](deployment.md)** - Production deployment +- **[Security Guide](security.md)** - Security best practices +- **[Scheduler Architecture](scheduler-architecture.md)** - How vLLM integrates with scheduler +- **[CLI Reference](cli-reference.md)** - Command-line tools +- **[Jupyter Workflow](jupyter-workflow.md)** - Jupyter integration with vLLM