docs: add vLLM workflow and cross-link documentation

- Add new vLLM workflow documentation (vllm-workflow.md) - Update scheduler-architecture.md with Plugin GPU Quota and audit logging - Add See Also sections to jupyter-workflow.md, quick-start.md, configuration-reference.md for better navigation - Update landing page and index with vLLM and scheduler links - Cross-link all documentation for improved discoverability
2026-02-26 13:04:39 -05:00 · 2026-02-26 13:04:39 -05:00 · 90ea18555c
commit 90ea18555c
parent 8f2495deb0
9 changed files with 999 additions and 29 deletions
--- a/docs/src/_index.md
+++ b/docs/src/_index.md
@ -40,12 +40,14 @@ make test-unit
 - [Environment Variables](environment-variables.md) - Configuration options
 - [Smart Defaults](smart-defaults.md) - Default configuration settings

-### Development
- [Architecture](architecture.md) - System architecture and design
- [CLI Reference](cli-reference.md) - Command-line interface documentation
- [Testing Guide](testing.md) - Testing procedures and guidelines
- [Jupyter Workflow](jupyter-workflow.md) - CLI and Jupyter integration
- [Queue System](queue.md) - Job queue implementation
+### 🛠️ Development
+- **[Architecture](architecture.md)** - System architecture and design
+- **[Scheduler Architecture](scheduler-architecture.md)** - Job scheduler and service management
+- **[CLI Reference](cli-reference.md)** - Command-line interface documentation
+- **[Testing Guide](testing.md)** - Testing procedures and guidelines
+- **[Jupyter Workflow](jupyter-workflow.md)** - Jupyter notebook services
+- **[vLLM Workflow](vllm-workflow.md)** - LLM inference services
+- **[Queue System](queue.md)** - Job queue implementation

 ### Production Deployment
 - [Deployment Guide](deployment.md) - Production deployment instructions
--- a/docs/src/architecture.md
+++ b/docs/src/architecture.md
@ -244,6 +244,72 @@ Plugins can be configured via worker configuration under `plugins`, including:
 - `mode`
 - per-plugin paths/settings (e.g., artifact base path, log base path)

+## Plugin GPU Quota System
+
+The scheduler includes a GPU quota management system for plugin-based services (Jupyter, vLLM, etc.) that controls resource allocation across users and plugins.
+
+### Quota Enforcement
+
+The quota system enforces limits at multiple levels:
+
+1. **Global GPU Limit**: Total GPUs available across all plugins
+2. **Per-User GPU Limit**: Maximum GPUs a single user can allocate
+3. **Per-User Service Limit**: Maximum number of service instances per user
+4. **Plugin-Specific Limits**: Separate limits for each plugin type
+5. **User Overrides**: Custom limits for specific users with allowed plugin restrictions
+
+### Architecture
+
+```mermaid
+graph TB
+    subgraph "Plugin Quota System"
+        Submit[Job Submission] --> CheckQuota{Check Quota}
+        CheckQuota -->|Within Limits| Accept[Accept Job]
+        CheckQuota -->|Exceeded| Reject[Reject with Error]
+        
+        Accept --> RecordUsage[Record Usage]
+        RecordUsage --> Assign[Assign to Worker]
+        
+        Complete[Job Complete] --> ReleaseUsage[Release Usage]
+        
+        subgraph "Quota Manager"
+            Global[Global GPU Counter]
+            PerUser[Per-User Tracking]
+            PerPlugin[Per-Plugin Tracking]
+            Overrides[User Overrides]
+        end
+        
+        CheckQuota --> Global
+        CheckQuota --> PerUser
+        CheckQuota --> PerPlugin
+        CheckQuota --> Overrides
+    end
+```
+
+### Components
+
+- **PluginQuotaConfig**: Configuration for all quota limits and overrides
+- **PluginQuotaManager**: Thread-safe manager for tracking and enforcing quotas
+- **Integration Points**: 
+  - `SubmitJob()`: Validates quotas before accepting service jobs
+  - `handleJobAccepted()`: Records usage when jobs are assigned
+  - `handleJobResult()`: Releases usage when jobs complete
+
+### Usage
+
+Jobs must include `user_id` and `plugin_name` metadata for quota tracking:
+
+```go
+spec := scheduler.JobSpec{
+    Type:     scheduler.JobTypeService,
+    UserID:   "user123",
+    GPUCount: 2,
+    Metadata: map[string]string{
+        "plugin_name": "jupyter",
+    },
+}
+```
+
 ## Zig CLI Architecture

 ### Component Structure
@ -865,3 +931,13 @@ graph TB
 ---

 This architecture provides a solid foundation for secure, scalable machine learning experiments while maintaining simplicity and developer productivity.
+
+## See Also
+
+- **[Scheduler Architecture](scheduler-architecture.md)** - Detailed scheduler design and protocols
+- **[Security Guide](security.md)** - Security architecture and best practices
+- **[Configuration Reference](configuration-reference.md)** - Configuration options and environment variables
+- **[Deployment Guide](deployment.md)** - Production deployment architecture
+- **[Performance & Monitoring](performance-monitoring.md)** - Metrics and observability
+- **[Research Runner Plan](research-runner-plan.md)** - Roadmap and implementation phases
+- **[Native Libraries](native-libraries.md)** - C++ performance optimizations
--- a/docs/src/configuration-reference.md
+++ b/docs/src/configuration-reference.md
@ -664,6 +664,45 @@ api_key = "<analyst-api-key>"
 | `resources.podman_cpus` | string | "2" | CPU limit for Podman containers |
 | `resources.podman_memory` | string | "4Gi" | Memory limit for Podman containers |

+### Plugin GPU Quotas
+
+Control GPU allocation for plugin-based services (Jupyter, vLLM, etc.).
+
+| Option | Type | Default | Description |
+|--------|------|---------|-------------|
+| `scheduler.plugin_quota.enabled` | bool | false | Enable plugin GPU quota enforcement |
+| `scheduler.plugin_quota.total_gpus` | int | 0 | Global GPU limit across all plugins (0 = unlimited) |
+| `scheduler.plugin_quota.per_user_gpus` | int | 0 | Default per-user GPU limit (0 = unlimited) |
+| `scheduler.plugin_quota.per_user_services` | int | 0 | Default per-user service count limit (0 = unlimited) |
+| `scheduler.plugin_quota.per_plugin_limits.{plugin}.max_gpus` | int | 0 | Plugin-specific GPU limit |
+| `scheduler.plugin_quota.per_plugin_limits.{plugin}.max_services` | int | 0 | Plugin-specific service count limit |
+| `scheduler.plugin_quota.user_overrides.{user}.max_gpus` | int | 0 | Per-user GPU override |
+| `scheduler.plugin_quota.user_overrides.{user}.max_services` | int | 0 | Per-user service limit override |
+| `scheduler.plugin_quota.user_overrides.{user}.allowed_plugins` | array | [] | Plugins user is allowed to use (empty = all) |
+
+**Example configuration:**
+
+```yaml
+scheduler:
+  plugin_quota:
+    enabled: true
+    total_gpus: 16
+    per_user_gpus: 4
+    per_user_services: 2
+    per_plugin_limits:
+      vllm:
+        max_gpus: 8
+        max_services: 4
+      jupyter:
+        max_gpus: 4
+        max_services: 10
+    user_overrides:
+      admin:
+        max_gpus: 8
+        max_services: 5
+        allowed_plugins: ["jupyter", "vllm"]
+```
+
 ### Redis

 | Option | Type | Default | Description |
@ -746,4 +785,16 @@ go run cmd/api-server/main.go --config configs/api/dev.yaml --validate

 # Test CLI configuration
 ./cli/zig-out/bin/ml status --debug
-```
+```
+
+---
+
+## See Also
+
+- **[Architecture](architecture.md)** - System architecture overview
+- **[Scheduler Architecture](scheduler-architecture.md)** - Scheduler configuration details
+- **[Environment Variables](environment-variables.md)** - Additional environment variable documentation
+- **[Security Guide](security.md)** - Security-related configuration
+- **[Deployment Guide](deployment.md)** - Production configuration guidance
+- **[Jupyter Workflow](jupyter-workflow.md)** - Jupyter service configuration
+- **[vLLM Workflow](vllm-workflow.md)** - vLLM service configuration
--- a/docs/src/jupyter-workflow.md
+++ b/docs/src/jupyter-workflow.md
@ -620,8 +620,10 @@ Common error codes in binary responses:

 ## See Also

+- **[vLLM Workflow](vllm-workflow.md)** - LLM inference services (complementary to Jupyter)
+- **[Scheduler Architecture](scheduler-architecture.md)** - How Jupyter services are scheduled
+- **[Configuration Reference](configuration-reference.md)** - Service configuration options
 - **[Testing Guide](testing.md)** - Testing Jupyter workflows
 - **[Deployment Guide](deployment.md)** - Production deployment
 - **[Security Guide](security.md)** - Security best practices
- **[API Reference](api-key-process.md)** - API documentation
 - **[CLI Reference](cli-reference.md)** - Command-line tools
--- a/docs/src/landing.md
+++ b/docs/src/landing.md
@ -43,9 +43,11 @@ make test-unit

 ### 🛠️ Development
 - [**Architecture**](architecture.md) - System architecture and design
+- [**Scheduler Architecture**](scheduler-architecture.md) - Job scheduler and service management
 - [**CLI Reference**](cli-reference.md) - Command-line interface documentation
 - [**Testing Guide**](testing.md) - Testing procedures and guidelines
- [**Jupyter Workflow**](jupyter-workflow.md) - CLI and Jupyter integration
+- [**Jupyter Workflow**](jupyter-workflow.md) - Jupyter notebook services
+- [**vLLM Workflow**](vllm-workflow.md) - LLM inference services
 - [**Queue System**](queue.md) - Job queue implementation

 ### 🏭 Production Deployment
--- a/docs/src/quick-start.md
+++ b/docs/src/quick-start.md
@ -329,4 +329,13 @@ make help              # Show all available commands

 ---

-*Ready in minutes!*
+*Ready in minutes!*
+
+## See Also
+
+- **[Architecture](architecture.md)** - System architecture overview
+- **[Scheduler Architecture](scheduler-architecture.md)** - Job scheduling and service management
+- **[Jupyter Workflow](jupyter-workflow.md)** - Jupyter notebook services
+- **[vLLM Workflow](vllm-workflow.md)** - LLM inference services
+- **[Configuration Reference](configuration-reference.md)** - Configuration options
+- **[Security Guide](security.md)** - Security best practices
--- a/docs/src/scheduler-architecture.md
+++ b/docs/src/scheduler-architecture.md
@ -169,6 +169,52 @@ The scheduler persists state for crash recovery:

 State is replayed on startup via `StateStore.Replay()`.

+## Service Templates
+
+The scheduler provides built-in service templates for common ML workloads:
+
+### Available Templates
+
+| Template | Description | Default Port Range |
+|----------|-------------|-------------------|
+| **JupyterLab** | Interactive Jupyter environment | 8000-9000 |
+| **Jupyter Notebook** | Classic Jupyter notebooks | 8000-9000 |
+| **vLLM** | OpenAI-compatible LLM inference server | 8000-9000 |
+
+### Port Allocation
+
+Dynamic port management for service instances:
+
+```go
+type PortAllocator struct {
+    startPort int    // Default: 8000
+    endPort   int    // Default: 9000
+    allocated map[int]time.Time  // Port -> allocation time
+}
+```
+
+**Features:**
+- Automatic port selection from configured range
+- TTL-based port reclamation
+- Thread-safe concurrent allocations
+- Exhaustion handling with clear error messages
+
+### Template Variables
+
+Service templates support dynamic variable substitution:
+
+| Variable | Description | Example |
+|----------|-------------|---------|
+| `{{SERVICE_PORT}}` | Allocated port for the service | `8080` |
+| `{{WORKER_ID}}` | ID of the assigned worker | `worker-1` |
+| `{{TASK_ID}}` | Unique task identifier | `task-abc123` |
+| `{{SECRET:xxx}}` | Secret reference from keychain | `api-key-value` |
+| `{{MODEL_NAME}}` | ML model name (vLLM) | `llama-2-7b` |
+| `{{GPU_COUNT}}` | Number of GPUs allocated | `2` |
+| `{{GPU_DEVICES}}` | Specific GPU device IDs | `0,1` |
+| `{{MODEL_CACHE}}` | Path to model cache directory | `/models` |
+| `{{WORKSPACE}}` | Working directory path | `/workspace` |
+
 ## API Methods

 ```go
@ -188,7 +234,52 @@ func (h *SchedulerHub) Start() error
 func (h *SchedulerHub) Stop()
 ```

-## Configuration
+## Audit Integration
+
+The scheduler integrates with the audit logging system for security and compliance:
+
+### Audit Logger Integration
+
+```go
+type SchedulerHub struct {
+    // ... other fields ...
+    auditor *audit.Logger  // Security audit logger
+}
+```
+
+**Initialization:**
+```go
+auditor := audit.NewLogger(audit.Config{
+    LogPath: "/var/log/fetch_ml/scheduler_audit.log",
+    Enabled: true,
+})
+hub, err := scheduler.NewHub(config, auditor)
+```
+
+### Audit Events
+
+The scheduler logs the following audit events:
+
+| Event | Description | Fields Logged |
+|-------|-------------|---------------|
+| `job_submitted` | New job queued | job_id, user_id, job_type, gpu_count |
+| `job_assigned` | Job assigned to worker | job_id, worker_id, assignment_time |
+| `job_accepted` | Worker accepted job | job_id, worker_id, acceptance_time |
+| `job_completed` | Job finished successfully | job_id, worker_id, duration |
+| `job_failed` | Job failed | job_id, worker_id, error_code |
+| `job_cancelled` | Job cancelled | job_id, cancelled_by, reason |
+| `worker_registered` | Worker connected | worker_id, capabilities, timestamp |
+| `worker_disconnected` | Worker disconnected | worker_id, duration_connected |
+| `quota_exceeded` | GPU quota violation | user_id, plugin_name, requested, limit |
+
+### Tamper-Evident Logging
+
+Audit logs use chain hashing for integrity:
+- Each event includes SHA-256 hash of previous event
+- Chain verification detects log tampering
+- Separate log file from operational logs
+
+### Configuration

 ```go
 type HubConfig struct {
@ -203,6 +294,7 @@ type HubConfig struct {
    GangAllocTimeoutSecs    int               // Multi-node allocation timeout
    AcceptanceTimeoutSecs   int               // Job acceptance timeout
    WorkerTokens            map[string]string // Authentication tokens
+    PluginQuota             PluginQuotaConfig // Plugin GPU quota configuration
 }
 ```

@ -214,6 +306,11 @@ Process management is abstracted for Unix/Windows:

 ## See Also

- `internal/scheduler/hub.go` - Core implementation
- `tests/fixtures/scheduler_fixture.go` - Test infrastructure
- `docs/src/native-libraries.md` - Native C++ performance libraries
+- **[Architecture Overview](architecture.md)** - High-level system architecture
+- **[Security Guide](security.md)** - Audit logging and security features
+- **[Configuration Reference](configuration-reference.md)** - Plugin GPU quotas and scheduler config
+- **[Jupyter Workflow](jupyter-workflow.md)** - Jupyter service integration with scheduler
+- **[vLLM Workflow](vllm-workflow.md)** - vLLM service integration with scheduler
+- **[Testing Guide](testing.md)** - Testing the scheduler
+- **`internal/scheduler/hub.go`** - Core implementation
+- **`tests/fixtures/scheduler_fixture.go`** - Test infrastructure
--- a/docs/src/security.md
+++ b/docs/src/security.md
@ -112,27 +112,164 @@ The system detects and rejects plaintext secrets using:

 ### HIPAA-Compliant Audit Logging

-**Tamper-Evident Logging:**
+FetchML implements comprehensive HIPAA-compliant audit logging with tamper-evident chain hashing for healthcare and regulated environments.
+
+**Architecture:**
 ```go
-// Each event includes chain hash for integrity
-audit.Log(audit.Event{
+// Audit logger initialization
+auditor := audit.NewLogger(audit.Config{
+    Enabled:  true,
+    LogPath:  "/var/log/fetch_ml/audit.log",
+})
+
+// Logging an event
+auditor.Log(audit.Event{
    EventType: audit.EventFileRead,
-    UserID:    "user1",
-    Resource:  "/data/file.txt",
+    UserID:    "user123",
+    Resource:  "/data/patient_records/file.txt",
+    IPAddress: "10.0.0.5",
+    Success:   true,
+    Metadata: map[string]any{
+        "file_size": 1024,
+        "checksum": "abc123...",
+    },
 })
 ```

-**Event Types:**
- `file_read` - File access logged
- `file_write` - File modification logged
- `file_delete` - File deletion logged
- `auth_success` / `auth_failure` - Authentication events
- `job_queued` / `job_started` / `job_completed` - Job lifecycle
-
-**Chain Hashing:**
- Each event includes SHA-256 hash of previous event
+**Tamper-Evident Chain Hashing:**
+- Each event includes SHA-256 hash of the previous event (PrevHash)
+- Event hash covers all fields including PrevHash (chaining)
 - Modification of any log entry breaks the chain
- `VerifyChain()` function detects tampering
+- Separate `VerifyChain()` function detects tampering
+- Monotonic sequence numbers prevent deletion attacks
+
+```go
+// Verify audit chain integrity
+valid, err := audit.VerifyChain("/var/log/fetch_ml/audit.log")
+if err != nil || !valid {
+    log.Fatal("AUDIT TAMPERING DETECTED")
+}
+```
+
+**HIPAA-Specific Event Types:**
+
+| Event Type | HIPAA Relevance | Fields Logged |
+|------------|-----------------|---------------|
+| `file_read` | Access to PHI | user_id, file_path, ip_address, timestamp, checksum |
+| `file_write` | Modification of PHI | user_id, file_path, bytes_written, prev_checksum, new_checksum |
+| `file_delete` | Deletion of PHI | user_id, file_path, deletion_type (soft/hard) |
+| `dataset_access` | Bulk data access | user_id, dataset_id, record_count, access_purpose |
+| `authentication_success` | Access control | user_id, auth_method, ip_address, mfa_used |
+| `authentication_failure` | Failed access attempts | attempted_user, ip_address, failure_reason, attempt_count |
+| `job_queued` | Processing PHI | user_id, job_id, input_data_classification |
+| `job_started` | PHI processing begun | job_id, worker_id, data_accessed |
+| `job_completed` | PHI processing complete | job_id, output_location, data_disposition |
+
+**Standard Event Types:**
+
+| Event Type | Description | Use Case |
+|------------|-------------|----------|
+| `authentication_attempt` | Login attempt (pre-validation) | Brute force detection |
+| `authentication_success` | Successful login | Access tracking |
+| `authentication_failure` | Failed login | Security monitoring |
+| `job_queued` | Job submitted to queue | Workflow tracking |
+| `job_started` | Job execution begun | Performance monitoring |
+| `job_completed` | Job finished successfully | Completion tracking |
+| `job_failed` | Job execution failed | Error tracking |
+| `jupyter_start` | Jupyter service started | Resource tracking |
+| `jupyter_stop` | Jupyter service stopped | Session tracking |
+| `experiment_created` | Experiment initialized | Provenance tracking |
+| `experiment_deleted` | Experiment removed | Data lifecycle |
+
+**Scheduler Audit Integration:**
+
+The scheduler automatically logs these events:
+- `job_submitted` - Job queued (includes user_id, job_type, gpu_count)
+- `job_assigned` - Job assigned to worker (worker_id, assignment_time)
+- `job_accepted` - Worker confirmed job execution
+- `job_completed` / `job_failed` / `job_cancelled` - Job terminal states
+- `worker_registered` - Worker connected to scheduler
+- `worker_disconnected` - Worker disconnected
+- `quota_exceeded` - GPU quota violation attempt
+
+**Audit Log Format:**
+```json
+{
+  "timestamp": "2024-01-15T10:30:00Z",
+  "event_type": "file_read",
+  "user_id": "researcher1",
+  "ip_address": "10.0.0.5",
+  "resource": "/data/experiments/run_001/results.csv",
+  "action": "read",
+  "success": true,
+  "sequence_num": 15423,
+  "prev_hash": "a1b2c3d4...",
+  "event_hash": "e5f6g7h8...",
+  "metadata": {
+    "file_size": 1048576,
+    "checksum": "sha256:abc123...",
+    "access_duration_ms": 150
+  }
+}
+```
+
+**Log Storage and Rotation:**
+- Default location: `/var/log/fetch_ml/audit.log`
+- Automatic rotation by size (100MB) or time (daily)
+- Retention policy: Configurable (default: 7 years for HIPAA)
+- Immutable storage: Append-only with filesystem-level protection
+
+**Compliance Features:**
+
+- **User Identification**: Every event includes `user_id` for accountability
+- **Timestamp Precision**: RFC3339 nanosecond precision timestamps
+- **IP Address Tracking**: Source IP for all network events
+- **Success/Failure Tracking**: Boolean success field for all operations
+- **Metadata Flexibility**: Extensible key-value metadata for domain-specific data
+- **Immutable Logging**: Append-only files with filesystem protections
+- **Chain Verification**: Cryptographic proof of log integrity
+- **Sealed Logs**: Optional GPG signing for regulatory submissions
+
+**Audit Log Analysis:**
+
+```bash
+# View recent audit events
+tail -f /var/log/fetch_ml/audit.log | jq '.'
+
+# Search for specific user activity
+grep '"user_id":"researcher1"' /var/log/fetch_ml/audit.log | jq '.'
+
+# Find all file access events
+jq 'select(.event_type == "file_read")' /var/log/fetch_ml/audit.log
+
+# Detect failed authentication attempts
+jq 'select(.event_type == "authentication_failure")' /var/log/fetch_ml/audit.log
+
+# Verify audit chain integrity
+./cli/zig-out/bin/ml audit verify /var/log/fetch_ml/audit.log
+
+# Export audit report for compliance
+./cli/zig-out/bin/ml audit export --start 2024-01-01 --end 2024-01-31 --format csv
+```
+
+**Regulatory Compliance:**
+
+| Regulation | Requirement | FetchML Implementation |
+|------------|-------------|------------------------|
+| **HIPAA** | Access logging, tamper evidence | Chain hashing, file access events, user tracking |
+| **GDPR** | Data subject access, right to deletion | Full audit trail, deletion events with chain preservation |
+| **SOX** | Financial controls, audit trail | Immutable logs, separation of duties via RBAC |
+| **21 CFR Part 11** | Electronic records integrity | Tamper-evident logging, user authentication, timestamps |
+| **PCI DSS** | Access logging, data protection | Audit trails, encryption, access controls |
+
+**Best Practices:**
+
+1. **Enable Audit Logging**: Always enable in production
+2. **Separate Storage**: Store audit logs on separate volume from application data
+3. **Regular Verification**: Run chain verification daily
+4. **Backup Strategy**: Include audit logs in backup procedures
+5. **Access Control**: Restrict audit log access to security personnel only
+6. **Monitoring**: Set up alerts for suspicious patterns (multiple failed logins, after-hours access)

 ---

@ -420,3 +557,16 @@ All API access is logged with:
 - **Security Issues**: Report privately via email
 - **Questions**: See documentation or create issue
 - **Updates**: Monitor releases for security patches
+
+---
+
+## See Also
+
+- **[Privacy & Security](privacy-security.md)** - PII detection and privacy controls
+- **[Multi-Tenant Security](multi-tenant-security.md)** - Tenant isolation and cross-tenant access prevention
+- **[API Key Process](api-key-process.md)** - Generate and manage API keys
+- **[User Permissions](user-permissions.md)** - Role-based access control
+- **[Runtime Security](runtime-security.md)** - Container sandboxing and seccomp profiles
+- **[Scheduler Architecture](scheduler-architecture.md)** - Audit integration in the scheduler
+- **[Configuration Reference](configuration-reference.md)** - Security-related configuration options
+- **[Deployment Guide](deployment.md)** - Production security hardening
--- a/docs/src/vllm-workflow.md
+++ b/docs/src/vllm-workflow.md
@ -0,0 +1,581 @@
+# vLLM Inference Service Guide
+
+Comprehensive guide to deploying and managing OpenAI-compatible LLM inference services using vLLM in FetchML.
+
+## Overview
+
+The vLLM plugin provides high-performance LLM inference with:
+- **OpenAI-Compatible API**: Drop-in replacement for OpenAI's API
+- **Advanced Scheduling**: Continuous batching for throughput optimization
+- **GPU Optimization**: Tensor parallelism and quantization support
+- **Model Management**: Automatic model downloading and caching
+- **Quantization**: AWQ, GPTQ, FP8, and SqueezeLLM support
+
+## Quick Start
+
+### Start vLLM Service
+
+```bash
+# Start development stack
+make dev-up
+
+# Start vLLM service with default model
+./cli/zig-out/bin/ml service start vllm --name llm-server --model meta-llama/Llama-2-7b-chat-hf
+
+# Or with specific GPU requirements
+./cli/zig-out/bin/ml service start vllm \
+  --name llm-server \
+  --model meta-llama/Llama-2-7b-chat-hf \
+  --gpu-count 1 \
+  --quantization awq
+
+# Access the API
+open http://localhost:8000/docs
+```
+
+### Using the API
+
+```python
+import openai
+
+# Point to local vLLM instance
+client = openai.OpenAI(
+    base_url="http://localhost:8000/v1",
+    api_key="not-needed"
+)
+
+# Chat completion
+response = client.chat.completions.create(
+    model="meta-llama/Llama-2-7b-chat-hf",
+    messages=[
+        {"role": "user", "content": "Explain quantum computing in simple terms"}
+    ]
+)
+
+print(response.choices[0].message.content)
+```
+
+## Service Management
+
+### Creating vLLM Services
+
+```bash
+# Create basic vLLM service
+./cli/zig-out/bin/ml service start vllm --name my-llm
+
+# Create with specific model
+./cli/zig-out/bin/ml service start vllm \
+  --name my-llm \
+  --model microsoft/DialoGPT-medium
+
+# Create with resource constraints
+./cli/zig-out/bin/ml service start vllm \
+  --name production-llm \
+  --model meta-llama/Llama-2-13b-chat-hf \
+  --gpu-count 2 \
+  --quantization gptq \
+  --max-model-len 4096
+
+# List all vLLM services
+./cli/zig-out/bin/ml service list
+
+# Service details
+./cli/zig-out/bin/ml service info my-llm
+```
+
+### Service Configuration
+
+**Resource Allocation:**
+```yaml
+# vllm-config.yaml
+resources:
+  gpu_count: 1
+  gpu_memory: 24gb
+  cpu: 4
+  memory: 16g
+
+model:
+  name: "meta-llama/Llama-2-7b-chat-hf"
+  quantization: "awq"  # Options: awq, gptq, squeezellm, fp8
+  trust_remote_code: false
+  max_model_len: 4096
+
+serving:
+  port: 8000
+  host: "0.0.0.0"
+  tensor_parallel_size: 1
+  dtype: "auto"  # auto, half, bfloat16, float
+
+optimization:
+  enable_prefix_caching: true
+  swap_space: 4  # GB
+  max_num_batched_tokens: 4096
+  max_num_seqs: 256
+```
+
+**Environment Variables:**
+```bash
+# Model cache location
+export VLLM_MODEL_CACHE=/models
+
+# HuggingFace token for gated models
+export HUGGING_FACE_HUB_TOKEN=your_token_here
+
+# CUDA settings
+export CUDA_VISIBLE_DEVICES=0,1
+```
+
+### Service Lifecycle
+
+```bash
+# Start a service
+./cli/zig-out/bin/ml service start vllm --name my-llm
+
+# Stop a service (graceful shutdown)
+./cli/zig-out/bin/ml service stop my-llm
+
+# Restart a service
+./cli/zig-out/bin/ml service restart my-llm
+
+# Remove a service (stops and deletes)
+./cli/zig-out/bin/ml service remove my-llm
+
+# View service logs
+./cli/zig-out/bin/ml service logs my-llm --follow
+
+# Check service health
+./cli/zig-out/bin/ml service health my-llm
+```
+
+## Model Management
+
+### Supported Models
+
+vLLM supports most HuggingFace Transformers models:
+
+- **Llama 2/3**: `meta-llama/Llama-2-7b-chat-hf`, `meta-llama/Llama-2-70b-chat-hf`
+- **Mistral**: `mistralai/Mistral-7B-Instruct-v0.2`
+- **Mixtral**: `mistralai/Mixtral-8x7B-Instruct-v0.1`
+- **Falcon**: `tiiuae/falcon-7b-instruct`
+- **CodeLlama**: `codellama/CodeLlama-7b-hf`
+- **Phi**: `microsoft/phi-2`
+- **Qwen**: `Qwen/Qwen-7B-Chat`
+- **Gemma**: `google/gemma-7b-it`
+
+### Model Caching
+
+Models are automatically cached to avoid repeated downloads:
+
+```bash
+# Default cache location
+~/.cache/huggingface/hub/
+
+# Custom cache location
+export VLLM_MODEL_CACHE=/mnt/fast-storage/models
+
+# Pre-download models
+./cli/zig-out/bin/ml service prefetch --model meta-llama/Llama-2-7b-chat-hf
+```
+
+### Quantization
+
+Quantization reduces memory usage and improves inference speed:
+
+```bash
+# AWQ (4-bit quantization)
+./cli/zig-out/bin/ml service start vllm \
+  --name llm-awq \
+  --model TheBloke/Llama-2-7B-AWQ \
+  --quantization awq
+
+# GPTQ (4-bit quantization)
+./cli/zig-out/bin/ml service start vllm \
+  --name llm-gptq \
+  --model TheBloke/Llama-2-7B-GPTQ \
+  --quantization gptq
+
+# FP8 (8-bit floating point)
+./cli/zig-out/bin/ml service start vllm \
+  --name llm-fp8 \
+  --model meta-llama/Llama-2-7b-chat-hf \
+  --quantization fp8
+```
+
+**Quantization Comparison:**
+
+| Method | Bits | Memory Reduction | Speed Impact | Quality |
+|--------|------|------------------|--------------|---------|
+| None (FP16) | 16 | 1x | Baseline | Best |
+| FP8 | 8 | 2x | Faster | Excellent |
+| AWQ | 4 | 4x | Fast | Very Good |
+| GPTQ | 4 | 4x | Fast | Very Good |
+| SqueezeLLM | 4 | 4x | Fast | Good |
+
+## API Reference
+
+### OpenAI-Compatible Endpoints
+
+vLLM provides OpenAI-compatible REST API endpoints:
+
+**Chat Completions:**
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "meta-llama/Llama-2-7b-chat-hf",
+    "messages": [
+      {"role": "user", "content": "Hello!"}
+    ],
+    "max_tokens": 100,
+    "temperature": 0.7
+  }'
+```
+
+**Completions (Legacy):**
+```bash
+curl http://localhost:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "meta-llama/Llama-2-7b-chat-hf",
+    "prompt": "The capital of France is",
+    "max_tokens": 10
+  }'
+```
+
+**Embeddings:**
+```bash
+curl http://localhost:8000/v1/embeddings \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "meta-llama/Llama-2-7b-chat-hf",
+    "input": "Hello world"
+  }'
+```
+
+**List Models:**
+```bash
+curl http://localhost:8000/v1/models
+```
+
+### Streaming Responses
+
+Enable streaming for real-time token generation:
+
+```python
+import openai
+
+client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
+
+stream = client.chat.completions.create(
+    model="meta-llama/Llama-2-7b-chat-hf",
+    messages=[{"role": "user", "content": "Write a poem about AI"}],
+    stream=True,
+    max_tokens=200
+)
+
+for chunk in stream:
+    if chunk.choices[0].delta.content:
+        print(chunk.choices[0].delta.content, end="")
+```
+
+### Advanced Parameters
+
+```python
+response = client.chat.completions.create(
+    model="meta-llama/Llama-2-7b-chat-hf",
+    messages=messages,
+    
+    # Generation parameters
+    max_tokens=500,
+    temperature=0.7,
+    top_p=0.9,
+    top_k=40,
+    
+    # Repetition and penalties
+    frequency_penalty=0.5,
+    presence_penalty=0.5,
+    repetition_penalty=1.1,
+    
+    # Sampling
+    seed=42,
+    stop=["END", "STOP"],
+    
+    # Beam search (optional)
+    best_of=1,
+    use_beam_search=False,
+)
+```
+
+## GPU Quotas and Resource Management
+
+### Per-User GPU Limits
+
+The scheduler enforces GPU quotas for vLLM services:
+
+```yaml
+# scheduler-config.yaml
+scheduler:
+  plugin_quota:
+    enabled: true
+    total_gpus: 16
+    per_user_gpus: 4
+    per_user_services: 2
+    per_plugin_limits:
+      vllm:
+        max_gpus: 8
+        max_services: 4
+    user_overrides:
+      admin:
+        max_gpus: 8
+        max_services: 5
+        allowed_plugins: ["vllm", "jupyter"]
+```
+
+### Resource Monitoring
+
+```bash
+# Check GPU allocation for your user
+./cli/zig-out/bin/ml service quota
+
+# View current usage
+./cli/zig-out/bin/ml service usage
+
+# Monitor service resource usage
+./cli/zig-out/bin/ml service stats my-llm
+```
+
+## Multi-GPU and Distributed Inference
+
+### Tensor Parallelism
+
+For large models that don't fit on a single GPU:
+
+```bash
+# 70B model across 4 GPUs
+./cli/zig-out/bin/ml service start vllm \
+  --name llm-70b \
+  --model meta-llama/Llama-2-70b-chat-hf \
+  --gpu-count 4 \
+  --tensor-parallel-size 4
+```
+
+### Pipeline Parallelism
+
+For very large models with pipeline stages:
+
+```yaml
+# Pipeline parallelism config
+model:
+  name: "meta-llama/Llama-2-70b-chat-hf"
+  
+serving:
+  tensor_parallel_size: 2
+  pipeline_parallel_size: 2  # Total 4 GPUs
+```
+
+## Integration with Experiments
+
+### Using vLLM from Training Jobs
+
+```python
+# In your training script
+import requests
+
+# Call local vLLM service
+response = requests.post(
+    "http://vllm-service:8000/v1/chat/completions",
+    json={
+        "model": "meta-llama/Llama-2-7b-chat-hf",
+        "messages": [{"role": "user", "content": "Summarize this text"}]
+    }
+)
+
+result = response.json()
+summary = result["choices"][0]["message"]["content"]
+```
+
+### Linking with Experiments
+
+```bash
+# Start vLLM service linked to experiment
+./cli/zig-out/bin/ml service start vllm \
+  --name llm-exp-1 \
+  --model meta-llama/Llama-2-7b-chat-hf \
+  --experiment experiment-id
+
+# View linked services
+./cli/zig-out/bin/ml service list --experiment experiment-id
+```
+
+## Security and Access Control
+
+### Network Isolation
+
+```bash
+# Restrict to internal network only
+./cli/zig-out/bin/ml service start vllm \
+  --name internal-llm \
+  --model meta-llama/Llama-2-7b-chat-hf \
+  --host 10.0.0.1 \
+  --port 8000
+```
+
+### API Key Authentication
+
+```yaml
+# vllm-security.yaml
+auth:
+  api_key_required: true
+  allowed_ips:
+    - "10.0.0.0/8"
+    - "192.168.0.0/16"
+  
+rate_limit:
+  requests_per_minute: 60
+  tokens_per_minute: 10000
+```
+
+### Audit Trail
+
+All API calls are logged for compliance:
+
+```bash
+# View audit log
+./cli/zig-out/bin/ml service audit my-llm
+
+# Export audit report
+./cli/zig-out/bin/ml service audit my-llm --export=csv
+
+# Check access patterns
+./cli/zig-out/bin/ml service audit my-llm --summary
+```
+
+## Monitoring and Troubleshooting
+
+### Health Checks
+
+```bash
+# Check service health
+./cli/zig-out/bin/ml service health my-llm
+
+# Detailed diagnostics
+./cli/zig-out/bin/ml service diagnose my-llm
+
+# View service status
+./cli/zig-out/bin/ml service status my-llm
+```
+
+### Performance Monitoring
+
+```bash
+# Real-time metrics
+./cli/zig-out/bin/ml service monitor my-llm
+
+# Performance report
+./cli/zig-out/bin/ml service report my-llm --format=html
+
+# GPU utilization
+./cli/zig-out/bin/ml service stats my-llm --gpu
+```
+
+### Common Issues
+
+**Out of Memory:**
+```bash
+# Reduce batch size
+./cli/zig-out/bin/ml service update my-llm --max-num-seqs 128
+
+# Enable quantization
+./cli/zig-out/bin/ml service update my-llm --quantization awq
+
+# Reduce GPU memory fraction
+export VLLM_GPU_MEMORY_FRACTION=0.85
+```
+
+**Model Download Failures:**
+```bash
+# Set HuggingFace token
+export HUGGING_FACE_HUB_TOKEN=your_token
+
+# Use mirror
+export HF_ENDPOINT=https://hf-mirror.com
+
+# Pre-download with retry
+./cli/zig-out/bin/ml service prefetch --model meta-llama/Llama-2-7b-chat-hf --retry
+```
+
+**Slow Inference:**
+```bash
+# Enable prefix caching
+./cli/zig-out/bin/ml service update my-llm --enable-prefix-caching
+
+# Increase batch size
+./cli/zig-out/bin/ml service update my-llm --max-num-batched-tokens 8192
+
+# Check GPU utilization
+nvidia-smi dmon -s u
+```
+
+## Best Practices
+
+### Resource Planning
+
+1. **GPU Memory Calculation**: Model size × precision × overhead (1.2-1.5x)
+2. **Batch Size Tuning**: Balance throughput vs. latency
+3. **Quantization**: Use AWQ/GPTQ for production, FP16 for best quality
+4. **Prefix Caching**: Enable for chat applications with repeated prompts
+
+### Production Deployment
+
+1. **Load Balancing**: Deploy multiple vLLM instances behind a load balancer
+2. **Health Checks**: Configure Kubernetes liveness/readiness probes
+3. **Autoscaling**: Scale based on queue depth or GPU utilization
+4. **Monitoring**: Track tokens/sec, queue depth, and error rates
+
+### Security
+
+1. **Network Segmentation**: Isolate vLLM on internal network
+2. **Rate Limiting**: Prevent abuse with per-user quotas
+3. **Input Validation**: Sanitize prompts to prevent injection attacks
+4. **Audit Logging**: Enable comprehensive audit trails
+
+## CLI Reference
+
+### Service Commands
+
+```bash
+# Start a service
+ml service start vllm [flags]
+  --name string           Service name (required)
+  --model string          Model name or path (default: "meta-llama/Llama-2-7b-chat-hf")
+  --gpu-count int         Number of GPUs (default: 1)
+  --quantization string   Quantization method (awq, gptq, fp8, squeezellm)
+  --port int             Service port (default: 8000)
+  --max-model-len int    Maximum sequence length
+  --tensor-parallel-size int  Tensor parallelism degree
+
+# List services
+ml service list [flags]
+  --format string    Output format (table, json)
+  --all             Show all users' services (admin only)
+
+# Service operations
+ml service stop <name>
+ml service start <name>      # Restart a stopped service
+ml service restart <name>
+ml service remove <name>
+ml service logs <name> [flags]
+  --follow          Follow log output
+  --tail int        Number of lines to show (default: 100)
+ml service info <name>
+ml service health <name>
+```
+
+## See Also
+
+- **[Testing Guide](testing.md)** - Testing vLLM services
+- **[Deployment Guide](deployment.md)** - Production deployment
+- **[Security Guide](security.md)** - Security best practices
+- **[Scheduler Architecture](scheduler-architecture.md)** - How vLLM integrates with scheduler
+- **[CLI Reference](cli-reference.md)** - Command-line tools
+- **[Jupyter Workflow](jupyter-workflow.md)** - Jupyter integration with vLLM