docs: add vLLM workflow and cross-link documentation
Some checks failed
Security Scan / Security Analysis (push) Waiting to run
Security Scan / Native Library Security (push) Waiting to run
Verification & Maintenance / V.1 - Schema Drift Detection (push) Waiting to run
Verification & Maintenance / V.4 - Custom Go Vet Analyzers (push) Waiting to run
Verification & Maintenance / V.7 - Audit Chain Integrity (push) Waiting to run
Verification & Maintenance / V.6 - Extended Security Scanning (push) Waiting to run
Verification & Maintenance / V.10 - OpenSSF Scorecard (push) Waiting to run
Verification & Maintenance / Verification Summary (push) Blocked by required conditions
Build Pipeline / Build Binaries (push) Failing after 2m4s
Build Pipeline / Build Docker Images (push) Has been skipped
Build Pipeline / Sign HIPAA Config (push) Has been skipped
Build Pipeline / Generate SLSA Provenance (push) Has been skipped
Checkout test / test (push) Successful in 5s
CI Pipeline / Test (push) Failing after 1s
CI Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI Pipeline / Security Scan (push) Has been skipped
CI Pipeline / Test Scripts (push) Has been skipped
CI Pipeline / Test Native Libraries (push) Has been skipped
CI Pipeline / Native Library Build Matrix (push) Has been skipped
Contract Tests / Spec Drift Detection (push) Failing after 16s
Contract Tests / API Contract Tests (push) Has been skipped
Deploy API Docs / Build API Documentation (push) Failing after 5s
Deploy API Docs / Deploy to GitHub Pages (push) Has been skipped
Documentation / build-and-publish (push) Failing after 44s
CI Pipeline / Trigger Build Workflow (push) Failing after 0s
Some checks failed
Security Scan / Security Analysis (push) Waiting to run
Security Scan / Native Library Security (push) Waiting to run
Verification & Maintenance / V.1 - Schema Drift Detection (push) Waiting to run
Verification & Maintenance / V.4 - Custom Go Vet Analyzers (push) Waiting to run
Verification & Maintenance / V.7 - Audit Chain Integrity (push) Waiting to run
Verification & Maintenance / V.6 - Extended Security Scanning (push) Waiting to run
Verification & Maintenance / V.10 - OpenSSF Scorecard (push) Waiting to run
Verification & Maintenance / Verification Summary (push) Blocked by required conditions
Build Pipeline / Build Binaries (push) Failing after 2m4s
Build Pipeline / Build Docker Images (push) Has been skipped
Build Pipeline / Sign HIPAA Config (push) Has been skipped
Build Pipeline / Generate SLSA Provenance (push) Has been skipped
Checkout test / test (push) Successful in 5s
CI Pipeline / Test (push) Failing after 1s
CI Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI Pipeline / Security Scan (push) Has been skipped
CI Pipeline / Test Scripts (push) Has been skipped
CI Pipeline / Test Native Libraries (push) Has been skipped
CI Pipeline / Native Library Build Matrix (push) Has been skipped
Contract Tests / Spec Drift Detection (push) Failing after 16s
Contract Tests / API Contract Tests (push) Has been skipped
Deploy API Docs / Build API Documentation (push) Failing after 5s
Deploy API Docs / Deploy to GitHub Pages (push) Has been skipped
Documentation / build-and-publish (push) Failing after 44s
CI Pipeline / Trigger Build Workflow (push) Failing after 0s
- Add new vLLM workflow documentation (vllm-workflow.md) - Update scheduler-architecture.md with Plugin GPU Quota and audit logging - Add See Also sections to jupyter-workflow.md, quick-start.md, configuration-reference.md for better navigation - Update landing page and index with vLLM and scheduler links - Cross-link all documentation for improved discoverability
This commit is contained in:
parent
8f2495deb0
commit
90ea18555c
9 changed files with 999 additions and 29 deletions
|
|
@ -40,12 +40,14 @@ make test-unit
|
|||
- [Environment Variables](environment-variables.md) - Configuration options
|
||||
- [Smart Defaults](smart-defaults.md) - Default configuration settings
|
||||
|
||||
### Development
|
||||
- [Architecture](architecture.md) - System architecture and design
|
||||
- [CLI Reference](cli-reference.md) - Command-line interface documentation
|
||||
- [Testing Guide](testing.md) - Testing procedures and guidelines
|
||||
- [Jupyter Workflow](jupyter-workflow.md) - CLI and Jupyter integration
|
||||
- [Queue System](queue.md) - Job queue implementation
|
||||
### 🛠️ Development
|
||||
- **[Architecture](architecture.md)** - System architecture and design
|
||||
- **[Scheduler Architecture](scheduler-architecture.md)** - Job scheduler and service management
|
||||
- **[CLI Reference](cli-reference.md)** - Command-line interface documentation
|
||||
- **[Testing Guide](testing.md)** - Testing procedures and guidelines
|
||||
- **[Jupyter Workflow](jupyter-workflow.md)** - Jupyter notebook services
|
||||
- **[vLLM Workflow](vllm-workflow.md)** - LLM inference services
|
||||
- **[Queue System](queue.md)** - Job queue implementation
|
||||
|
||||
### Production Deployment
|
||||
- [Deployment Guide](deployment.md) - Production deployment instructions
|
||||
|
|
|
|||
|
|
@ -244,6 +244,72 @@ Plugins can be configured via worker configuration under `plugins`, including:
|
|||
- `mode`
|
||||
- per-plugin paths/settings (e.g., artifact base path, log base path)
|
||||
|
||||
## Plugin GPU Quota System
|
||||
|
||||
The scheduler includes a GPU quota management system for plugin-based services (Jupyter, vLLM, etc.) that controls resource allocation across users and plugins.
|
||||
|
||||
### Quota Enforcement
|
||||
|
||||
The quota system enforces limits at multiple levels:
|
||||
|
||||
1. **Global GPU Limit**: Total GPUs available across all plugins
|
||||
2. **Per-User GPU Limit**: Maximum GPUs a single user can allocate
|
||||
3. **Per-User Service Limit**: Maximum number of service instances per user
|
||||
4. **Plugin-Specific Limits**: Separate limits for each plugin type
|
||||
5. **User Overrides**: Custom limits for specific users with allowed plugin restrictions
|
||||
|
||||
### Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Plugin Quota System"
|
||||
Submit[Job Submission] --> CheckQuota{Check Quota}
|
||||
CheckQuota -->|Within Limits| Accept[Accept Job]
|
||||
CheckQuota -->|Exceeded| Reject[Reject with Error]
|
||||
|
||||
Accept --> RecordUsage[Record Usage]
|
||||
RecordUsage --> Assign[Assign to Worker]
|
||||
|
||||
Complete[Job Complete] --> ReleaseUsage[Release Usage]
|
||||
|
||||
subgraph "Quota Manager"
|
||||
Global[Global GPU Counter]
|
||||
PerUser[Per-User Tracking]
|
||||
PerPlugin[Per-Plugin Tracking]
|
||||
Overrides[User Overrides]
|
||||
end
|
||||
|
||||
CheckQuota --> Global
|
||||
CheckQuota --> PerUser
|
||||
CheckQuota --> PerPlugin
|
||||
CheckQuota --> Overrides
|
||||
end
|
||||
```
|
||||
|
||||
### Components
|
||||
|
||||
- **PluginQuotaConfig**: Configuration for all quota limits and overrides
|
||||
- **PluginQuotaManager**: Thread-safe manager for tracking and enforcing quotas
|
||||
- **Integration Points**:
|
||||
- `SubmitJob()`: Validates quotas before accepting service jobs
|
||||
- `handleJobAccepted()`: Records usage when jobs are assigned
|
||||
- `handleJobResult()`: Releases usage when jobs complete
|
||||
|
||||
### Usage
|
||||
|
||||
Jobs must include `user_id` and `plugin_name` metadata for quota tracking:
|
||||
|
||||
```go
|
||||
spec := scheduler.JobSpec{
|
||||
Type: scheduler.JobTypeService,
|
||||
UserID: "user123",
|
||||
GPUCount: 2,
|
||||
Metadata: map[string]string{
|
||||
"plugin_name": "jupyter",
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
## Zig CLI Architecture
|
||||
|
||||
### Component Structure
|
||||
|
|
@ -865,3 +931,13 @@ graph TB
|
|||
---
|
||||
|
||||
This architecture provides a solid foundation for secure, scalable machine learning experiments while maintaining simplicity and developer productivity.
|
||||
|
||||
## See Also
|
||||
|
||||
- **[Scheduler Architecture](scheduler-architecture.md)** - Detailed scheduler design and protocols
|
||||
- **[Security Guide](security.md)** - Security architecture and best practices
|
||||
- **[Configuration Reference](configuration-reference.md)** - Configuration options and environment variables
|
||||
- **[Deployment Guide](deployment.md)** - Production deployment architecture
|
||||
- **[Performance & Monitoring](performance-monitoring.md)** - Metrics and observability
|
||||
- **[Research Runner Plan](research-runner-plan.md)** - Roadmap and implementation phases
|
||||
- **[Native Libraries](native-libraries.md)** - C++ performance optimizations
|
||||
|
|
|
|||
|
|
@ -664,6 +664,45 @@ api_key = "<analyst-api-key>"
|
|||
| `resources.podman_cpus` | string | "2" | CPU limit for Podman containers |
|
||||
| `resources.podman_memory` | string | "4Gi" | Memory limit for Podman containers |
|
||||
|
||||
### Plugin GPU Quotas
|
||||
|
||||
Control GPU allocation for plugin-based services (Jupyter, vLLM, etc.).
|
||||
|
||||
| Option | Type | Default | Description |
|
||||
|--------|------|---------|-------------|
|
||||
| `scheduler.plugin_quota.enabled` | bool | false | Enable plugin GPU quota enforcement |
|
||||
| `scheduler.plugin_quota.total_gpus` | int | 0 | Global GPU limit across all plugins (0 = unlimited) |
|
||||
| `scheduler.plugin_quota.per_user_gpus` | int | 0 | Default per-user GPU limit (0 = unlimited) |
|
||||
| `scheduler.plugin_quota.per_user_services` | int | 0 | Default per-user service count limit (0 = unlimited) |
|
||||
| `scheduler.plugin_quota.per_plugin_limits.{plugin}.max_gpus` | int | 0 | Plugin-specific GPU limit |
|
||||
| `scheduler.plugin_quota.per_plugin_limits.{plugin}.max_services` | int | 0 | Plugin-specific service count limit |
|
||||
| `scheduler.plugin_quota.user_overrides.{user}.max_gpus` | int | 0 | Per-user GPU override |
|
||||
| `scheduler.plugin_quota.user_overrides.{user}.max_services` | int | 0 | Per-user service limit override |
|
||||
| `scheduler.plugin_quota.user_overrides.{user}.allowed_plugins` | array | [] | Plugins user is allowed to use (empty = all) |
|
||||
|
||||
**Example configuration:**
|
||||
|
||||
```yaml
|
||||
scheduler:
|
||||
plugin_quota:
|
||||
enabled: true
|
||||
total_gpus: 16
|
||||
per_user_gpus: 4
|
||||
per_user_services: 2
|
||||
per_plugin_limits:
|
||||
vllm:
|
||||
max_gpus: 8
|
||||
max_services: 4
|
||||
jupyter:
|
||||
max_gpus: 4
|
||||
max_services: 10
|
||||
user_overrides:
|
||||
admin:
|
||||
max_gpus: 8
|
||||
max_services: 5
|
||||
allowed_plugins: ["jupyter", "vllm"]
|
||||
```
|
||||
|
||||
### Redis
|
||||
|
||||
| Option | Type | Default | Description |
|
||||
|
|
@ -746,4 +785,16 @@ go run cmd/api-server/main.go --config configs/api/dev.yaml --validate
|
|||
|
||||
# Test CLI configuration
|
||||
./cli/zig-out/bin/ml status --debug
|
||||
```
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
|
||||
- **[Architecture](architecture.md)** - System architecture overview
|
||||
- **[Scheduler Architecture](scheduler-architecture.md)** - Scheduler configuration details
|
||||
- **[Environment Variables](environment-variables.md)** - Additional environment variable documentation
|
||||
- **[Security Guide](security.md)** - Security-related configuration
|
||||
- **[Deployment Guide](deployment.md)** - Production configuration guidance
|
||||
- **[Jupyter Workflow](jupyter-workflow.md)** - Jupyter service configuration
|
||||
- **[vLLM Workflow](vllm-workflow.md)** - vLLM service configuration
|
||||
|
|
@ -620,8 +620,10 @@ Common error codes in binary responses:
|
|||
|
||||
## See Also
|
||||
|
||||
- **[vLLM Workflow](vllm-workflow.md)** - LLM inference services (complementary to Jupyter)
|
||||
- **[Scheduler Architecture](scheduler-architecture.md)** - How Jupyter services are scheduled
|
||||
- **[Configuration Reference](configuration-reference.md)** - Service configuration options
|
||||
- **[Testing Guide](testing.md)** - Testing Jupyter workflows
|
||||
- **[Deployment Guide](deployment.md)** - Production deployment
|
||||
- **[Security Guide](security.md)** - Security best practices
|
||||
- **[API Reference](api-key-process.md)** - API documentation
|
||||
- **[CLI Reference](cli-reference.md)** - Command-line tools
|
||||
|
|
@ -43,9 +43,11 @@ make test-unit
|
|||
|
||||
### 🛠️ Development
|
||||
- [**Architecture**](architecture.md) - System architecture and design
|
||||
- [**Scheduler Architecture**](scheduler-architecture.md) - Job scheduler and service management
|
||||
- [**CLI Reference**](cli-reference.md) - Command-line interface documentation
|
||||
- [**Testing Guide**](testing.md) - Testing procedures and guidelines
|
||||
- [**Jupyter Workflow**](jupyter-workflow.md) - CLI and Jupyter integration
|
||||
- [**Jupyter Workflow**](jupyter-workflow.md) - Jupyter notebook services
|
||||
- [**vLLM Workflow**](vllm-workflow.md) - LLM inference services
|
||||
- [**Queue System**](queue.md) - Job queue implementation
|
||||
|
||||
### 🏭 Production Deployment
|
||||
|
|
|
|||
|
|
@ -329,4 +329,13 @@ make help # Show all available commands
|
|||
|
||||
---
|
||||
|
||||
*Ready in minutes!*
|
||||
*Ready in minutes!*
|
||||
|
||||
## See Also
|
||||
|
||||
- **[Architecture](architecture.md)** - System architecture overview
|
||||
- **[Scheduler Architecture](scheduler-architecture.md)** - Job scheduling and service management
|
||||
- **[Jupyter Workflow](jupyter-workflow.md)** - Jupyter notebook services
|
||||
- **[vLLM Workflow](vllm-workflow.md)** - LLM inference services
|
||||
- **[Configuration Reference](configuration-reference.md)** - Configuration options
|
||||
- **[Security Guide](security.md)** - Security best practices
|
||||
|
|
@ -169,6 +169,52 @@ The scheduler persists state for crash recovery:
|
|||
|
||||
State is replayed on startup via `StateStore.Replay()`.
|
||||
|
||||
## Service Templates
|
||||
|
||||
The scheduler provides built-in service templates for common ML workloads:
|
||||
|
||||
### Available Templates
|
||||
|
||||
| Template | Description | Default Port Range |
|
||||
|----------|-------------|-------------------|
|
||||
| **JupyterLab** | Interactive Jupyter environment | 8000-9000 |
|
||||
| **Jupyter Notebook** | Classic Jupyter notebooks | 8000-9000 |
|
||||
| **vLLM** | OpenAI-compatible LLM inference server | 8000-9000 |
|
||||
|
||||
### Port Allocation
|
||||
|
||||
Dynamic port management for service instances:
|
||||
|
||||
```go
|
||||
type PortAllocator struct {
|
||||
startPort int // Default: 8000
|
||||
endPort int // Default: 9000
|
||||
allocated map[int]time.Time // Port -> allocation time
|
||||
}
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- Automatic port selection from configured range
|
||||
- TTL-based port reclamation
|
||||
- Thread-safe concurrent allocations
|
||||
- Exhaustion handling with clear error messages
|
||||
|
||||
### Template Variables
|
||||
|
||||
Service templates support dynamic variable substitution:
|
||||
|
||||
| Variable | Description | Example |
|
||||
|----------|-------------|---------|
|
||||
| `{{SERVICE_PORT}}` | Allocated port for the service | `8080` |
|
||||
| `{{WORKER_ID}}` | ID of the assigned worker | `worker-1` |
|
||||
| `{{TASK_ID}}` | Unique task identifier | `task-abc123` |
|
||||
| `{{SECRET:xxx}}` | Secret reference from keychain | `api-key-value` |
|
||||
| `{{MODEL_NAME}}` | ML model name (vLLM) | `llama-2-7b` |
|
||||
| `{{GPU_COUNT}}` | Number of GPUs allocated | `2` |
|
||||
| `{{GPU_DEVICES}}` | Specific GPU device IDs | `0,1` |
|
||||
| `{{MODEL_CACHE}}` | Path to model cache directory | `/models` |
|
||||
| `{{WORKSPACE}}` | Working directory path | `/workspace` |
|
||||
|
||||
## API Methods
|
||||
|
||||
```go
|
||||
|
|
@ -188,7 +234,52 @@ func (h *SchedulerHub) Start() error
|
|||
func (h *SchedulerHub) Stop()
|
||||
```
|
||||
|
||||
## Configuration
|
||||
## Audit Integration
|
||||
|
||||
The scheduler integrates with the audit logging system for security and compliance:
|
||||
|
||||
### Audit Logger Integration
|
||||
|
||||
```go
|
||||
type SchedulerHub struct {
|
||||
// ... other fields ...
|
||||
auditor *audit.Logger // Security audit logger
|
||||
}
|
||||
```
|
||||
|
||||
**Initialization:**
|
||||
```go
|
||||
auditor := audit.NewLogger(audit.Config{
|
||||
LogPath: "/var/log/fetch_ml/scheduler_audit.log",
|
||||
Enabled: true,
|
||||
})
|
||||
hub, err := scheduler.NewHub(config, auditor)
|
||||
```
|
||||
|
||||
### Audit Events
|
||||
|
||||
The scheduler logs the following audit events:
|
||||
|
||||
| Event | Description | Fields Logged |
|
||||
|-------|-------------|---------------|
|
||||
| `job_submitted` | New job queued | job_id, user_id, job_type, gpu_count |
|
||||
| `job_assigned` | Job assigned to worker | job_id, worker_id, assignment_time |
|
||||
| `job_accepted` | Worker accepted job | job_id, worker_id, acceptance_time |
|
||||
| `job_completed` | Job finished successfully | job_id, worker_id, duration |
|
||||
| `job_failed` | Job failed | job_id, worker_id, error_code |
|
||||
| `job_cancelled` | Job cancelled | job_id, cancelled_by, reason |
|
||||
| `worker_registered` | Worker connected | worker_id, capabilities, timestamp |
|
||||
| `worker_disconnected` | Worker disconnected | worker_id, duration_connected |
|
||||
| `quota_exceeded` | GPU quota violation | user_id, plugin_name, requested, limit |
|
||||
|
||||
### Tamper-Evident Logging
|
||||
|
||||
Audit logs use chain hashing for integrity:
|
||||
- Each event includes SHA-256 hash of previous event
|
||||
- Chain verification detects log tampering
|
||||
- Separate log file from operational logs
|
||||
|
||||
### Configuration
|
||||
|
||||
```go
|
||||
type HubConfig struct {
|
||||
|
|
@ -203,6 +294,7 @@ type HubConfig struct {
|
|||
GangAllocTimeoutSecs int // Multi-node allocation timeout
|
||||
AcceptanceTimeoutSecs int // Job acceptance timeout
|
||||
WorkerTokens map[string]string // Authentication tokens
|
||||
PluginQuota PluginQuotaConfig // Plugin GPU quota configuration
|
||||
}
|
||||
```
|
||||
|
||||
|
|
@ -214,6 +306,11 @@ Process management is abstracted for Unix/Windows:
|
|||
|
||||
## See Also
|
||||
|
||||
- `internal/scheduler/hub.go` - Core implementation
|
||||
- `tests/fixtures/scheduler_fixture.go` - Test infrastructure
|
||||
- `docs/src/native-libraries.md` - Native C++ performance libraries
|
||||
- **[Architecture Overview](architecture.md)** - High-level system architecture
|
||||
- **[Security Guide](security.md)** - Audit logging and security features
|
||||
- **[Configuration Reference](configuration-reference.md)** - Plugin GPU quotas and scheduler config
|
||||
- **[Jupyter Workflow](jupyter-workflow.md)** - Jupyter service integration with scheduler
|
||||
- **[vLLM Workflow](vllm-workflow.md)** - vLLM service integration with scheduler
|
||||
- **[Testing Guide](testing.md)** - Testing the scheduler
|
||||
- **`internal/scheduler/hub.go`** - Core implementation
|
||||
- **`tests/fixtures/scheduler_fixture.go`** - Test infrastructure
|
||||
|
|
|
|||
|
|
@ -112,27 +112,164 @@ The system detects and rejects plaintext secrets using:
|
|||
|
||||
### HIPAA-Compliant Audit Logging
|
||||
|
||||
**Tamper-Evident Logging:**
|
||||
FetchML implements comprehensive HIPAA-compliant audit logging with tamper-evident chain hashing for healthcare and regulated environments.
|
||||
|
||||
**Architecture:**
|
||||
```go
|
||||
// Each event includes chain hash for integrity
|
||||
audit.Log(audit.Event{
|
||||
// Audit logger initialization
|
||||
auditor := audit.NewLogger(audit.Config{
|
||||
Enabled: true,
|
||||
LogPath: "/var/log/fetch_ml/audit.log",
|
||||
})
|
||||
|
||||
// Logging an event
|
||||
auditor.Log(audit.Event{
|
||||
EventType: audit.EventFileRead,
|
||||
UserID: "user1",
|
||||
Resource: "/data/file.txt",
|
||||
UserID: "user123",
|
||||
Resource: "/data/patient_records/file.txt",
|
||||
IPAddress: "10.0.0.5",
|
||||
Success: true,
|
||||
Metadata: map[string]any{
|
||||
"file_size": 1024,
|
||||
"checksum": "abc123...",
|
||||
},
|
||||
})
|
||||
```
|
||||
|
||||
**Event Types:**
|
||||
- `file_read` - File access logged
|
||||
- `file_write` - File modification logged
|
||||
- `file_delete` - File deletion logged
|
||||
- `auth_success` / `auth_failure` - Authentication events
|
||||
- `job_queued` / `job_started` / `job_completed` - Job lifecycle
|
||||
|
||||
**Chain Hashing:**
|
||||
- Each event includes SHA-256 hash of previous event
|
||||
**Tamper-Evident Chain Hashing:**
|
||||
- Each event includes SHA-256 hash of the previous event (PrevHash)
|
||||
- Event hash covers all fields including PrevHash (chaining)
|
||||
- Modification of any log entry breaks the chain
|
||||
- `VerifyChain()` function detects tampering
|
||||
- Separate `VerifyChain()` function detects tampering
|
||||
- Monotonic sequence numbers prevent deletion attacks
|
||||
|
||||
```go
|
||||
// Verify audit chain integrity
|
||||
valid, err := audit.VerifyChain("/var/log/fetch_ml/audit.log")
|
||||
if err != nil || !valid {
|
||||
log.Fatal("AUDIT TAMPERING DETECTED")
|
||||
}
|
||||
```
|
||||
|
||||
**HIPAA-Specific Event Types:**
|
||||
|
||||
| Event Type | HIPAA Relevance | Fields Logged |
|
||||
|------------|-----------------|---------------|
|
||||
| `file_read` | Access to PHI | user_id, file_path, ip_address, timestamp, checksum |
|
||||
| `file_write` | Modification of PHI | user_id, file_path, bytes_written, prev_checksum, new_checksum |
|
||||
| `file_delete` | Deletion of PHI | user_id, file_path, deletion_type (soft/hard) |
|
||||
| `dataset_access` | Bulk data access | user_id, dataset_id, record_count, access_purpose |
|
||||
| `authentication_success` | Access control | user_id, auth_method, ip_address, mfa_used |
|
||||
| `authentication_failure` | Failed access attempts | attempted_user, ip_address, failure_reason, attempt_count |
|
||||
| `job_queued` | Processing PHI | user_id, job_id, input_data_classification |
|
||||
| `job_started` | PHI processing begun | job_id, worker_id, data_accessed |
|
||||
| `job_completed` | PHI processing complete | job_id, output_location, data_disposition |
|
||||
|
||||
**Standard Event Types:**
|
||||
|
||||
| Event Type | Description | Use Case |
|
||||
|------------|-------------|----------|
|
||||
| `authentication_attempt` | Login attempt (pre-validation) | Brute force detection |
|
||||
| `authentication_success` | Successful login | Access tracking |
|
||||
| `authentication_failure` | Failed login | Security monitoring |
|
||||
| `job_queued` | Job submitted to queue | Workflow tracking |
|
||||
| `job_started` | Job execution begun | Performance monitoring |
|
||||
| `job_completed` | Job finished successfully | Completion tracking |
|
||||
| `job_failed` | Job execution failed | Error tracking |
|
||||
| `jupyter_start` | Jupyter service started | Resource tracking |
|
||||
| `jupyter_stop` | Jupyter service stopped | Session tracking |
|
||||
| `experiment_created` | Experiment initialized | Provenance tracking |
|
||||
| `experiment_deleted` | Experiment removed | Data lifecycle |
|
||||
|
||||
**Scheduler Audit Integration:**
|
||||
|
||||
The scheduler automatically logs these events:
|
||||
- `job_submitted` - Job queued (includes user_id, job_type, gpu_count)
|
||||
- `job_assigned` - Job assigned to worker (worker_id, assignment_time)
|
||||
- `job_accepted` - Worker confirmed job execution
|
||||
- `job_completed` / `job_failed` / `job_cancelled` - Job terminal states
|
||||
- `worker_registered` - Worker connected to scheduler
|
||||
- `worker_disconnected` - Worker disconnected
|
||||
- `quota_exceeded` - GPU quota violation attempt
|
||||
|
||||
**Audit Log Format:**
|
||||
```json
|
||||
{
|
||||
"timestamp": "2024-01-15T10:30:00Z",
|
||||
"event_type": "file_read",
|
||||
"user_id": "researcher1",
|
||||
"ip_address": "10.0.0.5",
|
||||
"resource": "/data/experiments/run_001/results.csv",
|
||||
"action": "read",
|
||||
"success": true,
|
||||
"sequence_num": 15423,
|
||||
"prev_hash": "a1b2c3d4...",
|
||||
"event_hash": "e5f6g7h8...",
|
||||
"metadata": {
|
||||
"file_size": 1048576,
|
||||
"checksum": "sha256:abc123...",
|
||||
"access_duration_ms": 150
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Log Storage and Rotation:**
|
||||
- Default location: `/var/log/fetch_ml/audit.log`
|
||||
- Automatic rotation by size (100MB) or time (daily)
|
||||
- Retention policy: Configurable (default: 7 years for HIPAA)
|
||||
- Immutable storage: Append-only with filesystem-level protection
|
||||
|
||||
**Compliance Features:**
|
||||
|
||||
- **User Identification**: Every event includes `user_id` for accountability
|
||||
- **Timestamp Precision**: RFC3339 nanosecond precision timestamps
|
||||
- **IP Address Tracking**: Source IP for all network events
|
||||
- **Success/Failure Tracking**: Boolean success field for all operations
|
||||
- **Metadata Flexibility**: Extensible key-value metadata for domain-specific data
|
||||
- **Immutable Logging**: Append-only files with filesystem protections
|
||||
- **Chain Verification**: Cryptographic proof of log integrity
|
||||
- **Sealed Logs**: Optional GPG signing for regulatory submissions
|
||||
|
||||
**Audit Log Analysis:**
|
||||
|
||||
```bash
|
||||
# View recent audit events
|
||||
tail -f /var/log/fetch_ml/audit.log | jq '.'
|
||||
|
||||
# Search for specific user activity
|
||||
grep '"user_id":"researcher1"' /var/log/fetch_ml/audit.log | jq '.'
|
||||
|
||||
# Find all file access events
|
||||
jq 'select(.event_type == "file_read")' /var/log/fetch_ml/audit.log
|
||||
|
||||
# Detect failed authentication attempts
|
||||
jq 'select(.event_type == "authentication_failure")' /var/log/fetch_ml/audit.log
|
||||
|
||||
# Verify audit chain integrity
|
||||
./cli/zig-out/bin/ml audit verify /var/log/fetch_ml/audit.log
|
||||
|
||||
# Export audit report for compliance
|
||||
./cli/zig-out/bin/ml audit export --start 2024-01-01 --end 2024-01-31 --format csv
|
||||
```
|
||||
|
||||
**Regulatory Compliance:**
|
||||
|
||||
| Regulation | Requirement | FetchML Implementation |
|
||||
|------------|-------------|------------------------|
|
||||
| **HIPAA** | Access logging, tamper evidence | Chain hashing, file access events, user tracking |
|
||||
| **GDPR** | Data subject access, right to deletion | Full audit trail, deletion events with chain preservation |
|
||||
| **SOX** | Financial controls, audit trail | Immutable logs, separation of duties via RBAC |
|
||||
| **21 CFR Part 11** | Electronic records integrity | Tamper-evident logging, user authentication, timestamps |
|
||||
| **PCI DSS** | Access logging, data protection | Audit trails, encryption, access controls |
|
||||
|
||||
**Best Practices:**
|
||||
|
||||
1. **Enable Audit Logging**: Always enable in production
|
||||
2. **Separate Storage**: Store audit logs on separate volume from application data
|
||||
3. **Regular Verification**: Run chain verification daily
|
||||
4. **Backup Strategy**: Include audit logs in backup procedures
|
||||
5. **Access Control**: Restrict audit log access to security personnel only
|
||||
6. **Monitoring**: Set up alerts for suspicious patterns (multiple failed logins, after-hours access)
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -420,3 +557,16 @@ All API access is logged with:
|
|||
- **Security Issues**: Report privately via email
|
||||
- **Questions**: See documentation or create issue
|
||||
- **Updates**: Monitor releases for security patches
|
||||
|
||||
---
|
||||
|
||||
## See Also
|
||||
|
||||
- **[Privacy & Security](privacy-security.md)** - PII detection and privacy controls
|
||||
- **[Multi-Tenant Security](multi-tenant-security.md)** - Tenant isolation and cross-tenant access prevention
|
||||
- **[API Key Process](api-key-process.md)** - Generate and manage API keys
|
||||
- **[User Permissions](user-permissions.md)** - Role-based access control
|
||||
- **[Runtime Security](runtime-security.md)** - Container sandboxing and seccomp profiles
|
||||
- **[Scheduler Architecture](scheduler-architecture.md)** - Audit integration in the scheduler
|
||||
- **[Configuration Reference](configuration-reference.md)** - Security-related configuration options
|
||||
- **[Deployment Guide](deployment.md)** - Production security hardening
|
||||
|
|
|
|||
581
docs/src/vllm-workflow.md
Normal file
581
docs/src/vllm-workflow.md
Normal file
|
|
@ -0,0 +1,581 @@
|
|||
# vLLM Inference Service Guide
|
||||
|
||||
Comprehensive guide to deploying and managing OpenAI-compatible LLM inference services using vLLM in FetchML.
|
||||
|
||||
## Overview
|
||||
|
||||
The vLLM plugin provides high-performance LLM inference with:
|
||||
- **OpenAI-Compatible API**: Drop-in replacement for OpenAI's API
|
||||
- **Advanced Scheduling**: Continuous batching for throughput optimization
|
||||
- **GPU Optimization**: Tensor parallelism and quantization support
|
||||
- **Model Management**: Automatic model downloading and caching
|
||||
- **Quantization**: AWQ, GPTQ, FP8, and SqueezeLLM support
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Start vLLM Service
|
||||
|
||||
```bash
|
||||
# Start development stack
|
||||
make dev-up
|
||||
|
||||
# Start vLLM service with default model
|
||||
./cli/zig-out/bin/ml service start vllm --name llm-server --model meta-llama/Llama-2-7b-chat-hf
|
||||
|
||||
# Or with specific GPU requirements
|
||||
./cli/zig-out/bin/ml service start vllm \
|
||||
--name llm-server \
|
||||
--model meta-llama/Llama-2-7b-chat-hf \
|
||||
--gpu-count 1 \
|
||||
--quantization awq
|
||||
|
||||
# Access the API
|
||||
open http://localhost:8000/docs
|
||||
```
|
||||
|
||||
### Using the API
|
||||
|
||||
```python
|
||||
import openai
|
||||
|
||||
# Point to local vLLM instance
|
||||
client = openai.OpenAI(
|
||||
base_url="http://localhost:8000/v1",
|
||||
api_key="not-needed"
|
||||
)
|
||||
|
||||
# Chat completion
|
||||
response = client.chat.completions.create(
|
||||
model="meta-llama/Llama-2-7b-chat-hf",
|
||||
messages=[
|
||||
{"role": "user", "content": "Explain quantum computing in simple terms"}
|
||||
]
|
||||
)
|
||||
|
||||
print(response.choices[0].message.content)
|
||||
```
|
||||
|
||||
## Service Management
|
||||
|
||||
### Creating vLLM Services
|
||||
|
||||
```bash
|
||||
# Create basic vLLM service
|
||||
./cli/zig-out/bin/ml service start vllm --name my-llm
|
||||
|
||||
# Create with specific model
|
||||
./cli/zig-out/bin/ml service start vllm \
|
||||
--name my-llm \
|
||||
--model microsoft/DialoGPT-medium
|
||||
|
||||
# Create with resource constraints
|
||||
./cli/zig-out/bin/ml service start vllm \
|
||||
--name production-llm \
|
||||
--model meta-llama/Llama-2-13b-chat-hf \
|
||||
--gpu-count 2 \
|
||||
--quantization gptq \
|
||||
--max-model-len 4096
|
||||
|
||||
# List all vLLM services
|
||||
./cli/zig-out/bin/ml service list
|
||||
|
||||
# Service details
|
||||
./cli/zig-out/bin/ml service info my-llm
|
||||
```
|
||||
|
||||
### Service Configuration
|
||||
|
||||
**Resource Allocation:**
|
||||
```yaml
|
||||
# vllm-config.yaml
|
||||
resources:
|
||||
gpu_count: 1
|
||||
gpu_memory: 24gb
|
||||
cpu: 4
|
||||
memory: 16g
|
||||
|
||||
model:
|
||||
name: "meta-llama/Llama-2-7b-chat-hf"
|
||||
quantization: "awq" # Options: awq, gptq, squeezellm, fp8
|
||||
trust_remote_code: false
|
||||
max_model_len: 4096
|
||||
|
||||
serving:
|
||||
port: 8000
|
||||
host: "0.0.0.0"
|
||||
tensor_parallel_size: 1
|
||||
dtype: "auto" # auto, half, bfloat16, float
|
||||
|
||||
optimization:
|
||||
enable_prefix_caching: true
|
||||
swap_space: 4 # GB
|
||||
max_num_batched_tokens: 4096
|
||||
max_num_seqs: 256
|
||||
```
|
||||
|
||||
**Environment Variables:**
|
||||
```bash
|
||||
# Model cache location
|
||||
export VLLM_MODEL_CACHE=/models
|
||||
|
||||
# HuggingFace token for gated models
|
||||
export HUGGING_FACE_HUB_TOKEN=your_token_here
|
||||
|
||||
# CUDA settings
|
||||
export CUDA_VISIBLE_DEVICES=0,1
|
||||
```
|
||||
|
||||
### Service Lifecycle
|
||||
|
||||
```bash
|
||||
# Start a service
|
||||
./cli/zig-out/bin/ml service start vllm --name my-llm
|
||||
|
||||
# Stop a service (graceful shutdown)
|
||||
./cli/zig-out/bin/ml service stop my-llm
|
||||
|
||||
# Restart a service
|
||||
./cli/zig-out/bin/ml service restart my-llm
|
||||
|
||||
# Remove a service (stops and deletes)
|
||||
./cli/zig-out/bin/ml service remove my-llm
|
||||
|
||||
# View service logs
|
||||
./cli/zig-out/bin/ml service logs my-llm --follow
|
||||
|
||||
# Check service health
|
||||
./cli/zig-out/bin/ml service health my-llm
|
||||
```
|
||||
|
||||
## Model Management
|
||||
|
||||
### Supported Models
|
||||
|
||||
vLLM supports most HuggingFace Transformers models:
|
||||
|
||||
- **Llama 2/3**: `meta-llama/Llama-2-7b-chat-hf`, `meta-llama/Llama-2-70b-chat-hf`
|
||||
- **Mistral**: `mistralai/Mistral-7B-Instruct-v0.2`
|
||||
- **Mixtral**: `mistralai/Mixtral-8x7B-Instruct-v0.1`
|
||||
- **Falcon**: `tiiuae/falcon-7b-instruct`
|
||||
- **CodeLlama**: `codellama/CodeLlama-7b-hf`
|
||||
- **Phi**: `microsoft/phi-2`
|
||||
- **Qwen**: `Qwen/Qwen-7B-Chat`
|
||||
- **Gemma**: `google/gemma-7b-it`
|
||||
|
||||
### Model Caching
|
||||
|
||||
Models are automatically cached to avoid repeated downloads:
|
||||
|
||||
```bash
|
||||
# Default cache location
|
||||
~/.cache/huggingface/hub/
|
||||
|
||||
# Custom cache location
|
||||
export VLLM_MODEL_CACHE=/mnt/fast-storage/models
|
||||
|
||||
# Pre-download models
|
||||
./cli/zig-out/bin/ml service prefetch --model meta-llama/Llama-2-7b-chat-hf
|
||||
```
|
||||
|
||||
### Quantization
|
||||
|
||||
Quantization reduces memory usage and improves inference speed:
|
||||
|
||||
```bash
|
||||
# AWQ (4-bit quantization)
|
||||
./cli/zig-out/bin/ml service start vllm \
|
||||
--name llm-awq \
|
||||
--model TheBloke/Llama-2-7B-AWQ \
|
||||
--quantization awq
|
||||
|
||||
# GPTQ (4-bit quantization)
|
||||
./cli/zig-out/bin/ml service start vllm \
|
||||
--name llm-gptq \
|
||||
--model TheBloke/Llama-2-7B-GPTQ \
|
||||
--quantization gptq
|
||||
|
||||
# FP8 (8-bit floating point)
|
||||
./cli/zig-out/bin/ml service start vllm \
|
||||
--name llm-fp8 \
|
||||
--model meta-llama/Llama-2-7b-chat-hf \
|
||||
--quantization fp8
|
||||
```
|
||||
|
||||
**Quantization Comparison:**
|
||||
|
||||
| Method | Bits | Memory Reduction | Speed Impact | Quality |
|
||||
|--------|------|------------------|--------------|---------|
|
||||
| None (FP16) | 16 | 1x | Baseline | Best |
|
||||
| FP8 | 8 | 2x | Faster | Excellent |
|
||||
| AWQ | 4 | 4x | Fast | Very Good |
|
||||
| GPTQ | 4 | 4x | Fast | Very Good |
|
||||
| SqueezeLLM | 4 | 4x | Fast | Good |
|
||||
|
||||
## API Reference
|
||||
|
||||
### OpenAI-Compatible Endpoints
|
||||
|
||||
vLLM provides OpenAI-compatible REST API endpoints:
|
||||
|
||||
**Chat Completions:**
|
||||
```bash
|
||||
curl http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "meta-llama/Llama-2-7b-chat-hf",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Hello!"}
|
||||
],
|
||||
"max_tokens": 100,
|
||||
"temperature": 0.7
|
||||
}'
|
||||
```
|
||||
|
||||
**Completions (Legacy):**
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "meta-llama/Llama-2-7b-chat-hf",
|
||||
"prompt": "The capital of France is",
|
||||
"max_tokens": 10
|
||||
}'
|
||||
```
|
||||
|
||||
**Embeddings:**
|
||||
```bash
|
||||
curl http://localhost:8000/v1/embeddings \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "meta-llama/Llama-2-7b-chat-hf",
|
||||
"input": "Hello world"
|
||||
}'
|
||||
```
|
||||
|
||||
**List Models:**
|
||||
```bash
|
||||
curl http://localhost:8000/v1/models
|
||||
```
|
||||
|
||||
### Streaming Responses
|
||||
|
||||
Enable streaming for real-time token generation:
|
||||
|
||||
```python
|
||||
import openai
|
||||
|
||||
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
|
||||
|
||||
stream = client.chat.completions.create(
|
||||
model="meta-llama/Llama-2-7b-chat-hf",
|
||||
messages=[{"role": "user", "content": "Write a poem about AI"}],
|
||||
stream=True,
|
||||
max_tokens=200
|
||||
)
|
||||
|
||||
for chunk in stream:
|
||||
if chunk.choices[0].delta.content:
|
||||
print(chunk.choices[0].delta.content, end="")
|
||||
```
|
||||
|
||||
### Advanced Parameters
|
||||
|
||||
```python
|
||||
response = client.chat.completions.create(
|
||||
model="meta-llama/Llama-2-7b-chat-hf",
|
||||
messages=messages,
|
||||
|
||||
# Generation parameters
|
||||
max_tokens=500,
|
||||
temperature=0.7,
|
||||
top_p=0.9,
|
||||
top_k=40,
|
||||
|
||||
# Repetition and penalties
|
||||
frequency_penalty=0.5,
|
||||
presence_penalty=0.5,
|
||||
repetition_penalty=1.1,
|
||||
|
||||
# Sampling
|
||||
seed=42,
|
||||
stop=["END", "STOP"],
|
||||
|
||||
# Beam search (optional)
|
||||
best_of=1,
|
||||
use_beam_search=False,
|
||||
)
|
||||
```
|
||||
|
||||
## GPU Quotas and Resource Management
|
||||
|
||||
### Per-User GPU Limits
|
||||
|
||||
The scheduler enforces GPU quotas for vLLM services:
|
||||
|
||||
```yaml
|
||||
# scheduler-config.yaml
|
||||
scheduler:
|
||||
plugin_quota:
|
||||
enabled: true
|
||||
total_gpus: 16
|
||||
per_user_gpus: 4
|
||||
per_user_services: 2
|
||||
per_plugin_limits:
|
||||
vllm:
|
||||
max_gpus: 8
|
||||
max_services: 4
|
||||
user_overrides:
|
||||
admin:
|
||||
max_gpus: 8
|
||||
max_services: 5
|
||||
allowed_plugins: ["vllm", "jupyter"]
|
||||
```
|
||||
|
||||
### Resource Monitoring
|
||||
|
||||
```bash
|
||||
# Check GPU allocation for your user
|
||||
./cli/zig-out/bin/ml service quota
|
||||
|
||||
# View current usage
|
||||
./cli/zig-out/bin/ml service usage
|
||||
|
||||
# Monitor service resource usage
|
||||
./cli/zig-out/bin/ml service stats my-llm
|
||||
```
|
||||
|
||||
## Multi-GPU and Distributed Inference
|
||||
|
||||
### Tensor Parallelism
|
||||
|
||||
For large models that don't fit on a single GPU:
|
||||
|
||||
```bash
|
||||
# 70B model across 4 GPUs
|
||||
./cli/zig-out/bin/ml service start vllm \
|
||||
--name llm-70b \
|
||||
--model meta-llama/Llama-2-70b-chat-hf \
|
||||
--gpu-count 4 \
|
||||
--tensor-parallel-size 4
|
||||
```
|
||||
|
||||
### Pipeline Parallelism
|
||||
|
||||
For very large models with pipeline stages:
|
||||
|
||||
```yaml
|
||||
# Pipeline parallelism config
|
||||
model:
|
||||
name: "meta-llama/Llama-2-70b-chat-hf"
|
||||
|
||||
serving:
|
||||
tensor_parallel_size: 2
|
||||
pipeline_parallel_size: 2 # Total 4 GPUs
|
||||
```
|
||||
|
||||
## Integration with Experiments
|
||||
|
||||
### Using vLLM from Training Jobs
|
||||
|
||||
```python
|
||||
# In your training script
|
||||
import requests
|
||||
|
||||
# Call local vLLM service
|
||||
response = requests.post(
|
||||
"http://vllm-service:8000/v1/chat/completions",
|
||||
json={
|
||||
"model": "meta-llama/Llama-2-7b-chat-hf",
|
||||
"messages": [{"role": "user", "content": "Summarize this text"}]
|
||||
}
|
||||
)
|
||||
|
||||
result = response.json()
|
||||
summary = result["choices"][0]["message"]["content"]
|
||||
```
|
||||
|
||||
### Linking with Experiments
|
||||
|
||||
```bash
|
||||
# Start vLLM service linked to experiment
|
||||
./cli/zig-out/bin/ml service start vllm \
|
||||
--name llm-exp-1 \
|
||||
--model meta-llama/Llama-2-7b-chat-hf \
|
||||
--experiment experiment-id
|
||||
|
||||
# View linked services
|
||||
./cli/zig-out/bin/ml service list --experiment experiment-id
|
||||
```
|
||||
|
||||
## Security and Access Control
|
||||
|
||||
### Network Isolation
|
||||
|
||||
```bash
|
||||
# Restrict to internal network only
|
||||
./cli/zig-out/bin/ml service start vllm \
|
||||
--name internal-llm \
|
||||
--model meta-llama/Llama-2-7b-chat-hf \
|
||||
--host 10.0.0.1 \
|
||||
--port 8000
|
||||
```
|
||||
|
||||
### API Key Authentication
|
||||
|
||||
```yaml
|
||||
# vllm-security.yaml
|
||||
auth:
|
||||
api_key_required: true
|
||||
allowed_ips:
|
||||
- "10.0.0.0/8"
|
||||
- "192.168.0.0/16"
|
||||
|
||||
rate_limit:
|
||||
requests_per_minute: 60
|
||||
tokens_per_minute: 10000
|
||||
```
|
||||
|
||||
### Audit Trail
|
||||
|
||||
All API calls are logged for compliance:
|
||||
|
||||
```bash
|
||||
# View audit log
|
||||
./cli/zig-out/bin/ml service audit my-llm
|
||||
|
||||
# Export audit report
|
||||
./cli/zig-out/bin/ml service audit my-llm --export=csv
|
||||
|
||||
# Check access patterns
|
||||
./cli/zig-out/bin/ml service audit my-llm --summary
|
||||
```
|
||||
|
||||
## Monitoring and Troubleshooting
|
||||
|
||||
### Health Checks
|
||||
|
||||
```bash
|
||||
# Check service health
|
||||
./cli/zig-out/bin/ml service health my-llm
|
||||
|
||||
# Detailed diagnostics
|
||||
./cli/zig-out/bin/ml service diagnose my-llm
|
||||
|
||||
# View service status
|
||||
./cli/zig-out/bin/ml service status my-llm
|
||||
```
|
||||
|
||||
### Performance Monitoring
|
||||
|
||||
```bash
|
||||
# Real-time metrics
|
||||
./cli/zig-out/bin/ml service monitor my-llm
|
||||
|
||||
# Performance report
|
||||
./cli/zig-out/bin/ml service report my-llm --format=html
|
||||
|
||||
# GPU utilization
|
||||
./cli/zig-out/bin/ml service stats my-llm --gpu
|
||||
```
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Out of Memory:**
|
||||
```bash
|
||||
# Reduce batch size
|
||||
./cli/zig-out/bin/ml service update my-llm --max-num-seqs 128
|
||||
|
||||
# Enable quantization
|
||||
./cli/zig-out/bin/ml service update my-llm --quantization awq
|
||||
|
||||
# Reduce GPU memory fraction
|
||||
export VLLM_GPU_MEMORY_FRACTION=0.85
|
||||
```
|
||||
|
||||
**Model Download Failures:**
|
||||
```bash
|
||||
# Set HuggingFace token
|
||||
export HUGGING_FACE_HUB_TOKEN=your_token
|
||||
|
||||
# Use mirror
|
||||
export HF_ENDPOINT=https://hf-mirror.com
|
||||
|
||||
# Pre-download with retry
|
||||
./cli/zig-out/bin/ml service prefetch --model meta-llama/Llama-2-7b-chat-hf --retry
|
||||
```
|
||||
|
||||
**Slow Inference:**
|
||||
```bash
|
||||
# Enable prefix caching
|
||||
./cli/zig-out/bin/ml service update my-llm --enable-prefix-caching
|
||||
|
||||
# Increase batch size
|
||||
./cli/zig-out/bin/ml service update my-llm --max-num-batched-tokens 8192
|
||||
|
||||
# Check GPU utilization
|
||||
nvidia-smi dmon -s u
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Resource Planning
|
||||
|
||||
1. **GPU Memory Calculation**: Model size × precision × overhead (1.2-1.5x)
|
||||
2. **Batch Size Tuning**: Balance throughput vs. latency
|
||||
3. **Quantization**: Use AWQ/GPTQ for production, FP16 for best quality
|
||||
4. **Prefix Caching**: Enable for chat applications with repeated prompts
|
||||
|
||||
### Production Deployment
|
||||
|
||||
1. **Load Balancing**: Deploy multiple vLLM instances behind a load balancer
|
||||
2. **Health Checks**: Configure Kubernetes liveness/readiness probes
|
||||
3. **Autoscaling**: Scale based on queue depth or GPU utilization
|
||||
4. **Monitoring**: Track tokens/sec, queue depth, and error rates
|
||||
|
||||
### Security
|
||||
|
||||
1. **Network Segmentation**: Isolate vLLM on internal network
|
||||
2. **Rate Limiting**: Prevent abuse with per-user quotas
|
||||
3. **Input Validation**: Sanitize prompts to prevent injection attacks
|
||||
4. **Audit Logging**: Enable comprehensive audit trails
|
||||
|
||||
## CLI Reference
|
||||
|
||||
### Service Commands
|
||||
|
||||
```bash
|
||||
# Start a service
|
||||
ml service start vllm [flags]
|
||||
--name string Service name (required)
|
||||
--model string Model name or path (default: "meta-llama/Llama-2-7b-chat-hf")
|
||||
--gpu-count int Number of GPUs (default: 1)
|
||||
--quantization string Quantization method (awq, gptq, fp8, squeezellm)
|
||||
--port int Service port (default: 8000)
|
||||
--max-model-len int Maximum sequence length
|
||||
--tensor-parallel-size int Tensor parallelism degree
|
||||
|
||||
# List services
|
||||
ml service list [flags]
|
||||
--format string Output format (table, json)
|
||||
--all Show all users' services (admin only)
|
||||
|
||||
# Service operations
|
||||
ml service stop <name>
|
||||
ml service start <name> # Restart a stopped service
|
||||
ml service restart <name>
|
||||
ml service remove <name>
|
||||
ml service logs <name> [flags]
|
||||
--follow Follow log output
|
||||
--tail int Number of lines to show (default: 100)
|
||||
ml service info <name>
|
||||
ml service health <name>
|
||||
```
|
||||
|
||||
## See Also
|
||||
|
||||
- **[Testing Guide](testing.md)** - Testing vLLM services
|
||||
- **[Deployment Guide](deployment.md)** - Production deployment
|
||||
- **[Security Guide](security.md)** - Security best practices
|
||||
- **[Scheduler Architecture](scheduler-architecture.md)** - How vLLM integrates with scheduler
|
||||
- **[CLI Reference](cli-reference.md)** - Command-line tools
|
||||
- **[Jupyter Workflow](jupyter-workflow.md)** - Jupyter integration with vLLM
|
||||
Loading…
Reference in a new issue