fetch_ml/docs/src/scheduler-architecture.md
Jeremie Fraeys 90ea18555c
Some checks failed
Security Scan / Security Analysis (push) Waiting to run
Security Scan / Native Library Security (push) Waiting to run
Verification & Maintenance / V.1 - Schema Drift Detection (push) Waiting to run
Verification & Maintenance / V.4 - Custom Go Vet Analyzers (push) Waiting to run
Verification & Maintenance / V.7 - Audit Chain Integrity (push) Waiting to run
Verification & Maintenance / V.6 - Extended Security Scanning (push) Waiting to run
Verification & Maintenance / V.10 - OpenSSF Scorecard (push) Waiting to run
Verification & Maintenance / Verification Summary (push) Blocked by required conditions
Build Pipeline / Build Binaries (push) Failing after 2m4s
Build Pipeline / Build Docker Images (push) Has been skipped
Build Pipeline / Sign HIPAA Config (push) Has been skipped
Build Pipeline / Generate SLSA Provenance (push) Has been skipped
Checkout test / test (push) Successful in 5s
CI Pipeline / Test (push) Failing after 1s
CI Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI Pipeline / Security Scan (push) Has been skipped
CI Pipeline / Test Scripts (push) Has been skipped
CI Pipeline / Test Native Libraries (push) Has been skipped
CI Pipeline / Native Library Build Matrix (push) Has been skipped
Contract Tests / Spec Drift Detection (push) Failing after 16s
Contract Tests / API Contract Tests (push) Has been skipped
Deploy API Docs / Build API Documentation (push) Failing after 5s
Deploy API Docs / Deploy to GitHub Pages (push) Has been skipped
Documentation / build-and-publish (push) Failing after 44s
CI Pipeline / Trigger Build Workflow (push) Failing after 0s
docs: add vLLM workflow and cross-link documentation
- Add new vLLM workflow documentation (vllm-workflow.md)
- Update scheduler-architecture.md with Plugin GPU Quota and audit logging
- Add See Also sections to jupyter-workflow.md, quick-start.md,
  configuration-reference.md for better navigation
- Update landing page and index with vLLM and scheduler links
- Cross-link all documentation for improved discoverability
2026-02-26 13:04:39 -05:00

316 lines
9.5 KiB
Markdown

# Scheduler Architecture
The FetchML Scheduler manages distributed job scheduling across workers via WebSocket connections.
## Overview
The scheduler consists of:
- **SchedulerHub**: Core scheduling engine (`internal/scheduler/hub.go`)
- **PriorityQueue**: Heap-based job queues for batch and service jobs
- **WorkerConn**: WebSocket connection handling per worker
- **StateStore**: Persistent state for crash recovery
- **ServiceManager**: Long-running service lifecycle management
## Key Components
### SchedulerHub
```go
type SchedulerHub struct {
workers map[string]*WorkerConn // Active worker connections
readyWorkers map[string]*WorkerConn // Workers ready for jobs
batchQueue *PriorityQueue // Batch job queue
serviceQueue *PriorityQueue // Service job queue
reservations map[string]*Reservation // Job reservations
multiNodePending map[string]*MultiNodeJob // Multi-node gang allocations
pendingAcceptance map[string]*JobAssignment // Jobs awaiting acceptance
state *StateStore // Persistent state
}
```
### Job Types
| Type | Description | Scheduling |
|------|-------------|------------|
| **Batch** | Finite training jobs | FIFO with priority aging |
| **Service** | Long-running inference | Dedicated slots, health checks |
| **Multi-node** | Distributed training | Gang allocation across workers |
## Protocol
### Unified WSS Protocol
All communication uses a single WebSocket Secure (WSS) endpoint:
- Workers connect to `wss://scheduler:port/ws/worker`
- Metrics clients connect with `metrics-` prefixed token
### Message Types
```go
const (
// Worker → Scheduler
MsgRegister = "register"
MsgHeartbeat = "heartbeat"
MsgReadyForWork = "ready_for_work"
MsgJobAccepted = "job_accepted"
MsgJobResult = "job_result"
MsgServiceHealth = "service_health"
MsgMetricsRequest = "metrics_request" // Metrics over WSS
// Scheduler → Worker
MsgJobAssign = "job_assign"
MsgNoWork = "no_work"
MsgJobCancel = "job_cancel"
MsgPrewarmHint = "prewarm_hint"
MsgAck = "ack"
MsgMetricsResponse = "metrics_response" // Metrics over WSS
)
```
### Metrics Over WSS
Metrics are retrieved via WSS using a special client token:
```go
// Connect with metrics token
conn, err := scheduler.DialWSS("scheduler:8443", "ca.crt", "metrics-scraper-1")
// Request metrics
conn.WriteJSON(scheduler.Message{
Type: scheduler.MsgMetricsRequest,
})
// Receive metrics
var msg scheduler.Message
conn.ReadJSON(&msg)
// msg.Type == MsgMetricsResponse
// msg.Payload contains metrics map
```
**Metrics payload:**
```json
{
"workers_connected": 5,
"queue_depth_batch": 12,
"queue_depth_service": 3,
"jobs_completed": 142,
"jobs_failed": 2,
"jobs_cancelled": 0,
"worker_slots": {
"worker-1": {"batch_total": 4, "batch_in_use": 2, ...}
}
}
```
## Features
### Priority Aging
Prevents starvation by increasing priority of long-waiting jobs:
```go
effective_priority = base_priority + (wait_time * aging_rate)
```
### Gang Allocation
Multi-node jobs are allocated atomically across workers:
1. Job submitted with `NodeCount > 1`
2. Scheduler waits for required workers
3. All nodes assigned simultaneously
4. Timeout handling for partial allocations
### Starvation Prevention
Tracks job wait times and triggers priority boosts:
```go
if wait_time > starvation_threshold {
effective_priority += boost_amount
}
```
### Worker Mode Switching
Workers can switch between batch and service modes:
- Batch mode: processes training jobs
- Service mode: runs long-lived inference services
## Testing
### Test Infrastructure
All tests use shared fixtures in `tests/fixtures/`:
- `SchedulerTestFixture`: Common setup/teardown
- `MockWorker`: Simulated worker connections
### Test Categories
| Category | Count | Files |
|----------|-------|-------|
| Unit | 17+ | `tests/unit/scheduler/` |
| Integration | 6 | `tests/integration/scheduler/` |
| E2E | 6 | `tests/e2e/scheduler/` |
### Running Tests
```bash
make test # All tests
make test-unit # Unit tests only
make test-integration # Integration tests only
go test ./tests/e2e/... # E2E tests
```
## State Persistence
The scheduler persists state for crash recovery:
- Job queue state
- Task assignments
- Worker registrations
- Lease timestamps
State is replayed on startup via `StateStore.Replay()`.
## Service Templates
The scheduler provides built-in service templates for common ML workloads:
### Available Templates
| Template | Description | Default Port Range |
|----------|-------------|-------------------|
| **JupyterLab** | Interactive Jupyter environment | 8000-9000 |
| **Jupyter Notebook** | Classic Jupyter notebooks | 8000-9000 |
| **vLLM** | OpenAI-compatible LLM inference server | 8000-9000 |
### Port Allocation
Dynamic port management for service instances:
```go
type PortAllocator struct {
startPort int // Default: 8000
endPort int // Default: 9000
allocated map[int]time.Time // Port -> allocation time
}
```
**Features:**
- Automatic port selection from configured range
- TTL-based port reclamation
- Thread-safe concurrent allocations
- Exhaustion handling with clear error messages
### Template Variables
Service templates support dynamic variable substitution:
| Variable | Description | Example |
|----------|-------------|---------|
| `{{SERVICE_PORT}}` | Allocated port for the service | `8080` |
| `{{WORKER_ID}}` | ID of the assigned worker | `worker-1` |
| `{{TASK_ID}}` | Unique task identifier | `task-abc123` |
| `{{SECRET:xxx}}` | Secret reference from keychain | `api-key-value` |
| `{{MODEL_NAME}}` | ML model name (vLLM) | `llama-2-7b` |
| `{{GPU_COUNT}}` | Number of GPUs allocated | `2` |
| `{{GPU_DEVICES}}` | Specific GPU device IDs | `0,1` |
| `{{MODEL_CACHE}}` | Path to model cache directory | `/models` |
| `{{WORKSPACE}}` | Working directory path | `/workspace` |
## API Methods
```go
// SubmitJob submits a job to the scheduler
func (h *SchedulerHub) SubmitJob(spec JobSpec) error
// GetTask retrieves a task by ID
func (h *SchedulerHub) GetTask(taskID string) *Task
// Addr returns the scheduler's listen address
func (h *SchedulerHub) Addr() string
// Start begins the scheduler
func (h *SchedulerHub) Start() error
// Stop shuts down the scheduler
func (h *SchedulerHub) Stop()
```
## Audit Integration
The scheduler integrates with the audit logging system for security and compliance:
### Audit Logger Integration
```go
type SchedulerHub struct {
// ... other fields ...
auditor *audit.Logger // Security audit logger
}
```
**Initialization:**
```go
auditor := audit.NewLogger(audit.Config{
LogPath: "/var/log/fetch_ml/scheduler_audit.log",
Enabled: true,
})
hub, err := scheduler.NewHub(config, auditor)
```
### Audit Events
The scheduler logs the following audit events:
| Event | Description | Fields Logged |
|-------|-------------|---------------|
| `job_submitted` | New job queued | job_id, user_id, job_type, gpu_count |
| `job_assigned` | Job assigned to worker | job_id, worker_id, assignment_time |
| `job_accepted` | Worker accepted job | job_id, worker_id, acceptance_time |
| `job_completed` | Job finished successfully | job_id, worker_id, duration |
| `job_failed` | Job failed | job_id, worker_id, error_code |
| `job_cancelled` | Job cancelled | job_id, cancelled_by, reason |
| `worker_registered` | Worker connected | worker_id, capabilities, timestamp |
| `worker_disconnected` | Worker disconnected | worker_id, duration_connected |
| `quota_exceeded` | GPU quota violation | user_id, plugin_name, requested, limit |
### Tamper-Evident Logging
Audit logs use chain hashing for integrity:
- Each event includes SHA-256 hash of previous event
- Chain verification detects log tampering
- Separate log file from operational logs
### Configuration
```go
type HubConfig struct {
BindAddr string // Listen address
CertFile string // TLS certificate
KeyFile string // TLS key
StateDir string // State persistence dir
DefaultBatchSlots int // Default batch slots per worker
DefaultServiceSlots int // Default service slots per worker
StarvationThresholdMins float64 // Starvation detection threshold
PriorityAgingRate float64 // Priority increase rate
GangAllocTimeoutSecs int // Multi-node allocation timeout
AcceptanceTimeoutSecs int // Job acceptance timeout
WorkerTokens map[string]string // Authentication tokens
PluginQuota PluginQuotaConfig // Plugin GPU quota configuration
}
```
## Cross-Platform Support
Process management is abstracted for Unix/Windows:
- `service_manager_unix.go`: POSIX process groups
- `service_manager_windows.go`: Windows job objects
## See Also
- **[Architecture Overview](architecture.md)** - High-level system architecture
- **[Security Guide](security.md)** - Audit logging and security features
- **[Configuration Reference](configuration-reference.md)** - Plugin GPU quotas and scheduler config
- **[Jupyter Workflow](jupyter-workflow.md)** - Jupyter service integration with scheduler
- **[vLLM Workflow](vllm-workflow.md)** - vLLM service integration with scheduler
- **[Testing Guide](testing.md)** - Testing the scheduler
- **`internal/scheduler/hub.go`** - Core implementation
- **`tests/fixtures/scheduler_fixture.go`** - Test infrastructure