fetch_ml/docs/src/scheduler-architecture.md
Jeremie Fraeys 90ea18555c
Some checks failed
Security Scan / Security Analysis (push) Waiting to run
Security Scan / Native Library Security (push) Waiting to run
Verification & Maintenance / V.1 - Schema Drift Detection (push) Waiting to run
Verification & Maintenance / V.4 - Custom Go Vet Analyzers (push) Waiting to run
Verification & Maintenance / V.7 - Audit Chain Integrity (push) Waiting to run
Verification & Maintenance / V.6 - Extended Security Scanning (push) Waiting to run
Verification & Maintenance / V.10 - OpenSSF Scorecard (push) Waiting to run
Verification & Maintenance / Verification Summary (push) Blocked by required conditions
Build Pipeline / Build Binaries (push) Failing after 2m4s
Build Pipeline / Build Docker Images (push) Has been skipped
Build Pipeline / Sign HIPAA Config (push) Has been skipped
Build Pipeline / Generate SLSA Provenance (push) Has been skipped
Checkout test / test (push) Successful in 5s
CI Pipeline / Test (push) Failing after 1s
CI Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI Pipeline / Security Scan (push) Has been skipped
CI Pipeline / Test Scripts (push) Has been skipped
CI Pipeline / Test Native Libraries (push) Has been skipped
CI Pipeline / Native Library Build Matrix (push) Has been skipped
Contract Tests / Spec Drift Detection (push) Failing after 16s
Contract Tests / API Contract Tests (push) Has been skipped
Deploy API Docs / Build API Documentation (push) Failing after 5s
Deploy API Docs / Deploy to GitHub Pages (push) Has been skipped
Documentation / build-and-publish (push) Failing after 44s
CI Pipeline / Trigger Build Workflow (push) Failing after 0s
docs: add vLLM workflow and cross-link documentation
- Add new vLLM workflow documentation (vllm-workflow.md)
- Update scheduler-architecture.md with Plugin GPU Quota and audit logging
- Add See Also sections to jupyter-workflow.md, quick-start.md,
  configuration-reference.md for better navigation
- Update landing page and index with vLLM and scheduler links
- Cross-link all documentation for improved discoverability
2026-02-26 13:04:39 -05:00

9.5 KiB

Scheduler Architecture

The FetchML Scheduler manages distributed job scheduling across workers via WebSocket connections.

Overview

The scheduler consists of:

  • SchedulerHub: Core scheduling engine (internal/scheduler/hub.go)
  • PriorityQueue: Heap-based job queues for batch and service jobs
  • WorkerConn: WebSocket connection handling per worker
  • StateStore: Persistent state for crash recovery
  • ServiceManager: Long-running service lifecycle management

Key Components

SchedulerHub

type SchedulerHub struct {
    workers           map[string]*WorkerConn     // Active worker connections
    readyWorkers      map[string]*WorkerConn     // Workers ready for jobs
    batchQueue        *PriorityQueue             // Batch job queue
    serviceQueue      *PriorityQueue             // Service job queue
    reservations      map[string]*Reservation    // Job reservations
    multiNodePending  map[string]*MultiNodeJob   // Multi-node gang allocations
    pendingAcceptance map[string]*JobAssignment  // Jobs awaiting acceptance
    state             *StateStore                // Persistent state
}

Job Types

Type Description Scheduling
Batch Finite training jobs FIFO with priority aging
Service Long-running inference Dedicated slots, health checks
Multi-node Distributed training Gang allocation across workers

Protocol

Unified WSS Protocol

All communication uses a single WebSocket Secure (WSS) endpoint:

  • Workers connect to wss://scheduler:port/ws/worker
  • Metrics clients connect with metrics- prefixed token

Message Types

const (
    // Worker → Scheduler
    MsgRegister       = "register"
    MsgHeartbeat      = "heartbeat"
    MsgReadyForWork   = "ready_for_work"
    MsgJobAccepted    = "job_accepted"
    MsgJobResult      = "job_result"
    MsgServiceHealth  = "service_health"
    MsgMetricsRequest = "metrics_request"  // Metrics over WSS

    // Scheduler → Worker
    MsgJobAssign       = "job_assign"
    MsgNoWork          = "no_work"
    MsgJobCancel       = "job_cancel"
    MsgPrewarmHint     = "prewarm_hint"
    MsgAck             = "ack"
    MsgMetricsResponse = "metrics_response"  // Metrics over WSS
)

Metrics Over WSS

Metrics are retrieved via WSS using a special client token:

// Connect with metrics token
conn, err := scheduler.DialWSS("scheduler:8443", "ca.crt", "metrics-scraper-1")

// Request metrics
conn.WriteJSON(scheduler.Message{
    Type: scheduler.MsgMetricsRequest,
})

// Receive metrics
var msg scheduler.Message
conn.ReadJSON(&msg)
// msg.Type == MsgMetricsResponse
// msg.Payload contains metrics map

Metrics payload:

{
  "workers_connected": 5,
  "queue_depth_batch": 12,
  "queue_depth_service": 3,
  "jobs_completed": 142,
  "jobs_failed": 2,
  "jobs_cancelled": 0,
  "worker_slots": {
    "worker-1": {"batch_total": 4, "batch_in_use": 2, ...}
  }
}

Features

Priority Aging

Prevents starvation by increasing priority of long-waiting jobs:

effective_priority = base_priority + (wait_time * aging_rate)

Gang Allocation

Multi-node jobs are allocated atomically across workers:

  1. Job submitted with NodeCount > 1
  2. Scheduler waits for required workers
  3. All nodes assigned simultaneously
  4. Timeout handling for partial allocations

Starvation Prevention

Tracks job wait times and triggers priority boosts:

if wait_time > starvation_threshold {
    effective_priority += boost_amount
}

Worker Mode Switching

Workers can switch between batch and service modes:

  • Batch mode: processes training jobs
  • Service mode: runs long-lived inference services

Testing

Test Infrastructure

All tests use shared fixtures in tests/fixtures/:

  • SchedulerTestFixture: Common setup/teardown
  • MockWorker: Simulated worker connections

Test Categories

Category Count Files
Unit 17+ tests/unit/scheduler/
Integration 6 tests/integration/scheduler/
E2E 6 tests/e2e/scheduler/

Running Tests

make test                    # All tests
make test-unit              # Unit tests only
make test-integration       # Integration tests only
go test ./tests/e2e/...     # E2E tests

State Persistence

The scheduler persists state for crash recovery:

  • Job queue state
  • Task assignments
  • Worker registrations
  • Lease timestamps

State is replayed on startup via StateStore.Replay().

Service Templates

The scheduler provides built-in service templates for common ML workloads:

Available Templates

Template Description Default Port Range
JupyterLab Interactive Jupyter environment 8000-9000
Jupyter Notebook Classic Jupyter notebooks 8000-9000
vLLM OpenAI-compatible LLM inference server 8000-9000

Port Allocation

Dynamic port management for service instances:

type PortAllocator struct {
    startPort int    // Default: 8000
    endPort   int    // Default: 9000
    allocated map[int]time.Time  // Port -> allocation time
}

Features:

  • Automatic port selection from configured range
  • TTL-based port reclamation
  • Thread-safe concurrent allocations
  • Exhaustion handling with clear error messages

Template Variables

Service templates support dynamic variable substitution:

Variable Description Example
{{SERVICE_PORT}} Allocated port for the service 8080
{{WORKER_ID}} ID of the assigned worker worker-1
{{TASK_ID}} Unique task identifier task-abc123
{{SECRET:xxx}} Secret reference from keychain api-key-value
{{MODEL_NAME}} ML model name (vLLM) llama-2-7b
{{GPU_COUNT}} Number of GPUs allocated 2
{{GPU_DEVICES}} Specific GPU device IDs 0,1
{{MODEL_CACHE}} Path to model cache directory /models
{{WORKSPACE}} Working directory path /workspace

API Methods

// SubmitJob submits a job to the scheduler
func (h *SchedulerHub) SubmitJob(spec JobSpec) error

// GetTask retrieves a task by ID
func (h *SchedulerHub) GetTask(taskID string) *Task

// Addr returns the scheduler's listen address
func (h *SchedulerHub) Addr() string

// Start begins the scheduler
func (h *SchedulerHub) Start() error

// Stop shuts down the scheduler
func (h *SchedulerHub) Stop()

Audit Integration

The scheduler integrates with the audit logging system for security and compliance:

Audit Logger Integration

type SchedulerHub struct {
    // ... other fields ...
    auditor *audit.Logger  // Security audit logger
}

Initialization:

auditor := audit.NewLogger(audit.Config{
    LogPath: "/var/log/fetch_ml/scheduler_audit.log",
    Enabled: true,
})
hub, err := scheduler.NewHub(config, auditor)

Audit Events

The scheduler logs the following audit events:

Event Description Fields Logged
job_submitted New job queued job_id, user_id, job_type, gpu_count
job_assigned Job assigned to worker job_id, worker_id, assignment_time
job_accepted Worker accepted job job_id, worker_id, acceptance_time
job_completed Job finished successfully job_id, worker_id, duration
job_failed Job failed job_id, worker_id, error_code
job_cancelled Job cancelled job_id, cancelled_by, reason
worker_registered Worker connected worker_id, capabilities, timestamp
worker_disconnected Worker disconnected worker_id, duration_connected
quota_exceeded GPU quota violation user_id, plugin_name, requested, limit

Tamper-Evident Logging

Audit logs use chain hashing for integrity:

  • Each event includes SHA-256 hash of previous event
  • Chain verification detects log tampering
  • Separate log file from operational logs

Configuration

type HubConfig struct {
    BindAddr                string            // Listen address
    CertFile                string            // TLS certificate
    KeyFile                 string            // TLS key
    StateDir                string            // State persistence dir
    DefaultBatchSlots       int               // Default batch slots per worker
    DefaultServiceSlots     int               // Default service slots per worker
    StarvationThresholdMins float64           // Starvation detection threshold
    PriorityAgingRate       float64           // Priority increase rate
    GangAllocTimeoutSecs    int               // Multi-node allocation timeout
    AcceptanceTimeoutSecs   int               // Job acceptance timeout
    WorkerTokens            map[string]string // Authentication tokens
    PluginQuota             PluginQuotaConfig // Plugin GPU quota configuration
}

Cross-Platform Support

Process management is abstracted for Unix/Windows:

  • service_manager_unix.go: POSIX process groups
  • service_manager_windows.go: Windows job objects

See Also