jfraeysd/fetch_ml

Fork 0

Jeremie Fraeys 90ea18555c

Security Scan / Security Analysis (push) Waiting to run

Details

Security Scan / Native Library Security (push) Waiting to run

Details

Verification & Maintenance / V.1 - Schema Drift Detection (push) Waiting to run

Details

Verification & Maintenance / V.4 - Custom Go Vet Analyzers (push) Waiting to run

Details

Verification & Maintenance / V.7 - Audit Chain Integrity (push) Waiting to run

Details

Verification & Maintenance / V.6 - Extended Security Scanning (push) Waiting to run

Details

Verification & Maintenance / V.10 - OpenSSF Scorecard (push) Waiting to run

Details

Verification & Maintenance / Verification Summary (push) Blocked by required conditions

Details

Build Pipeline / Build Binaries (push) Failing after 2m4s

Details

Build Pipeline / Build Docker Images (push) Has been skipped

Details

Build Pipeline / Sign HIPAA Config (push) Has been skipped

Details

Build Pipeline / Generate SLSA Provenance (push) Has been skipped

Details

Checkout test / test (push) Successful in 5s

Details

CI Pipeline / Test (push) Failing after 1s

Details

CI Pipeline / Dev Compose Smoke Test (push) Has been skipped

Details

CI Pipeline / Security Scan (push) Has been skipped

Details

CI Pipeline / Test Scripts (push) Has been skipped

Details

CI Pipeline / Test Native Libraries (push) Has been skipped

Details

CI Pipeline / Native Library Build Matrix (push) Has been skipped

Details

Contract Tests / Spec Drift Detection (push) Failing after 16s

Details

Contract Tests / API Contract Tests (push) Has been skipped

Details

Deploy API Docs / Build API Documentation (push) Failing after 5s

Details

Deploy API Docs / Deploy to GitHub Pages (push) Has been skipped

Details

Documentation / build-and-publish (push) Failing after 44s

Details

CI Pipeline / Trigger Build Workflow (push) Failing after 0s

Details

docs: add vLLM workflow and cross-link documentation

- Add new vLLM workflow documentation (vllm-workflow.md)
- Update scheduler-architecture.md with Plugin GPU Quota and audit logging
- Add See Also sections to jupyter-workflow.md, quick-start.md,
  configuration-reference.md for better navigation
- Update landing page and index with vLLM and scheduler links
- Cross-link all documentation for improved discoverability

2026-02-26 13:04:39 -05:00

9.5 KiB

Raw Blame History

Scheduler Architecture

The FetchML Scheduler manages distributed job scheduling across workers via WebSocket connections.

Overview

The scheduler consists of:

SchedulerHub: Core scheduling engine (internal/scheduler/hub.go)
PriorityQueue: Heap-based job queues for batch and service jobs
WorkerConn: WebSocket connection handling per worker
StateStore: Persistent state for crash recovery
ServiceManager: Long-running service lifecycle management

Key Components

SchedulerHub

type SchedulerHub struct {
    workers           map[string]*WorkerConn     // Active worker connections
    readyWorkers      map[string]*WorkerConn     // Workers ready for jobs
    batchQueue        *PriorityQueue             // Batch job queue
    serviceQueue      *PriorityQueue             // Service job queue
    reservations      map[string]*Reservation    // Job reservations
    multiNodePending  map[string]*MultiNodeJob   // Multi-node gang allocations
    pendingAcceptance map[string]*JobAssignment  // Jobs awaiting acceptance
    state             *StateStore                // Persistent state
}

Job Types

Type	Description	Scheduling
Batch	Finite training jobs	FIFO with priority aging
Service	Long-running inference	Dedicated slots, health checks
Multi-node	Distributed training	Gang allocation across workers

Protocol

Unified WSS Protocol

All communication uses a single WebSocket Secure (WSS) endpoint:

Workers connect to wss://scheduler:port/ws/worker
Metrics clients connect with metrics- prefixed token

Message Types

const (
    // Worker → Scheduler
    MsgRegister       = "register"
    MsgHeartbeat      = "heartbeat"
    MsgReadyForWork   = "ready_for_work"
    MsgJobAccepted    = "job_accepted"
    MsgJobResult      = "job_result"
    MsgServiceHealth  = "service_health"
    MsgMetricsRequest = "metrics_request"  // Metrics over WSS

    // Scheduler → Worker
    MsgJobAssign       = "job_assign"
    MsgNoWork          = "no_work"
    MsgJobCancel       = "job_cancel"
    MsgPrewarmHint     = "prewarm_hint"
    MsgAck             = "ack"
    MsgMetricsResponse = "metrics_response"  // Metrics over WSS
)

Metrics Over WSS

Metrics are retrieved via WSS using a special client token:

// Connect with metrics token
conn, err := scheduler.DialWSS("scheduler:8443", "ca.crt", "metrics-scraper-1")

// Request metrics
conn.WriteJSON(scheduler.Message{
    Type: scheduler.MsgMetricsRequest,
})

// Receive metrics
var msg scheduler.Message
conn.ReadJSON(&msg)
// msg.Type == MsgMetricsResponse
// msg.Payload contains metrics map

Metrics payload:

{
  "workers_connected": 5,
  "queue_depth_batch": 12,
  "queue_depth_service": 3,
  "jobs_completed": 142,
  "jobs_failed": 2,
  "jobs_cancelled": 0,
  "worker_slots": {
    "worker-1": {"batch_total": 4, "batch_in_use": 2, ...}
  }
}

Features

Priority Aging

Prevents starvation by increasing priority of long-waiting jobs:

effective_priority = base_priority + (wait_time * aging_rate)

Gang Allocation

Multi-node jobs are allocated atomically across workers:

Job submitted with NodeCount > 1
Scheduler waits for required workers
All nodes assigned simultaneously
Timeout handling for partial allocations

Starvation Prevention

Tracks job wait times and triggers priority boosts:

if wait_time > starvation_threshold {
    effective_priority += boost_amount
}

Worker Mode Switching

Workers can switch between batch and service modes:

Batch mode: processes training jobs
Service mode: runs long-lived inference services

Testing

Test Infrastructure

All tests use shared fixtures in tests/fixtures/:

SchedulerTestFixture: Common setup/teardown
MockWorker: Simulated worker connections

Test Categories

Category	Count	Files
Unit	17+	`tests/unit/scheduler/`
Integration	6	`tests/integration/scheduler/`
E2E	6	`tests/e2e/scheduler/`

Running Tests

make test                    # All tests
make test-unit              # Unit tests only
make test-integration       # Integration tests only
go test ./tests/e2e/...     # E2E tests

State Persistence

The scheduler persists state for crash recovery:

Job queue state
Task assignments
Worker registrations
Lease timestamps

State is replayed on startup via StateStore.Replay().

Service Templates

The scheduler provides built-in service templates for common ML workloads:

Available Templates

Template	Description	Default Port Range
JupyterLab	Interactive Jupyter environment	8000-9000
Jupyter Notebook	Classic Jupyter notebooks	8000-9000
vLLM	OpenAI-compatible LLM inference server	8000-9000

Port Allocation

Dynamic port management for service instances:

type PortAllocator struct {
    startPort int    // Default: 8000
    endPort   int    // Default: 9000
    allocated map[int]time.Time  // Port -> allocation time
}

Features:

Automatic port selection from configured range
TTL-based port reclamation
Thread-safe concurrent allocations
Exhaustion handling with clear error messages

Template Variables

Service templates support dynamic variable substitution:

Variable	Description	Example
`{{SERVICE_PORT}}`	Allocated port for the service	`8080`
`{{WORKER_ID}}`	ID of the assigned worker	`worker-1`
`{{TASK_ID}}`	Unique task identifier	`task-abc123`
`{{SECRET:xxx}}`	Secret reference from keychain	`api-key-value`
`{{MODEL_NAME}}`	ML model name (vLLM)	`llama-2-7b`
`{{GPU_COUNT}}`	Number of GPUs allocated	`2`
`{{GPU_DEVICES}}`	Specific GPU device IDs	`0,1`
`{{MODEL_CACHE}}`	Path to model cache directory	`/models`
`{{WORKSPACE}}`	Working directory path	`/workspace`

API Methods

// SubmitJob submits a job to the scheduler
func (h *SchedulerHub) SubmitJob(spec JobSpec) error

// GetTask retrieves a task by ID
func (h *SchedulerHub) GetTask(taskID string) *Task

// Addr returns the scheduler's listen address
func (h *SchedulerHub) Addr() string

// Start begins the scheduler
func (h *SchedulerHub) Start() error

// Stop shuts down the scheduler
func (h *SchedulerHub) Stop()

Audit Integration

The scheduler integrates with the audit logging system for security and compliance:

Audit Logger Integration

type SchedulerHub struct {
    // ... other fields ...
    auditor *audit.Logger  // Security audit logger
}

Initialization:

auditor := audit.NewLogger(audit.Config{
    LogPath: "/var/log/fetch_ml/scheduler_audit.log",
    Enabled: true,
})
hub, err := scheduler.NewHub(config, auditor)

Audit Events

The scheduler logs the following audit events:

Event	Description	Fields Logged
`job_submitted`	New job queued	job_id, user_id, job_type, gpu_count
`job_assigned`	Job assigned to worker	job_id, worker_id, assignment_time
`job_accepted`	Worker accepted job	job_id, worker_id, acceptance_time
`job_completed`	Job finished successfully	job_id, worker_id, duration
`job_failed`	Job failed	job_id, worker_id, error_code
`job_cancelled`	Job cancelled	job_id, cancelled_by, reason
`worker_registered`	Worker connected	worker_id, capabilities, timestamp
`worker_disconnected`	Worker disconnected	worker_id, duration_connected
`quota_exceeded`	GPU quota violation	user_id, plugin_name, requested, limit

Tamper-Evident Logging

Audit logs use chain hashing for integrity:

Each event includes SHA-256 hash of previous event
Chain verification detects log tampering
Separate log file from operational logs

Configuration

type HubConfig struct {
    BindAddr                string            // Listen address
    CertFile                string            // TLS certificate
    KeyFile                 string            // TLS key
    StateDir                string            // State persistence dir
    DefaultBatchSlots       int               // Default batch slots per worker
    DefaultServiceSlots     int               // Default service slots per worker
    StarvationThresholdMins float64           // Starvation detection threshold
    PriorityAgingRate       float64           // Priority increase rate
    GangAllocTimeoutSecs    int               // Multi-node allocation timeout
    AcceptanceTimeoutSecs   int               // Job acceptance timeout
    WorkerTokens            map[string]string // Authentication tokens
    PluginQuota             PluginQuotaConfig // Plugin GPU quota configuration
}

Cross-Platform Support

Process management is abstracted for Unix/Windows:

service_manager_unix.go: POSIX process groups
service_manager_windows.go: Windows job objects

9.5 KiB Raw Blame History

Scheduler Architecture

Overview

Key Components

SchedulerHub

Job Types

Protocol

Unified WSS Protocol

Message Types

Metrics Over WSS

Features

Priority Aging

Gang Allocation

Starvation Prevention

Worker Mode Switching

Testing

Test Infrastructure

Test Categories

Running Tests

State Persistence

Service Templates

Available Templates

Port Allocation

Template Variables

API Methods

Audit Integration

Audit Logger Integration

Audit Events

Tamper-Evident Logging

Configuration

Cross-Platform Support

See Also

9.5 KiB

Raw Blame History