fetch_ml/docs/src/architecture.md
Jeremie Fraeys 5144d291cb
docs: comprehensive documentation updates
- Add architecture, CI/CD, CLI reference documentation
- Update installation, operations, and quick-start guides
- Add Jupyter workflow and queue documentation
- New landing page and research runner plan
2026-02-12 12:05:27 -05:00

23 KiB

title url weight
Homelab Architecture /architecture/ 1

Homelab Architecture

Simple, secure architecture for ML experiments in your homelab.

Components Overview

graph TB
    subgraph "Homelab Stack"
        CLI[Zig CLI]
        API[API Server (HTTPS + WebSocket)]
        REDIS[Redis Cache]
        DB[(SQLite/PostgreSQL)]
        FS[Local Storage]
        WORKER[Worker Service]
        PODMAN[Podman/Docker]
    end
    
    CLI --> API
    API --> REDIS
    API --> DB
    API --> FS
    WORKER --> API
    WORKER --> REDIS
    WORKER --> FS
    WORKER --> PODMAN

Core Services

API Server

  • Purpose: Secure HTTPS API for ML experiments
  • Port: 9101 (HTTPS only)
  • Auth: API key authentication
  • Security: Rate limiting, IP whitelisting

Redis

  • Purpose: Caching and job queuing
  • Port: 6379 (localhost only)
  • Storage: Temporary data only
  • Persistence: Local volume

Zig CLI

  • Purpose: High-performance experiment management
  • Language: Zig for maximum speed and efficiency
  • Features:
    • Content-addressed storage with deduplication
    • SHA256-based commit ID generation
    • WebSocket communication for real-time updates
    • Rsync-based incremental file transfers
    • Multi-threaded operations
    • Secure API key authentication
    • Auto-sync monitoring with file system watching
    • Priority-based job queuing
    • Memory-efficient operations with arena allocators

Security Architecture

graph LR
    USER[User] --> AUTH[API Key Auth]
    AUTH --> RATE[Rate Limiting]
    RATE --> WHITELIST[IP Whitelist]
    WHITELIST --> API[Secure API]
    API --> AUDIT[Audit Logging]

Security Layers

  1. API Key Authentication - Hashed keys with roles
  2. Rate Limiting - 30 requests/minute
  3. IP Whitelisting - Local networks only
  4. Fail2Ban - Automatic IP blocking
  5. HTTPS/TLS - Encrypted communication
  6. Audit Logging - Complete action tracking

Data Flow

sequenceDiagram
    participant CLI
    participant API
    participant Redis
    participant Storage
    
    CLI->>API: HTTPS + WebSocket request
    API->>API: Validate Auth
    API->>Redis: Cache/Queue
    API->>Storage: Experiment Data
    Storage->>API: Results
    API->>CLI: Response

Deployment Options

services:
  redis:
    image: redis:7-alpine
    ports: ["6379:6379"]
    volumes: [redis_data:/data]
    
  api-server:
    build: .
    ports: ["9101:9101"]
    depends_on: [redis]

Local Setup

docker-compose -f deployments/docker-compose.dev.yml up -d

Network Architecture

  • Private Network: Docker internal network
  • Localhost Access: Redis only on localhost
  • HTTPS API: Port 9101, TLS encrypted
  • No External Dependencies: Everything runs locally

Storage Architecture

data/
├── experiments/     # Experiment definitions, run manifests, and artifacts
├── tracking/        # Tracking tool state (e.g., MLflow/TensorBoard), when enabled
├── .prewarm/        # Best-effort prewarm staging (snapshots/env/datasets), when enabled
├── cache/           # Temporary caches (best-effort)
└── backups/         # Local backups

logs/
├── app.log         # Application logs
├── audit.log       # Security events
└── access.log      # API access logs

Monitoring Architecture

Simple, lightweight monitoring:

  • Health Checks: Service availability
  • Log Files: Structured logging
  • Prometheus Metrics: Worker and API metrics (including prewarm hit/miss/timing)
  • Security Events: Failed auth, rate limits

Homelab Benefits

  • Simple Setup: One-command installation
  • Local Only: No external dependencies
  • Secure by Default: HTTPS, auth, rate limiting
  • Low Resource: Minimal CPU/memory usage
  • Easy Backup: Local file system
  • Privacy: Everything stays on your network

High-Level Architecture

graph TB
    subgraph "Client Layer"
        CLI[CLI Tools]
        TUI[Terminal UI]
        API[WebSocket API]
    end
    
    subgraph "Authentication Layer"
        Auth[Authentication Service]
        RBAC[Role-Based Access Control]
        Perm[Permission Manager]
    end
    
    subgraph "Core Services"
        Worker[ML Worker Service]
        DataMgr[Data Manager Service]
        Queue[Job Queue]
    end
    
    subgraph "Storage Layer"
        Redis[(Redis Cache)]
        DB[(SQLite/PostgreSQL)]
        Files[File Storage]
    end
    
    subgraph "Container Runtime"
        Podman[Podman/Docker]
        Containers[ML Containers]
    end
    
    CLI --> Auth
    TUI --> Auth
    API --> Auth
    
    Auth --> RBAC
    RBAC --> Perm
    
    Worker --> Queue
    Worker --> DataMgr
    Worker --> Podman
    
    DataMgr --> DB
    DataMgr --> Files
    
    Queue --> Redis
    
    Podman --> Containers

Tracking & Plugin System

fetch_ml includes an optional tracking plugin system that can provision sidecar tools and/or pass through environment variables for common research tracking stacks.

Tracking modes

Tracking tools support the following modes:

  • sidecar: provision a local sidecar container per task (best-effort).
  • remote: point to an externally managed instance (no local provisioning).
  • disabled: disable the tool entirely.

How it works

  • The worker maintains a tracking registry and provisions tools during task startup.
  • Provisioned plugins return environment variables that are injected into the task container.
  • Some plugins also require host paths (e.g., TensorBoard log directory); these are mounted into the task container and sanitized to avoid leaking host paths.

Built-in plugins

The worker ships with built-in plugins:

  • mlflow: can run an MLflow server as a sidecar or use a remote MLFLOW_TRACKING_URI.
  • tensorboard: runs a TensorBoard sidecar and mounts a per-job log directory.
  • wandb: does not provision a sidecar; it forwards configuration via environment variables.

Configuration

Plugins can be configured via worker configuration under plugins, including:

  • enabled
  • image
  • mode
  • per-plugin paths/settings (e.g., artifact base path, log base path)

Zig CLI Architecture

Component Structure

graph TB
    subgraph "Zig CLI Components"
        Main[main.zig] --> Commands[commands/]
        Commands --> Config[config.zig]
        Commands --> Utils[utils/]
        Commands --> Net[net/]
        Commands --> Errors[errors.zig]
        
        subgraph "Commands"
            Init[init.zig]
            Sync[sync.zig]
            Queue[queue.zig]
            Watch[watch.zig]
            Status[status.zig]
            Monitor[monitor.zig]
            Cancel[cancel.zig]
            Prune[prune.zig]
        end
        
        subgraph "Utils"
            Crypto[crypto.zig]
            Storage[storage.zig]
            Rsync[rsync.zig]
        end
        
        subgraph "Network"
            WS[ws.zig]
        end
    end

Performance Optimizations

Content-Addressed Storage

  • Deduplication: Files stored by SHA256 hash
  • Space Efficiency: Shared files across experiments
  • Fast Lookup: Hash-based file retrieval

Memory Management

  • Arena Allocators: Efficient bulk allocation
  • Zero-Copy Operations: Minimized memory copying
  • Automatic Cleanup: Resource deallocation

Network Communication

  • WebSocket Protocol: Real-time bidirectional communication
  • Connection Pooling: Reused connections
  • Binary Messaging: Efficient data transfer

Security Implementation

graph LR
    subgraph "CLI Security"
        Config[Config File] --> Hash[SHA256 Hashing]
        Hash --> Auth[API Authentication]
        Auth --> SSH[SSH Transfer]
        SSH --> WS[WebSocket Security]
    end

Core Components

1. Authentication & Authorization

graph LR
    subgraph "Auth Flow"
        Client[Client] --> APIKey[API Key]
        APIKey --> Hash[Hash Validation]
        Hash --> Roles[Role Resolution]
        Roles --> Perms[Permission Check]
        Perms --> Access[Grant/Deny Access]
    end
    
    subgraph "Permission Sources"
        YAML[YAML Config]
        Inline[Inline Fallback]
        Roles --> YAML
        Roles --> Inline
    end

Features:

  • API key-based authentication
  • Role-based access control (RBAC)
  • YAML-based permission configuration
  • Fallback to inline permissions
  • Admin wildcard permissions

2. Worker Service

graph TB
    subgraph "Worker Architecture"
        API[HTTP API] --> Router[Request Router]
        Router --> Auth[Auth Middleware]
        Auth --> Queue[Job Queue]
        Queue --> Processor[Job Processor]
        Processor --> Runtime[Container Runtime]
        Runtime --> Storage[Result Storage]
        
        subgraph "Job Lifecycle"
            Submit[Submit Job] --> Queue
            Queue --> Execute[Execute]
            Execute --> Monitor[Monitor]
            Monitor --> Complete[Complete]
            Complete --> Store[Store Results]
        end
    end

Responsibilities:

  • HTTP API for job submission
  • Job queue management
  • Container orchestration
  • Result collection and storage
  • Metrics and monitoring

3. Data Manager Service

graph TB
    subgraph "Data Management"
        API[Data API] --> Storage[Storage Layer]
        Storage --> Metadata[Metadata DB]
        Storage --> Files[File System]
        Storage --> Cache[Redis Cache]
        
        subgraph "Data Operations"
            Upload[Upload Data] --> Validate[Validate]
            Validate --> Store[Store]
            Store --> Index[Index]
            Index --> Catalog[Catalog]
        end
    end

Features:

  • Data upload and validation
  • Metadata management
  • File system abstraction
  • Caching layer
  • Data catalog

4. Terminal UI (TUI)

graph TB
    subgraph "TUI Architecture"
        UI[UI Components] --> Model[Data Model]
        Model --> Update[Update Loop]
        Update --> Render[Render]
        
        subgraph "UI Panels"
            Jobs[Job List]
            Details[Job Details]
            Logs[Log Viewer]
            Status[Status Bar]
        end
        
        UI --> Jobs
        UI --> Details
        UI --> Logs
        UI --> Status
    end

Components:

  • Bubble Tea framework
  • Component-based architecture
  • Real-time updates
  • Keyboard navigation
  • Theme support

Data Flow

Job Execution Flow

sequenceDiagram
    participant Client
    participant Auth
    participant Worker
    participant Queue
    participant Container
    participant Storage
    
    Client->>Auth: Submit job with API key
    Auth->>Client: Validate and return job ID
    
    Client->>Worker: Execute job request
    Worker->>Queue: Queue job
    Queue->>Worker: Job ready
    Worker->>Container: Start ML container
    Container->>Worker: Execute experiment
    Worker->>Storage: Store results
    Worker->>Client: Return results

Authentication Flow

sequenceDiagram
    participant Client
    participant Auth
    participant PermMgr
    participant Config
    
    Client->>Auth: Request with API key
    Auth->>Auth: Validate key hash
    Auth->>PermMgr: Get user permissions
    PermMgr->>Config: Load YAML permissions
    Config->>PermMgr: Return permissions
    PermMgr->>Auth: Return resolved permissions
    Auth->>Client: Grant/deny access

Security Architecture

Defense in Depth

graph TB
    subgraph "Security Layers"
        Network[Network Security]
        Auth[Authentication]
        AuthZ[Authorization]
        Container[Container Security]
        Data[Data Protection]
        Audit[Audit Logging]
    end
    
    Network --> Auth
    Auth --> AuthZ
    AuthZ --> Container
    Container --> Data
    Data --> Audit

Security Features:

  • API key authentication
  • Role-based permissions
  • Container isolation
  • File system sandboxing
  • Comprehensive audit logs
  • Input validation and sanitization

Container Security

graph TB
    subgraph "Container Isolation"
        Host[Host System]
        Podman[Podman Runtime]
        Network[Network Isolation]
        FS[File System Isolation]
        User[User Namespaces]
        ML[ML Container]
        
        Host --> Podman
        Podman --> Network
        Podman --> FS
        Podman --> User
        User --> ML
    end

Isolation Features:

  • Rootless containers
  • Network isolation
  • File system sandboxing
  • User namespace mapping
  • Resource limits

Configuration Architecture

Configuration Hierarchy

graph TB
    subgraph "Config Sources"
        Env[Environment Variables]
        File[Config Files]
        CLI[CLI Flags]
        Defaults[Default Values]
    end
    
    subgraph "Config Processing"
        Merge[Config Merger]
        Validate[Schema Validator]
        Apply[Config Applier]
    end
    
    Env --> Merge
    File --> Merge
    CLI --> Merge
    Defaults --> Merge
    
    Merge --> Validate
    Validate --> Apply

Configuration Priority:

  1. CLI flags (highest)
  2. Environment variables
  3. Configuration files
  4. Default values (lowest)

Scalability Architecture

Horizontal Scaling

graph TB
    subgraph "Scaled Architecture"
        LB[Load Balancer]
        W1[Worker 1]
        W2[Worker 2]
        W3[Worker N]
        Redis[Redis Cluster]
        Storage[Shared Storage]
        
        LB --> W1
        LB --> W2
        LB --> W3
        
        W1 --> Redis
        W2 --> Redis
        W3 --> Redis
        
        W1 --> Storage
        W2 --> Storage
        W3 --> Storage
    end

Scaling Features:

  • Stateless worker services
  • Shared job queue (Redis)
  • Distributed storage
  • Load balancer ready
  • Health checks and monitoring

Technology Stack

Backend Technologies

Component Technology Purpose
Language Go 1.25+ Core application
Web Framework Standard library HTTP server
Authentication Custom API key + RBAC
Database SQLite/PostgreSQL Metadata storage
Cache Redis Job queue & caching
Containers Podman/Docker Job isolation
UI Framework Bubble Tea Terminal UI

Dependencies

// Core dependencies
require (
    github.com/charmbracelet/bubbletea v1.3.10  // TUI framework
    github.com/go-redis/redis/v8 v8.11.5        // Redis client
    github.com/google/uuid v1.6.0               // UUID generation
    github.com/mattn/go-sqlite3 v1.14.32        // SQLite driver
    golang.org/x/crypto v0.45.0                 // Crypto utilities
    gopkg.in/yaml.v3 v3.0.1                     // YAML parsing
)

Development Architecture

Project Structure

fetch_ml/
├── cmd/                    # CLI applications
│   ├── worker/            # ML worker service
│   ├── tui/               # Terminal UI
│   ├── data_manager/      # Data management
│   └── user_manager/      # User management
├── internal/              # Internal packages
│   ├── auth/              # Authentication system
│   ├── config/            # Configuration management
│   ├── container/         # Container operations
│   ├── database/          # Database operations
│   ├── logging/           # Logging utilities
│   ├── metrics/           # Metrics collection
│   └── network/           # Network utilities
├── configs/               # Configuration files
├── scripts/               # Setup and utility scripts
├── tests/                 # Test suites
└── docs/                  # Documentation

Package Dependencies

graph TB
    subgraph "Application Layer"
        Worker[cmd/worker]
        TUI[cmd/tui]
        DataMgr[cmd/data_manager]
        UserMgr[cmd/user_manager]
    end
    
    subgraph "Service Layer"
        Auth[internal/auth]
        Config[internal/config]
        Container[internal/container]
        Database[internal/database]
    end
    
    subgraph "Utility Layer"
        Logging[internal/logging]
        Metrics[internal/metrics]
        Network[internal/network]
    end
    
    Worker --> Auth
    Worker --> Config
    Worker --> Container
    TUI --> Auth
    DataMgr --> Database
    UserMgr --> Auth
    
    Auth --> Logging
    Container --> Network
    Database --> Metrics

Monitoring & Observability

Metrics Collection

graph TB
    subgraph "Metrics Pipeline"
        App[Application] --> Metrics[Metrics Collector]
        Metrics --> Export[Prometheus Exporter]
        Export --> Prometheus[Prometheus Server]
        Prometheus --> Grafana[Grafana Dashboard]
        
        subgraph "Metric Types"
            Counter[Counters]
            Gauge[Gauges]
            Histogram[Histograms]
            Timer[Timers]
        end
        
        App --> Counter
        App --> Gauge
        App --> Histogram
        App --> Timer
    end

Logging Architecture

graph TB
    subgraph "Logging Pipeline"
        App[Application] --> Logger[Structured Logger]
        Logger --> File[File Output]
        Logger --> Console[Console Output]
        Logger --> Syslog[Syslog Forwarder]
        Syslog --> Aggregator[Log Aggregator]
        Aggregator --> Storage[Log Storage]
        Storage --> Viewer[Log Viewer]
    end

Deployment Architecture

Container Deployment

graph TB
    subgraph "Deployment Stack"
        Image[Container Image]
        Registry[Container Registry]
        Orchestrator[Docker Compose]
        Config[ConfigMaps/Secrets]
        Storage[Persistent Storage]
        
        Image --> Registry
        Registry --> Orchestrator
        Config --> Orchestrator
        Storage --> Orchestrator
    end

Service Discovery

graph TB
    subgraph "Service Mesh"
        Gateway[API Gateway]
        Discovery[Service Discovery]
        Worker[Worker Service]
        Data[Data Service]
        Redis[Redis Cluster]
        
        Gateway --> Discovery
        Discovery --> Worker
        Discovery --> Data
        Discovery --> Redis
    end

Future Architecture Considerations

Microservices Evolution

  • API Gateway: Centralized routing and authentication
  • Service Mesh: Inter-service communication
  • Event Streaming: Kafka for job events
  • Distributed Tracing: OpenTelemetry integration
  • Multi-tenant: Tenant isolation and quotas

Homelab Features

  • Docker Compose: Simple container orchestration
  • Local Development: Easy setup and testing
  • Security: Built-in authentication and encryption
  • Monitoring: Basic health checks and logging

Roadmap (Research-First, Workstation-First)

fetchml is a research-first ML experiment runner with production-grade discipline.

Guiding principles

  • Reproducibility over speed: optimizations must never change experimental semantics.
  • Explicit over magic: every run should be explainable from manifests, configs, and logs.
  • Best-effort optimizations: prewarming/caching must be optional and must not be required for correctness.
  • Workstation-first: prioritize single-node reliability, observability, and fast iteration; avoid HPC-specific complexity.

Where we are now

  • Run provenance: run_manifest.json exists and is readable via ml info <path|id>.
  • Validation: ml validate <commit_id> and ml validate --task <task_id> exist; task validation includes run-manifest lifecycle/provenance checks.
  • Prewarming (Phase 1, best-effort):
    • Next-task prewarm loop stages snapshots under base/.prewarm/snapshots/<task_id>.
    • Best-effort dataset prefetch with a TTL cache.
    • Warmed container image infrastructure exists (images keyed by deps_manifest_sha256).
    • Prewarm status is surfaced in ml status --json under the prewarm field.

Phase 0: Trust and usability (highest priority)

1) Make ml status excellent (human output)

  • Show a compact summary of:
    • queued/running/completed/failed counts
    • a short list of most relevant tasks
    • prewarm state (worker id, target task id, phase, dataset count, age)
  • Preserve --json output as stable API for scripting.

2) Add a dry-run preview command (ml explain)

  • Print the resolved execution plan before running:

    • commit id, experiment manifest overall sha
    • dependency manifest name + sha
    • snapshot id + expected sha (when applicable)
    • dataset identities + checksums (when applicable)
    • requested resources (cpu/mem/gpu)
    • candidate runtime image (base vs warmed tag)
  • Enforce a strict preflight by default:

    • Queue-time blocking (do not enqueue tasks that fail reproducibility requirements).
    • The strict preflight should be shared by ml queue and ml explain.
    • Record the resolved plan into task metadata for traceability:
      • repro_policy: strict
      • trust_level: <L0..L4> (simple trust ladder)
      • plan_sha256: <sha256> (digest of the resolved execution plan)

3) Tighten run manifest completeness

  • For running: require started_at.
  • For completed/failed: require started_at, ended_at, and exit_code.
  • When snapshots/datasets are used: ensure manifest records the relevant identifiers and digests.

4) Dataset identity (minimal but research-grade)

  • Prefer structured dataset_specs (name + checksum) as the authoritative input.
  • Treat missing checksum as an error by default (strict-by-default).

Phase 1: Simple performance wins (only after Phase 0 feels solid)

  • Keep prewarming single-level (next task only).
  • Improve observability first (status output + metrics), then expand capabilities.

Phase 2+: Research workflows

  • ml compare <runA> <runB>: manifest-driven diff of provenance and key parameters.
  • ml reproduce <run-id>: submit a new task derived from the recorded manifest inputs.
  • ml export <run-id>: package provenance + artifacts for collaborators/reviewers.

Phase 3: Infrastructure (only if needed)

  • Multi-level prewarming, predictive scheduling, tmpfs caching, dashboards.
  • Optional scalable storage backend for team deployments:
    • Store run manifests + artifacts in S3-compatible object storage (e.g., MinIO) for durability and multi-worker/Kubernetes setups.
    • Keep workstation-first defaults (local filesystem) for simplicity.
  • Optional integrations via plugins/exporters (keep core strict and offline-capable):
    • Server-side exporters that mirror run metadata, metrics, and artifacts to external systems (e.g., MLflow Tracking, Weights & Biases).
    • Prefer lifecycle hooks that consume run_manifest.json / artifact manifests over plugins that influence execution semantics.
  • Optional Kubernetes deployment path (for teams on scalable infra):
    • Publish versioned container images for the backend (API server; optionally worker) and provide reference manifests (Helm/Kustomize).
    • Keep the CLI as the primary UX; Kubernetes is an execution/deployment backend, not a UI.
  • These are optional and should be driven by measured bottlenecks.

This architecture provides a solid foundation for secure, scalable machine learning experiments while maintaining simplicity and developer productivity.