- Add architecture, CI/CD, CLI reference documentation - Update installation, operations, and quick-start guides - Add Jupyter workflow and queue documentation - New landing page and research runner plan
23 KiB
| title | url | weight |
|---|---|---|
| Homelab Architecture | /architecture/ | 1 |
Homelab Architecture
Simple, secure architecture for ML experiments in your homelab.
Components Overview
graph TB
subgraph "Homelab Stack"
CLI[Zig CLI]
API[API Server (HTTPS + WebSocket)]
REDIS[Redis Cache]
DB[(SQLite/PostgreSQL)]
FS[Local Storage]
WORKER[Worker Service]
PODMAN[Podman/Docker]
end
CLI --> API
API --> REDIS
API --> DB
API --> FS
WORKER --> API
WORKER --> REDIS
WORKER --> FS
WORKER --> PODMAN
Core Services
API Server
- Purpose: Secure HTTPS API for ML experiments
- Port: 9101 (HTTPS only)
- Auth: API key authentication
- Security: Rate limiting, IP whitelisting
Redis
- Purpose: Caching and job queuing
- Port: 6379 (localhost only)
- Storage: Temporary data only
- Persistence: Local volume
Zig CLI
- Purpose: High-performance experiment management
- Language: Zig for maximum speed and efficiency
- Features:
- Content-addressed storage with deduplication
- SHA256-based commit ID generation
- WebSocket communication for real-time updates
- Rsync-based incremental file transfers
- Multi-threaded operations
- Secure API key authentication
- Auto-sync monitoring with file system watching
- Priority-based job queuing
- Memory-efficient operations with arena allocators
Security Architecture
graph LR
USER[User] --> AUTH[API Key Auth]
AUTH --> RATE[Rate Limiting]
RATE --> WHITELIST[IP Whitelist]
WHITELIST --> API[Secure API]
API --> AUDIT[Audit Logging]
Security Layers
- API Key Authentication - Hashed keys with roles
- Rate Limiting - 30 requests/minute
- IP Whitelisting - Local networks only
- Fail2Ban - Automatic IP blocking
- HTTPS/TLS - Encrypted communication
- Audit Logging - Complete action tracking
Data Flow
sequenceDiagram
participant CLI
participant API
participant Redis
participant Storage
CLI->>API: HTTPS + WebSocket request
API->>API: Validate Auth
API->>Redis: Cache/Queue
API->>Storage: Experiment Data
Storage->>API: Results
API->>CLI: Response
Deployment Options
Docker Compose (Recommended)
services:
redis:
image: redis:7-alpine
ports: ["6379:6379"]
volumes: [redis_data:/data]
api-server:
build: .
ports: ["9101:9101"]
depends_on: [redis]
Local Setup
docker-compose -f deployments/docker-compose.dev.yml up -d
Network Architecture
- Private Network: Docker internal network
- Localhost Access: Redis only on localhost
- HTTPS API: Port 9101, TLS encrypted
- No External Dependencies: Everything runs locally
Storage Architecture
data/
├── experiments/ # Experiment definitions, run manifests, and artifacts
├── tracking/ # Tracking tool state (e.g., MLflow/TensorBoard), when enabled
├── .prewarm/ # Best-effort prewarm staging (snapshots/env/datasets), when enabled
├── cache/ # Temporary caches (best-effort)
└── backups/ # Local backups
logs/
├── app.log # Application logs
├── audit.log # Security events
└── access.log # API access logs
Monitoring Architecture
Simple, lightweight monitoring:
- Health Checks: Service availability
- Log Files: Structured logging
- Prometheus Metrics: Worker and API metrics (including prewarm hit/miss/timing)
- Security Events: Failed auth, rate limits
Homelab Benefits
- ✅ Simple Setup: One-command installation
- ✅ Local Only: No external dependencies
- ✅ Secure by Default: HTTPS, auth, rate limiting
- ✅ Low Resource: Minimal CPU/memory usage
- ✅ Easy Backup: Local file system
- ✅ Privacy: Everything stays on your network
High-Level Architecture
graph TB
subgraph "Client Layer"
CLI[CLI Tools]
TUI[Terminal UI]
API[WebSocket API]
end
subgraph "Authentication Layer"
Auth[Authentication Service]
RBAC[Role-Based Access Control]
Perm[Permission Manager]
end
subgraph "Core Services"
Worker[ML Worker Service]
DataMgr[Data Manager Service]
Queue[Job Queue]
end
subgraph "Storage Layer"
Redis[(Redis Cache)]
DB[(SQLite/PostgreSQL)]
Files[File Storage]
end
subgraph "Container Runtime"
Podman[Podman/Docker]
Containers[ML Containers]
end
CLI --> Auth
TUI --> Auth
API --> Auth
Auth --> RBAC
RBAC --> Perm
Worker --> Queue
Worker --> DataMgr
Worker --> Podman
DataMgr --> DB
DataMgr --> Files
Queue --> Redis
Podman --> Containers
Tracking & Plugin System
fetch_ml includes an optional tracking plugin system that can provision sidecar tools and/or pass through environment variables for common research tracking stacks.
Tracking modes
Tracking tools support the following modes:
sidecar: provision a local sidecar container per task (best-effort).remote: point to an externally managed instance (no local provisioning).disabled: disable the tool entirely.
How it works
- The worker maintains a tracking registry and provisions tools during task startup.
- Provisioned plugins return environment variables that are injected into the task container.
- Some plugins also require host paths (e.g., TensorBoard log directory); these are mounted into the task container and sanitized to avoid leaking host paths.
Built-in plugins
The worker ships with built-in plugins:
mlflow: can run an MLflow server as a sidecar or use a remoteMLFLOW_TRACKING_URI.tensorboard: runs a TensorBoard sidecar and mounts a per-job log directory.wandb: does not provision a sidecar; it forwards configuration via environment variables.
Configuration
Plugins can be configured via worker configuration under plugins, including:
enabledimagemode- per-plugin paths/settings (e.g., artifact base path, log base path)
Zig CLI Architecture
Component Structure
graph TB
subgraph "Zig CLI Components"
Main[main.zig] --> Commands[commands/]
Commands --> Config[config.zig]
Commands --> Utils[utils/]
Commands --> Net[net/]
Commands --> Errors[errors.zig]
subgraph "Commands"
Init[init.zig]
Sync[sync.zig]
Queue[queue.zig]
Watch[watch.zig]
Status[status.zig]
Monitor[monitor.zig]
Cancel[cancel.zig]
Prune[prune.zig]
end
subgraph "Utils"
Crypto[crypto.zig]
Storage[storage.zig]
Rsync[rsync.zig]
end
subgraph "Network"
WS[ws.zig]
end
end
Performance Optimizations
Content-Addressed Storage
- Deduplication: Files stored by SHA256 hash
- Space Efficiency: Shared files across experiments
- Fast Lookup: Hash-based file retrieval
Memory Management
- Arena Allocators: Efficient bulk allocation
- Zero-Copy Operations: Minimized memory copying
- Automatic Cleanup: Resource deallocation
Network Communication
- WebSocket Protocol: Real-time bidirectional communication
- Connection Pooling: Reused connections
- Binary Messaging: Efficient data transfer
Security Implementation
graph LR
subgraph "CLI Security"
Config[Config File] --> Hash[SHA256 Hashing]
Hash --> Auth[API Authentication]
Auth --> SSH[SSH Transfer]
SSH --> WS[WebSocket Security]
end
Core Components
1. Authentication & Authorization
graph LR
subgraph "Auth Flow"
Client[Client] --> APIKey[API Key]
APIKey --> Hash[Hash Validation]
Hash --> Roles[Role Resolution]
Roles --> Perms[Permission Check]
Perms --> Access[Grant/Deny Access]
end
subgraph "Permission Sources"
YAML[YAML Config]
Inline[Inline Fallback]
Roles --> YAML
Roles --> Inline
end
Features:
- API key-based authentication
- Role-based access control (RBAC)
- YAML-based permission configuration
- Fallback to inline permissions
- Admin wildcard permissions
2. Worker Service
graph TB
subgraph "Worker Architecture"
API[HTTP API] --> Router[Request Router]
Router --> Auth[Auth Middleware]
Auth --> Queue[Job Queue]
Queue --> Processor[Job Processor]
Processor --> Runtime[Container Runtime]
Runtime --> Storage[Result Storage]
subgraph "Job Lifecycle"
Submit[Submit Job] --> Queue
Queue --> Execute[Execute]
Execute --> Monitor[Monitor]
Monitor --> Complete[Complete]
Complete --> Store[Store Results]
end
end
Responsibilities:
- HTTP API for job submission
- Job queue management
- Container orchestration
- Result collection and storage
- Metrics and monitoring
3. Data Manager Service
graph TB
subgraph "Data Management"
API[Data API] --> Storage[Storage Layer]
Storage --> Metadata[Metadata DB]
Storage --> Files[File System]
Storage --> Cache[Redis Cache]
subgraph "Data Operations"
Upload[Upload Data] --> Validate[Validate]
Validate --> Store[Store]
Store --> Index[Index]
Index --> Catalog[Catalog]
end
end
Features:
- Data upload and validation
- Metadata management
- File system abstraction
- Caching layer
- Data catalog
4. Terminal UI (TUI)
graph TB
subgraph "TUI Architecture"
UI[UI Components] --> Model[Data Model]
Model --> Update[Update Loop]
Update --> Render[Render]
subgraph "UI Panels"
Jobs[Job List]
Details[Job Details]
Logs[Log Viewer]
Status[Status Bar]
end
UI --> Jobs
UI --> Details
UI --> Logs
UI --> Status
end
Components:
- Bubble Tea framework
- Component-based architecture
- Real-time updates
- Keyboard navigation
- Theme support
Data Flow
Job Execution Flow
sequenceDiagram
participant Client
participant Auth
participant Worker
participant Queue
participant Container
participant Storage
Client->>Auth: Submit job with API key
Auth->>Client: Validate and return job ID
Client->>Worker: Execute job request
Worker->>Queue: Queue job
Queue->>Worker: Job ready
Worker->>Container: Start ML container
Container->>Worker: Execute experiment
Worker->>Storage: Store results
Worker->>Client: Return results
Authentication Flow
sequenceDiagram
participant Client
participant Auth
participant PermMgr
participant Config
Client->>Auth: Request with API key
Auth->>Auth: Validate key hash
Auth->>PermMgr: Get user permissions
PermMgr->>Config: Load YAML permissions
Config->>PermMgr: Return permissions
PermMgr->>Auth: Return resolved permissions
Auth->>Client: Grant/deny access
Security Architecture
Defense in Depth
graph TB
subgraph "Security Layers"
Network[Network Security]
Auth[Authentication]
AuthZ[Authorization]
Container[Container Security]
Data[Data Protection]
Audit[Audit Logging]
end
Network --> Auth
Auth --> AuthZ
AuthZ --> Container
Container --> Data
Data --> Audit
Security Features:
- API key authentication
- Role-based permissions
- Container isolation
- File system sandboxing
- Comprehensive audit logs
- Input validation and sanitization
Container Security
graph TB
subgraph "Container Isolation"
Host[Host System]
Podman[Podman Runtime]
Network[Network Isolation]
FS[File System Isolation]
User[User Namespaces]
ML[ML Container]
Host --> Podman
Podman --> Network
Podman --> FS
Podman --> User
User --> ML
end
Isolation Features:
- Rootless containers
- Network isolation
- File system sandboxing
- User namespace mapping
- Resource limits
Configuration Architecture
Configuration Hierarchy
graph TB
subgraph "Config Sources"
Env[Environment Variables]
File[Config Files]
CLI[CLI Flags]
Defaults[Default Values]
end
subgraph "Config Processing"
Merge[Config Merger]
Validate[Schema Validator]
Apply[Config Applier]
end
Env --> Merge
File --> Merge
CLI --> Merge
Defaults --> Merge
Merge --> Validate
Validate --> Apply
Configuration Priority:
- CLI flags (highest)
- Environment variables
- Configuration files
- Default values (lowest)
Scalability Architecture
Horizontal Scaling
graph TB
subgraph "Scaled Architecture"
LB[Load Balancer]
W1[Worker 1]
W2[Worker 2]
W3[Worker N]
Redis[Redis Cluster]
Storage[Shared Storage]
LB --> W1
LB --> W2
LB --> W3
W1 --> Redis
W2 --> Redis
W3 --> Redis
W1 --> Storage
W2 --> Storage
W3 --> Storage
end
Scaling Features:
- Stateless worker services
- Shared job queue (Redis)
- Distributed storage
- Load balancer ready
- Health checks and monitoring
Technology Stack
Backend Technologies
| Component | Technology | Purpose |
|---|---|---|
| Language | Go 1.25+ | Core application |
| Web Framework | Standard library | HTTP server |
| Authentication | Custom | API key + RBAC |
| Database | SQLite/PostgreSQL | Metadata storage |
| Cache | Redis | Job queue & caching |
| Containers | Podman/Docker | Job isolation |
| UI Framework | Bubble Tea | Terminal UI |
Dependencies
// Core dependencies
require (
github.com/charmbracelet/bubbletea v1.3.10 // TUI framework
github.com/go-redis/redis/v8 v8.11.5 // Redis client
github.com/google/uuid v1.6.0 // UUID generation
github.com/mattn/go-sqlite3 v1.14.32 // SQLite driver
golang.org/x/crypto v0.45.0 // Crypto utilities
gopkg.in/yaml.v3 v3.0.1 // YAML parsing
)
Development Architecture
Project Structure
fetch_ml/
├── cmd/ # CLI applications
│ ├── worker/ # ML worker service
│ ├── tui/ # Terminal UI
│ ├── data_manager/ # Data management
│ └── user_manager/ # User management
├── internal/ # Internal packages
│ ├── auth/ # Authentication system
│ ├── config/ # Configuration management
│ ├── container/ # Container operations
│ ├── database/ # Database operations
│ ├── logging/ # Logging utilities
│ ├── metrics/ # Metrics collection
│ └── network/ # Network utilities
├── configs/ # Configuration files
├── scripts/ # Setup and utility scripts
├── tests/ # Test suites
└── docs/ # Documentation
Package Dependencies
graph TB
subgraph "Application Layer"
Worker[cmd/worker]
TUI[cmd/tui]
DataMgr[cmd/data_manager]
UserMgr[cmd/user_manager]
end
subgraph "Service Layer"
Auth[internal/auth]
Config[internal/config]
Container[internal/container]
Database[internal/database]
end
subgraph "Utility Layer"
Logging[internal/logging]
Metrics[internal/metrics]
Network[internal/network]
end
Worker --> Auth
Worker --> Config
Worker --> Container
TUI --> Auth
DataMgr --> Database
UserMgr --> Auth
Auth --> Logging
Container --> Network
Database --> Metrics
Monitoring & Observability
Metrics Collection
graph TB
subgraph "Metrics Pipeline"
App[Application] --> Metrics[Metrics Collector]
Metrics --> Export[Prometheus Exporter]
Export --> Prometheus[Prometheus Server]
Prometheus --> Grafana[Grafana Dashboard]
subgraph "Metric Types"
Counter[Counters]
Gauge[Gauges]
Histogram[Histograms]
Timer[Timers]
end
App --> Counter
App --> Gauge
App --> Histogram
App --> Timer
end
Logging Architecture
graph TB
subgraph "Logging Pipeline"
App[Application] --> Logger[Structured Logger]
Logger --> File[File Output]
Logger --> Console[Console Output]
Logger --> Syslog[Syslog Forwarder]
Syslog --> Aggregator[Log Aggregator]
Aggregator --> Storage[Log Storage]
Storage --> Viewer[Log Viewer]
end
Deployment Architecture
Container Deployment
graph TB
subgraph "Deployment Stack"
Image[Container Image]
Registry[Container Registry]
Orchestrator[Docker Compose]
Config[ConfigMaps/Secrets]
Storage[Persistent Storage]
Image --> Registry
Registry --> Orchestrator
Config --> Orchestrator
Storage --> Orchestrator
end
Service Discovery
graph TB
subgraph "Service Mesh"
Gateway[API Gateway]
Discovery[Service Discovery]
Worker[Worker Service]
Data[Data Service]
Redis[Redis Cluster]
Gateway --> Discovery
Discovery --> Worker
Discovery --> Data
Discovery --> Redis
end
Future Architecture Considerations
Microservices Evolution
- API Gateway: Centralized routing and authentication
- Service Mesh: Inter-service communication
- Event Streaming: Kafka for job events
- Distributed Tracing: OpenTelemetry integration
- Multi-tenant: Tenant isolation and quotas
Homelab Features
- Docker Compose: Simple container orchestration
- Local Development: Easy setup and testing
- Security: Built-in authentication and encryption
- Monitoring: Basic health checks and logging
Roadmap (Research-First, Workstation-First)
fetchml is a research-first ML experiment runner with production-grade discipline.
Guiding principles
- Reproducibility over speed: optimizations must never change experimental semantics.
- Explicit over magic: every run should be explainable from manifests, configs, and logs.
- Best-effort optimizations: prewarming/caching must be optional and must not be required for correctness.
- Workstation-first: prioritize single-node reliability, observability, and fast iteration; avoid HPC-specific complexity.
Where we are now
- Run provenance:
run_manifest.jsonexists and is readable viaml info <path|id>. - Validation:
ml validate <commit_id>andml validate --task <task_id>exist; task validation includes run-manifest lifecycle/provenance checks. - Prewarming (Phase 1, best-effort):
- Next-task prewarm loop stages snapshots under
base/.prewarm/snapshots/<task_id>. - Best-effort dataset prefetch with a TTL cache.
- Warmed container image infrastructure exists (images keyed by
deps_manifest_sha256). - Prewarm status is surfaced in
ml status --jsonunder theprewarmfield.
- Next-task prewarm loop stages snapshots under
Phase 0: Trust and usability (highest priority)
1) Make ml status excellent (human output)
- Show a compact summary of:
- queued/running/completed/failed counts
- a short list of most relevant tasks
- prewarm state (worker id, target task id, phase, dataset count, age)
- Preserve
--jsonoutput as stable API for scripting.
2) Add a dry-run preview command (ml explain)
-
Print the resolved execution plan before running:
- commit id, experiment manifest overall sha
- dependency manifest name + sha
- snapshot id + expected sha (when applicable)
- dataset identities + checksums (when applicable)
- requested resources (cpu/mem/gpu)
- candidate runtime image (base vs warmed tag)
-
Enforce a strict preflight by default:
- Queue-time blocking (do not enqueue tasks that fail reproducibility requirements).
- The strict preflight should be shared by
ml queueandml explain. - Record the resolved plan into task metadata for traceability:
repro_policy: stricttrust_level: <L0..L4>(simple trust ladder)plan_sha256: <sha256>(digest of the resolved execution plan)
3) Tighten run manifest completeness
- For
running: requirestarted_at. - For
completed/failed: requirestarted_at,ended_at, andexit_code. - When snapshots/datasets are used: ensure manifest records the relevant identifiers and digests.
4) Dataset identity (minimal but research-grade)
- Prefer structured
dataset_specs(name + checksum) as the authoritative input. - Treat missing checksum as an error by default (strict-by-default).
Phase 1: Simple performance wins (only after Phase 0 feels solid)
- Keep prewarming single-level (next task only).
- Improve observability first (status output + metrics), then expand capabilities.
Phase 2+: Research workflows
ml compare <runA> <runB>: manifest-driven diff of provenance and key parameters.ml reproduce <run-id>: submit a new task derived from the recorded manifest inputs.ml export <run-id>: package provenance + artifacts for collaborators/reviewers.
Phase 3: Infrastructure (only if needed)
- Multi-level prewarming, predictive scheduling, tmpfs caching, dashboards.
- Optional scalable storage backend for team deployments:
- Store run manifests + artifacts in S3-compatible object storage (e.g., MinIO) for durability and multi-worker/Kubernetes setups.
- Keep workstation-first defaults (local filesystem) for simplicity.
- Optional integrations via plugins/exporters (keep core strict and offline-capable):
- Server-side exporters that mirror run metadata, metrics, and artifacts to external systems (e.g., MLflow Tracking, Weights & Biases).
- Prefer lifecycle hooks that consume
run_manifest.json/ artifact manifests over plugins that influence execution semantics.
- Optional Kubernetes deployment path (for teams on scalable infra):
- Publish versioned container images for the backend (API server; optionally worker) and provide reference manifests (Helm/Kustomize).
- Keep the CLI as the primary UX; Kubernetes is an execution/deployment backend, not a UI.
- These are optional and should be driven by measured bottlenecks.
This architecture provides a solid foundation for secure, scalable machine learning experiments while maintaining simplicity and developer productivity.