--- title: "Homelab Architecture" url: "/architecture/" weight: 1 --- # Homelab Architecture Simple, secure architecture for ML experiments in your homelab. ## Components Overview ```mermaid graph TB subgraph "Homelab Stack" CLI[Zig CLI] API["API Server (HTTPS + WebSocket)"] REDIS[Redis Cache] DB[(SQLite/PostgreSQL)] FS[Local Storage] WORKER[Worker Service] PODMAN[Podman/Docker] end CLI --> API API --> REDIS API --> DB API --> FS WORKER --> API WORKER --> REDIS WORKER --> FS WORKER --> PODMAN ``` ## Core Services ### API Server - **Purpose**: Secure HTTPS API for ML experiments - **Port**: 9101 (HTTPS only) - **Auth**: API key authentication - **Security**: Rate limiting, IP whitelisting ### Redis - **Purpose**: Caching and job queuing - **Port**: 6379 (localhost only) - **Storage**: Temporary data only - **Persistence**: Local volume ### Zig CLI - **Purpose**: High-performance experiment management - **Language**: Zig for maximum speed and efficiency - **Features**: - Content-addressed storage with deduplication - SHA256-based commit ID generation - WebSocket communication for real-time updates - Rsync-based incremental file transfers - Multi-threaded operations - Secure API key authentication - Auto-sync monitoring with file system watching - Priority-based job queuing - Memory-efficient operations with arena allocators ## Security Architecture ```mermaid graph LR USER[User] --> AUTH[API Key Auth] AUTH --> RATE[Rate Limiting] RATE --> WHITELIST[IP Whitelist] WHITELIST --> API[Secure API] API --> AUDIT[Audit Logging] ``` ### Security Layers 1. **API Key Authentication** - Hashed keys with roles 2. **Rate Limiting** - 30 requests/minute 3. **IP Whitelisting** - Local networks only 4. **Fail2Ban** - Automatic IP blocking 5. **HTTPS/TLS** - Encrypted communication 6. **Audit Logging** - Complete action tracking ## Data Flow ```mermaid sequenceDiagram participant CLI participant API participant Redis participant Storage CLI->>API: HTTPS + WebSocket request API->>API: Validate Auth API->>Redis: Cache/Queue API->>Storage: Experiment Data Storage->>API: Results API->>CLI: Response ``` ## Deployment Options ### Docker Compose (Recommended) ```yaml services: redis: image: redis:7-alpine ports: ["6379:6379"] volumes: [redis_data:/data] api-server: build: . ports: ["9101:9101"] depends_on: [redis] ``` ### Local Setup ```bash docker-compose -f deployments/docker-compose.dev.yml up -d ``` ## Network Architecture - **Private Network**: Docker internal network - **Localhost Access**: Redis only on localhost - **HTTPS API**: Port 9101, TLS encrypted - **No External Dependencies**: Everything runs locally ## Storage Architecture ``` data/ ├── experiments/ # Experiment definitions, run manifests, and artifacts ├── tracking/ # Tracking tool state (e.g., MLflow/TensorBoard), when enabled ├── .prewarm/ # Best-effort prewarm staging (snapshots/env/datasets), when enabled ├── cache/ # Temporary caches (best-effort) └── backups/ # Local backups logs/ ├── app.log # Application logs ├── audit.log # Security events └── access.log # API access logs ``` ## Monitoring Architecture Simple, lightweight monitoring: - **Health Checks**: Service availability - **Log Files**: Structured logging - **Prometheus Metrics**: Worker and API metrics (including prewarm hit/miss/timing) - **Security Events**: Failed auth, rate limits ## Homelab Benefits - ✅ **Simple Setup**: One-command installation - ✅ **Local Only**: No external dependencies - ✅ **Secure by Default**: HTTPS, auth, rate limiting - ✅ **Low Resource**: Minimal CPU/memory usage - ✅ **Easy Backup**: Local file system - ✅ **Privacy**: Everything stays on your network ## High-Level Architecture ```mermaid graph TB subgraph "Client Layer" CLI[CLI Tools] TUI[Terminal UI] API[WebSocket API] end subgraph "Authentication Layer" Auth[Authentication Service] RBAC[Role-Based Access Control] Perm[Permission Manager] end subgraph "Core Services" Worker[ML Worker Service] DataMgr[Data Manager Service] Queue[Job Queue] end subgraph "Storage Layer" Redis[(Redis Cache)] DB[(SQLite/PostgreSQL)] Files[File Storage] end subgraph "Container Runtime" Podman[Podman/Docker] Containers[ML Containers] end CLI --> Auth TUI --> Auth API --> Auth Auth --> RBAC RBAC --> Perm Worker --> Queue Worker --> DataMgr Worker --> Podman DataMgr --> DB DataMgr --> Files Queue --> Redis Podman --> Containers ``` ## Tracking & Plugin System fetch_ml includes an optional tracking plugin system that can provision sidecar tools and/or pass through environment variables for common research tracking stacks. ### Tracking modes Tracking tools support the following modes: - `sidecar`: provision a local sidecar container per task (best-effort). - `remote`: point to an externally managed instance (no local provisioning). - `disabled`: disable the tool entirely. ### How it works - The worker maintains a tracking registry and provisions tools during task startup. - Provisioned plugins return environment variables that are injected into the task container. - Some plugins also require host paths (e.g., TensorBoard log directory); these are mounted into the task container and sanitized to avoid leaking host paths. ### Built-in plugins The worker ships with built-in plugins: - `mlflow`: can run an MLflow server as a sidecar or use a remote `MLFLOW_TRACKING_URI`. - `tensorboard`: runs a TensorBoard sidecar and mounts a per-job log directory. - `wandb`: does not provision a sidecar; it forwards configuration via environment variables. ### Configuration Plugins can be configured via worker configuration under `plugins`, including: - `enabled` - `image` - `mode` - per-plugin paths/settings (e.g., artifact base path, log base path) ## Plugin GPU Quota System The scheduler includes a GPU quota management system for plugin-based services (Jupyter, vLLM, etc.) that controls resource allocation across users and plugins. ### Quota Enforcement The quota system enforces limits at multiple levels: 1. **Global GPU Limit**: Total GPUs available across all plugins 2. **Per-User GPU Limit**: Maximum GPUs a single user can allocate 3. **Per-User Service Limit**: Maximum number of service instances per user 4. **Plugin-Specific Limits**: Separate limits for each plugin type 5. **User Overrides**: Custom limits for specific users with allowed plugin restrictions ### Architecture ```mermaid graph TB subgraph "Plugin Quota System" Submit[Job Submission] --> CheckQuota{Check Quota} CheckQuota -->|Within Limits| Accept[Accept Job] CheckQuota -->|Exceeded| Reject[Reject with Error] Accept --> RecordUsage[Record Usage] RecordUsage --> Assign[Assign to Worker] Complete[Job Complete] --> ReleaseUsage[Release Usage] subgraph "Quota Manager" Global[Global GPU Counter] PerUser[Per-User Tracking] PerPlugin[Per-Plugin Tracking] Overrides[User Overrides] end CheckQuota --> Global CheckQuota --> PerUser CheckQuota --> PerPlugin CheckQuota --> Overrides end ``` ### Components - **PluginQuotaConfig**: Configuration for all quota limits and overrides - **PluginQuotaManager**: Thread-safe manager for tracking and enforcing quotas - **Integration Points**: - `SubmitJob()`: Validates quotas before accepting service jobs - `handleJobAccepted()`: Records usage when jobs are assigned - `handleJobResult()`: Releases usage when jobs complete ### Usage Jobs must include `user_id` and `plugin_name` metadata for quota tracking: ```go spec := scheduler.JobSpec{ Type: scheduler.JobTypeService, UserID: "user123", GPUCount: 2, Metadata: map[string]string{ "plugin_name": "jupyter", }, } ``` ## Zig CLI Architecture ### Component Structure ```mermaid graph TB subgraph "Zig CLI Components" Main[main.zig] --> Commands[commands/] Commands --> Config[config.zig] Commands --> Utils[utils/] Commands --> Net[net/] Commands --> Errors[errors.zig] subgraph "Commands" Init[init.zig] Sync[sync.zig] Queue[queue.zig] Watch[watch.zig] Status[status.zig] Monitor[monitor.zig] Cancel[cancel.zig] Prune[prune.zig] end subgraph "Utils" Crypto[crypto.zig] Storage[storage.zig] Rsync[rsync.zig] end subgraph "Network" WS[ws.zig] end end ``` ### Performance Optimizations #### Content-Addressed Storage - **Deduplication**: Files stored by SHA256 hash - **Space Efficiency**: Shared files across experiments - **Fast Lookup**: Hash-based file retrieval #### Memory Management - **Arena Allocators**: Efficient bulk allocation - **Zero-Copy Operations**: Minimized memory copying - **Automatic Cleanup**: Resource deallocation #### Network Communication - **WebSocket Protocol**: Real-time bidirectional communication - **Connection Pooling**: Reused connections - **Binary Messaging**: Efficient data transfer ### Security Implementation ```mermaid graph LR subgraph "CLI Security" Config[Config File] --> Hash[SHA256 Hashing] Hash --> Auth[API Authentication] Auth --> SSH[SSH Transfer] SSH --> WS[WebSocket Security] end ``` ## Core Components ### 1. Authentication & Authorization ```mermaid graph LR subgraph "Auth Flow" Client[Client] --> APIKey[API Key] APIKey --> Hash[Hash Validation] Hash --> Roles[Role Resolution] Roles --> Perms[Permission Check] Perms --> Access[Grant/Deny Access] end subgraph "Permission Sources" YAML[YAML Config] Inline[Inline Fallback] Roles --> YAML Roles --> Inline end ``` **Features:** - API key-based authentication - Role-based access control (RBAC) - YAML-based permission configuration - Fallback to inline permissions - Admin wildcard permissions ### 2. Worker Service ```mermaid graph TB subgraph "Worker Architecture" API[HTTP API] --> Router[Request Router] Router --> Auth[Auth Middleware] Auth --> Queue[Job Queue] Queue --> Processor[Job Processor] Processor --> Runtime[Container Runtime] Runtime --> Storage[Result Storage] subgraph "Job Lifecycle" Submit[Submit Job] --> Queue Queue --> Execute[Execute] Execute --> Monitor[Monitor] Monitor --> Complete[Complete] Complete --> Store[Store Results] end end ``` **Responsibilities:** - HTTP API for job submission - Job queue management - Container orchestration - Result collection and storage - Metrics and monitoring ### 3. Data Manager Service ```mermaid graph TB subgraph "Data Management" API[Data API] --> Storage[Storage Layer] Storage --> Metadata[Metadata DB] Storage --> Files[File System] Storage --> Cache[Redis Cache] subgraph "Data Operations" Upload[Upload Data] --> Validate[Validate] Validate --> Store[Store] Store --> Index[Index] Index --> Catalog[Catalog] end end ``` **Features:** - Data upload and validation - Metadata management - File system abstraction - Caching layer - Data catalog ### 4. Terminal UI (TUI) ```mermaid graph TB subgraph "TUI Architecture" UI[UI Components] --> Model[Data Model] Model --> Update[Update Loop] Update --> Render[Render] subgraph "UI Panels" Jobs[Job List] Details[Job Details] Logs[Log Viewer] Status[Status Bar] end UI --> Jobs UI --> Details UI --> Logs UI --> Status end ``` **Components:** - Bubble Tea framework - Component-based architecture - Real-time updates - Keyboard navigation - Theme support ## Data Flow ### Job Execution Flow ```mermaid sequenceDiagram participant Client participant Auth participant Worker participant Queue participant Container participant Storage Client->>Auth: Submit job with API key Auth->>Client: Validate and return job ID Client->>Worker: Execute job request Worker->>Queue: Queue job Queue->>Worker: Job ready Worker->>Container: Start ML container Container->>Worker: Execute experiment Worker->>Storage: Store results Worker->>Client: Return results ``` ### Authentication Flow ```mermaid sequenceDiagram participant Client participant Auth participant PermMgr participant Config Client->>Auth: Request with API key Auth->>Auth: Validate key hash Auth->>PermMgr: Get user permissions PermMgr->>Config: Load YAML permissions Config->>PermMgr: Return permissions PermMgr->>Auth: Return resolved permissions Auth->>Client: Grant/deny access ``` ## Security Architecture ### Defense in Depth ```mermaid graph TB subgraph "Security Layers" Network[Network Security] Auth[Authentication] AuthZ[Authorization] Container[Container Security] Data[Data Protection] Audit[Audit Logging] end Network --> Auth Auth --> AuthZ AuthZ --> Container Container --> Data Data --> Audit ``` **Security Features:** - API key authentication - Role-based permissions - Container isolation - File system sandboxing - Comprehensive audit logs - Input validation and sanitization ### Container Security ```mermaid graph TB subgraph "Container Isolation" Host[Host System] Podman[Podman Runtime] Network[Network Isolation] FS[File System Isolation] User[User Namespaces] ML[ML Container] Host --> Podman Podman --> Network Podman --> FS Podman --> User User --> ML end ``` **Isolation Features:** - Rootless containers - Network isolation - File system sandboxing - User namespace mapping - Resource limits ## Configuration Architecture ### Configuration Hierarchy ```mermaid graph TB subgraph "Config Sources" Env[Environment Variables] File[Config Files] CLI[CLI Flags] Defaults[Default Values] end subgraph "Config Processing" Merge[Config Merger] Validate[Schema Validator] Apply[Config Applier] end Env --> Merge File --> Merge CLI --> Merge Defaults --> Merge Merge --> Validate Validate --> Apply ``` **Configuration Priority:** 1. CLI flags (highest) 2. Environment variables 3. Configuration files 4. Default values (lowest) ## Scalability Architecture ### Horizontal Scaling ```mermaid graph TB subgraph "Scaled Architecture" LB[Load Balancer] W1[Worker 1] W2[Worker 2] W3[Worker N] Redis[Redis Cluster] Storage[Shared Storage] LB --> W1 LB --> W2 LB --> W3 W1 --> Redis W2 --> Redis W3 --> Redis W1 --> Storage W2 --> Storage W3 --> Storage end ``` **Scaling Features:** - Stateless worker services - Shared job queue (Redis) - Distributed storage - Load balancer ready - Health checks and monitoring ## Technology Stack ### Backend Technologies | Component | Technology | Purpose | |-----------|------------|---------| | **Language** | Go 1.25+ | Core application | | **Web Framework** | Standard library | HTTP server | | **Authentication** | Custom | API key + RBAC | | **Database** | SQLite/PostgreSQL | Metadata storage | | **Cache** | Redis | Job queue & caching | | **Containers** | Podman/Docker | Job isolation | | **UI Framework** | Bubble Tea | Terminal UI | ### Dependencies ```go // Core dependencies require ( github.com/charmbracelet/bubbletea v1.3.10 // TUI framework github.com/go-redis/redis/v8 v8.11.5 // Redis client github.com/google/uuid v1.6.0 // UUID generation github.com/mattn/go-sqlite3 v1.14.32 // SQLite driver golang.org/x/crypto v0.45.0 // Crypto utilities gopkg.in/yaml.v3 v3.0.1 // YAML parsing ) ``` ## Development Architecture ### Project Structure ``` fetch_ml/ ├── cmd/ # CLI applications │ ├── worker/ # ML worker service │ ├── tui/ # Terminal UI │ ├── data_manager/ # Data management │ └── user_manager/ # User management ├── internal/ # Internal packages │ ├── auth/ # Authentication system │ ├── config/ # Configuration management │ ├── container/ # Container operations │ ├── database/ # Database operations │ ├── logging/ # Logging utilities │ ├── metrics/ # Metrics collection │ └── network/ # Network utilities ├── configs/ # Configuration files ├── scripts/ # Setup and utility scripts ├── tests/ # Test suites └── docs/ # Documentation ``` ### Package Dependencies ```mermaid graph TB subgraph "Application Layer" Worker[cmd/worker] TUI[cmd/tui] DataMgr[cmd/data_manager] UserMgr[cmd/user_manager] end subgraph "Service Layer" Auth[internal/auth] Config[internal/config] Container[internal/container] Database[internal/database] end subgraph "Utility Layer" Logging[internal/logging] Metrics[internal/metrics] Network[internal/network] end Worker --> Auth Worker --> Config Worker --> Container TUI --> Auth DataMgr --> Database UserMgr --> Auth Auth --> Logging Container --> Network Database --> Metrics ``` ## Monitoring & Observability ### Metrics Collection ```mermaid graph TB subgraph "Metrics Pipeline" App[Application] --> Metrics[Metrics Collector] Metrics --> Export[Prometheus Exporter] Export --> Prometheus[Prometheus Server] Prometheus --> Grafana[Grafana Dashboard] subgraph "Metric Types" Counter[Counters] Gauge[Gauges] Histogram[Histograms] Timer[Timers] end App --> Counter App --> Gauge App --> Histogram App --> Timer end ``` ### Logging Architecture ```mermaid graph TB subgraph "Logging Pipeline" App[Application] --> Logger[Structured Logger] Logger --> File[File Output] Logger --> Console[Console Output] Logger --> Syslog[Syslog Forwarder] Syslog --> Aggregator[Log Aggregator] Aggregator --> Storage[Log Storage] Storage --> Viewer[Log Viewer] end ``` ## Deployment Architecture ### Container Deployment ```mermaid graph TB subgraph "Deployment Stack" Image[Container Image] Registry[Container Registry] Orchestrator[Docker Compose] Config[ConfigMaps/Secrets] Storage[Persistent Storage] Image --> Registry Registry --> Orchestrator Config --> Orchestrator Storage --> Orchestrator end ``` ### Service Discovery ```mermaid graph TB subgraph "Service Mesh" Gateway[API Gateway] Discovery[Service Discovery] Worker[Worker Service] Data[Data Service] Redis[Redis Cluster] Gateway --> Discovery Discovery --> Worker Discovery --> Data Discovery --> Redis end ``` ## Future Architecture Considerations ### Microservices Evolution - **API Gateway**: Centralized routing and authentication - **Service Mesh**: Inter-service communication - **Event Streaming**: Kafka for job events - **Distributed Tracing**: OpenTelemetry integration - **Multi-tenant**: Tenant isolation and quotas ### Homelab Features - **Docker Compose**: Simple container orchestration - **Local Development**: Easy setup and testing - **Security**: Built-in authentication and encryption - **Monitoring**: Basic health checks and logging ## Roadmap (Research-First, Workstation-First) fetchml is a research-first ML experiment runner with production-grade discipline. ### Guiding principles - **Reproducibility over speed**: optimizations must never change experimental semantics. - **Explicit over magic**: every run should be explainable from manifests, configs, and logs. - **Best-effort optimizations**: prewarming/caching must be optional and must not be required for correctness. - **Workstation-first**: prioritize single-node reliability, observability, and fast iteration; avoid HPC-specific complexity. ### Where we are now - **Run provenance**: `run_manifest.json` exists and is readable via `ml info `. - **Validation**: `ml validate ` and `ml validate --task ` exist; task validation includes run-manifest lifecycle/provenance checks. - **Prewarming (best-effort)**: - Next-task prewarm loop stages snapshots under `base/.prewarm/snapshots/`. - Best-effort dataset prefetch with a TTL cache. - Warmed container image infrastructure exists (images keyed by `deps_manifest_sha256`). - Prewarm status is surfaced in `ml status --json` under the `prewarm` field. ### Trust and usability (highest priority) #### 1) Make `ml status` excellent (human output) - Show a compact summary of: - queued/running/completed/failed counts - a short list of most relevant tasks - **prewarm state** (worker id, target task id, phase, dataset count, age) - Preserve `--json` output as stable API for scripting. #### 2) Add a dry-run preview command (`ml explain`) - Print the resolved execution plan before running: - commit id, experiment manifest overall sha - dependency manifest name + sha - snapshot id + expected sha (when applicable) - dataset identities + checksums (when applicable) - requested resources (cpu/mem/gpu) - candidate runtime image (base vs warmed tag) - Enforce a strict preflight by default: - Queue-time blocking (do not enqueue tasks that fail reproducibility requirements). - The strict preflight should be shared by `ml queue` and `ml explain`. - Record the resolved plan into task metadata for traceability: - `repro_policy: strict` - `trust_level: ` (simple trust ladder) - `plan_sha256: ` (digest of the resolved execution plan) #### 3) Tighten run manifest completeness - For `running`: require `started_at`. - For `completed/failed`: require `started_at`, `ended_at`, and `exit_code`. - When snapshots/datasets are used: ensure manifest records the relevant identifiers and digests. #### 4) Dataset identity (minimal but research-grade) - Prefer structured `dataset_specs` (name + checksum) as the authoritative input. - Treat missing checksum as an error by default (strict-by-default). ### Simple performance wins (only after trust/usability feels solid) - Keep prewarming single-level (next task only). - Improve observability first (status output + metrics), then expand capabilities. ### Research workflows - `ml compare `: manifest-driven diff of provenance and key parameters. - `ml reproduce `: submit a new task derived from the recorded manifest inputs. - `ml export `: package provenance + artifacts for collaborators/reviewers. ### Infrastructure (only if needed) - Multi-level prewarming, predictive scheduling, tmpfs caching, dashboards. - Optional scalable storage backend for team deployments: - Store run manifests + artifacts in S3-compatible object storage (e.g., MinIO) for durability and multi-worker/Kubernetes setups. - Keep workstation-first defaults (local filesystem) for simplicity. - Optional integrations via plugins/exporters (keep core strict and offline-capable): - Server-side exporters that mirror run metadata, metrics, and artifacts to external systems (e.g., MLflow Tracking, Weights & Biases). - Prefer lifecycle hooks that consume `run_manifest.json` / artifact manifests over plugins that influence execution semantics. - Optional Kubernetes deployment path (for teams on scalable infra): - Publish versioned container images for the backend (API server; optionally worker) and provide reference manifests (Helm/Kustomize). - Keep the CLI as the primary UX; Kubernetes is an execution/deployment backend, not a UI. - These are optional and should be driven by measured bottlenecks. --- This architecture provides a solid foundation for secure, scalable machine learning experiments while maintaining simplicity and developer productivity. ## See Also - **[Scheduler Architecture](scheduler-architecture.md)** - Detailed scheduler design and protocols - **[Security Guide](security.md)** - Security architecture and best practices - **[Configuration Reference](configuration-reference.md)** - Configuration options and environment variables - **[Deployment Guide](deployment.md)** - Production deployment architecture - **[Performance & Monitoring](performance-monitoring.md)** - Metrics and observability - **[Research Runner Plan](research-runner-plan.md)** - Roadmap and implementation phases - **[Native Libraries](native-libraries.md)** - C++ performance optimizations