fetch_ml/docs/src/architecture.md
Jeremie Fraeys 5d75f3576b
docs: comprehensive documentation updates
- Update TEST_COVERAGE_MAP with current requirements
- Refresh ADR-004 with C++ implementation details
- Update architecture, deployment, and security docs
- Improve CLI/TUI UX contract documentation
2026-03-04 13:23:48 -05:00

943 lines
26 KiB
Markdown

---
title: "Homelab Architecture"
url: "/architecture/"
weight: 1
---
# Homelab Architecture
Simple, secure architecture for ML experiments in your homelab.
## Components Overview
```mermaid
graph TB
subgraph "Homelab Stack"
CLI[Zig CLI]
API["API Server (HTTPS + WebSocket)"]
REDIS[Redis Cache]
DB[(SQLite/PostgreSQL)]
FS[Local Storage]
WORKER[Worker Service]
PODMAN[Podman/Docker]
end
CLI --> API
API --> REDIS
API --> DB
API --> FS
WORKER --> API
WORKER --> REDIS
WORKER --> FS
WORKER --> PODMAN
```
## Core Services
### API Server
- **Purpose**: Secure HTTPS API for ML experiments
- **Port**: 9101 (HTTPS only)
- **Auth**: API key authentication
- **Security**: Rate limiting, IP whitelisting
### Redis
- **Purpose**: Caching and job queuing
- **Port**: 6379 (localhost only)
- **Storage**: Temporary data only
- **Persistence**: Local volume
### Zig CLI
- **Purpose**: High-performance experiment management
- **Language**: Zig for maximum speed and efficiency
- **Features**:
- Content-addressed storage with deduplication
- SHA256-based commit ID generation
- WebSocket communication for real-time updates
- Rsync-based incremental file transfers
- Multi-threaded operations
- Secure API key authentication
- Auto-sync monitoring with file system watching
- Priority-based job queuing
- Memory-efficient operations with arena allocators
## Security Architecture
```mermaid
graph LR
USER[User] --> AUTH[API Key Auth]
AUTH --> RATE[Rate Limiting]
RATE --> WHITELIST[IP Whitelist]
WHITELIST --> API[Secure API]
API --> AUDIT[Audit Logging]
```
### Security Layers
1. **API Key Authentication** - Hashed keys with roles
2. **Rate Limiting** - 30 requests/minute
3. **IP Whitelisting** - Local networks only
4. **Fail2Ban** - Automatic IP blocking
5. **HTTPS/TLS** - Encrypted communication
6. **Audit Logging** - Complete action tracking
## Data Flow
```mermaid
sequenceDiagram
participant CLI
participant API
participant Redis
participant Storage
CLI->>API: HTTPS + WebSocket request
API->>API: Validate Auth
API->>Redis: Cache/Queue
API->>Storage: Experiment Data
Storage->>API: Results
API->>CLI: Response
```
## Deployment Options
### Docker Compose (Recommended)
```yaml
services:
redis:
image: redis:7-alpine
ports: ["6379:6379"]
volumes: [redis_data:/data]
api-server:
build: .
ports: ["9101:9101"]
depends_on: [redis]
```
### Local Setup
```bash
docker-compose -f deployments/docker-compose.dev.yml up -d
```
## Network Architecture
- **Private Network**: Docker internal network
- **Localhost Access**: Redis only on localhost
- **HTTPS API**: Port 9101, TLS encrypted
- **No External Dependencies**: Everything runs locally
## Storage Architecture
```
data/
├── experiments/ # Experiment definitions, run manifests, and artifacts
├── tracking/ # Tracking tool state (e.g., MLflow/TensorBoard), when enabled
├── .prewarm/ # Best-effort prewarm staging (snapshots/env/datasets), when enabled
├── cache/ # Temporary caches (best-effort)
└── backups/ # Local backups
logs/
├── app.log # Application logs
├── audit.log # Security events
└── access.log # API access logs
```
## Monitoring Architecture
Simple, lightweight monitoring:
- **Health Checks**: Service availability
- **Log Files**: Structured logging
- **Prometheus Metrics**: Worker and API metrics (including prewarm hit/miss/timing)
- **Security Events**: Failed auth, rate limits
## Homelab Benefits
-**Simple Setup**: One-command installation
-**Local Only**: No external dependencies
-**Secure by Default**: HTTPS, auth, rate limiting
-**Low Resource**: Minimal CPU/memory usage
-**Easy Backup**: Local file system
-**Privacy**: Everything stays on your network
## High-Level Architecture
```mermaid
graph TB
subgraph "Client Layer"
CLI[CLI Tools]
TUI[Terminal UI]
API[WebSocket API]
end
subgraph "Authentication Layer"
Auth[Authentication Service]
RBAC[Role-Based Access Control]
Perm[Permission Manager]
end
subgraph "Core Services"
Worker[ML Worker Service]
DataMgr[Data Manager Service]
Queue[Job Queue]
end
subgraph "Storage Layer"
Redis[(Redis Cache)]
DB[(SQLite/PostgreSQL)]
Files[File Storage]
end
subgraph "Container Runtime"
Podman[Podman/Docker]
Containers[ML Containers]
end
CLI --> Auth
TUI --> Auth
API --> Auth
Auth --> RBAC
RBAC --> Perm
Worker --> Queue
Worker --> DataMgr
Worker --> Podman
DataMgr --> DB
DataMgr --> Files
Queue --> Redis
Podman --> Containers
```
## Tracking & Plugin System
fetch_ml includes an optional tracking plugin system that can provision sidecar tools and/or pass through environment variables for common research tracking stacks.
### Tracking modes
Tracking tools support the following modes:
- `sidecar`: provision a local sidecar container per task (best-effort).
- `remote`: point to an externally managed instance (no local provisioning).
- `disabled`: disable the tool entirely.
### How it works
- The worker maintains a tracking registry and provisions tools during task startup.
- Provisioned plugins return environment variables that are injected into the task container.
- Some plugins also require host paths (e.g., TensorBoard log directory); these are mounted into the task container and sanitized to avoid leaking host paths.
### Built-in plugins
The worker ships with built-in plugins:
- `mlflow`: can run an MLflow server as a sidecar or use a remote `MLFLOW_TRACKING_URI`.
- `tensorboard`: runs a TensorBoard sidecar and mounts a per-job log directory.
- `wandb`: does not provision a sidecar; it forwards configuration via environment variables.
### Configuration
Plugins can be configured via worker configuration under `plugins`, including:
- `enabled`
- `image`
- `mode`
- per-plugin paths/settings (e.g., artifact base path, log base path)
## Plugin GPU Quota System
The scheduler includes a GPU quota management system for plugin-based services (Jupyter, vLLM, etc.) that controls resource allocation across users and plugins.
### Quota Enforcement
The quota system enforces limits at multiple levels:
1. **Global GPU Limit**: Total GPUs available across all plugins
2. **Per-User GPU Limit**: Maximum GPUs a single user can allocate
3. **Per-User Service Limit**: Maximum number of service instances per user
4. **Plugin-Specific Limits**: Separate limits for each plugin type
5. **User Overrides**: Custom limits for specific users with allowed plugin restrictions
### Architecture
```mermaid
graph TB
subgraph "Plugin Quota System"
Submit[Job Submission] --> CheckQuota{Check Quota}
CheckQuota -->|Within Limits| Accept[Accept Job]
CheckQuota -->|Exceeded| Reject[Reject with Error]
Accept --> RecordUsage[Record Usage]
RecordUsage --> Assign[Assign to Worker]
Complete[Job Complete] --> ReleaseUsage[Release Usage]
subgraph "Quota Manager"
Global[Global GPU Counter]
PerUser[Per-User Tracking]
PerPlugin[Per-Plugin Tracking]
Overrides[User Overrides]
end
CheckQuota --> Global
CheckQuota --> PerUser
CheckQuota --> PerPlugin
CheckQuota --> Overrides
end
```
### Components
- **PluginQuotaConfig**: Configuration for all quota limits and overrides
- **PluginQuotaManager**: Thread-safe manager for tracking and enforcing quotas
- **Integration Points**:
- `SubmitJob()`: Validates quotas before accepting service jobs
- `handleJobAccepted()`: Records usage when jobs are assigned
- `handleJobResult()`: Releases usage when jobs complete
### Usage
Jobs must include `user_id` and `plugin_name` metadata for quota tracking:
```go
spec := scheduler.JobSpec{
Type: scheduler.JobTypeService,
UserID: "user123",
GPUCount: 2,
Metadata: map[string]string{
"plugin_name": "jupyter",
},
}
```
## Zig CLI Architecture
### Component Structure
```mermaid
graph TB
subgraph "Zig CLI Components"
Main[main.zig] --> Commands[commands/]
Commands --> Config[config.zig]
Commands --> Utils[utils/]
Commands --> Net[net/]
Commands --> Errors[errors.zig]
subgraph "Commands"
Init[init.zig]
Sync[sync.zig]
Queue[queue.zig]
Watch[watch.zig]
Status[status.zig]
Monitor[monitor.zig]
Cancel[cancel.zig]
Prune[prune.zig]
end
subgraph "Utils"
Crypto[crypto.zig]
Storage[storage.zig]
Rsync[rsync.zig]
end
subgraph "Network"
WS[ws.zig]
end
end
```
### Performance Optimizations
#### Content-Addressed Storage
- **Deduplication**: Files stored by SHA256 hash
- **Space Efficiency**: Shared files across experiments
- **Fast Lookup**: Hash-based file retrieval
#### Memory Management
- **Arena Allocators**: Efficient bulk allocation
- **Zero-Copy Operations**: Minimized memory copying
- **Automatic Cleanup**: Resource deallocation
#### Network Communication
- **WebSocket Protocol**: Real-time bidirectional communication
- **Connection Pooling**: Reused connections
- **Binary Messaging**: Efficient data transfer
### Security Implementation
```mermaid
graph LR
subgraph "CLI Security"
Config[Config File] --> Hash[SHA256 Hashing]
Hash --> Auth[API Authentication]
Auth --> SSH[SSH Transfer]
SSH --> WS[WebSocket Security]
end
```
## Core Components
### 1. Authentication & Authorization
```mermaid
graph LR
subgraph "Auth Flow"
Client[Client] --> APIKey[API Key]
APIKey --> Hash[Hash Validation]
Hash --> Roles[Role Resolution]
Roles --> Perms[Permission Check]
Perms --> Access[Grant/Deny Access]
end
subgraph "Permission Sources"
YAML[YAML Config]
Inline[Inline Fallback]
Roles --> YAML
Roles --> Inline
end
```
**Features:**
- API key-based authentication
- Role-based access control (RBAC)
- YAML-based permission configuration
- Fallback to inline permissions
- Admin wildcard permissions
### 2. Worker Service
```mermaid
graph TB
subgraph "Worker Architecture"
API[HTTP API] --> Router[Request Router]
Router --> Auth[Auth Middleware]
Auth --> Queue[Job Queue]
Queue --> Processor[Job Processor]
Processor --> Runtime[Container Runtime]
Runtime --> Storage[Result Storage]
subgraph "Job Lifecycle"
Submit[Submit Job] --> Queue
Queue --> Execute[Execute]
Execute --> Monitor[Monitor]
Monitor --> Complete[Complete]
Complete --> Store[Store Results]
end
end
```
**Responsibilities:**
- HTTP API for job submission
- Job queue management
- Container orchestration
- Result collection and storage
- Metrics and monitoring
### 3. Data Manager Service
```mermaid
graph TB
subgraph "Data Management"
API[Data API] --> Storage[Storage Layer]
Storage --> Metadata[Metadata DB]
Storage --> Files[File System]
Storage --> Cache[Redis Cache]
subgraph "Data Operations"
Upload[Upload Data] --> Validate[Validate]
Validate --> Store[Store]
Store --> Index[Index]
Index --> Catalog[Catalog]
end
end
```
**Features:**
- Data upload and validation
- Metadata management
- File system abstraction
- Caching layer
- Data catalog
### 4. Terminal UI (TUI)
```mermaid
graph TB
subgraph "TUI Architecture"
UI[UI Components] --> Model[Data Model]
Model --> Update[Update Loop]
Update --> Render[Render]
subgraph "UI Panels"
Jobs[Job List]
Details[Job Details]
Logs[Log Viewer]
Status[Status Bar]
end
UI --> Jobs
UI --> Details
UI --> Logs
UI --> Status
end
```
**Components:**
- Bubble Tea framework
- Component-based architecture
- Real-time updates
- Keyboard navigation
- Theme support
## Data Flow
### Job Execution Flow
```mermaid
sequenceDiagram
participant Client
participant Auth
participant Worker
participant Queue
participant Container
participant Storage
Client->>Auth: Submit job with API key
Auth->>Client: Validate and return job ID
Client->>Worker: Execute job request
Worker->>Queue: Queue job
Queue->>Worker: Job ready
Worker->>Container: Start ML container
Container->>Worker: Execute experiment
Worker->>Storage: Store results
Worker->>Client: Return results
```
### Authentication Flow
```mermaid
sequenceDiagram
participant Client
participant Auth
participant PermMgr
participant Config
Client->>Auth: Request with API key
Auth->>Auth: Validate key hash
Auth->>PermMgr: Get user permissions
PermMgr->>Config: Load YAML permissions
Config->>PermMgr: Return permissions
PermMgr->>Auth: Return resolved permissions
Auth->>Client: Grant/deny access
```
## Security Architecture
### Defense in Depth
```mermaid
graph TB
subgraph "Security Layers"
Network[Network Security]
Auth[Authentication]
AuthZ[Authorization]
Container[Container Security]
Data[Data Protection]
Audit[Audit Logging]
end
Network --> Auth
Auth --> AuthZ
AuthZ --> Container
Container --> Data
Data --> Audit
```
**Security Features:**
- API key authentication
- Role-based permissions
- Container isolation
- File system sandboxing
- Comprehensive audit logs
- Input validation and sanitization
### Container Security
```mermaid
graph TB
subgraph "Container Isolation"
Host[Host System]
Podman[Podman Runtime]
Network[Network Isolation]
FS[File System Isolation]
User[User Namespaces]
ML[ML Container]
Host --> Podman
Podman --> Network
Podman --> FS
Podman --> User
User --> ML
end
```
**Isolation Features:**
- Rootless containers
- Network isolation
- File system sandboxing
- User namespace mapping
- Resource limits
## Configuration Architecture
### Configuration Hierarchy
```mermaid
graph TB
subgraph "Config Sources"
Env[Environment Variables]
File[Config Files]
CLI[CLI Flags]
Defaults[Default Values]
end
subgraph "Config Processing"
Merge[Config Merger]
Validate[Schema Validator]
Apply[Config Applier]
end
Env --> Merge
File --> Merge
CLI --> Merge
Defaults --> Merge
Merge --> Validate
Validate --> Apply
```
**Configuration Priority:**
1. CLI flags (highest)
2. Environment variables
3. Configuration files
4. Default values (lowest)
## Scalability Architecture
### Horizontal Scaling
```mermaid
graph TB
subgraph "Scaled Architecture"
LB[Load Balancer]
W1[Worker 1]
W2[Worker 2]
W3[Worker N]
Redis[Redis Cluster]
Storage[Shared Storage]
LB --> W1
LB --> W2
LB --> W3
W1 --> Redis
W2 --> Redis
W3 --> Redis
W1 --> Storage
W2 --> Storage
W3 --> Storage
end
```
**Scaling Features:**
- Stateless worker services
- Shared job queue (Redis)
- Distributed storage
- Load balancer ready
- Health checks and monitoring
## Technology Stack
### Backend Technologies
| Component | Technology | Purpose |
|-----------|------------|---------|
| **Language** | Go 1.25+ | Core application |
| **Web Framework** | Standard library | HTTP server |
| **Authentication** | Custom | API key + RBAC |
| **Database** | SQLite/PostgreSQL | Metadata storage |
| **Cache** | Redis | Job queue & caching |
| **Containers** | Podman/Docker | Job isolation |
| **UI Framework** | Bubble Tea | Terminal UI |
### Dependencies
```go
// Core dependencies
require (
github.com/charmbracelet/bubbletea v1.3.10 // TUI framework
github.com/go-redis/redis/v8 v8.11.5 // Redis client
github.com/google/uuid v1.6.0 // UUID generation
github.com/mattn/go-sqlite3 v1.14.32 // SQLite driver
golang.org/x/crypto v0.45.0 // Crypto utilities
gopkg.in/yaml.v3 v3.0.1 // YAML parsing
)
```
## Development Architecture
### Project Structure
```
fetch_ml/
├── cmd/ # CLI applications
│ ├── worker/ # ML worker service
│ ├── tui/ # Terminal UI
│ ├── data_manager/ # Data management
│ └── user_manager/ # User management
├── internal/ # Internal packages
│ ├── auth/ # Authentication system
│ ├── config/ # Configuration management
│ ├── container/ # Container operations
│ ├── database/ # Database operations
│ ├── logging/ # Logging utilities
│ ├── metrics/ # Metrics collection
│ └── network/ # Network utilities
├── configs/ # Configuration files
├── scripts/ # Setup and utility scripts
├── tests/ # Test suites
└── docs/ # Documentation
```
### Package Dependencies
```mermaid
graph TB
subgraph "Application Layer"
Worker[cmd/worker]
TUI[cmd/tui]
DataMgr[cmd/data_manager]
UserMgr[cmd/user_manager]
end
subgraph "Service Layer"
Auth[internal/auth]
Config[internal/config]
Container[internal/container]
Database[internal/database]
end
subgraph "Utility Layer"
Logging[internal/logging]
Metrics[internal/metrics]
Network[internal/network]
end
Worker --> Auth
Worker --> Config
Worker --> Container
TUI --> Auth
DataMgr --> Database
UserMgr --> Auth
Auth --> Logging
Container --> Network
Database --> Metrics
```
## Monitoring & Observability
### Metrics Collection
```mermaid
graph TB
subgraph "Metrics Pipeline"
App[Application] --> Metrics[Metrics Collector]
Metrics --> Export[Prometheus Exporter]
Export --> Prometheus[Prometheus Server]
Prometheus --> Grafana[Grafana Dashboard]
subgraph "Metric Types"
Counter[Counters]
Gauge[Gauges]
Histogram[Histograms]
Timer[Timers]
end
App --> Counter
App --> Gauge
App --> Histogram
App --> Timer
end
```
### Logging Architecture
```mermaid
graph TB
subgraph "Logging Pipeline"
App[Application] --> Logger[Structured Logger]
Logger --> File[File Output]
Logger --> Console[Console Output]
Logger --> Syslog[Syslog Forwarder]
Syslog --> Aggregator[Log Aggregator]
Aggregator --> Storage[Log Storage]
Storage --> Viewer[Log Viewer]
end
```
## Deployment Architecture
### Container Deployment
```mermaid
graph TB
subgraph "Deployment Stack"
Image[Container Image]
Registry[Container Registry]
Orchestrator[Docker Compose]
Config[ConfigMaps/Secrets]
Storage[Persistent Storage]
Image --> Registry
Registry --> Orchestrator
Config --> Orchestrator
Storage --> Orchestrator
end
```
### Service Discovery
```mermaid
graph TB
subgraph "Service Mesh"
Gateway[API Gateway]
Discovery[Service Discovery]
Worker[Worker Service]
Data[Data Service]
Redis[Redis Cluster]
Gateway --> Discovery
Discovery --> Worker
Discovery --> Data
Discovery --> Redis
end
```
## Future Architecture Considerations
### Microservices Evolution
- **API Gateway**: Centralized routing and authentication
- **Service Mesh**: Inter-service communication
- **Event Streaming**: Kafka for job events
- **Distributed Tracing**: OpenTelemetry integration
- **Multi-tenant**: Tenant isolation and quotas
### Homelab Features
- **Docker Compose**: Simple container orchestration
- **Local Development**: Easy setup and testing
- **Security**: Built-in authentication and encryption
- **Monitoring**: Basic health checks and logging
## Roadmap (Research-First, Workstation-First)
fetchml is a research-first ML experiment runner with production-grade discipline.
### Guiding principles
- **Reproducibility over speed**: optimizations must never change experimental semantics.
- **Explicit over magic**: every run should be explainable from manifests, configs, and logs.
- **Best-effort optimizations**: prewarming/caching must be optional and must not be required for correctness.
- **Workstation-first**: prioritize single-node reliability, observability, and fast iteration; avoid HPC-specific complexity.
### Where we are now
- **Run provenance**: `run_manifest.json` exists and is readable via `ml info <path|id>`.
- **Validation**: `ml validate <commit_id>` and `ml validate --task <task_id>` exist; task validation includes run-manifest lifecycle/provenance checks.
- **Prewarming (best-effort)**:
- Next-task prewarm loop stages snapshots under `base/.prewarm/snapshots/<task_id>`.
- Best-effort dataset prefetch with a TTL cache.
- Warmed container image infrastructure exists (images keyed by `deps_manifest_sha256`).
- Prewarm status is surfaced in `ml status --json` under the `prewarm` field.
### Trust and usability (highest priority)
#### 1) Make `ml status` excellent (human output)
- Show a compact summary of:
- queued/running/completed/failed counts
- a short list of most relevant tasks
- **prewarm state** (worker id, target task id, phase, dataset count, age)
- Preserve `--json` output as stable API for scripting.
#### 2) Add a dry-run preview command (`ml explain`)
- Print the resolved execution plan before running:
- commit id, experiment manifest overall sha
- dependency manifest name + sha
- snapshot id + expected sha (when applicable)
- dataset identities + checksums (when applicable)
- requested resources (cpu/mem/gpu)
- candidate runtime image (base vs warmed tag)
- Enforce a strict preflight by default:
- Queue-time blocking (do not enqueue tasks that fail reproducibility requirements).
- The strict preflight should be shared by `ml queue` and `ml explain`.
- Record the resolved plan into task metadata for traceability:
- `repro_policy: strict`
- `trust_level: <L0..L4>` (simple trust ladder)
- `plan_sha256: <sha256>` (digest of the resolved execution plan)
#### 3) Tighten run manifest completeness
- For `running`: require `started_at`.
- For `completed/failed`: require `started_at`, `ended_at`, and `exit_code`.
- When snapshots/datasets are used: ensure manifest records the relevant identifiers and digests.
#### 4) Dataset identity (minimal but research-grade)
- Prefer structured `dataset_specs` (name + checksum) as the authoritative input.
- Treat missing checksum as an error by default (strict-by-default).
### Simple performance wins (only after trust/usability feels solid)
- Keep prewarming single-level (next task only).
- Improve observability first (status output + metrics), then expand capabilities.
### Research workflows
- `ml compare <runA> <runB>`: manifest-driven diff of provenance and key parameters.
- `ml reproduce <run-id>`: submit a new task derived from the recorded manifest inputs.
- `ml export <run-id>`: package provenance + artifacts for collaborators/reviewers.
### Infrastructure (only if needed)
- Multi-level prewarming, predictive scheduling, tmpfs caching, dashboards.
- Optional scalable storage backend for team deployments:
- Store run manifests + artifacts in S3-compatible object storage (e.g., MinIO) for durability and multi-worker/Kubernetes setups.
- Keep workstation-first defaults (local filesystem) for simplicity.
- Optional integrations via plugins/exporters (keep core strict and offline-capable):
- Server-side exporters that mirror run metadata, metrics, and artifacts to external systems (e.g., MLflow Tracking, Weights & Biases).
- Prefer lifecycle hooks that consume `run_manifest.json` / artifact manifests over plugins that influence execution semantics.
- Optional Kubernetes deployment path (for teams on scalable infra):
- Publish versioned container images for the backend (API server; optionally worker) and provide reference manifests (Helm/Kustomize).
- Keep the CLI as the primary UX; Kubernetes is an execution/deployment backend, not a UI.
- These are optional and should be driven by measured bottlenecks.
---
This architecture provides a solid foundation for secure, scalable machine learning experiments while maintaining simplicity and developer productivity.
## See Also
- **[Scheduler Architecture](scheduler-architecture.md)** - Detailed scheduler design and protocols
- **[Security Guide](security.md)** - Security architecture and best practices
- **[Configuration Reference](configuration-reference.md)** - Configuration options and environment variables
- **[Deployment Guide](deployment.md)** - Production deployment architecture
- **[Performance & Monitoring](performance-monitoring.md)** - Metrics and observability
- **[Research Runner Plan](research-runner-plan.md)** - Roadmap and implementation phases
- **[Native Libraries](native-libraries.md)** - C++ performance optimizations