- Update TEST_COVERAGE_MAP with current requirements - Refresh ADR-004 with C++ implementation details - Update architecture, deployment, and security docs - Improve CLI/TUI UX contract documentation
943 lines
26 KiB
Markdown
943 lines
26 KiB
Markdown
---
|
|
title: "Homelab Architecture"
|
|
url: "/architecture/"
|
|
weight: 1
|
|
---
|
|
|
|
# Homelab Architecture
|
|
|
|
Simple, secure architecture for ML experiments in your homelab.
|
|
|
|
## Components Overview
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Homelab Stack"
|
|
CLI[Zig CLI]
|
|
API["API Server (HTTPS + WebSocket)"]
|
|
REDIS[Redis Cache]
|
|
DB[(SQLite/PostgreSQL)]
|
|
FS[Local Storage]
|
|
WORKER[Worker Service]
|
|
PODMAN[Podman/Docker]
|
|
end
|
|
|
|
CLI --> API
|
|
API --> REDIS
|
|
API --> DB
|
|
API --> FS
|
|
WORKER --> API
|
|
WORKER --> REDIS
|
|
WORKER --> FS
|
|
WORKER --> PODMAN
|
|
```
|
|
|
|
## Core Services
|
|
|
|
### API Server
|
|
- **Purpose**: Secure HTTPS API for ML experiments
|
|
- **Port**: 9101 (HTTPS only)
|
|
- **Auth**: API key authentication
|
|
- **Security**: Rate limiting, IP whitelisting
|
|
|
|
### Redis
|
|
- **Purpose**: Caching and job queuing
|
|
- **Port**: 6379 (localhost only)
|
|
- **Storage**: Temporary data only
|
|
- **Persistence**: Local volume
|
|
|
|
### Zig CLI
|
|
- **Purpose**: High-performance experiment management
|
|
- **Language**: Zig for maximum speed and efficiency
|
|
- **Features**:
|
|
- Content-addressed storage with deduplication
|
|
- SHA256-based commit ID generation
|
|
- WebSocket communication for real-time updates
|
|
- Rsync-based incremental file transfers
|
|
- Multi-threaded operations
|
|
- Secure API key authentication
|
|
- Auto-sync monitoring with file system watching
|
|
- Priority-based job queuing
|
|
- Memory-efficient operations with arena allocators
|
|
|
|
## Security Architecture
|
|
|
|
```mermaid
|
|
graph LR
|
|
USER[User] --> AUTH[API Key Auth]
|
|
AUTH --> RATE[Rate Limiting]
|
|
RATE --> WHITELIST[IP Whitelist]
|
|
WHITELIST --> API[Secure API]
|
|
API --> AUDIT[Audit Logging]
|
|
```
|
|
|
|
### Security Layers
|
|
1. **API Key Authentication** - Hashed keys with roles
|
|
2. **Rate Limiting** - 30 requests/minute
|
|
3. **IP Whitelisting** - Local networks only
|
|
4. **Fail2Ban** - Automatic IP blocking
|
|
5. **HTTPS/TLS** - Encrypted communication
|
|
6. **Audit Logging** - Complete action tracking
|
|
|
|
## Data Flow
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant CLI
|
|
participant API
|
|
participant Redis
|
|
participant Storage
|
|
|
|
CLI->>API: HTTPS + WebSocket request
|
|
API->>API: Validate Auth
|
|
API->>Redis: Cache/Queue
|
|
API->>Storage: Experiment Data
|
|
Storage->>API: Results
|
|
API->>CLI: Response
|
|
```
|
|
|
|
## Deployment Options
|
|
|
|
### Docker Compose (Recommended)
|
|
```yaml
|
|
services:
|
|
redis:
|
|
image: redis:7-alpine
|
|
ports: ["6379:6379"]
|
|
volumes: [redis_data:/data]
|
|
|
|
api-server:
|
|
build: .
|
|
ports: ["9101:9101"]
|
|
depends_on: [redis]
|
|
```
|
|
|
|
### Local Setup
|
|
```bash
|
|
docker-compose -f deployments/docker-compose.dev.yml up -d
|
|
```
|
|
|
|
## Network Architecture
|
|
|
|
- **Private Network**: Docker internal network
|
|
- **Localhost Access**: Redis only on localhost
|
|
- **HTTPS API**: Port 9101, TLS encrypted
|
|
- **No External Dependencies**: Everything runs locally
|
|
|
|
## Storage Architecture
|
|
|
|
```
|
|
data/
|
|
├── experiments/ # Experiment definitions, run manifests, and artifacts
|
|
├── tracking/ # Tracking tool state (e.g., MLflow/TensorBoard), when enabled
|
|
├── .prewarm/ # Best-effort prewarm staging (snapshots/env/datasets), when enabled
|
|
├── cache/ # Temporary caches (best-effort)
|
|
└── backups/ # Local backups
|
|
|
|
logs/
|
|
├── app.log # Application logs
|
|
├── audit.log # Security events
|
|
└── access.log # API access logs
|
|
```
|
|
|
|
## Monitoring Architecture
|
|
|
|
Simple, lightweight monitoring:
|
|
- **Health Checks**: Service availability
|
|
- **Log Files**: Structured logging
|
|
- **Prometheus Metrics**: Worker and API metrics (including prewarm hit/miss/timing)
|
|
- **Security Events**: Failed auth, rate limits
|
|
|
|
## Homelab Benefits
|
|
|
|
- ✅ **Simple Setup**: One-command installation
|
|
- ✅ **Local Only**: No external dependencies
|
|
- ✅ **Secure by Default**: HTTPS, auth, rate limiting
|
|
- ✅ **Low Resource**: Minimal CPU/memory usage
|
|
- ✅ **Easy Backup**: Local file system
|
|
- ✅ **Privacy**: Everything stays on your network
|
|
|
|
## High-Level Architecture
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Client Layer"
|
|
CLI[CLI Tools]
|
|
TUI[Terminal UI]
|
|
API[WebSocket API]
|
|
end
|
|
|
|
subgraph "Authentication Layer"
|
|
Auth[Authentication Service]
|
|
RBAC[Role-Based Access Control]
|
|
Perm[Permission Manager]
|
|
end
|
|
|
|
subgraph "Core Services"
|
|
Worker[ML Worker Service]
|
|
DataMgr[Data Manager Service]
|
|
Queue[Job Queue]
|
|
end
|
|
|
|
subgraph "Storage Layer"
|
|
Redis[(Redis Cache)]
|
|
DB[(SQLite/PostgreSQL)]
|
|
Files[File Storage]
|
|
end
|
|
|
|
subgraph "Container Runtime"
|
|
Podman[Podman/Docker]
|
|
Containers[ML Containers]
|
|
end
|
|
|
|
CLI --> Auth
|
|
TUI --> Auth
|
|
API --> Auth
|
|
|
|
Auth --> RBAC
|
|
RBAC --> Perm
|
|
|
|
Worker --> Queue
|
|
Worker --> DataMgr
|
|
Worker --> Podman
|
|
|
|
DataMgr --> DB
|
|
DataMgr --> Files
|
|
|
|
Queue --> Redis
|
|
|
|
Podman --> Containers
|
|
```
|
|
|
|
## Tracking & Plugin System
|
|
|
|
fetch_ml includes an optional tracking plugin system that can provision sidecar tools and/or pass through environment variables for common research tracking stacks.
|
|
|
|
### Tracking modes
|
|
|
|
Tracking tools support the following modes:
|
|
|
|
- `sidecar`: provision a local sidecar container per task (best-effort).
|
|
- `remote`: point to an externally managed instance (no local provisioning).
|
|
- `disabled`: disable the tool entirely.
|
|
|
|
### How it works
|
|
|
|
- The worker maintains a tracking registry and provisions tools during task startup.
|
|
- Provisioned plugins return environment variables that are injected into the task container.
|
|
- Some plugins also require host paths (e.g., TensorBoard log directory); these are mounted into the task container and sanitized to avoid leaking host paths.
|
|
|
|
### Built-in plugins
|
|
|
|
The worker ships with built-in plugins:
|
|
|
|
- `mlflow`: can run an MLflow server as a sidecar or use a remote `MLFLOW_TRACKING_URI`.
|
|
- `tensorboard`: runs a TensorBoard sidecar and mounts a per-job log directory.
|
|
- `wandb`: does not provision a sidecar; it forwards configuration via environment variables.
|
|
|
|
### Configuration
|
|
|
|
Plugins can be configured via worker configuration under `plugins`, including:
|
|
|
|
- `enabled`
|
|
- `image`
|
|
- `mode`
|
|
- per-plugin paths/settings (e.g., artifact base path, log base path)
|
|
|
|
## Plugin GPU Quota System
|
|
|
|
The scheduler includes a GPU quota management system for plugin-based services (Jupyter, vLLM, etc.) that controls resource allocation across users and plugins.
|
|
|
|
### Quota Enforcement
|
|
|
|
The quota system enforces limits at multiple levels:
|
|
|
|
1. **Global GPU Limit**: Total GPUs available across all plugins
|
|
2. **Per-User GPU Limit**: Maximum GPUs a single user can allocate
|
|
3. **Per-User Service Limit**: Maximum number of service instances per user
|
|
4. **Plugin-Specific Limits**: Separate limits for each plugin type
|
|
5. **User Overrides**: Custom limits for specific users with allowed plugin restrictions
|
|
|
|
### Architecture
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Plugin Quota System"
|
|
Submit[Job Submission] --> CheckQuota{Check Quota}
|
|
CheckQuota -->|Within Limits| Accept[Accept Job]
|
|
CheckQuota -->|Exceeded| Reject[Reject with Error]
|
|
|
|
Accept --> RecordUsage[Record Usage]
|
|
RecordUsage --> Assign[Assign to Worker]
|
|
|
|
Complete[Job Complete] --> ReleaseUsage[Release Usage]
|
|
|
|
subgraph "Quota Manager"
|
|
Global[Global GPU Counter]
|
|
PerUser[Per-User Tracking]
|
|
PerPlugin[Per-Plugin Tracking]
|
|
Overrides[User Overrides]
|
|
end
|
|
|
|
CheckQuota --> Global
|
|
CheckQuota --> PerUser
|
|
CheckQuota --> PerPlugin
|
|
CheckQuota --> Overrides
|
|
end
|
|
```
|
|
|
|
### Components
|
|
|
|
- **PluginQuotaConfig**: Configuration for all quota limits and overrides
|
|
- **PluginQuotaManager**: Thread-safe manager for tracking and enforcing quotas
|
|
- **Integration Points**:
|
|
- `SubmitJob()`: Validates quotas before accepting service jobs
|
|
- `handleJobAccepted()`: Records usage when jobs are assigned
|
|
- `handleJobResult()`: Releases usage when jobs complete
|
|
|
|
### Usage
|
|
|
|
Jobs must include `user_id` and `plugin_name` metadata for quota tracking:
|
|
|
|
```go
|
|
spec := scheduler.JobSpec{
|
|
Type: scheduler.JobTypeService,
|
|
UserID: "user123",
|
|
GPUCount: 2,
|
|
Metadata: map[string]string{
|
|
"plugin_name": "jupyter",
|
|
},
|
|
}
|
|
```
|
|
|
|
## Zig CLI Architecture
|
|
|
|
### Component Structure
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Zig CLI Components"
|
|
Main[main.zig] --> Commands[commands/]
|
|
Commands --> Config[config.zig]
|
|
Commands --> Utils[utils/]
|
|
Commands --> Net[net/]
|
|
Commands --> Errors[errors.zig]
|
|
|
|
subgraph "Commands"
|
|
Init[init.zig]
|
|
Sync[sync.zig]
|
|
Queue[queue.zig]
|
|
Watch[watch.zig]
|
|
Status[status.zig]
|
|
Monitor[monitor.zig]
|
|
Cancel[cancel.zig]
|
|
Prune[prune.zig]
|
|
end
|
|
|
|
subgraph "Utils"
|
|
Crypto[crypto.zig]
|
|
Storage[storage.zig]
|
|
Rsync[rsync.zig]
|
|
end
|
|
|
|
subgraph "Network"
|
|
WS[ws.zig]
|
|
end
|
|
end
|
|
```
|
|
|
|
### Performance Optimizations
|
|
|
|
#### Content-Addressed Storage
|
|
- **Deduplication**: Files stored by SHA256 hash
|
|
- **Space Efficiency**: Shared files across experiments
|
|
- **Fast Lookup**: Hash-based file retrieval
|
|
|
|
#### Memory Management
|
|
- **Arena Allocators**: Efficient bulk allocation
|
|
- **Zero-Copy Operations**: Minimized memory copying
|
|
- **Automatic Cleanup**: Resource deallocation
|
|
|
|
#### Network Communication
|
|
- **WebSocket Protocol**: Real-time bidirectional communication
|
|
- **Connection Pooling**: Reused connections
|
|
- **Binary Messaging**: Efficient data transfer
|
|
|
|
### Security Implementation
|
|
|
|
```mermaid
|
|
graph LR
|
|
subgraph "CLI Security"
|
|
Config[Config File] --> Hash[SHA256 Hashing]
|
|
Hash --> Auth[API Authentication]
|
|
Auth --> SSH[SSH Transfer]
|
|
SSH --> WS[WebSocket Security]
|
|
end
|
|
```
|
|
|
|
## Core Components
|
|
|
|
### 1. Authentication & Authorization
|
|
|
|
```mermaid
|
|
graph LR
|
|
subgraph "Auth Flow"
|
|
Client[Client] --> APIKey[API Key]
|
|
APIKey --> Hash[Hash Validation]
|
|
Hash --> Roles[Role Resolution]
|
|
Roles --> Perms[Permission Check]
|
|
Perms --> Access[Grant/Deny Access]
|
|
end
|
|
|
|
subgraph "Permission Sources"
|
|
YAML[YAML Config]
|
|
Inline[Inline Fallback]
|
|
Roles --> YAML
|
|
Roles --> Inline
|
|
end
|
|
```
|
|
|
|
**Features:**
|
|
- API key-based authentication
|
|
- Role-based access control (RBAC)
|
|
- YAML-based permission configuration
|
|
- Fallback to inline permissions
|
|
- Admin wildcard permissions
|
|
|
|
### 2. Worker Service
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Worker Architecture"
|
|
API[HTTP API] --> Router[Request Router]
|
|
Router --> Auth[Auth Middleware]
|
|
Auth --> Queue[Job Queue]
|
|
Queue --> Processor[Job Processor]
|
|
Processor --> Runtime[Container Runtime]
|
|
Runtime --> Storage[Result Storage]
|
|
|
|
subgraph "Job Lifecycle"
|
|
Submit[Submit Job] --> Queue
|
|
Queue --> Execute[Execute]
|
|
Execute --> Monitor[Monitor]
|
|
Monitor --> Complete[Complete]
|
|
Complete --> Store[Store Results]
|
|
end
|
|
end
|
|
```
|
|
|
|
**Responsibilities:**
|
|
- HTTP API for job submission
|
|
- Job queue management
|
|
- Container orchestration
|
|
- Result collection and storage
|
|
- Metrics and monitoring
|
|
|
|
### 3. Data Manager Service
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Data Management"
|
|
API[Data API] --> Storage[Storage Layer]
|
|
Storage --> Metadata[Metadata DB]
|
|
Storage --> Files[File System]
|
|
Storage --> Cache[Redis Cache]
|
|
|
|
subgraph "Data Operations"
|
|
Upload[Upload Data] --> Validate[Validate]
|
|
Validate --> Store[Store]
|
|
Store --> Index[Index]
|
|
Index --> Catalog[Catalog]
|
|
end
|
|
end
|
|
```
|
|
|
|
**Features:**
|
|
- Data upload and validation
|
|
- Metadata management
|
|
- File system abstraction
|
|
- Caching layer
|
|
- Data catalog
|
|
|
|
### 4. Terminal UI (TUI)
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "TUI Architecture"
|
|
UI[UI Components] --> Model[Data Model]
|
|
Model --> Update[Update Loop]
|
|
Update --> Render[Render]
|
|
|
|
subgraph "UI Panels"
|
|
Jobs[Job List]
|
|
Details[Job Details]
|
|
Logs[Log Viewer]
|
|
Status[Status Bar]
|
|
end
|
|
|
|
UI --> Jobs
|
|
UI --> Details
|
|
UI --> Logs
|
|
UI --> Status
|
|
end
|
|
```
|
|
|
|
**Components:**
|
|
- Bubble Tea framework
|
|
- Component-based architecture
|
|
- Real-time updates
|
|
- Keyboard navigation
|
|
- Theme support
|
|
|
|
## Data Flow
|
|
|
|
### Job Execution Flow
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant Client
|
|
participant Auth
|
|
participant Worker
|
|
participant Queue
|
|
participant Container
|
|
participant Storage
|
|
|
|
Client->>Auth: Submit job with API key
|
|
Auth->>Client: Validate and return job ID
|
|
|
|
Client->>Worker: Execute job request
|
|
Worker->>Queue: Queue job
|
|
Queue->>Worker: Job ready
|
|
Worker->>Container: Start ML container
|
|
Container->>Worker: Execute experiment
|
|
Worker->>Storage: Store results
|
|
Worker->>Client: Return results
|
|
```
|
|
|
|
### Authentication Flow
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant Client
|
|
participant Auth
|
|
participant PermMgr
|
|
participant Config
|
|
|
|
Client->>Auth: Request with API key
|
|
Auth->>Auth: Validate key hash
|
|
Auth->>PermMgr: Get user permissions
|
|
PermMgr->>Config: Load YAML permissions
|
|
Config->>PermMgr: Return permissions
|
|
PermMgr->>Auth: Return resolved permissions
|
|
Auth->>Client: Grant/deny access
|
|
```
|
|
|
|
## Security Architecture
|
|
|
|
### Defense in Depth
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Security Layers"
|
|
Network[Network Security]
|
|
Auth[Authentication]
|
|
AuthZ[Authorization]
|
|
Container[Container Security]
|
|
Data[Data Protection]
|
|
Audit[Audit Logging]
|
|
end
|
|
|
|
Network --> Auth
|
|
Auth --> AuthZ
|
|
AuthZ --> Container
|
|
Container --> Data
|
|
Data --> Audit
|
|
```
|
|
|
|
**Security Features:**
|
|
- API key authentication
|
|
- Role-based permissions
|
|
- Container isolation
|
|
- File system sandboxing
|
|
- Comprehensive audit logs
|
|
- Input validation and sanitization
|
|
|
|
### Container Security
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Container Isolation"
|
|
Host[Host System]
|
|
Podman[Podman Runtime]
|
|
Network[Network Isolation]
|
|
FS[File System Isolation]
|
|
User[User Namespaces]
|
|
ML[ML Container]
|
|
|
|
Host --> Podman
|
|
Podman --> Network
|
|
Podman --> FS
|
|
Podman --> User
|
|
User --> ML
|
|
end
|
|
```
|
|
|
|
**Isolation Features:**
|
|
- Rootless containers
|
|
- Network isolation
|
|
- File system sandboxing
|
|
- User namespace mapping
|
|
- Resource limits
|
|
|
|
## Configuration Architecture
|
|
|
|
### Configuration Hierarchy
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Config Sources"
|
|
Env[Environment Variables]
|
|
File[Config Files]
|
|
CLI[CLI Flags]
|
|
Defaults[Default Values]
|
|
end
|
|
|
|
subgraph "Config Processing"
|
|
Merge[Config Merger]
|
|
Validate[Schema Validator]
|
|
Apply[Config Applier]
|
|
end
|
|
|
|
Env --> Merge
|
|
File --> Merge
|
|
CLI --> Merge
|
|
Defaults --> Merge
|
|
|
|
Merge --> Validate
|
|
Validate --> Apply
|
|
```
|
|
|
|
**Configuration Priority:**
|
|
1. CLI flags (highest)
|
|
2. Environment variables
|
|
3. Configuration files
|
|
4. Default values (lowest)
|
|
|
|
## Scalability Architecture
|
|
|
|
### Horizontal Scaling
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Scaled Architecture"
|
|
LB[Load Balancer]
|
|
W1[Worker 1]
|
|
W2[Worker 2]
|
|
W3[Worker N]
|
|
Redis[Redis Cluster]
|
|
Storage[Shared Storage]
|
|
|
|
LB --> W1
|
|
LB --> W2
|
|
LB --> W3
|
|
|
|
W1 --> Redis
|
|
W2 --> Redis
|
|
W3 --> Redis
|
|
|
|
W1 --> Storage
|
|
W2 --> Storage
|
|
W3 --> Storage
|
|
end
|
|
```
|
|
|
|
**Scaling Features:**
|
|
- Stateless worker services
|
|
- Shared job queue (Redis)
|
|
- Distributed storage
|
|
- Load balancer ready
|
|
- Health checks and monitoring
|
|
|
|
## Technology Stack
|
|
|
|
### Backend Technologies
|
|
|
|
| Component | Technology | Purpose |
|
|
|-----------|------------|---------|
|
|
| **Language** | Go 1.25+ | Core application |
|
|
| **Web Framework** | Standard library | HTTP server |
|
|
| **Authentication** | Custom | API key + RBAC |
|
|
| **Database** | SQLite/PostgreSQL | Metadata storage |
|
|
| **Cache** | Redis | Job queue & caching |
|
|
| **Containers** | Podman/Docker | Job isolation |
|
|
| **UI Framework** | Bubble Tea | Terminal UI |
|
|
|
|
### Dependencies
|
|
|
|
```go
|
|
// Core dependencies
|
|
require (
|
|
github.com/charmbracelet/bubbletea v1.3.10 // TUI framework
|
|
github.com/go-redis/redis/v8 v8.11.5 // Redis client
|
|
github.com/google/uuid v1.6.0 // UUID generation
|
|
github.com/mattn/go-sqlite3 v1.14.32 // SQLite driver
|
|
golang.org/x/crypto v0.45.0 // Crypto utilities
|
|
gopkg.in/yaml.v3 v3.0.1 // YAML parsing
|
|
)
|
|
```
|
|
|
|
## Development Architecture
|
|
|
|
### Project Structure
|
|
|
|
```
|
|
fetch_ml/
|
|
├── cmd/ # CLI applications
|
|
│ ├── worker/ # ML worker service
|
|
│ ├── tui/ # Terminal UI
|
|
│ ├── data_manager/ # Data management
|
|
│ └── user_manager/ # User management
|
|
├── internal/ # Internal packages
|
|
│ ├── auth/ # Authentication system
|
|
│ ├── config/ # Configuration management
|
|
│ ├── container/ # Container operations
|
|
│ ├── database/ # Database operations
|
|
│ ├── logging/ # Logging utilities
|
|
│ ├── metrics/ # Metrics collection
|
|
│ └── network/ # Network utilities
|
|
├── configs/ # Configuration files
|
|
├── scripts/ # Setup and utility scripts
|
|
├── tests/ # Test suites
|
|
└── docs/ # Documentation
|
|
```
|
|
|
|
### Package Dependencies
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Application Layer"
|
|
Worker[cmd/worker]
|
|
TUI[cmd/tui]
|
|
DataMgr[cmd/data_manager]
|
|
UserMgr[cmd/user_manager]
|
|
end
|
|
|
|
subgraph "Service Layer"
|
|
Auth[internal/auth]
|
|
Config[internal/config]
|
|
Container[internal/container]
|
|
Database[internal/database]
|
|
end
|
|
|
|
subgraph "Utility Layer"
|
|
Logging[internal/logging]
|
|
Metrics[internal/metrics]
|
|
Network[internal/network]
|
|
end
|
|
|
|
Worker --> Auth
|
|
Worker --> Config
|
|
Worker --> Container
|
|
TUI --> Auth
|
|
DataMgr --> Database
|
|
UserMgr --> Auth
|
|
|
|
Auth --> Logging
|
|
Container --> Network
|
|
Database --> Metrics
|
|
```
|
|
|
|
## Monitoring & Observability
|
|
|
|
### Metrics Collection
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Metrics Pipeline"
|
|
App[Application] --> Metrics[Metrics Collector]
|
|
Metrics --> Export[Prometheus Exporter]
|
|
Export --> Prometheus[Prometheus Server]
|
|
Prometheus --> Grafana[Grafana Dashboard]
|
|
|
|
subgraph "Metric Types"
|
|
Counter[Counters]
|
|
Gauge[Gauges]
|
|
Histogram[Histograms]
|
|
Timer[Timers]
|
|
end
|
|
|
|
App --> Counter
|
|
App --> Gauge
|
|
App --> Histogram
|
|
App --> Timer
|
|
end
|
|
```
|
|
|
|
### Logging Architecture
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Logging Pipeline"
|
|
App[Application] --> Logger[Structured Logger]
|
|
Logger --> File[File Output]
|
|
Logger --> Console[Console Output]
|
|
Logger --> Syslog[Syslog Forwarder]
|
|
Syslog --> Aggregator[Log Aggregator]
|
|
Aggregator --> Storage[Log Storage]
|
|
Storage --> Viewer[Log Viewer]
|
|
end
|
|
```
|
|
|
|
## Deployment Architecture
|
|
|
|
### Container Deployment
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Deployment Stack"
|
|
Image[Container Image]
|
|
Registry[Container Registry]
|
|
Orchestrator[Docker Compose]
|
|
Config[ConfigMaps/Secrets]
|
|
Storage[Persistent Storage]
|
|
|
|
Image --> Registry
|
|
Registry --> Orchestrator
|
|
Config --> Orchestrator
|
|
Storage --> Orchestrator
|
|
end
|
|
```
|
|
|
|
### Service Discovery
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Service Mesh"
|
|
Gateway[API Gateway]
|
|
Discovery[Service Discovery]
|
|
Worker[Worker Service]
|
|
Data[Data Service]
|
|
Redis[Redis Cluster]
|
|
|
|
Gateway --> Discovery
|
|
Discovery --> Worker
|
|
Discovery --> Data
|
|
Discovery --> Redis
|
|
end
|
|
```
|
|
|
|
## Future Architecture Considerations
|
|
|
|
### Microservices Evolution
|
|
|
|
- **API Gateway**: Centralized routing and authentication
|
|
- **Service Mesh**: Inter-service communication
|
|
- **Event Streaming**: Kafka for job events
|
|
- **Distributed Tracing**: OpenTelemetry integration
|
|
- **Multi-tenant**: Tenant isolation and quotas
|
|
|
|
### Homelab Features
|
|
|
|
- **Docker Compose**: Simple container orchestration
|
|
- **Local Development**: Easy setup and testing
|
|
- **Security**: Built-in authentication and encryption
|
|
- **Monitoring**: Basic health checks and logging
|
|
|
|
## Roadmap (Research-First, Workstation-First)
|
|
|
|
fetchml is a research-first ML experiment runner with production-grade discipline.
|
|
|
|
### Guiding principles
|
|
|
|
- **Reproducibility over speed**: optimizations must never change experimental semantics.
|
|
- **Explicit over magic**: every run should be explainable from manifests, configs, and logs.
|
|
- **Best-effort optimizations**: prewarming/caching must be optional and must not be required for correctness.
|
|
- **Workstation-first**: prioritize single-node reliability, observability, and fast iteration; avoid HPC-specific complexity.
|
|
|
|
### Where we are now
|
|
|
|
- **Run provenance**: `run_manifest.json` exists and is readable via `ml info <path|id>`.
|
|
- **Validation**: `ml validate <commit_id>` and `ml validate --task <task_id>` exist; task validation includes run-manifest lifecycle/provenance checks.
|
|
- **Prewarming (best-effort)**:
|
|
- Next-task prewarm loop stages snapshots under `base/.prewarm/snapshots/<task_id>`.
|
|
- Best-effort dataset prefetch with a TTL cache.
|
|
- Warmed container image infrastructure exists (images keyed by `deps_manifest_sha256`).
|
|
- Prewarm status is surfaced in `ml status --json` under the `prewarm` field.
|
|
|
|
### Trust and usability (highest priority)
|
|
|
|
#### 1) Make `ml status` excellent (human output)
|
|
|
|
- Show a compact summary of:
|
|
- queued/running/completed/failed counts
|
|
- a short list of most relevant tasks
|
|
- **prewarm state** (worker id, target task id, phase, dataset count, age)
|
|
- Preserve `--json` output as stable API for scripting.
|
|
|
|
#### 2) Add a dry-run preview command (`ml explain`)
|
|
|
|
- Print the resolved execution plan before running:
|
|
- commit id, experiment manifest overall sha
|
|
- dependency manifest name + sha
|
|
- snapshot id + expected sha (when applicable)
|
|
- dataset identities + checksums (when applicable)
|
|
- requested resources (cpu/mem/gpu)
|
|
- candidate runtime image (base vs warmed tag)
|
|
|
|
- Enforce a strict preflight by default:
|
|
- Queue-time blocking (do not enqueue tasks that fail reproducibility requirements).
|
|
- The strict preflight should be shared by `ml queue` and `ml explain`.
|
|
- Record the resolved plan into task metadata for traceability:
|
|
- `repro_policy: strict`
|
|
- `trust_level: <L0..L4>` (simple trust ladder)
|
|
- `plan_sha256: <sha256>` (digest of the resolved execution plan)
|
|
|
|
#### 3) Tighten run manifest completeness
|
|
|
|
- For `running`: require `started_at`.
|
|
- For `completed/failed`: require `started_at`, `ended_at`, and `exit_code`.
|
|
- When snapshots/datasets are used: ensure manifest records the relevant identifiers and digests.
|
|
|
|
#### 4) Dataset identity (minimal but research-grade)
|
|
|
|
- Prefer structured `dataset_specs` (name + checksum) as the authoritative input.
|
|
- Treat missing checksum as an error by default (strict-by-default).
|
|
|
|
### Simple performance wins (only after trust/usability feels solid)
|
|
|
|
- Keep prewarming single-level (next task only).
|
|
- Improve observability first (status output + metrics), then expand capabilities.
|
|
|
|
### Research workflows
|
|
|
|
- `ml compare <runA> <runB>`: manifest-driven diff of provenance and key parameters.
|
|
- `ml reproduce <run-id>`: submit a new task derived from the recorded manifest inputs.
|
|
- `ml export <run-id>`: package provenance + artifacts for collaborators/reviewers.
|
|
|
|
### Infrastructure (only if needed)
|
|
|
|
- Multi-level prewarming, predictive scheduling, tmpfs caching, dashboards.
|
|
- Optional scalable storage backend for team deployments:
|
|
- Store run manifests + artifacts in S3-compatible object storage (e.g., MinIO) for durability and multi-worker/Kubernetes setups.
|
|
- Keep workstation-first defaults (local filesystem) for simplicity.
|
|
- Optional integrations via plugins/exporters (keep core strict and offline-capable):
|
|
- Server-side exporters that mirror run metadata, metrics, and artifacts to external systems (e.g., MLflow Tracking, Weights & Biases).
|
|
- Prefer lifecycle hooks that consume `run_manifest.json` / artifact manifests over plugins that influence execution semantics.
|
|
- Optional Kubernetes deployment path (for teams on scalable infra):
|
|
- Publish versioned container images for the backend (API server; optionally worker) and provide reference manifests (Helm/Kustomize).
|
|
- Keep the CLI as the primary UX; Kubernetes is an execution/deployment backend, not a UI.
|
|
- These are optional and should be driven by measured bottlenecks.
|
|
|
|
---
|
|
|
|
This architecture provides a solid foundation for secure, scalable machine learning experiments while maintaining simplicity and developer productivity.
|
|
|
|
## See Also
|
|
|
|
- **[Scheduler Architecture](scheduler-architecture.md)** - Detailed scheduler design and protocols
|
|
- **[Security Guide](security.md)** - Security architecture and best practices
|
|
- **[Configuration Reference](configuration-reference.md)** - Configuration options and environment variables
|
|
- **[Deployment Guide](deployment.md)** - Production deployment architecture
|
|
- **[Performance & Monitoring](performance-monitoring.md)** - Metrics and observability
|
|
- **[Research Runner Plan](research-runner-plan.md)** - Roadmap and implementation phases
|
|
- **[Native Libraries](native-libraries.md)** - C++ performance optimizations
|