From f6e506a6328035299b926e96b44961a6cfdd057e Mon Sep 17 00:00:00 2001 From: Jeremie Fraeys Date: Mon, 5 Jan 2026 12:41:09 -0500 Subject: [PATCH] chore(docs): remove legacy Jekyll docs/_pages after Hugo migration --- docs/_pages/architecture.md | 738 ----------------------------------- docs/_pages/cicd.md | 165 -------- docs/_pages/cli-reference.md | 404 ------------------- docs/_pages/operations.md | 310 --------------- docs/_pages/queue.md | 322 --------------- docs/_pages/redis-ha.md | 95 ----- docs/_pages/zig-cli.md | 452 --------------------- 7 files changed, 2486 deletions(-) delete mode 100644 docs/_pages/architecture.md delete mode 100644 docs/_pages/cicd.md delete mode 100644 docs/_pages/cli-reference.md delete mode 100644 docs/_pages/operations.md delete mode 100644 docs/_pages/queue.md delete mode 100644 docs/_pages/redis-ha.md delete mode 100644 docs/_pages/zig-cli.md diff --git a/docs/_pages/architecture.md b/docs/_pages/architecture.md deleted file mode 100644 index 73d2360..0000000 --- a/docs/_pages/architecture.md +++ /dev/null @@ -1,738 +0,0 @@ ---- -layout: page -title: "Homelab Architecture" -permalink: /architecture/ -nav_order: 1 ---- - -# Homelab Architecture - -Simple, secure architecture for ML experiments in your homelab. - -## Components Overview - -```mermaid -graph TB - subgraph "Homelab Stack" - CLI[Zig CLI] - API[HTTPS API] - REDIS[Redis Cache] - FS[Local Storage] - end - - CLI --> API - API --> REDIS - API --> FS -``` - -## Core Services - -### API Server -- **Purpose**: Secure HTTPS API for ML experiments -- **Port**: 9101 (HTTPS only) -- **Auth**: API key authentication -- **Security**: Rate limiting, IP whitelisting - -### Redis -- **Purpose**: Caching and job queuing -- **Port**: 6379 (localhost only) -- **Storage**: Temporary data only -- **Persistence**: Local volume - -### Zig CLI -- **Purpose**: High-performance experiment management -- **Language**: Zig for maximum speed and efficiency -- **Features**: - - Content-addressed storage with deduplication - - SHA256-based commit ID generation - - WebSocket communication for real-time updates - - Rsync-based incremental file transfers - - Multi-threaded operations - - Secure API key authentication - - Auto-sync monitoring with file system watching - - Priority-based job queuing - - Memory-efficient operations with arena allocators - -## Security Architecture - -```mermaid -graph LR - USER[User] --> AUTH[API Key Auth] - AUTH --> RATE[Rate Limiting] - RATE --> WHITELIST[IP Whitelist] - WHITELIST --> API[Secure API] - API --> AUDIT[Audit Logging] -``` - -### Security Layers -1. **API Key Authentication** - Hashed keys with roles -2. **Rate Limiting** - 30 requests/minute -3. **IP Whitelisting** - Local networks only -4. **Fail2Ban** - Automatic IP blocking -5. **HTTPS/TLS** - Encrypted communication -6. **Audit Logging** - Complete action tracking - -## Data Flow - -```mermaid -sequenceDiagram - participant CLI - participant API - participant Redis - participant Storage - - CLI->>API: HTTPS Request - API->>API: Validate Auth - API->>Redis: Cache/Queue - API->>Storage: Experiment Data - Storage->>API: Results - API->>CLI: Response -``` - -## Deployment Options - -### Docker Compose (Recommended) -```yaml -services: - redis: - image: redis:7-alpine - ports: ["6379:6379"] - volumes: [redis_data:/data] - - api-server: - build: . - ports: ["9101:9101"] - depends_on: [redis] -``` - -### Local Setup -```bash -./setup.sh && ./manage.sh start -``` - -## Network Architecture - -- **Private Network**: Docker internal network -- **Localhost Access**: Redis only on localhost -- **HTTPS API**: Port 9101, TLS encrypted -- **No External Dependencies**: Everything runs locally - -## Storage Architecture - -``` -data/ -├── experiments/ # ML experiment results -├── cache/ # Temporary cache files -└── backups/ # Local backups - -logs/ -├── app.log # Application logs -├── audit.log # Security events -└── access.log # API access logs -``` - -## Monitoring Architecture - -Simple, lightweight monitoring: -- **Health Checks**: Service availability -- **Log Files**: Structured logging -- **Basic Metrics**: Request counts, error rates -- **Security Events**: Failed auth, rate limits - -## Homelab Benefits - -- ✅ **Simple Setup**: One-command installation -- ✅ **Local Only**: No external dependencies -- ✅ **Secure by Default**: HTTPS, auth, rate limiting -- ✅ **Low Resource**: Minimal CPU/memory usage -- ✅ **Easy Backup**: Local file system -- ✅ **Privacy**: Everything stays on your network - -## High-Level Architecture - -```mermaid -graph TB - subgraph "Client Layer" - CLI[CLI Tools] - TUI[Terminal UI] - API[REST API] - end - - subgraph "Authentication Layer" - Auth[Authentication Service] - RBAC[Role-Based Access Control] - Perm[Permission Manager] - end - - subgraph "Core Services" - Worker[ML Worker Service] - DataMgr[Data Manager Service] - Queue[Job Queue] - end - - subgraph "Storage Layer" - Redis[(Redis Cache)] - DB[(SQLite/PostgreSQL)] - Files[File Storage] - end - - subgraph "Container Runtime" - Podman[Podman/Docker] - Containers[ML Containers] - end - - CLI --> Auth - TUI --> Auth - API --> Auth - - Auth --> RBAC - RBAC --> Perm - - Worker --> Queue - Worker --> DataMgr - Worker --> Podman - - DataMgr --> DB - DataMgr --> Files - - Queue --> Redis - - Podman --> Containers -``` - -## Zig CLI Architecture - -### Component Structure - -```mermaid -graph TB - subgraph "Zig CLI Components" - Main[main.zig] --> Commands[commands/] - Commands --> Config[config.zig] - Commands --> Utils[utils/] - Commands --> Net[net/] - Commands --> Errors[errors.zig] - - subgraph "Commands" - Init[init.zig] - Sync[sync.zig] - Queue[queue.zig] - Watch[watch.zig] - Status[status.zig] - Monitor[monitor.zig] - Cancel[cancel.zig] - Prune[prune.zig] - end - - subgraph "Utils" - Crypto[crypto.zig] - Storage[storage.zig] - Rsync[rsync.zig] - end - - subgraph "Network" - WS[ws.zig] - end - end -``` - -### Performance Optimizations - -#### Content-Addressed Storage -- **Deduplication**: Files stored by SHA256 hash -- **Space Efficiency**: Shared files across experiments -- **Fast Lookup**: Hash-based file retrieval - -#### Memory Management -- **Arena Allocators**: Efficient bulk allocation -- **Zero-Copy Operations**: Minimized memory copying -- **Automatic Cleanup**: Resource deallocation - -#### Network Communication -- **WebSocket Protocol**: Real-time bidirectional communication -- **Connection Pooling**: Reused connections -- **Binary Messaging**: Efficient data transfer - -### Security Implementation - -```mermaid -graph LR - subgraph "CLI Security" - Config[Config File] --> Hash[SHA256 Hashing] - Hash --> Auth[API Authentication] - Auth --> SSH[SSH Transfer] - SSH --> WS[WebSocket Security] - end -``` - -## Core Components - -### 1. Authentication & Authorization - -```mermaid -graph LR - subgraph "Auth Flow" - Client[Client] --> APIKey[API Key] - APIKey --> Hash[Hash Validation] - Hash --> Roles[Role Resolution] - Roles --> Perms[Permission Check] - Perms --> Access[Grant/Deny Access] - end - - subgraph "Permission Sources" - YAML[YAML Config] - Inline[Inline Fallback] - Roles --> YAML - Roles --> Inline - end -``` - -**Features:** -- API key-based authentication -- Role-based access control (RBAC) -- YAML-based permission configuration -- Fallback to inline permissions -- Admin wildcard permissions - -### 2. Worker Service - -```mermaid -graph TB - subgraph "Worker Architecture" - API[HTTP API] --> Router[Request Router] - Router --> Auth[Auth Middleware] - Auth --> Queue[Job Queue] - Queue --> Processor[Job Processor] - Processor --> Runtime[Container Runtime] - Runtime --> Storage[Result Storage] - - subgraph "Job Lifecycle" - Submit[Submit Job] --> Queue - Queue --> Execute[Execute] - Execute --> Monitor[Monitor] - Monitor --> Complete[Complete] - Complete --> Store[Store Results] - end - end -``` - -**Responsibilities:** -- HTTP API for job submission -- Job queue management -- Container orchestration -- Result collection and storage -- Metrics and monitoring - -### 3. Data Manager Service - -```mermaid -graph TB - subgraph "Data Management" - API[Data API] --> Storage[Storage Layer] - Storage --> Metadata[Metadata DB] - Storage --> Files[File System] - Storage --> Cache[Redis Cache] - - subgraph "Data Operations" - Upload[Upload Data] --> Validate[Validate] - Validate --> Store[Store] - Store --> Index[Index] - Index --> Catalog[Catalog] - end - end -``` - -**Features:** -- Data upload and validation -- Metadata management -- File system abstraction -- Caching layer -- Data catalog - -### 4. Terminal UI (TUI) - -```mermaid -graph TB - subgraph "TUI Architecture" - UI[UI Components] --> Model[Data Model] - Model --> Update[Update Loop] - Update --> Render[Render] - - subgraph "UI Panels" - Jobs[Job List] - Details[Job Details] - Logs[Log Viewer] - Status[Status Bar] - end - - UI --> Jobs - UI --> Details - UI --> Logs - UI --> Status - end -``` - -**Components:** -- Bubble Tea framework -- Component-based architecture -- Real-time updates -- Keyboard navigation -- Theme support - -## Data Flow - -### Job Execution Flow - -```mermaid -sequenceDiagram - participant Client - participant Auth - participant Worker - participant Queue - participant Container - participant Storage - - Client->>Auth: Submit job with API key - Auth->>Client: Validate and return job ID - - Client->>Worker: Execute job request - Worker->>Queue: Queue job - Queue->>Worker: Job ready - Worker->>Container: Start ML container - Container->>Worker: Execute experiment - Worker->>Storage: Store results - Worker->>Client: Return results -``` - -### Authentication Flow - -```mermaid -sequenceDiagram - participant Client - participant Auth - participant PermMgr - participant Config - - Client->>Auth: Request with API key - Auth->>Auth: Validate key hash - Auth->>PermMgr: Get user permissions - PermMgr->>Config: Load YAML permissions - Config->>PermMgr: Return permissions - PermMgr->>Auth: Return resolved permissions - Auth->>Client: Grant/deny access -``` - -## Security Architecture - -### Defense in Depth - -```mermaid -graph TB - subgraph "Security Layers" - Network[Network Security] - Auth[Authentication] - AuthZ[Authorization] - Container[Container Security] - Data[Data Protection] - Audit[Audit Logging] - end - - Network --> Auth - Auth --> AuthZ - AuthZ --> Container - Container --> Data - Data --> Audit -``` - -**Security Features:** -- API key authentication -- Role-based permissions -- Container isolation -- File system sandboxing -- Comprehensive audit logs -- Input validation and sanitization - -### Container Security - -```mermaid -graph TB - subgraph "Container Isolation" - Host[Host System] - Podman[Podman Runtime] - Network[Network Isolation] - FS[File System Isolation] - User[User Namespaces] - ML[ML Container] - - Host --> Podman - Podman --> Network - Podman --> FS - Podman --> User - User --> ML - end -``` - -**Isolation Features:** -- Rootless containers -- Network isolation -- File system sandboxing -- User namespace mapping -- Resource limits - -## Configuration Architecture - -### Configuration Hierarchy - -```mermaid -graph TB - subgraph "Config Sources" - Env[Environment Variables] - File[Config Files] - CLI[CLI Flags] - Defaults[Default Values] - end - - subgraph "Config Processing" - Merge[Config Merger] - Validate[Schema Validator] - Apply[Config Applier] - end - - Env --> Merge - File --> Merge - CLI --> Merge - Defaults --> Merge - - Merge --> Validate - Validate --> Apply -``` - -**Configuration Priority:** -1. CLI flags (highest) -2. Environment variables -3. Configuration files -4. Default values (lowest) - -## Scalability Architecture - -### Horizontal Scaling - -```mermaid -graph TB - subgraph "Scaled Architecture" - LB[Load Balancer] - W1[Worker 1] - W2[Worker 2] - W3[Worker N] - Redis[Redis Cluster] - Storage[Shared Storage] - - LB --> W1 - LB --> W2 - LB --> W3 - - W1 --> Redis - W2 --> Redis - W3 --> Redis - - W1 --> Storage - W2 --> Storage - W3 --> Storage - end -``` - -**Scaling Features:** -- Stateless worker services -- Shared job queue (Redis) -- Distributed storage -- Load balancer ready -- Health checks and monitoring - -## Technology Stack - -### Backend Technologies - -| Component | Technology | Purpose | -|-----------|------------|---------| -| **Language** | Go 1.25+ | Core application | -| **Web Framework** | Standard library | HTTP server | -| **Authentication** | Custom | API key + RBAC | -| **Database** | SQLite/PostgreSQL | Metadata storage | -| **Cache** | Redis | Job queue & caching | -| **Containers** | Podman/Docker | Job isolation | -| **UI Framework** | Bubble Tea | Terminal UI | - -### Dependencies - -```go -// Core dependencies -require ( - github.com/charmbracelet/bubbletea v1.3.10 // TUI framework - github.com/go-redis/redis/v8 v8.11.5 // Redis client - github.com/google/uuid v1.6.0 // UUID generation - github.com/mattn/go-sqlite3 v1.14.32 // SQLite driver - golang.org/x/crypto v0.45.0 // Crypto utilities - gopkg.in/yaml.v3 v3.0.1 // YAML parsing -) -``` - -## Development Architecture - -### Project Structure - -``` -fetch_ml/ -├── cmd/ # CLI applications -│ ├── worker/ # ML worker service -│ ├── tui/ # Terminal UI -│ ├── data_manager/ # Data management -│ └── user_manager/ # User management -├── internal/ # Internal packages -│ ├── auth/ # Authentication system -│ ├── config/ # Configuration management -│ ├── container/ # Container operations -│ ├── database/ # Database operations -│ ├── logging/ # Logging utilities -│ ├── metrics/ # Metrics collection -│ └── network/ # Network utilities -├── configs/ # Configuration files -├── scripts/ # Setup and utility scripts -├── tests/ # Test suites -└── docs/ # Documentation -``` - -### Package Dependencies - -```mermaid -graph TB - subgraph "Application Layer" - Worker[cmd/worker] - TUI[cmd/tui] - DataMgr[cmd/data_manager] - UserMgr[cmd/user_manager] - end - - subgraph "Service Layer" - Auth[internal/auth] - Config[internal/config] - Container[internal/container] - Database[internal/database] - end - - subgraph "Utility Layer" - Logging[internal/logging] - Metrics[internal/metrics] - Network[internal/network] - end - - Worker --> Auth - Worker --> Config - Worker --> Container - TUI --> Auth - DataMgr --> Database - UserMgr --> Auth - - Auth --> Logging - Container --> Network - Database --> Metrics -``` - -## Monitoring & Observability - -### Metrics Collection - -```mermaid -graph TB - subgraph "Metrics Pipeline" - App[Application] --> Metrics[Metrics Collector] - Metrics --> Export[Prometheus Exporter] - Export --> Prometheus[Prometheus Server] - Prometheus --> Grafana[Grafana Dashboard] - - subgraph "Metric Types" - Counter[Counters] - Gauge[Gauges] - Histogram[Histograms] - Timer[Timers] - end - - App --> Counter - App --> Gauge - App --> Histogram - App --> Timer - end -``` - -### Logging Architecture - -```mermaid -graph TB - subgraph "Logging Pipeline" - App[Application] --> Logger[Structured Logger] - Logger --> File[File Output] - Logger --> Console[Console Output] - Logger --> Syslog[Syslog Forwarder] - Syslog --> Aggregator[Log Aggregator] - Aggregator --> Storage[Log Storage] - Storage --> Viewer[Log Viewer] - end -``` - -## Deployment Architecture - -### Container Deployment - -```mermaid -graph TB - subgraph "Deployment Stack" - Image[Container Image] - Registry[Container Registry] - Orchestrator[Docker Compose] - Config[ConfigMaps/Secrets] - Storage[Persistent Storage] - - Image --> Registry - Registry --> Orchestrator - Config --> Orchestrator - Storage --> Orchestrator - end -``` - -### Service Discovery - -```mermaid -graph TB - subgraph "Service Mesh" - Gateway[API Gateway] - Discovery[Service Discovery] - Worker[Worker Service] - Data[Data Service] - Redis[Redis Cluster] - - Gateway --> Discovery - Discovery --> Worker - Discovery --> Data - Discovery --> Redis - end -``` - -## Future Architecture Considerations - -### Microservices Evolution - -- **API Gateway**: Centralized routing and authentication -- **Service Mesh**: Inter-service communication -- **Event Streaming**: Kafka for job events -- **Distributed Tracing**: OpenTelemetry integration -- **Multi-tenant**: Tenant isolation and quotas - -### Homelab Features - -- **Docker Compose**: Simple container orchestration -- **Local Development**: Easy setup and testing -- **Security**: Built-in authentication and encryption -- **Monitoring**: Basic health checks and logging - ---- - -This architecture provides a solid foundation for secure, scalable machine learning experiments while maintaining simplicity and developer productivity. diff --git a/docs/_pages/cicd.md b/docs/_pages/cicd.md deleted file mode 100644 index 44baad0..0000000 --- a/docs/_pages/cicd.md +++ /dev/null @@ -1,165 +0,0 @@ ---- -layout: page -title: "CI/CD Pipeline" -permalink: /cicd/ -nav_order: 5 ---- - -# CI/CD Pipeline - -Automated testing, building, and releasing for fetch_ml. - -## Workflows - -### CI Workflow (`.github/workflows/ci.yml`) - -Runs on every push to `main`/`develop` and all pull requests. - -**Jobs:** -1. **test** - Go backend tests with Redis -2. **build** - Build all binaries (Go + Zig CLI) -3. **test-scripts** - Validate deployment scripts -4. **security-scan** - Trivy and Gosec security scans -5. **docker-build** - Build and push Docker images (main branch only) - -**Test Coverage:** -- Go unit tests with race detection -- `internal/queue` package tests -- Zig CLI tests -- Integration tests -- Security audits - -### Release Workflow (`.github/workflows/release.yml`) - -Runs on version tags (e.g., `v1.0.0`). - -**Jobs:** - -1. **build-cli** (matrix build) - - Linux x86_64 (static musl) - - macOS x86_64 - - macOS ARM64 - - Downloads platform-specific static rsync - - Embeds rsync for zero-dependency releases - -2. **build-go-backends** - - Cross-platform Go builds - - api-server, worker, tui, data_manager, user_manager - -3. **create-release** - - Collects all artifacts - - Generates SHA256 checksums - - Creates GitHub release with notes - -## Release Process - -### Creating a Release - -```bash -# 1. Update version -git tag v1.0.0 - -# 2. Push tag -git push origin v1.0.0 - -# 3. CI automatically builds and releases -``` - -### Release Artifacts - -**CLI Binaries (with embedded rsync):** -- `ml-linux-x86_64.tar.gz` (~450-650KB) -- `ml-macos-x86_64.tar.gz` (~450-650KB) -- `ml-macos-arm64.tar.gz` (~450-650KB) - -**Go Backends:** -- `fetch_ml_api-server.tar.gz` -- `fetch_ml_worker.tar.gz` -- `fetch_ml_tui.tar.gz` -- `fetch_ml_data_manager.tar.gz` -- `fetch_ml_user_manager.tar.gz` - -**Checksums:** -- `checksums.txt` - Combined SHA256 sums -- Individual `.sha256` files per binary - -## Development Workflow - -### Local Testing - -```bash -# Run all tests -make test - -# Run specific package tests -go test ./internal/queue/... - -# Build CLI -cd cli && zig build dev - -# Run formatters and linters -make lint - -# Security scans are handled automatically in CI by the `security-scan` job -``` - -#### Optional heavy end-to-end tests - -Some e2e tests exercise full Docker deployments and performance scenarios and are -**skipped by default** to keep local/CI runs fast. You can enable them explicitly -with environment variables: - -```bash -# Run Docker deployment e2e tests -FETCH_ML_E2E_DOCKER=1 go test ./tests/e2e/... - -# Run performance-oriented e2e tests -FETCH_ML_E2E_PERF=1 go test ./tests/e2e/... -``` - -Without these variables, `TestDockerDeploymentE2E` and `TestPerformanceE2E` will -`t.Skip`, while all lighter e2e tests still run. - -### Pull Request Checks - -All PRs must pass: -- ✅ Go tests (with Redis) -- ✅ CLI tests -- ✅ Security scans -- ✅ Code linting -- ✅ Build verification - -## Configuration - -### Environment Variables - -```yaml -GO_VERSION: '1.25.0' -ZIG_VERSION: '0.15.2' -``` - -### Secrets - -Required for releases: -- `GITHUB_TOKEN` - Automatic, provided by GitHub Actions - -## Monitoring - -### Build Status - -Check workflow runs at: -``` -https://github.com/jfraeys/fetch_ml/actions -``` - -### Artifacts - -Download build artifacts from: -- Successful workflow runs (30-day retention) -- GitHub Releases (permanent) - ---- - -For implementation details: -- [.github/workflows/ci.yml](https://github.com/jfraeys/fetch_ml/blob/main/.github/workflows/ci.yml) -- [.github/workflows/release.yml](https://github.com/jfraeys/fetch_ml/blob/main/.github/workflows/release.yml) diff --git a/docs/_pages/cli-reference.md b/docs/_pages/cli-reference.md deleted file mode 100644 index bd87520..0000000 --- a/docs/_pages/cli-reference.md +++ /dev/null @@ -1,404 +0,0 @@ ---- -layout: page -title: "CLI Reference" -permalink: /cli-reference/ -nav_order: 2 ---- - -# Fetch ML CLI Reference - -Comprehensive command-line tools for managing ML experiments in your homelab with Zig-based high-performance CLI. - -## Overview - -Fetch ML provides a comprehensive CLI toolkit built with performance and security in mind: - -- **Zig CLI** - High-performance experiment management written in Zig -- **Go Commands** - API server, TUI, and data management utilities -- **Management Scripts** - Service orchestration and deployment -- **Setup Scripts** - One-command installation and configuration - -## Zig CLI (`./cli/zig-out/bin/ml`) - -High-performance command-line interface for experiment management, written in Zig for speed and efficiency. - -### Available Commands - -| Command | Description | Example | -|---------|-------------|----------| -| `init` | Interactive configuration setup | `ml init` | -| `sync` | Sync project to worker with deduplication | `ml sync ./project --name myjob --queue` | -| `queue` | Queue job for execution | `ml queue myjob --commit abc123 --priority 8` | -| `status` | Get system and worker status | `ml status` | -| `monitor` | Launch TUI monitoring via SSH | `ml monitor` | -| `cancel` | Cancel running job | `ml cancel job123` | -| `prune` | Clean up old experiments | `ml prune --keep 10` | -| `watch` | Auto-sync directory on changes | `ml watch ./project --queue` | - -### Command Details - -#### `init` - Configuration Setup -```bash -ml init -``` -Creates a configuration template at `~/.ml/config.toml` with: -- Worker connection details -- API authentication -- Base paths and ports - -#### `sync` - Project Synchronization -```bash -# Basic sync -ml sync ./my-project - -# Sync with custom name and queue -ml sync ./my-project --name "experiment-1" --queue - -# Sync with priority -ml sync ./my-project --priority 9 -``` - -**Features:** -- Content-addressed storage for deduplication -- SHA256 commit ID generation -- Rsync-based file transfer -- Automatic queuing (with `--queue` flag) - -#### `queue` - Job Management -```bash -# Queue with commit ID -ml queue my-job --commit abc123def456 - -# Queue with priority (1-10, default 5) -ml queue my-job --commit abc123 --priority 8 -``` - -**Features:** -- WebSocket-based communication -- Priority queuing system -- API key authentication - -#### `watch` - Auto-Sync Monitoring -```bash -# Watch directory for changes -ml watch ./project - -# Watch and auto-queue on changes -ml watch ./project --name "dev-exp" --queue -``` - -**Features:** -- Real-time file system monitoring -- Automatic re-sync on changes -- Configurable polling interval (2 seconds) -- Commit ID comparison for efficiency - -#### `prune` - Cleanup Management -```bash -# Keep last N experiments -ml prune --keep 20 - -# Remove experiments older than N days -ml prune --older-than 30 -``` - -#### `monitor` - Remote Monitoring -```bash -ml monitor -``` -Launches TUI interface via SSH for real-time monitoring. - -#### `cancel` - Job Cancellation -```bash -ml cancel running-job-id -``` -Cancels currently running jobs by ID. - -### Configuration - -The Zig CLI reads configuration from `~/.ml/config.toml`: - -```toml -worker_host = "worker.local" -worker_user = "mluser" -worker_base = "/data/ml-experiments" -worker_port = 22 -api_key = "your-api-key" -``` - -### Performance Features - -- **Content-Addressed Storage**: Automatic deduplication of identical files -- **Incremental Sync**: Only transfers changed files -- **SHA256 Hashing**: Reliable commit ID generation -- **WebSocket Communication**: Efficient real-time messaging -- **Multi-threaded**: Concurrent operations where applicable - -## Go Commands - -### API Server (`./cmd/api-server/main.go`) -Main HTTPS API server for experiment management. - -```bash -# Build and run -go run ./cmd/api-server/main.go - -# With configuration -./bin/api-server --config configs/config-local.yaml -``` - -**Features:** -- HTTPS-only communication -- API key authentication -- Rate limiting and IP whitelisting -- WebSocket support for real-time updates -- Redis integration for caching - -### TUI (`./cmd/tui/main.go`) -Terminal User Interface for monitoring experiments. - -```bash -# Launch TUI -go run ./cmd/tui/main.go - -# With custom config -./tui --config configs/config-local.yaml -``` - -**Features:** -- Real-time experiment monitoring -- Interactive job management -- Status visualization -- Log viewing - -### Data Manager (`./cmd/data_manager/`) -Utilities for data synchronization and management. - -```bash -# Sync data -./data_manager --sync ./data - -# Clean old data -./data_manager --cleanup --older-than 30d -``` - -### Config Lint (`./cmd/configlint/main.go`) -Configuration validation and linting tool. - -```bash -# Validate configuration -./configlint configs/config-local.yaml - -# Check schema compliance -./configlint --schema configs/schema/config_schema.yaml -``` - -## Management Script (`./tools/manage.sh`) - -Simple service management for your homelab. - -### Commands -```bash -./tools/manage.sh start # Start all services -./tools/manage.sh stop # Stop all services -./tools/manage.sh status # Check service status -./tools/manage.sh logs # View logs -./tools/manage.sh monitor # Basic monitoring -./tools/manage.sh security # Security status -./tools/manage.sh cleanup # Clean project artifacts -``` - -## Setup Script (`./setup.sh`) - -One-command homelab setup. - -### Usage -```bash -# Full setup -./setup.sh - -# Setup includes: -# - SSL certificate generation -# - Configuration creation -# - Build all components -# - Start Redis -# - Setup Fail2Ban (if available) -``` - -## API Testing - -Test the API with curl: - -```bash -# Health check -curl -k -H 'X-API-Key: password' https://localhost:9101/health - -# List experiments -curl -k -H 'X-API-Key: password' https://localhost:9101/experiments - -# Submit experiment -curl -k -X POST -H 'X-API-Key: password' \ - -H 'Content-Type: application/json' \ - -d '{"name":"test","config":{"type":"basic"}}' \ - https://localhost:9101/experiments -``` - -## Zig CLI Architecture - -The Zig CLI is designed for performance and reliability: - -### Core Components -- **Commands** (`cli/src/commands/`): Individual command implementations -- **Config** (`cli/src/config.zig`): Configuration management -- **Network** (`cli/src/net/ws.zig`): WebSocket client implementation -- **Utils** (`cli/src/utils/`): Cryptography, storage, and rsync utilities -- **Errors** (`cli/src/errors.zig`): Centralized error handling - -### Performance Optimizations -- **Content-Addressed Storage**: Deduplicates identical files across experiments -- **SHA256 Hashing**: Fast, reliable commit ID generation -- **Rsync Integration**: Efficient incremental file transfers -- **WebSocket Protocol**: Low-latency communication with worker -- **Memory Management**: Efficient allocation with Zig's allocator system - -### Security Features -- **API Key Hashing**: Secure authentication token handling -- **SSH Integration**: Secure file transfers -- **Input Validation**: Comprehensive argument checking -- **Error Handling**: Secure error reporting without information leakage - -## Configuration - -Main configuration file: `configs/config-local.yaml` - -### Key Settings -```yaml -auth: - enabled: true - api_keys: - homelab_user: - hash: "5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8" - admin: true - -server: - address: ":9101" - tls: - enabled: true - cert_file: "./ssl/cert.pem" - key_file: "./ssl/key.pem" - -security: - rate_limit: - enabled: true - requests_per_minute: 30 - ip_whitelist: - - "127.0.0.1" - - "::1" - - "192.168.0.0/16" - - "10.0.0.0/8" -``` - -## Docker Commands - -If using Docker Compose: - -```bash -# Start services -docker-compose up -d - -# View logs -docker-compose logs -f - -# Stop services -docker-compose down - -# Check status -docker-compose ps -``` - -## Troubleshooting - -### Common Issues - -**Zig CLI not found:** -```bash -# Build the CLI -cd cli && make build - -# Check binary exists -ls -la ./cli/zig-out/bin/ml -``` - -**Configuration not found:** -```bash -# Create configuration -./cli/zig-out/bin/ml init - -# Check config file -ls -la ~/.ml/config.toml -``` - -**Worker connection failed:** -```bash -# Test SSH connection -ssh -p 22 mluser@worker.local - -# Check configuration -cat ~/.ml/config.toml -``` - -**Sync not working:** -```bash -# Check rsync availability -rsync --version - -# Test manual sync -rsync -avz ./project/ mluser@worker.local:/tmp/test/ -``` - -**WebSocket connection failed:** -```bash -# Check worker WebSocket port -telnet worker.local 9100 - -# Verify API key -./cli/zig-out/bin/ml status -``` - -**API not responding:** -```bash -./tools/manage.sh status -./tools/manage.sh logs -``` - -**Authentication failed:** -```bash -# Check API key in config-local.yaml -grep -A 5 "api_keys:" configs/config-local.yaml -``` - -**Redis connection failed:** -```bash -# Check Redis status -redis-cli ping - -# Start Redis -redis-server -``` - -### Getting Help - -```bash -# CLI help -./cli/zig-out/bin/ml help - -# Management script help -./tools/manage.sh help - -# Check all available commands -make help -``` - ---- - -**That's it for the CLI reference!** For complete setup instructions, see the main [README](/). diff --git a/docs/_pages/operations.md b/docs/_pages/operations.md deleted file mode 100644 index 12168f2..0000000 --- a/docs/_pages/operations.md +++ /dev/null @@ -1,310 +0,0 @@ ---- -layout: page -title: "Operations Runbook" -permalink: /operations/ -nav_order: 6 ---- - -# Operations Runbook - -Operational guide for troubleshooting and maintaining the ML experiment system. - -## Task Queue Operations - -### Monitoring Queue Health - -```redis -# Check queue depth -ZCARD task:queue - -# List pending tasks -ZRANGE task:queue 0 -1 WITHSCORES - -# Check dead letter queue -KEYS task:dlq:* -``` - -### Handling Stuck Tasks - -**Symptom:** Tasks stuck in "running" status - -**Diagnosis:** -```bash -# Check for expired leases -redis-cli GET task:{task-id} -# Look for LeaseExpiry in past -``` - -**Rem - -ediation:** -Tasks with expired leases are automatically reclaimed every 1 minute. To force immediate reclamation: -```bash -# Restart worker to trigger reclaim cycle -systemctl restart ml-worker -``` - -### Dead Letter Queue Management - -**View failed tasks:** -```redis -KEYS task:dlq:* -``` - -**Inspect failed task:** -```redis -GET task:dlq:{task-id} -``` - -**Retry from DLQ:** -```bash -# Manual retry (requires custom script) -# 1. Get task from DLQ -# 2. Reset retry count -# 3. Re-queue task -``` - -### Worker Crashes - -**Symptom:** Worker disappeared mid-task - -**What Happens:** -1. Lease expires after 30 minutes (default) -2. Background reclaim job detects expired lease -3. Task is retried (up to 3 attempts) -4. After max retries → Dead Letter Queue - -**Prevention:** -- Monitor worker heartbeats -- Set up alerts for worker down -- Use process manager (systemd, supervisor) - -## Worker Operations - -### Graceful Shutdown - -```bash -# Send SIGTERM for graceful shutdown -kill -TERM $(pgrep ml-worker) - -# Worker will: -# 1. Stop accepting new tasks -# 2. Finish active tasks (up to 5min timeout) -# 3. Release all leases -# 4. Exit cleanly -``` - -### Force Shutdown - -```bash -# Force kill (leases will be reclaimed automatically) -kill -9 $(pgrep ml-worker) -``` - -### Worker Heartbeat Monitoring - -```redis -# Check worker heartbeats -HGETALL worker:heartbeat - -# Example output: -# worker-abc123 1701234567 -# worker-def456 1701234580 -``` - -**Alert if:** Heartbeat timestamp > 5 minutes old - -## Redis Operations - -### Backup - -```bash -# Manual backup -redis-cli SAVE -cp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d).rdb -``` - -### Restore - -```bash -# Stop Redis -systemctl stop redis - -# Restore snapshot -cp /backup/redis-20231201.rdb /var/lib/redis/dump.rdb - -# Start Redis -systemctl start redis -``` - -### Memory Management - -```redis -# Check memory usage -INFO memory - -# Evict old data if needed -FLUSHDB # DANGER: Clears all data! -``` - -## Common Issues - -### Issue: Queue Growing Unbounded - -**Symptoms:** -- `ZCARD task:queue` keeps increasing -- No workers processing tasks - -**Diagnosis:** -```bash -# Check worker status -systemctl status ml-worker - -# Check logs -journalctl -u ml-worker -n 100 -``` - -**Resolution:** -1. Verify workers are running -2. Check Redis connectivity -3. Verify lease configuration - -### Issue: High Retry Rate - -**Symptoms:** -- Many tasks in DLQ -- `retry_count` field high on tasks - -**Diagnosis:** -```bash -# Check worker logs for errors -journalctl -u ml-worker | grep "retry" - -# Look for patterns (network issues, resource limits, etc) -``` - -**Resolution:** -- Fix underlying issue (network, resources, etc) -- Adjust retry limits if permanent failures -- Increase task timeout if jobs are slow - -### Issue: Leases Expiring Prematurely - -**Symptoms:** -- Tasks retried even though worker is healthy -- Logs show "lease expired" frequently - -**Diagnosis:** -```yaml -# Check worker config -cat configs/worker-config.yaml | grep -A3 "lease" - -task_lease_duration: 30m # Too short? -heartbeat_interval: 1m # Too infrequent? -``` - -**Resolution:** -```yaml -# Increase lease duration for long-running jobs -task_lease_duration: 60m -heartbeat_interval: 30s # More frequent heartbeats -``` - -## Performance Tuning - -### Worker Concurrency - -```yaml -# worker-config.yaml -max_workers: 4 # Number of parallel tasks - -# Adjust based on: -# - CPU cores available -# - Memory per task -# - GPU availability -``` - -### Redis Configuration - -```conf -# /etc/redis/redis.conf - -# Persistence -save 900 1 -save 300 10 - -# Memory -maxmemory 2gb -maxmemory-policy noeviction - -# Performance -tcp-keepalive 300 -timeout 0 -``` - -## Alerting Rules - -### Critical Alerts - -1. **Worker Down** (no heartbeat > 5min) -2. **Queue Depth** > 1000 tasks -3. **DLQ Growth** > 100 tasks/hour -4. **Redis Down** (connection failed) - -### Warning Alerts - -1. **High Retry Rate** > 10% of tasks -2. **Slow Queue Drain** (depth increasing over 1 hour) -3. **Worker Memory** > 80% usage - -## Health Checks - -```bash -#!/bin/bash -# health-check.sh - -# Check Redis -redis-cli PING || echo "Redis DOWN" - -# Check worker heartbeat -WORKER_ID=$(cat /var/run/ml-worker.pid) -LAST_HB=$(redis-cli HGET worker:heartbeat "$WORKER_ID") -NOW=$(date +%s) -if [ $((NOW - LAST_HB)) -gt 300 ]; then - echo "Worker heartbeat stale" -fi - -# Check queue depth -DEPTH=$(redis-cli ZCARD task:queue) -if [ "$DEPTH" -gt 1000 ]; then - echo "Queue depth critical: $DEPTH" -fi -``` - -## Runbook Checklist - -### Daily Operations -- [ ] Check queue depth -- [ ] Verify worker heartbeats -- [ ] Review DLQ for patterns -- [ ] Check Redis memory usage - -### Weekly Operations -- [ ] Review retry rates -- [ ] Analyze failed task patterns -- [ ] Backup Redis snapshot -- [ ] Review worker logs - -### Monthly Operations -- [ ] Performance tuning review -- [ ] Capacity planning -- [ ] Update documentation -- [ ] Test disaster recovery - ---- - -**For homelab setups:** -Most of these operations can be simplified. Focus on: -- Basic monitoring (queue depth, worker status) -- Periodic Redis backups -- Graceful shutdowns for maintenance diff --git a/docs/_pages/queue.md b/docs/_pages/queue.md deleted file mode 100644 index 1697467..0000000 --- a/docs/_pages/queue.md +++ /dev/null @@ -1,322 +0,0 @@ ---- -layout: page -title: "Task Queue Architecture" -permalink: /queue/ -nav_order: 3 ---- - -# Task Queue Architecture - -The task queue system enables reliable job processing between the API server and workers using Redis. - -## Overview - -```mermaid -graph LR - CLI[CLI/Client] -->|WebSocket| API[API Server] - API -->|Enqueue| Redis[(Redis)] - Redis -->|Dequeue| Worker[Worker] - Worker -->|Update Status| Redis -``` - -## Components - -### TaskQueue (`internal/queue`) - -Shared package used by both API server and worker for job management. - -#### Task Structure - -```go -type Task struct { - ID string // Unique task ID (UUID) - JobName string // User-defined job name - Args string // Job arguments - Status string // queued, running, completed, failed - Priority int64 // Higher = executed first - CreatedAt time.Time - StartedAt *time.Time - EndedAt *time.Time - WorkerID string - Error string - Datasets []string - Metadata map[string]string // commit_id, user, etc -} -``` - -#### TaskQueue Interface - -```go -// Initialize queue -queue, err := queue.NewTaskQueue(queue.Config{ - RedisAddr: "localhost:6379", - RedisPassword: "", - RedisDB: 0, -}) - -// Add task (API server) -task := &queue.Task{ - ID: uuid.New().String(), - JobName: "train-model", - Status: "queued", - Priority: 5, - Metadata: map[string]string{ - "commit_id": commitID, - "user": username, - }, -} -err = queue.AddTask(task) - -// Get next task (Worker) -task, err := queue.GetNextTask() - -// Update task status -task.Status = "running" -err = queue.UpdateTask(task) -``` - -## Data Flow - -### Job Submission Flow - -```mermaid -sequenceDiagram - participant CLI - participant API - participant Redis - participant Worker - - CLI->>API: Queue Job (WebSocket) - API->>API: Create Task (UUID) - API->>Redis: ZADD task:queue - API->>Redis: SET task:{id} - API->>CLI: Success Response - - Worker->>Redis: ZPOPMAX task:queue - Redis->>Worker: Task ID - Worker->>Redis: GET task:{id} - Redis->>Worker: Task Data - Worker->>Worker: Execute Job - Worker->>Redis: Update Status -``` - -### Protocol - -**CLI → API** (Binary WebSocket): -``` -[opcode:1][api_key_hash:64][commit_id:64][priority:1][job_name_len:1][job_name:var] -``` - -**API → Redis**: -- Priority queue: `ZADD task:queue {priority} {task_id}` -- Task data: `SET task:{id} {json}` -- Status: `HSET task:status:{job_name} ...` - -**Worker ← Redis**: -- Poll: `ZPOPMAX task:queue 1` (highest priority first) -- Fetch: `GET task:{id}` - -## Redis Data Structures - -### Keys - -``` -task:queue # ZSET: priority queue -task:{uuid} # STRING: task JSON data -task:status:{job_name} # HASH: job status -worker:heartbeat # HASH: worker health -job:metrics:{job_name} # HASH: job metrics -``` - -### Priority Queue (ZSET) - -```redis -ZADD task:queue 10 "uuid-1" # Priority 10 -ZADD task:queue 5 "uuid-2" # Priority 5 -ZPOPMAX task:queue 1 # Returns uuid-1 (highest) -``` - -## API Server Integration - -### Initialization - -```go -// cmd/api-server/main.go -queueCfg := queue.Config{ - RedisAddr: cfg.Redis.Addr, - RedisPassword: cfg.Redis.Password, - RedisDB: cfg.Redis.DB, -} -taskQueue, err := queue.NewTaskQueue(queueCfg) -``` - -### WebSocket Handler - -```go -// internal/api/ws.go -func (h *WSHandler) handleQueueJob(conn *websocket.Conn, payload []byte) error { - // Parse request - apiKeyHash, commitID, priority, jobName := parsePayload(payload) - - // Create task with unique ID - taskID := uuid.New().String() - task := &queue.Task{ - ID: taskID, - JobName: jobName, - Status: "queued", - Priority: int64(priority), - Metadata: map[string]string{ - "commit_id": commitID, - "user": user, - }, - } - - // Enqueue - if err := h.queue.AddTask(task); err != nil { - return h.sendErrorPacket(conn, ErrorCodeDatabaseError, ...) - } - - return h.sendSuccessPacket(conn, "Job queued") -} -``` - -## Worker Integration - -### Task Polling - -```go -// cmd/worker/worker_server.go -func (w *Worker) Start() error { - for { - task, err := w.queue.WaitForNextTask(ctx, 5*time.Second) - if task != nil { - go w.executeTask(task) - } - } -} -``` - -### Task Execution - -```go -func (w *Worker) executeTask(task *queue.Task) { - // Update status - task.Status = "running" - task.StartedAt = &now - w.queue.UpdateTaskWithMetrics(task, "start") - - // Execute - err := w.runJob(task) - - // Finalize - task.Status = "completed" // or "failed" - task.EndedAt = &endTime - task.Error = err.Error() // if err != nil - w.queue.UpdateTaskWithMetrics(task, "final") -} -``` - -## Configuration - -### API Server (`configs/config.yaml`) - -```yaml -redis: - addr: "localhost:6379" - password: "" - db: 0 -``` - -### Worker (`configs/worker-config.yaml`) - -```yaml -redis: - addr: "localhost:6379" - password: "" - db: 0 - -metrics_flush_interval: 500ms -``` - -## Monitoring - -### Queue Depth - -```go -depth, err := queue.QueueDepth() -fmt.Printf("Pending tasks: %d\n", depth) -``` - -### Worker Heartbeat - -```go -// Worker sends heartbeat every 30s -err := queue.Heartbeat(workerID) -``` - -### Metrics - -```redis -HGETALL job:metrics:{job_name} -# Returns: timestamp, tasks_start, tasks_final, etc -``` - -## Error Handling - -### Task Failures - -```go -if err := w.runJob(task); err != nil { - task.Status = "failed" - task.Error = err.Error() - w.queue.UpdateTask(task) -} -``` - -### Redis Connection Loss - -```go -// TaskQueue automatically reconnects -// Workers should implement retry logic -for retries := 0; retries < 3; retries++ { - task, err := queue.GetNextTask() - if err == nil { - break - } - time.Sleep(backoff) -} -``` - -## Testing - -```go -// tests using miniredis -s, _ := miniredis.Run() -defer s.Close() - -tq, _ := queue.NewTaskQueue(queue.Config{ - RedisAddr: s.Addr(), -}) - -task := &queue.Task{ID: "test-1", JobName: "test"} -tq.AddTask(task) - -fetched, _ := tq.GetNextTask() -// assert fetched.ID == "test-1" -``` - -## Best Practices - -1. **Unique Task IDs**: Always use UUIDs to avoid conflicts -2. **Metadata**: Store commit_id and user in task metadata -3. **Priority**: Higher values execute first (0-255 range) -4. **Status Updates**: Update status at each lifecycle stage -5. **Error Logging**: Store detailed errors in task.Error -6. **Heartbeats**: Workers should send heartbeats regularly -7. **Metrics**: Use UpdateTaskWithMetrics for atomic updates - ---- - -For implementation details, see: -- [internal/queue/task.go](https://github.com/jfraeys/fetch_ml/blob/main/internal/queue/task.go) -- [internal/queue/queue.go](https://github.com/jfraeys/fetch_ml/blob/main/internal/queue/queue.go) diff --git a/docs/_pages/redis-ha.md b/docs/_pages/redis-ha.md deleted file mode 100644 index 4b7f17b..0000000 --- a/docs/_pages/redis-ha.md +++ /dev/null @@ -1,95 +0,0 @@ ---- -layout: page -title: "Redis High Availability (Optional)" -permalink: /redis-ha/ -nav_order: 7 ---- - -# Redis High Availability - -**Note:** This is optional for homelab setups. Single Redis instance is sufficient for most use cases. - -## When You Need HA - -Consider Redis HA if: -- Running production workloads -- Uptime > 99.9% required -- Can't afford to lose queued tasks -- Multiple workers across machines - -## Redis Sentinel (Recommended) - -### Setup - -```yaml -# docker-compose.yml -version: '3.8' -services: - redis-master: - image: redis:7-alpine - command: redis-server --maxmemory 2gb - - redis-replica: - image: redis:7-alpine - command: redis-server --slaveof redis-master 6379 - - redis-sentinel-1: - image: redis:7-alpine - command: redis-sentinel /etc/redis/sentinel.conf - volumes: - - ./sentinel.conf:/etc/redis/sentinel.conf -``` - -**sentinel.conf:** -```conf -sentinel monitor mymaster redis-master 6379 2 -sentinel down-after-milliseconds mymaster 5000 -sentinel parallel-syncs mymaster 1 -sentinel failover-timeout mymaster 10000 -``` - -### Application Configuration - -```yaml -# worker-config.yaml -redis_addr: "redis-sentinel-1:26379,redis-sentinel-2:26379" -redis_master_name: "mymaster" -``` - -## Redis Cluster (Advanced) - -For larger deployments with sharding needs. - -```yaml -# Minimum 3 masters + 3 replicas -services: - redis-1: - image: redis:7-alpine - command: redis-server --cluster-enabled yes - - redis-2: - # ... similar config -``` - -## Homelab Alternative: Persistence Only - -**For most homelabs, just enable persistence:** - -```yaml -# docker-compose.yml -services: - redis: - image: redis:7-alpine - command: redis-server --appendonly yes - volumes: - - redis_data:/data - -volumes: - redis_data: -``` - -This ensures tasks survive Redis restarts without full HA complexity. - ---- - -**Recommendation:** Start simple. Add HA only if you experience actual downtime issues. diff --git a/docs/_pages/zig-cli.md b/docs/_pages/zig-cli.md deleted file mode 100644 index 439846e..0000000 --- a/docs/_pages/zig-cli.md +++ /dev/null @@ -1,452 +0,0 @@ ---- -layout: page -title: "Zig CLI Guide" -permalink: /zig-cli/ -nav_order: 3 ---- - -# Zig CLI Guide - -High-performance command-line interface for ML experiment management, written in Zig for maximum speed and efficiency. - -## Overview - -The Zig CLI (`ml`) is the primary interface for managing ML experiments in your homelab. Built with Zig, it provides exceptional performance for file operations, network communication, and experiment management. - -## Installation - -### Pre-built Binaries (Recommended) - -Download from [GitHub Releases](https://github.com/jfraeys/fetch_ml/releases): - -```bash -# Download for your platform -curl -LO https://github.com/jfraeys/fetch_ml/releases/latest/download/ml-.tar.gz - -# Extract -tar -xzf ml-.tar.gz - -# Install -chmod +x ml- -sudo mv ml- /usr/local/bin/ml - -# Verify -ml --help -``` - -**Platforms:** -- `ml-linux-x86_64.tar.gz` - Linux (fully static, zero dependencies) -- `ml-macos-x86_64.tar.gz` - macOS Intel -- `ml-macos-arm64.tar.gz` - macOS Apple Silicon - -All release binaries include **embedded static rsync** for complete independence. - -### Build from Source - -**Development Build** (uses system rsync): -```bash -cd cli -zig build dev -./zig-out/dev/ml-dev --help -``` - -**Production Build** (embedded rsync): -```bash -cd cli -# For testing: uses rsync wrapper -zig build prod - -# For release with static rsync: -# 1. Place static rsync binary at src/assets/rsync_release.bin -# 2. Build -zig build prod -strip zig-out/prod/ml # Optional: reduce size - -# Verify -./zig-out/prod/ml --help -ls -lh zig-out/prod/ml -``` - -See [cli/src/assets/README.md](https://github.com/jfraeys/fetch_ml/blob/main/cli/src/assets/README.md) for details on obtaining static rsync binaries. - -### Verify Installation -```bash -ml --help -ml --version # Shows build config -``` - -## Quick Start - -1. **Initialize Configuration** - ```bash - ./cli/zig-out/bin/ml init - ``` - -2. **Sync Your First Project** - ```bash - ./cli/zig-out/bin/ml sync ./my-project --queue - ``` - -3. **Monitor Progress** - ```bash - ./cli/zig-out/bin/ml status - ``` - -## Command Reference - -### `init` - Configuration Setup - -Initialize the CLI configuration file. - -```bash -ml init -``` - -**Creates:** `~/.ml/config.toml` - -**Configuration Template:** -```toml -worker_host = "worker.local" -worker_user = "mluser" -worker_base = "/data/ml-experiments" -worker_port = 22 -api_key = "your-api-key" -``` - -### `sync` - Project Synchronization - -Sync project files to the worker with intelligent deduplication. - -```bash -# Basic sync -ml sync ./project - -# Sync with custom name and auto-queue -ml sync ./project --name "experiment-1" --queue - -# Sync with priority -ml sync ./project --priority 8 -``` - -**Options:** -- `--name `: Custom experiment name -- `--queue`: Automatically queue after sync -- `--priority N`: Set priority (1-10, default 5) - -**Features:** -- **Content-Addressed Storage**: Automatic deduplication -- **SHA256 Commit IDs**: Reliable change detection -- **Incremental Transfer**: Only sync changed files -- **Rsync Backend**: Efficient file transfer - -### `queue` - Job Management - -Queue experiments for execution on the worker. - -```bash -# Queue with commit ID -ml queue my-job --commit abc123def456 - -# Queue with priority -ml queue my-job --commit abc123 --priority 8 -``` - -**Options:** -- `--commit `: Commit ID from sync output -- `--priority N`: Execution priority (1-10) - -**Features:** -- **WebSocket Communication**: Real-time job submission -- **Priority Queuing**: Higher priority jobs run first -- **API Authentication**: Secure job submission - -### `watch` - Auto-Sync Monitoring - -Monitor directories for changes and auto-sync. - -```bash -# Watch for changes -ml watch ./project - -# Watch and auto-queue on changes -ml watch ./project --name "dev-exp" --queue -``` - -**Options:** -- `--name `: Custom experiment name -- `--queue`: Auto-queue on changes -- `--priority N`: Set priority for queued jobs - -**Features:** -- **Real-time Monitoring**: 2-second polling interval -- **Change Detection**: File modification time tracking -- **Commit Comparison**: Only sync when content changes -- **Automatic Queuing**: Seamless development workflow - -### `status` - System Status - -Check system and worker status. - -```bash -ml status -``` - -**Displays:** -- Worker connectivity -- Queue status -- Running jobs -- System health - -### `monitor` - Remote Monitoring - -Launch TUI interface via SSH for real-time monitoring. - -```bash -ml monitor -``` - -**Features:** -- **Real-time Updates**: Live experiment status -- **Interactive Interface**: Browse and manage experiments -- **SSH Integration**: Secure remote access - -### `cancel` - Job Cancellation - -Cancel running or queued jobs. - -```bash -ml cancel job-id -``` - -**Options:** -- `job-id`: Job identifier from status output - -### `prune` - Cleanup Management - -Clean up old experiments to save space. - -```bash -# Keep last N experiments -ml prune --keep 20 - -# Remove experiments older than N days -ml prune --older-than 30 -``` - -**Options:** -- `--keep N`: Keep N most recent experiments -- `--older-than N`: Remove experiments older than N days - -## Architecture - -### Core Components - -``` -cli/src/ -├── commands/ # Command implementations -│ ├── init.zig # Configuration setup -│ ├── sync.zig # Project synchronization -│ ├── queue.zig # Job management -│ ├── watch.zig # Auto-sync monitoring -│ ├── status.zig # System status -│ ├── monitor.zig # Remote monitoring -│ ├── cancel.zig # Job cancellation -│ └── prune.zig # Cleanup operations -├── config.zig # Configuration management -├── errors.zig # Error handling -├── net/ # Network utilities -│ └── ws.zig # WebSocket client -└── utils/ # Utility functions - ├── crypto.zig # Hashing and encryption - ├── storage.zig # Content-addressed storage - └── rsync.zig # File synchronization -``` - -### Performance Features - -#### Content-Addressed Storage -- **Deduplication**: Identical files shared across experiments -- **Hash-based Storage**: Files stored by SHA256 hash -- **Space Efficiency**: Reduces storage by up to 90% - -#### SHA256 Commit IDs -- **Reliable Detection**: Cryptographic change detection -- **Collision Resistance**: Guaranteed unique identifiers -- **Fast Computation**: Optimized for large directories - -#### WebSocket Protocol -- **Low Latency**: Real-time communication -- **Binary Protocol**: Efficient message format -- **Connection Pooling**: Reused connections - -#### Memory Management -- **Arena Allocators**: Efficient memory allocation -- **Zero-copy Operations**: Minimized memory usage -- **Resource Cleanup**: Automatic resource management - -### Security Features - -#### Authentication -- **API Key Hashing**: Secure token storage -- **SHA256 Hashes**: Irreversible token protection -- **Config Validation**: Input sanitization - -#### Secure Communication -- **SSH Integration**: Encrypted file transfers -- **WebSocket Security**: TLS-protected communication -- **Input Validation**: Comprehensive argument checking - -#### Error Handling -- **Secure Reporting**: No sensitive information leakage -- **Graceful Degradation**: Safe error recovery -- **Audit Logging**: Operation tracking - -## Advanced Usage - -### Workflow Integration - -#### Development Workflow -```bash -# 1. Initialize project -ml sync ./project --name "dev" --queue - -# 2. Auto-sync during development -ml watch ./project --name "dev" --queue - -# 3. Monitor progress -ml status -``` - -#### Batch Processing -```bash -# Process multiple experiments -for dir in experiments/*/; do - ml sync "$dir" --queue -done -``` - -#### Priority Management -```bash -# High priority experiment -ml sync ./urgent --priority 10 --queue - -# Background processing -ml sync ./background --priority 1 --queue -``` - -### Configuration Management - -#### Multiple Workers -```toml -# ~/.ml/config.toml -worker_host = "worker.local" -worker_user = "mluser" -worker_base = "/data/ml-experiments" -worker_port = 22 -api_key = "your-api-key" -``` - -#### Security Settings -```bash -# Set restrictive permissions -chmod 600 ~/.ml/config.toml - -# Verify configuration -ml status -``` - -## Troubleshooting - -### Common Issues - -#### Build Problems -```bash -# Check Zig installation -zig version - -# Clean build -cd cli && make clean && make build -``` - -#### Connection Issues -```bash -# Test SSH connectivity -ssh -p $worker_port $worker_user@$worker_host - -# Verify configuration -cat ~/.ml/config.toml -``` - -#### Sync Failures -```bash -# Check rsync -rsync --version - -# Manual sync test -rsync -avz ./test/ $worker_user@$worker_host:/tmp/ -``` - -#### Performance Issues -```bash -# Monitor resource usage -top -p $(pgrep ml) - -# Check disk space -df -h $worker_base -``` - -### Debug Mode - -Enable verbose logging: -```bash -# Environment variable -export ML_DEBUG=1 -ml sync ./project - -# Or use debug build -cd cli && make debug -``` - -## Performance Benchmarks - -### File Operations -- **Sync Speed**: 100MB/s+ (network limited) -- **Hash Computation**: 500MB/s+ (CPU limited) -- **Deduplication**: 90%+ space savings - -### Memory Usage -- **Base Memory**: ~10MB -- **Large Projects**: ~50MB (1GB+ projects) -- **Memory Efficiency**: Constant per-file overhead - -### Network Performance -- **WebSocket Latency**: <10ms (local network) -- **Connection Setup**: <100ms -- **Throughput**: Network limited - -## Contributing - -### Development Setup -```bash -cd cli -zig build-exe src/main.zig -``` - -### Testing -```bash -# Run tests -cd cli && zig test src/ - -# Integration tests -zig test tests/ -``` - -### Code Style -- Follow Zig style guidelines -- Use explicit error handling -- Document public APIs -- Add comprehensive tests - ---- - -**For more information, see the [CLI Reference](/cli-reference/) and [Architecture](/architecture/) pages.**