{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"],"fields":{"title":{"boost":1000.0},"text":{"boost":1.0},"tags":{"boost":1000000.0}}},"docs":[{"location":"","title":"Fetch ML - Secure Machine Learning Platform","text":"<p>A secure, containerized platform for running machine learning experiments with role-based access control and comprehensive audit trails.</p>"},{"location":"#quick-start","title":"Quick Start","text":"<p>New to the project? Start here!</p> <pre><code># Clone the repository\ngit clone https://github.com/your-username/fetch_ml.git\ncd fetch_ml\n\n# Quick setup (builds everything, creates test user)\nmake quick-start\n\n# Create your API key\n./bin/user_manager --config configs/config_dev.yaml --cmd generate-key --username your_name --role data_scientist\n\n# Run your first experiment\n./bin/worker --config configs/config_dev.yaml --api-key YOUR_GENERATED_KEY\n</code></pre>"},{"location":"#quick-navigation","title":"Quick Navigation","text":""},{"location":"#getting-started","title":"\ud83d\ude80 Getting Started","text":"<ul> <li>Getting Started Guide - Complete setup instructions</li> <li>Simple Install - Quick installation guide</li> </ul>"},{"location":"#security-authentication","title":"\ud83d\udd12 Security &amp; Authentication","text":"<ul> <li>Security Overview - Security best practices</li> <li>API Key Process - Generate and manage API keys</li> <li>User Permissions - Role-based access control</li> </ul>"},{"location":"#configuration","title":"\u2699\ufe0f Configuration","text":"<ul> <li>Environment Variables - Configuration options</li> <li>Smart Defaults - Default configuration settings</li> </ul>"},{"location":"#development","title":"\ud83d\udee0\ufe0f Development","text":"<ul> <li>Architecture - System architecture and design</li> <li>CLI Reference - Command-line interface documentation</li> <li>Testing Guide - Testing procedures and guidelines</li> <li>Queue System - Job queue implementation</li> </ul>"},{"location":"#production-deployment","title":"\ud83c\udfed Production Deployment","text":"<ul> <li>Deployment Guide - Production deployment instructions</li> <li>Production Monitoring - Monitoring and observability</li> <li>Operations Guide - Production operations</li> </ul>"},{"location":"#features","title":"Features","text":"<ul> <li>\ud83d\udd10 Secure Authentication - RBAC with API keys, roles, and permissions</li> <li>\ud83d\udc33 Containerized - Podman-based secure execution environments</li> <li>\ud83d\uddc4\ufe0f Database Storage - SQLite backend for user management (optional)</li> <li>\ud83d\udccb Audit Trail - Complete logging of all actions</li> <li>\ud83d\ude80 Production Ready - Security audits, systemd services, log rotation</li> </ul>"},{"location":"#available-commands","title":"Available Commands","text":"<pre><code># Core commands\nmake help                    # See all available commands\nmake build                   # Build all binaries\nmake test-unit              # Run tests\n\n# User management\n./bin/user_manager --config configs/config_dev.yaml --cmd generate-key --username new_user --role data_scientist\n./bin/user_manager --config configs/config_dev.yaml --cmd list-users\n\n# Run services\n./bin/worker --config configs/config_dev.yaml --api-key YOUR_KEY\n./bin/tui --config configs/config_dev.yaml\n./bin/data_manager --config configs/config_dev.yaml\n</code></pre>"},{"location":"#need-help","title":"Need Help?","text":"<ul> <li>\ud83d\udcd6 Documentation: Use the navigation menu on the left</li> <li>\u26a1 Quick help: <code>make help</code></li> <li>\ud83e\uddea Tests: <code>make test-unit</code></li> </ul> <p>Happy ML experimenting!</p>"},{"location":"api-key-process/","title":"FetchML API Key Process","text":"<p>This document describes how API keys are issued and how team members should configure the <code>ml</code> CLI to use them.</p> <p>The goal is to keep access easy for your homelab while treating API keys as sensitive secrets.</p>"},{"location":"api-key-process/#overview","title":"Overview","text":"<ul> <li>Each user gets a personal API key (no shared admin keys for normal use).</li> <li>API keys are used by the <code>ml</code> CLI to authenticate to the FetchML API.</li> <li>API keys and their SHA256 hashes must both be treated as secrets.</li> </ul> <p>There are two supported ways to receive your key:</p> <ol> <li>Bitwarden (recommended) \u2013 for users who already use Bitwarden.</li> <li>Direct share (minimal tools) \u2013 for users who do not use Bitwarden.</li> </ol>"},{"location":"api-key-process/#1-bitwarden-based-process-recommended","title":"1. Bitwarden-based process (recommended)","text":""},{"location":"api-key-process/#for-the-admin","title":"For the admin","text":"<ul> <li>Use the helper script to create a Bitwarden item for each user:</li> </ul> <pre><code>./scripts/create_bitwarden_fetchml_item.sh &lt;username&gt; &lt;api_key&gt; &lt;api_key_hash&gt;\n</code></pre> <p>This script:</p> <ul> <li>Creates a Bitwarden item named <code>FetchML API \u2013 &lt;username&gt;</code>.</li> <li> <p>Stores:</p> <ul> <li>Username: <code>&lt;username&gt;</code></li> <li>Password: <code>&lt;api_key&gt;</code> (the actual API key)</li> <li>Custom field <code>api_key_hash</code>: <code>&lt;api_key_hash&gt;</code></li> </ul> </li> <li> <p>Share that item with the user in Bitwarden (for example, via a shared collection like <code>FetchML</code>).</p> </li> </ul>"},{"location":"api-key-process/#for-the-user","title":"For the user","text":"<ol> <li> <p>Open Bitwarden and locate the item:</p> </li> <li> <p>Name: <code>FetchML API \u2013 &lt;your-name&gt;</code></p> </li> <li> <p>Copy the password field (this is your FetchML API key).</p> </li> <li> <p>Configure the CLI, e.g. in <code>~/.ml/config.toml</code>:</p> </li> </ol> <pre><code>api_key     = \"&lt;paste-from-bitwarden&gt;\"\nworker_host = \"localhost\"\nworker_port = 9100\napi_url     = \"ws://localhost:9100/ws\"\n</code></pre> <ol> <li>Test your setup:</li> </ol> <pre><code>ml status\n</code></pre> <p>If the command works, your key and tunnel/config are correct.</p>"},{"location":"api-key-process/#2-direct-share-no-password-manager-required","title":"2. Direct share (no password manager required)","text":"<p>For users who do not use Bitwarden, a lightweight alternative is a direct one-to-one share.</p>"},{"location":"api-key-process/#for-the-admin_1","title":"For the admin","text":"<ol> <li>Generate a per-user API key and hash as usual.</li> <li>Store them securely on your side (for example, in your own Bitwarden vault or configuration files).</li> <li> <p>Share only the API key with the user via a direct channel you both trust, such as:</p> </li> <li> <p>Signal / WhatsApp direct message</p> </li> <li>SMS</li> <li> <p>Short call/meeting where you read it to them</p> </li> <li> <p>Ask the user to:</p> </li> <li> <p>Paste the key into their local config.</p> </li> <li>Avoid keeping the key in plain chat history if possible.</li> </ol>"},{"location":"api-key-process/#for-the-user_1","title":"For the user","text":"<ol> <li>When you receive your FetchML API key from the admin, create or edit <code>~/.ml/config.toml</code>:</li> </ol> <pre><code>api_key     = \"&lt;your-api-key&gt;\"\nworker_host = \"localhost\"\nworker_port = 9100\napi_url     = \"ws://localhost:9100/ws\"\n</code></pre> <ol> <li>Save the file and run:</li> </ol> <pre><code>ml status\n</code></pre> <ol> <li>If it works, you are ready to use the CLI:</li> </ol> <pre><code>ml queue my-training-job\nml cancel my-training-job\n</code></pre>"},{"location":"api-key-process/#3-security-notes","title":"3. Security notes","text":"<ul> <li>API key and hash are secrets</li> <li>The 64-character <code>api_key_hash</code> is as sensitive as the API key itself.</li> <li> <p>Do not commit keys or hashes to Git or share them in screenshots or tickets.</p> </li> <li> <p>Rotation</p> </li> <li>If you suspect a key has leaked, notify the admin.</li> <li> <p>The admin will revoke the old key, generate a new one, and update Bitwarden or share a new key.</p> </li> <li> <p>Transport security</p> </li> <li>The <code>api_url</code> is typically <code>ws://localhost:9100/ws</code> when used through an SSH tunnel to the homelab.</li> <li>The SSH tunnel and nginx/TLS provide encryption over the network.</li> </ul> <p>Following these steps keeps API access easy for the team while maintaining a reasonable security posture for a personal homelab deployment.</p>"},{"location":"architecture/","title":"Homelab Architecture","text":"<p>Simple, secure architecture for ML experiments in your homelab.</p>"},{"location":"architecture/#components-overview","title":"Components Overview","text":"<pre><code>graph TB\n    subgraph \"Homelab Stack\"\n        CLI[Zig CLI]\n        API[HTTPS API]\n        REDIS[Redis Cache]\n        FS[Local Storage]\n    end\n\n    CLI --&gt; API\n    API --&gt; REDIS\n    API --&gt; FS\n</code></pre>"},{"location":"architecture/#core-services","title":"Core Services","text":""},{"location":"architecture/#api-server","title":"API Server","text":"<ul> <li>Purpose: Secure HTTPS API for ML experiments</li> <li>Port: 9101 (HTTPS only)</li> <li>Auth: API key authentication</li> <li>Security: Rate limiting, IP whitelisting</li> </ul>"},{"location":"architecture/#redis","title":"Redis","text":"<ul> <li>Purpose: Caching and job queuing</li> <li>Port: 6379 (localhost only)</li> <li>Storage: Temporary data only</li> <li>Persistence: Local volume</li> </ul>"},{"location":"architecture/#zig-cli","title":"Zig CLI","text":"<ul> <li>Purpose: High-performance experiment management</li> <li>Language: Zig for maximum speed and efficiency</li> <li>Features:</li> <li>Content-addressed storage with deduplication</li> <li>SHA256-based commit ID generation</li> <li>WebSocket communication for real-time updates</li> <li>Rsync-based incremental file transfers</li> <li>Multi-threaded operations</li> <li>Secure API key authentication</li> <li>Auto-sync monitoring with file system watching</li> <li>Priority-based job queuing</li> <li>Memory-efficient operations with arena allocators</li> </ul>"},{"location":"architecture/#security-architecture","title":"Security Architecture","text":"<pre><code>graph LR\n    USER[User] --&gt; AUTH[API Key Auth]\n    AUTH --&gt; RATE[Rate Limiting]\n    RATE --&gt; WHITELIST[IP Whitelist]\n    WHITELIST --&gt; API[Secure API]\n    API --&gt; AUDIT[Audit Logging]\n</code></pre>"},{"location":"architecture/#security-layers","title":"Security Layers","text":"<ol> <li>API Key Authentication - Hashed keys with roles</li> <li>Rate Limiting - 30 requests/minute</li> <li>IP Whitelisting - Local networks only</li> <li>Fail2Ban - Automatic IP blocking</li> <li>HTTPS/TLS - Encrypted communication</li> <li>Audit Logging - Complete action tracking</li> </ol>"},{"location":"architecture/#data-flow","title":"Data Flow","text":"<pre><code>sequenceDiagram\n    participant CLI\n    participant API\n    participant Redis\n    participant Storage\n\n    CLI-&gt;&gt;API: HTTPS Request\n    API-&gt;&gt;API: Validate Auth\n    API-&gt;&gt;Redis: Cache/Queue\n    API-&gt;&gt;Storage: Experiment Data\n    Storage-&gt;&gt;API: Results\n    API-&gt;&gt;CLI: Response\n</code></pre>"},{"location":"architecture/#deployment-options","title":"Deployment Options","text":""},{"location":"architecture/#docker-compose-recommended","title":"Docker Compose (Recommended)","text":"<pre><code>services:\n  redis:\n    image: redis:7-alpine\n    ports: [\"6379:6379\"]\n    volumes: [redis_data:/data]\n\n  api-server:\n    build: .\n    ports: [\"9101:9101\"]\n    depends_on: [redis]\n</code></pre>"},{"location":"architecture/#local-setup","title":"Local Setup","text":"<pre><code>./setup.sh &amp;&amp; ./manage.sh start\n</code></pre>"},{"location":"architecture/#network-architecture","title":"Network Architecture","text":"<ul> <li>Private Network: Docker internal network</li> <li>Localhost Access: Redis only on localhost</li> <li>HTTPS API: Port 9101, TLS encrypted</li> <li>No External Dependencies: Everything runs locally</li> </ul>"},{"location":"architecture/#storage-architecture","title":"Storage Architecture","text":"<pre><code>data/\n\u251c\u2500\u2500 experiments/     # ML experiment results\n\u251c\u2500\u2500 cache/          # Temporary cache files\n\u2514\u2500\u2500 backups/        # Local backups\n\nlogs/\n\u251c\u2500\u2500 app.log         # Application logs\n\u251c\u2500\u2500 audit.log       # Security events\n\u2514\u2500\u2500 access.log      # API access logs\n</code></pre>"},{"location":"architecture/#monitoring-architecture","title":"Monitoring Architecture","text":"<p>Simple, lightweight monitoring: - Health Checks: Service availability - Log Files: Structured logging - Basic Metrics: Request counts, error rates - Security Events: Failed auth, rate limits</p>"},{"location":"architecture/#homelab-benefits","title":"Homelab Benefits","text":"<ul> <li>\u2705 Simple Setup: One-command installation</li> <li>\u2705 Local Only: No external dependencies</li> <li>\u2705 Secure by Default: HTTPS, auth, rate limiting</li> <li>\u2705 Low Resource: Minimal CPU/memory usage</li> <li>\u2705 Easy Backup: Local file system</li> <li>\u2705 Privacy: Everything stays on your network</li> </ul>"},{"location":"architecture/#high-level-architecture","title":"High-Level Architecture","text":"<pre><code>graph TB\n    subgraph \"Client Layer\"\n        CLI[CLI Tools]\n        TUI[Terminal UI]\n        API[REST API]\n    end\n\n    subgraph \"Authentication Layer\"\n        Auth[Authentication Service]\n        RBAC[Role-Based Access Control]\n        Perm[Permission Manager]\n    end\n\n    subgraph \"Core Services\"\n        Worker[ML Worker Service]\n        DataMgr[Data Manager Service]\n        Queue[Job Queue]\n    end\n\n    subgraph \"Storage Layer\"\n        Redis[(Redis Cache)]\n        DB[(SQLite/PostgreSQL)]\n        Files[File Storage]\n    end\n\n    subgraph \"Container Runtime\"\n        Podman[Podman/Docker]\n        Containers[ML Containers]\n    end\n\n    CLI --&gt; Auth\n    TUI --&gt; Auth\n    API --&gt; Auth\n\n    Auth --&gt; RBAC\n    RBAC --&gt; Perm\n\n    Worker --&gt; Queue\n    Worker --&gt; DataMgr\n    Worker --&gt; Podman\n\n    DataMgr --&gt; DB\n    DataMgr --&gt; Files\n\n    Queue --&gt; Redis\n\n    Podman --&gt; Containers\n</code></pre>"},{"location":"architecture/#zig-cli-architecture","title":"Zig CLI Architecture","text":""},{"location":"architecture/#component-structure","title":"Component Structure","text":"<pre><code>graph TB\n    subgraph \"Zig CLI Components\"\n        Main[main.zig] --&gt; Commands[commands/]\n        Commands --&gt; Config[config.zig]\n        Commands --&gt; Utils[utils/]\n        Commands --&gt; Net[net/]\n        Commands --&gt; Errors[errors.zig]\n\n        subgraph \"Commands\"\n            Init[init.zig]\n            Sync[sync.zig]\n            Queue[queue.zig]\n            Watch[watch.zig]\n            Status[status.zig]\n            Monitor[monitor.zig]\n            Cancel[cancel.zig]\n            Prune[prune.zig]\n        end\n\n        subgraph \"Utils\"\n            Crypto[crypto.zig]\n            Storage[storage.zig]\n            Rsync[rsync.zig]\n        end\n\n        subgraph \"Network\"\n            WS[ws.zig]\n        end\n    end\n</code></pre>"},{"location":"architecture/#performance-optimizations","title":"Performance Optimizations","text":""},{"location":"architecture/#content-addressed-storage","title":"Content-Addressed Storage","text":"<ul> <li>Deduplication: Files stored by SHA256 hash</li> <li>Space Efficiency: Shared files across experiments</li> <li>Fast Lookup: Hash-based file retrieval</li> </ul>"},{"location":"architecture/#memory-management","title":"Memory Management","text":"<ul> <li>Arena Allocators: Efficient bulk allocation</li> <li>Zero-Copy Operations: Minimized memory copying</li> <li>Automatic Cleanup: Resource deallocation</li> </ul>"},{"location":"architecture/#network-communication","title":"Network Communication","text":"<ul> <li>WebSocket Protocol: Real-time bidirectional communication</li> <li>Connection Pooling: Reused connections</li> <li>Binary Messaging: Efficient data transfer</li> </ul>"},{"location":"architecture/#security-implementation","title":"Security Implementation","text":"<pre><code>graph LR\n    subgraph \"CLI Security\"\n        Config[Config File] --&gt; Hash[SHA256 Hashing]\n        Hash --&gt; Auth[API Authentication]\n        Auth --&gt; SSH[SSH Transfer]\n        SSH --&gt; WS[WebSocket Security]\n    end\n</code></pre>"},{"location":"architecture/#core-components","title":"Core Components","text":""},{"location":"architecture/#1-authentication-authorization","title":"1. Authentication &amp; Authorization","text":"<pre><code>graph LR\n    subgraph \"Auth Flow\"\n        Client[Client] --&gt; APIKey[API Key]\n        APIKey --&gt; Hash[Hash Validation]\n        Hash --&gt; Roles[Role Resolution]\n        Roles --&gt; Perms[Permission Check]\n        Perms --&gt; Access[Grant/Deny Access]\n    end\n\n    subgraph \"Permission Sources\"\n        YAML[YAML Config]\n        Inline[Inline Fallback]\n        Roles --&gt; YAML\n        Roles --&gt; Inline\n    end\n</code></pre> <p>Features: - API key-based authentication - Role-based access control (RBAC) - YAML-based permission configuration - Fallback to inline permissions - Admin wildcard permissions</p>"},{"location":"architecture/#2-worker-service","title":"2. Worker Service","text":"<pre><code>graph TB\n    subgraph \"Worker Architecture\"\n        API[HTTP API] --&gt; Router[Request Router]\n        Router --&gt; Auth[Auth Middleware]\n        Auth --&gt; Queue[Job Queue]\n        Queue --&gt; Processor[Job Processor]\n        Processor --&gt; Runtime[Container Runtime]\n        Runtime --&gt; Storage[Result Storage]\n\n        subgraph \"Job Lifecycle\"\n            Submit[Submit Job] --&gt; Queue\n            Queue --&gt; Execute[Execute]\n            Execute --&gt; Monitor[Monitor]\n            Monitor --&gt; Complete[Complete]\n            Complete --&gt; Store[Store Results]\n        end\n    end\n</code></pre> <p>Responsibilities: - HTTP API for job submission - Job queue management - Container orchestration - Result collection and storage - Metrics and monitoring</p>"},{"location":"architecture/#3-data-manager-service","title":"3. Data Manager Service","text":"<pre><code>graph TB\n    subgraph \"Data Management\"\n        API[Data API] --&gt; Storage[Storage Layer]\n        Storage --&gt; Metadata[Metadata DB]\n        Storage --&gt; Files[File System]\n        Storage --&gt; Cache[Redis Cache]\n\n        subgraph \"Data Operations\"\n            Upload[Upload Data] --&gt; Validate[Validate]\n            Validate --&gt; Store[Store]\n            Store --&gt; Index[Index]\n            Index --&gt; Catalog[Catalog]\n        end\n    end\n</code></pre> <p>Features: - Data upload and validation - Metadata management - File system abstraction - Caching layer - Data catalog</p>"},{"location":"architecture/#4-terminal-ui-tui","title":"4. Terminal UI (TUI)","text":"<pre><code>graph TB\n    subgraph \"TUI Architecture\"\n        UI[UI Components] --&gt; Model[Data Model]\n        Model --&gt; Update[Update Loop]\n        Update --&gt; Render[Render]\n\n        subgraph \"UI Panels\"\n            Jobs[Job List]\n            Details[Job Details]\n            Logs[Log Viewer]\n            Status[Status Bar]\n        end\n\n        UI --&gt; Jobs\n        UI --&gt; Details\n        UI --&gt; Logs\n        UI --&gt; Status\n    end\n</code></pre> <p>Components: - Bubble Tea framework - Component-based architecture - Real-time updates - Keyboard navigation - Theme support</p>"},{"location":"architecture/#data-flow_1","title":"Data Flow","text":""},{"location":"architecture/#job-execution-flow","title":"Job Execution Flow","text":"<pre><code>sequenceDiagram\n    participant Client\n    participant Auth\n    participant Worker\n    participant Queue\n    participant Container\n    participant Storage\n\n    Client-&gt;&gt;Auth: Submit job with API key\n    Auth-&gt;&gt;Client: Validate and return job ID\n\n    Client-&gt;&gt;Worker: Execute job request\n    Worker-&gt;&gt;Queue: Queue job\n    Queue-&gt;&gt;Worker: Job ready\n    Worker-&gt;&gt;Container: Start ML container\n    Container-&gt;&gt;Worker: Execute experiment\n    Worker-&gt;&gt;Storage: Store results\n    Worker-&gt;&gt;Client: Return results\n</code></pre>"},{"location":"architecture/#authentication-flow","title":"Authentication Flow","text":"<pre><code>sequenceDiagram\n    participant Client\n    participant Auth\n    participant PermMgr\n    participant Config\n\n    Client-&gt;&gt;Auth: Request with API key\n    Auth-&gt;&gt;Auth: Validate key hash\n    Auth-&gt;&gt;PermMgr: Get user permissions\n    PermMgr-&gt;&gt;Config: Load YAML permissions\n    Config-&gt;&gt;PermMgr: Return permissions\n    PermMgr-&gt;&gt;Auth: Return resolved permissions\n    Auth-&gt;&gt;Client: Grant/deny access\n</code></pre>"},{"location":"architecture/#security-architecture_1","title":"Security Architecture","text":""},{"location":"architecture/#defense-in-depth","title":"Defense in Depth","text":"<pre><code>graph TB\n    subgraph \"Security Layers\"\n        Network[Network Security]\n        Auth[Authentication]\n        AuthZ[Authorization]\n        Container[Container Security]\n        Data[Data Protection]\n        Audit[Audit Logging]\n    end\n\n    Network --&gt; Auth\n    Auth --&gt; AuthZ\n    AuthZ --&gt; Container\n    Container --&gt; Data\n    Data --&gt; Audit\n</code></pre> <p>Security Features: - API key authentication - Role-based permissions - Container isolation - File system sandboxing - Comprehensive audit logs - Input validation and sanitization</p>"},{"location":"architecture/#container-security","title":"Container Security","text":"<pre><code>graph TB\n    subgraph \"Container Isolation\"\n        Host[Host System]\n        Podman[Podman Runtime]\n        Network[Network Isolation]\n        FS[File System Isolation]\n        User[User Namespaces]\n        ML[ML Container]\n\n        Host --&gt; Podman\n        Podman --&gt; Network\n        Podman --&gt; FS\n        Podman --&gt; User\n        User --&gt; ML\n    end\n</code></pre> <p>Isolation Features: - Rootless containers - Network isolation - File system sandboxing - User namespace mapping - Resource limits</p>"},{"location":"architecture/#configuration-architecture","title":"Configuration Architecture","text":""},{"location":"architecture/#configuration-hierarchy","title":"Configuration Hierarchy","text":"<pre><code>graph TB\n    subgraph \"Config Sources\"\n        Env[Environment Variables]\n        File[Config Files]\n        CLI[CLI Flags]\n        Defaults[Default Values]\n    end\n\n    subgraph \"Config Processing\"\n        Merge[Config Merger]\n        Validate[Schema Validator]\n        Apply[Config Applier]\n    end\n\n    Env --&gt; Merge\n    File --&gt; Merge\n    CLI --&gt; Merge\n    Defaults --&gt; Merge\n\n    Merge --&gt; Validate\n    Validate --&gt; Apply\n</code></pre> <p>Configuration Priority: 1. CLI flags (highest) 2. Environment variables 3. Configuration files 4. Default values (lowest)</p>"},{"location":"architecture/#scalability-architecture","title":"Scalability Architecture","text":""},{"location":"architecture/#horizontal-scaling","title":"Horizontal Scaling","text":"<pre><code>graph TB\n    subgraph \"Scaled Architecture\"\n        LB[Load Balancer]\n        W1[Worker 1]\n        W2[Worker 2]\n        W3[Worker N]\n        Redis[Redis Cluster]\n        Storage[Shared Storage]\n\n        LB --&gt; W1\n        LB --&gt; W2\n        LB --&gt; W3\n\n        W1 --&gt; Redis\n        W2 --&gt; Redis\n        W3 --&gt; Redis\n\n        W1 --&gt; Storage\n        W2 --&gt; Storage\n        W3 --&gt; Storage\n    end\n</code></pre> <p>Scaling Features: - Stateless worker services - Shared job queue (Redis) - Distributed storage - Load balancer ready - Health checks and monitoring</p>"},{"location":"architecture/#technology-stack","title":"Technology Stack","text":""},{"location":"architecture/#backend-technologies","title":"Backend Technologies","text":"Component Technology Purpose Language Go 1.25+ Core application Web Framework Standard library HTTP server Authentication Custom API key + RBAC Database SQLite/PostgreSQL Metadata storage Cache Redis Job queue &amp; caching Containers Podman/Docker Job isolation UI Framework Bubble Tea Terminal UI"},{"location":"architecture/#dependencies","title":"Dependencies","text":"<pre><code>// Core dependencies\nrequire (\n    github.com/charmbracelet/bubbletea v1.3.10  // TUI framework\n    github.com/go-redis/redis/v8 v8.11.5        // Redis client\n    github.com/google/uuid v1.6.0               // UUID generation\n    github.com/mattn/go-sqlite3 v1.14.32        // SQLite driver\n    golang.org/x/crypto v0.45.0                 // Crypto utilities\n    gopkg.in/yaml.v3 v3.0.1                     // YAML parsing\n)\n</code></pre>"},{"location":"architecture/#development-architecture","title":"Development Architecture","text":""},{"location":"architecture/#project-structure","title":"Project Structure","text":"<pre><code>fetch_ml/\n\u251c\u2500\u2500 cmd/                    # CLI applications\n\u2502   \u251c\u2500\u2500 worker/            # ML worker service\n\u2502   \u251c\u2500\u2500 tui/               # Terminal UI\n\u2502   \u251c\u2500\u2500 data_manager/      # Data management\n\u2502   \u2514\u2500\u2500 user_manager/      # User management\n\u251c\u2500\u2500 internal/              # Internal packages\n\u2502   \u251c\u2500\u2500 auth/              # Authentication system\n\u2502   \u251c\u2500\u2500 config/            # Configuration management\n\u2502   \u251c\u2500\u2500 container/         # Container operations\n\u2502   \u251c\u2500\u2500 database/          # Database operations\n\u2502   \u251c\u2500\u2500 logging/           # Logging utilities\n\u2502   \u251c\u2500\u2500 metrics/           # Metrics collection\n\u2502   \u2514\u2500\u2500 network/           # Network utilities\n\u251c\u2500\u2500 configs/               # Configuration files\n\u251c\u2500\u2500 scripts/               # Setup and utility scripts\n\u251c\u2500\u2500 tests/                 # Test suites\n\u2514\u2500\u2500 docs/                  # Documentation\n</code></pre>"},{"location":"architecture/#package-dependencies","title":"Package Dependencies","text":"<pre><code>graph TB\n    subgraph \"Application Layer\"\n        Worker[cmd/worker]\n        TUI[cmd/tui]\n        DataMgr[cmd/data_manager]\n        UserMgr[cmd/user_manager]\n    end\n\n    subgraph \"Service Layer\"\n        Auth[internal/auth]\n        Config[internal/config]\n        Container[internal/container]\n        Database[internal/database]\n    end\n\n    subgraph \"Utility Layer\"\n        Logging[internal/logging]\n        Metrics[internal/metrics]\n        Network[internal/network]\n    end\n\n    Worker --&gt; Auth\n    Worker --&gt; Config\n    Worker --&gt; Container\n    TUI --&gt; Auth\n    DataMgr --&gt; Database\n    UserMgr --&gt; Auth\n\n    Auth --&gt; Logging\n    Container --&gt; Network\n    Database --&gt; Metrics\n</code></pre>"},{"location":"architecture/#monitoring-observability","title":"Monitoring &amp; Observability","text":""},{"location":"architecture/#metrics-collection","title":"Metrics Collection","text":"<pre><code>graph TB\n    subgraph \"Metrics Pipeline\"\n        App[Application] --&gt; Metrics[Metrics Collector]\n        Metrics --&gt; Export[Prometheus Exporter]\n        Export --&gt; Prometheus[Prometheus Server]\n        Prometheus --&gt; Grafana[Grafana Dashboard]\n\n        subgraph \"Metric Types\"\n            Counter[Counters]\n            Gauge[Gauges]\n            Histogram[Histograms]\n            Timer[Timers]\n        end\n\n        App --&gt; Counter\n        App --&gt; Gauge\n        App --&gt; Histogram\n        App --&gt; Timer\n    end\n</code></pre>"},{"location":"architecture/#logging-architecture","title":"Logging Architecture","text":"<pre><code>graph TB\n    subgraph \"Logging Pipeline\"\n        App[Application] --&gt; Logger[Structured Logger]\n        Logger --&gt; File[File Output]\n        Logger --&gt; Console[Console Output]\n        Logger --&gt; Syslog[Syslog Forwarder]\n        Syslog --&gt; Aggregator[Log Aggregator]\n        Aggregator --&gt; Storage[Log Storage]\n        Storage --&gt; Viewer[Log Viewer]\n    end\n</code></pre>"},{"location":"architecture/#deployment-architecture","title":"Deployment Architecture","text":""},{"location":"architecture/#container-deployment","title":"Container Deployment","text":"<pre><code>graph TB\n    subgraph \"Deployment Stack\"\n        Image[Container Image]\n        Registry[Container Registry]\n        Orchestrator[Docker Compose]\n        Config[ConfigMaps/Secrets]\n        Storage[Persistent Storage]\n\n        Image --&gt; Registry\n        Registry --&gt; Orchestrator\n        Config --&gt; Orchestrator\n        Storage --&gt; Orchestrator\n    end\n</code></pre>"},{"location":"architecture/#service-discovery","title":"Service Discovery","text":"<pre><code>graph TB\n    subgraph \"Service Mesh\"\n        Gateway[API Gateway]\n        Discovery[Service Discovery]\n        Worker[Worker Service]\n        Data[Data Service]\n        Redis[Redis Cluster]\n\n        Gateway --&gt; Discovery\n        Discovery --&gt; Worker\n        Discovery --&gt; Data\n        Discovery --&gt; Redis\n    end\n</code></pre>"},{"location":"architecture/#future-architecture-considerations","title":"Future Architecture Considerations","text":""},{"location":"architecture/#microservices-evolution","title":"Microservices Evolution","text":"<ul> <li>API Gateway: Centralized routing and authentication</li> <li>Service Mesh: Inter-service communication</li> <li>Event Streaming: Kafka for job events</li> <li>Distributed Tracing: OpenTelemetry integration</li> <li>Multi-tenant: Tenant isolation and quotas</li> </ul>"},{"location":"architecture/#homelab-features","title":"Homelab Features","text":"<ul> <li>Docker Compose: Simple container orchestration</li> <li>Local Development: Easy setup and testing</li> <li>Security: Built-in authentication and encryption</li> <li>Monitoring: Basic health checks and logging</li> </ul> <p>This architecture provides a solid foundation for secure, scalable machine learning experiments while maintaining simplicity and developer productivity.</p>"},{"location":"cicd/","title":"CI/CD Pipeline","text":"<p>Automated testing, building, and releasing for fetch_ml.</p>"},{"location":"cicd/#workflows","title":"Workflows","text":""},{"location":"cicd/#ci-workflow-githubworkflowsciyml","title":"CI Workflow (<code>.github/workflows/ci.yml</code>)","text":"<p>Runs on every push to <code>main</code>/<code>develop</code> and all pull requests.</p> <p>Jobs: 1. test - Go backend tests with Redis 2. build - Build all binaries (Go + Zig CLI) 3. test-scripts - Validate deployment scripts 4. security-scan - Trivy and Gosec security scans 5. docker-build - Build and push Docker images (main branch only)</p> <p>Test Coverage: - Go unit tests with race detection - <code>internal/queue</code> package tests - Zig CLI tests - Integration tests - Security audits</p>"},{"location":"cicd/#release-workflow-githubworkflowsreleaseyml","title":"Release Workflow (<code>.github/workflows/release.yml</code>)","text":"<p>Runs on version tags (e.g., <code>v1.0.0</code>).</p> <p>Jobs:</p> <ol> <li>build-cli (matrix build)</li> <li>Linux x86_64 (static musl)</li> <li>macOS x86_64</li> <li>macOS ARM64</li> <li>Downloads platform-specific static rsync</li> <li> <p>Embeds rsync for zero-dependency releases</p> </li> <li> <p>build-go-backends</p> </li> <li>Cross-platform Go builds</li> <li> <p>api-server, worker, tui, data_manager, user_manager</p> </li> <li> <p>create-release</p> </li> <li>Collects all artifacts</li> <li>Generates SHA256 checksums</li> <li>Creates GitHub release with notes</li> </ol>"},{"location":"cicd/#release-process","title":"Release Process","text":""},{"location":"cicd/#creating-a-release","title":"Creating a Release","text":"<pre><code># 1. Update version\ngit tag v1.0.0\n\n# 2. Push tag\ngit push origin v1.0.0\n\n# 3. CI automatically builds and releases\n</code></pre>"},{"location":"cicd/#release-artifacts","title":"Release Artifacts","text":"<p>CLI Binaries (with embedded rsync): - <code>ml-linux-x86_64.tar.gz</code> (~450-650KB) - <code>ml-macos-x86_64.tar.gz</code> (~450-650KB) - <code>ml-macos-arm64.tar.gz</code> (~450-650KB)</p> <p>Go Backends: - <code>fetch_ml_api-server.tar.gz</code> - <code>fetch_ml_worker.tar.gz</code> - <code>fetch_ml_tui.tar.gz</code> - <code>fetch_ml_data_manager.tar.gz</code> - <code>fetch_ml_user_manager.tar.gz</code></p> <p>Checksums: - <code>checksums.txt</code> - Combined SHA256 sums - Individual <code>.sha256</code> files per binary</p>"},{"location":"cicd/#development-workflow","title":"Development Workflow","text":""},{"location":"cicd/#local-testing","title":"Local Testing","text":"<pre><code># Run all tests\nmake test\n\n# Run specific package tests\ngo test ./internal/queue/...\n\n# Build CLI\ncd cli &amp;&amp; zig build dev\n\n# Run formatters and linters\nmake lint\n\n# Security scans are handled automatically in CI by the `security-scan` job\n</code></pre>"},{"location":"cicd/#optional-heavy-end-to-end-tests","title":"Optional heavy end-to-end tests","text":"<p>Some e2e tests exercise full Docker deployments and performance scenarios and are skipped by default to keep local/CI runs fast. You can enable them explicitly with environment variables:</p> <pre><code># Run Docker deployment e2e tests\nFETCH_ML_E2E_DOCKER=1 go test ./tests/e2e/...\n\n# Run performance-oriented e2e tests\nFETCH_ML_E2E_PERF=1 go test ./tests/e2e/...\n</code></pre> <p>Without these variables, <code>TestDockerDeploymentE2E</code> and <code>TestPerformanceE2E</code> will <code>t.Skip</code>, while all lighter e2e tests still run.</p>"},{"location":"cicd/#pull-request-checks","title":"Pull Request Checks","text":"<p>All PRs must pass: - \u2705 Go tests (with Redis) - \u2705 CLI tests - \u2705 Security scans - \u2705 Code linting - \u2705 Build verification</p>"},{"location":"cicd/#configuration","title":"Configuration","text":""},{"location":"cicd/#environment-variables","title":"Environment Variables","text":"<pre><code>GO_VERSION: '1.25.0'\nZIG_VERSION: '0.15.2'\n</code></pre>"},{"location":"cicd/#secrets","title":"Secrets","text":"<p>Required for releases: - <code>GITHUB_TOKEN</code> - Automatic, provided by GitHub Actions</p>"},{"location":"cicd/#monitoring","title":"Monitoring","text":""},{"location":"cicd/#build-status","title":"Build Status","text":"<p>Check workflow runs at: <pre><code>https://github.com/jfraeys/fetch_ml/actions\n</code></pre></p>"},{"location":"cicd/#artifacts","title":"Artifacts","text":"<p>Download build artifacts from: - Successful workflow runs (30-day retention) - GitHub Releases (permanent)</p> <p>For implementation details: - .github/workflows/ci.yml - .github/workflows/release.yml</p>"},{"location":"cli-reference/","title":"Fetch ML CLI Reference","text":"<p>Comprehensive command-line tools for managing ML experiments in your homelab with Zig-based high-performance CLI.</p>"},{"location":"cli-reference/#overview","title":"Overview","text":"<p>Fetch ML provides a comprehensive CLI toolkit built with performance and security in mind:</p> <ul> <li>Zig CLI - High-performance experiment management written in Zig</li> <li>Go Commands - API server, TUI, and data management utilities</li> <li>Management Scripts - Service orchestration and deployment</li> <li>Setup Scripts - One-command installation and configuration</li> </ul>"},{"location":"cli-reference/#zig-cli-clizig-outbinml","title":"Zig CLI (<code>./cli/zig-out/bin/ml</code>)","text":"<p>High-performance command-line interface for experiment management, written in Zig for speed and efficiency.</p>"},{"location":"cli-reference/#available-commands","title":"Available Commands","text":"Command Description Example <code>init</code> Interactive configuration setup <code>ml init</code> <code>sync</code> Sync project to worker with deduplication <code>ml sync ./project --name myjob --queue</code> <code>queue</code> Queue job for execution <code>ml queue myjob --commit abc123 --priority 8</code> <code>status</code> Get system and worker status <code>ml status</code> <code>monitor</code> Launch TUI monitoring via SSH <code>ml monitor</code> <code>cancel</code> Cancel running job <code>ml cancel job123</code> <code>prune</code> Clean up old experiments <code>ml prune --keep 10</code> <code>watch</code> Auto-sync directory on changes <code>ml watch ./project --queue</code>"},{"location":"cli-reference/#command-details","title":"Command Details","text":""},{"location":"cli-reference/#init-configuration-setup","title":"<code>init</code> - Configuration Setup","text":"<p><pre><code>ml init\n</code></pre> Creates a configuration template at <code>~/.ml/config.toml</code> with: - Worker connection details - API authentication - Base paths and ports</p>"},{"location":"cli-reference/#sync-project-synchronization","title":"<code>sync</code> - Project Synchronization","text":"<pre><code># Basic sync\nml sync ./my-project\n\n# Sync with custom name and queue\nml sync ./my-project --name \"experiment-1\" --queue\n\n# Sync with priority\nml sync ./my-project --priority 9\n</code></pre> <p>Features: - Content-addressed storage for deduplication - SHA256 commit ID generation - Rsync-based file transfer - Automatic queuing (with <code>--queue</code> flag)</p>"},{"location":"cli-reference/#queue-job-management","title":"<code>queue</code> - Job Management","text":"<pre><code># Queue with commit ID\nml queue my-job --commit abc123def456\n\n# Queue with priority (1-10, default 5)\nml queue my-job --commit abc123 --priority 8\n</code></pre> <p>Features: - WebSocket-based communication - Priority queuing system - API key authentication</p>"},{"location":"cli-reference/#watch-auto-sync-monitoring","title":"<code>watch</code> - Auto-Sync Monitoring","text":"<pre><code># Watch directory for changes\nml watch ./project\n\n# Watch and auto-queue on changes\nml watch ./project --name \"dev-exp\" --queue\n</code></pre> <p>Features: - Real-time file system monitoring - Automatic re-sync on changes - Configurable polling interval (2 seconds) - Commit ID comparison for efficiency</p>"},{"location":"cli-reference/#prune-cleanup-management","title":"<code>prune</code> - Cleanup Management","text":"<pre><code># Keep last N experiments\nml prune --keep 20\n\n# Remove experiments older than N days\nml prune --older-than 30\n</code></pre>"},{"location":"cli-reference/#monitor-remote-monitoring","title":"<code>monitor</code> - Remote Monitoring","text":"<p><pre><code>ml monitor\n</code></pre> Launches TUI interface via SSH for real-time monitoring.</p>"},{"location":"cli-reference/#cancel-job-cancellation","title":"<code>cancel</code> - Job Cancellation","text":"<p><pre><code>ml cancel running-job-id\n</code></pre> Cancels currently running jobs by ID.</p>"},{"location":"cli-reference/#configuration","title":"Configuration","text":"<p>The Zig CLI reads configuration from <code>~/.ml/config.toml</code>:</p> <pre><code>worker_host = \"worker.local\"\nworker_user = \"mluser\"\nworker_base = \"/data/ml-experiments\"\nworker_port = 22\napi_key = \"your-api-key\"\n</code></pre>"},{"location":"cli-reference/#performance-features","title":"Performance Features","text":"<ul> <li>Content-Addressed Storage: Automatic deduplication of identical files</li> <li>Incremental Sync: Only transfers changed files</li> <li>SHA256 Hashing: Reliable commit ID generation</li> <li>WebSocket Communication: Efficient real-time messaging</li> <li>Multi-threaded: Concurrent operations where applicable</li> </ul>"},{"location":"cli-reference/#go-commands","title":"Go Commands","text":""},{"location":"cli-reference/#api-server-cmdapi-servermaingo","title":"API Server (<code>./cmd/api-server/main.go</code>)","text":"<p>Main HTTPS API server for experiment management.</p> <pre><code># Build and run\ngo run ./cmd/api-server/main.go\n\n# With configuration\n./bin/api-server --config configs/config-local.yaml\n</code></pre> <p>Features: - HTTPS-only communication - API key authentication - Rate limiting and IP whitelisting - WebSocket support for real-time updates - Redis integration for caching</p>"},{"location":"cli-reference/#tui-cmdtuimaingo","title":"TUI (<code>./cmd/tui/main.go</code>)","text":"<p>Terminal User Interface for monitoring experiments.</p> <pre><code># Launch TUI\ngo run ./cmd/tui/main.go\n\n# With custom config\n./tui --config configs/config-local.yaml\n</code></pre> <p>Features: - Real-time experiment monitoring - Interactive job management - Status visualization - Log viewing</p>"},{"location":"cli-reference/#data-manager-cmddata_manager","title":"Data Manager (<code>./cmd/data_manager/</code>)","text":"<p>Utilities for data synchronization and management.</p> <pre><code># Sync data\n./data_manager --sync ./data\n\n# Clean old data\n./data_manager --cleanup --older-than 30d\n</code></pre>"},{"location":"cli-reference/#config-lint-cmdconfiglintmaingo","title":"Config Lint (<code>./cmd/configlint/main.go</code>)","text":"<p>Configuration validation and linting tool.</p> <pre><code># Validate configuration\n./configlint configs/config-local.yaml\n\n# Check schema compliance\n./configlint --schema configs/schema/config_schema.yaml\n</code></pre>"},{"location":"cli-reference/#management-script-toolsmanagesh","title":"Management Script (<code>./tools/manage.sh</code>)","text":"<p>Simple service management for your homelab.</p>"},{"location":"cli-reference/#commands","title":"Commands","text":"<pre><code>./tools/manage.sh start          # Start all services\n./tools/manage.sh stop           # Stop all services\n./tools/manage.sh status         # Check service status\n./tools/manage.sh logs           # View logs\n./tools/manage.sh monitor        # Basic monitoring\n./tools/manage.sh security       # Security status\n./tools/manage.sh cleanup        # Clean project artifacts\n</code></pre>"},{"location":"cli-reference/#setup-script-setupsh","title":"Setup Script (<code>./setup.sh</code>)","text":"<p>One-command homelab setup.</p>"},{"location":"cli-reference/#usage","title":"Usage","text":"<pre><code># Full setup\n./setup.sh\n\n# Setup includes:\n# - SSL certificate generation\n# - Configuration creation\n# - Build all components\n# - Start Redis\n# - Setup Fail2Ban (if available)\n</code></pre>"},{"location":"cli-reference/#api-testing","title":"API Testing","text":"<p>Test the API with curl:</p> <pre><code># Health check\ncurl -k -H 'X-API-Key: password' https://localhost:9101/health\n\n# List experiments\ncurl -k -H 'X-API-Key: password' https://localhost:9101/experiments\n\n# Submit experiment\ncurl -k -X POST -H 'X-API-Key: password' \\\n     -H 'Content-Type: application/json' \\\n     -d '{\"name\":\"test\",\"config\":{\"type\":\"basic\"}}' \\\n     https://localhost:9101/experiments\n</code></pre>"},{"location":"cli-reference/#zig-cli-architecture","title":"Zig CLI Architecture","text":"<p>The Zig CLI is designed for performance and reliability:</p>"},{"location":"cli-reference/#core-components","title":"Core Components","text":"<ul> <li>Commands (<code>cli/src/commands/</code>): Individual command implementations</li> <li>Config (<code>cli/src/config.zig</code>): Configuration management</li> <li>Network (<code>cli/src/net/ws.zig</code>): WebSocket client implementation</li> <li>Utils (<code>cli/src/utils/</code>): Cryptography, storage, and rsync utilities</li> <li>Errors (<code>cli/src/errors.zig</code>): Centralized error handling</li> </ul>"},{"location":"cli-reference/#performance-optimizations","title":"Performance Optimizations","text":"<ul> <li>Content-Addressed Storage: Deduplicates identical files across experiments</li> <li>SHA256 Hashing: Fast, reliable commit ID generation</li> <li>Rsync Integration: Efficient incremental file transfers</li> <li>WebSocket Protocol: Low-latency communication with worker</li> <li>Memory Management: Efficient allocation with Zig's allocator system</li> </ul>"},{"location":"cli-reference/#security-features","title":"Security Features","text":"<ul> <li>API Key Hashing: Secure authentication token handling</li> <li>SSH Integration: Secure file transfers</li> <li>Input Validation: Comprehensive argument checking</li> <li>Error Handling: Secure error reporting without information leakage</li> </ul>"},{"location":"cli-reference/#configuration_1","title":"Configuration","text":"<p>Main configuration file: <code>configs/config-local.yaml</code></p>"},{"location":"cli-reference/#key-settings","title":"Key Settings","text":"<pre><code>auth:\n  enabled: true\n  api_keys:\n    homelab_user:\n      hash: \"5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8\"\n      admin: true\n\nserver:\n  address: \":9101\"\n  tls:\n    enabled: true\n    cert_file: \"./ssl/cert.pem\"\n    key_file: \"./ssl/key.pem\"\n\nsecurity:\n  rate_limit:\n    enabled: true\n    requests_per_minute: 30\n  ip_whitelist:\n    - \"127.0.0.1\"\n    - \"::1\"\n    - \"192.168.0.0/16\"\n    - \"10.0.0.0/8\"\n</code></pre>"},{"location":"cli-reference/#docker-commands","title":"Docker Commands","text":"<p>If using Docker Compose:</p> <pre><code># Start services\ndocker-compose up -d (testing only)\n\n# View logs\ndocker-compose logs -f\n\n# Stop services\ndocker-compose down\n\n# Check status\ndocker-compose ps\n</code></pre>"},{"location":"cli-reference/#troubleshooting","title":"Troubleshooting","text":""},{"location":"cli-reference/#common-issues","title":"Common Issues","text":"<p>Zig CLI not found: <pre><code># Build the CLI\ncd cli &amp;&amp; make build\n\n# Check binary exists\nls -la ./cli/zig-out/bin/ml\n</code></pre></p> <p>Configuration not found: <pre><code># Create configuration\n./cli/zig-out/bin/ml init\n\n# Check config file\nls -la ~/.ml/config.toml\n</code></pre></p> <p>Worker connection failed: <pre><code># Test SSH connection\nssh -p 22 mluser@worker.local\n\n# Check configuration\ncat ~/.ml/config.toml\n</code></pre></p> <p>Sync not working: <pre><code># Check rsync availability\nrsync --version\n\n# Test manual sync\nrsync -avz ./project/ mluser@worker.local:/tmp/test/\n</code></pre></p> <p>WebSocket connection failed: <pre><code># Check worker WebSocket port\ntelnet worker.local 9100\n\n# Verify API key\n./cli/zig-out/bin/ml status\n</code></pre></p> <p>API not responding: <pre><code>./tools/manage.sh status\n./tools/manage.sh logs\n</code></pre></p> <p>Authentication failed: <pre><code># Check API key in config-local.yaml\ngrep -A 5 \"api_keys:\" configs/config-local.yaml\n</code></pre></p> <p>Redis connection failed: <pre><code># Check Redis status\nredis-cli ping\n\n# Start Redis\nredis-server\n</code></pre></p>"},{"location":"cli-reference/#getting-help","title":"Getting Help","text":"<pre><code># CLI help\n./cli/zig-out/bin/ml help\n\n# Management script help\n./tools/manage.sh help\n\n# Check all available commands\nmake help\n</code></pre> <p>That's it for the CLI reference! For complete setup instructions, see the main index.</p>"},{"location":"configuration-schema/","title":"Configuration Schema","text":"<p>Complete reference for Fetch ML configuration options.</p>"},{"location":"configuration-schema/#configuration-file-structure","title":"Configuration File Structure","text":"<p>Fetch ML uses YAML configuration files. The main configuration file is typically <code>config.yaml</code>.</p>"},{"location":"configuration-schema/#full-schema","title":"Full Schema","text":"<pre><code># Server Configuration\nserver:\n  address: \":9101\"\n  tls:\n    enabled: false\n    cert_file: \"\"\n    key_file: \"\"\n\n# Database Configuration\ndatabase:\n  type: \"sqlite\"  # sqlite, postgres, mysql\n  connection: \"fetch_ml.db\"\n  host: \"localhost\"\n  port: 5432\n  username: \"postgres\"\n  password: \"\"\n  database: \"fetch_ml\"\n\n# Redis Configuration\n\n\n## Quick Reference\n\n### Database Types\n- **SQLite**: `type: sqlite, connection: file.db`\n- **PostgreSQL**: `type: postgres, host: localhost, port: 5432`\n\n### Key Settings\n- `server.address: :9101`\n- `database.type: sqlite`\n- `redis.addr: localhost:6379`\n- `auth.enabled: true`\n- `logging.level: info`\n\n### Environment Override\n```bash\nexport FETCHML_SERVER_ADDRESS=:8080\nexport FETCHML_DATABASE_TYPE=postgres\n</code></pre>"},{"location":"configuration-schema/#validation","title":"Validation","text":"<pre><code>make configlint\n</code></pre>"},{"location":"deployment/","title":"ML Experiment Manager - Deployment Guide","text":""},{"location":"deployment/#overview","title":"Overview","text":"<p>The ML Experiment Manager supports multiple deployment methods from local development to homelab Docker setups.</p>"},{"location":"deployment/#quick-start","title":"Quick Start","text":""},{"location":"deployment/#docker-compose-recommended-for-development","title":"Docker Compose (Recommended for Development)","text":"<pre><code># Clone repository\ngit clone https://github.com/your-org/fetch_ml.git\ncd fetch_ml\n\n# Start all services\ndocker-compose up -d (testing only)\n\n# Check status\ndocker-compose ps\n\n# View logs\ndocker-compose logs -f api-server\n</code></pre> <p>Access the API at <code>http://localhost:9100</code></p>"},{"location":"deployment/#deployment-options","title":"Deployment Options","text":""},{"location":"deployment/#1-local-development","title":"1. Local Development","text":""},{"location":"deployment/#prerequisites","title":"Prerequisites","text":"<p>Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution - Go 1.25+ - Zig 0.15.2 - Redis 7+ - Docker &amp; Docker Compose (optional)</p>"},{"location":"deployment/#manual-setup","title":"Manual Setup","text":"<pre><code># Start Redis\nredis-server\n\n# Build and run Go server\ngo build -o bin/api-server ./cmd/api-server\n./bin/api-server -config configs/config-local.yaml\n\n# Build Zig CLI\ncd cli\nzig build prod\n./zig-out/bin/ml --help\n</code></pre>"},{"location":"deployment/#2-docker-deployment","title":"2. Docker Deployment","text":""},{"location":"deployment/#build-image","title":"Build Image","text":"<pre><code>docker build -t ml-experiment-manager:latest .\n</code></pre>"},{"location":"deployment/#run-container","title":"Run Container","text":"<pre><code>docker run -d \\\n  --name ml-api \\\n  -p 9100:9100 \\\n  -p 9101:9101 \\\n  -v $(pwd)/configs:/app/configs:ro \\\n  -v experiment-data:/data/ml-experiments \\\n  ml-experiment-manager:latest\n</code></pre>"},{"location":"deployment/#docker-compose","title":"Docker Compose","text":"<pre><code># Production mode\ndocker-compose -f docker-compose.yml up -d\n\n# Development mode with logs\ndocker-compose -f docker-compose.yml up\n</code></pre>"},{"location":"deployment/#3-homelab-setup","title":"3. Homelab Setup","text":"<pre><code># Use the simple setup script\n./setup.sh\n\n# Or manually with Docker Compose\ndocker-compose up -d (testing only)\n</code></pre>"},{"location":"deployment/#4-cloud-deployment","title":"4. Cloud Deployment","text":""},{"location":"deployment/#aws-ecs","title":"AWS ECS","text":"<pre><code># Build and push to ECR\naws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REGISTRY\ndocker build -t $ECR_REGISTRY/ml-experiment-manager:latest .\ndocker push $ECR_REGISTRY/ml-experiment-manager:latest\n\n# Deploy with ECS CLI\necs-cli compose --project-name ml-experiment-manager up\n</code></pre>"},{"location":"deployment/#google-cloud-run","title":"Google Cloud Run","text":"<pre><code># Build and push\ngcloud builds submit --tag gcr.io/$PROJECT_ID/ml-experiment-manager\n\n# Deploy\ngcloud run deploy ml-experiment-manager \\\n  --image gcr.io/$PROJECT_ID/ml-experiment-manager \\\n  --platform managed \\\n  --region us-central1 \\\n  --allow-unauthenticated\n</code></pre>"},{"location":"deployment/#configuration","title":"Configuration","text":""},{"location":"deployment/#environment-variables","title":"Environment Variables","text":"<pre><code># configs/config-local.yaml\nbase_path: \"/data/ml-experiments\"\nauth:\n  enabled: true\n  api_keys:\n    - \"your-production-api-key\"\nserver:\n  address: \":9100\"\n  tls:\n    enabled: true\n    cert_file: \"/app/ssl/cert.pem\"\n    key_file: \"/app/ssl/key.pem\"\n</code></pre>"},{"location":"deployment/#docker-compose-environment","title":"Docker Compose Environment","text":"<pre><code># docker-compose.yml\nversion: '3.8'\nservices:\n  api-server:\n    environment:\n      - REDIS_URL=redis://redis:6379\n      - LOG_LEVEL=info\n    volumes:\n      - ./configs:/configs:ro\n      - ./data:/data/experiments\n</code></pre>"},{"location":"deployment/#monitoring-logging","title":"Monitoring &amp; Logging","text":""},{"location":"deployment/#health-checks","title":"Health Checks","text":"<ul> <li>HTTP: <code>GET /health</code></li> <li>WebSocket: Connection test</li> <li>Redis: Ping check</li> </ul>"},{"location":"deployment/#metrics","title":"Metrics","text":"<ul> <li>Prometheus metrics at <code>/metrics</code></li> <li>Custom application metrics</li> <li>Container resource usage</li> </ul>"},{"location":"deployment/#logging","title":"Logging","text":"<ul> <li>Structured JSON logging</li> <li>Log levels: DEBUG, INFO, WARN, ERROR</li> <li>Centralized logging via ELK stack</li> </ul>"},{"location":"deployment/#security","title":"Security","text":""},{"location":"deployment/#tls-configuration","title":"TLS Configuration","text":"<pre><code># Generate self-signed cert (development)\nopenssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365 -nodes\n\n# Production - use Let's Encrypt\ncertbot certonly --standalone -d ml-experiments.example.com\n</code></pre>"},{"location":"deployment/#network-security","title":"Network Security","text":"<ul> <li>Firewall rules (ports 9100, 9101, 6379)</li> <li>VPN access for internal services</li> <li>API key authentication</li> <li>Rate limiting</li> </ul>"},{"location":"deployment/#performance-tuning","title":"Performance Tuning","text":""},{"location":"deployment/#resource-allocation","title":"Resource Allocation","text":"<pre><code>resources:\n  requests:\n    memory: \"256Mi\"\n    cpu: \"250m\"\n  limits:\n    memory: \"1Gi\"\n    cpu: \"1000m\"\n</code></pre>"},{"location":"deployment/#scaling-strategies","title":"Scaling Strategies","text":"<ul> <li>Horizontal pod autoscaling</li> <li>Redis clustering</li> <li>Load balancing</li> <li>CDN for static assets</li> </ul>"},{"location":"deployment/#backup-recovery","title":"Backup &amp; Recovery","text":""},{"location":"deployment/#data-backup","title":"Data Backup","text":"<pre><code># Backup experiment data\ndocker-compose exec redis redis-cli BGSAVE\ndocker cp $(docker-compose ps -q redis):/data/dump.rdb ./redis-backup.rdb\n\n# Backup data volume\ndocker run --rm -v ml-experiments_redis_data:/data -v $(pwd):/backup alpine tar czf /backup/redis-backup.tar.gz -C /data .\n</code></pre>"},{"location":"deployment/#disaster-recovery","title":"Disaster Recovery","text":"<ol> <li>Restore Redis data</li> <li>Restart services</li> <li>Verify experiment metadata</li> <li>Test API endpoints</li> </ol>"},{"location":"deployment/#troubleshooting","title":"Troubleshooting","text":""},{"location":"deployment/#common-issues","title":"Common Issues","text":""},{"location":"deployment/#api-server-not-starting","title":"API Server Not Starting","text":"<pre><code># Check logs\ndocker-compose logs api-server\n\n# Check configuration\ncat configs/config-local.yaml\n\n# Check Redis connection\ndocker-compose exec redis redis-cli ping\n</code></pre>"},{"location":"deployment/#websocket-connection-issues","title":"WebSocket Connection Issues","text":"<pre><code># Test WebSocket\nwscat -c ws://localhost:9100/ws\n\n# Check TLS\nopenssl s_client -connect localhost:9101 -servername localhost\n</code></pre>"},{"location":"deployment/#performance-issues","title":"Performance Issues","text":"<pre><code># Check resource usage\ndocker-compose exec api-server ps aux\n\n# Check Redis memory\ndocker-compose exec redis redis-cli info memory\n</code></pre>"},{"location":"deployment/#debug-mode","title":"Debug Mode","text":"<pre><code># Enable debug logging\nexport LOG_LEVEL=debug\n./bin/api-server -config configs/config-local.yaml\n</code></pre>"},{"location":"deployment/#cicd-integration","title":"CI/CD Integration","text":""},{"location":"deployment/#github-actions","title":"GitHub Actions","text":"<ul> <li>Automated testing on PR</li> <li>Multi-platform builds</li> <li>Security scanning</li> <li>Automatic releases</li> </ul>"},{"location":"deployment/#deployment-pipeline","title":"Deployment Pipeline","text":"<ol> <li>Code commit \u2192 GitHub</li> <li>CI/CD pipeline triggers</li> <li>Build and test</li> <li>Security scan</li> <li>Deploy to staging</li> <li>Run integration tests</li> <li>Deploy to production</li> <li>Post-deployment verification</li> </ol>"},{"location":"deployment/#support","title":"Support","text":"<p>For deployment issues: 1. Check this guide 2. Review logs 3. Check GitHub Issues 4. Contact maintainers</p>"},{"location":"development-setup/","title":"Development Setup","text":"<p>Set up your local development environment for Fetch ML.</p>"},{"location":"development-setup/#prerequisites","title":"Prerequisites","text":"<p>Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution</p> <ul> <li>Go 1.21+</li> <li>Zig 0.11+</li> <li>Docker Compose (testing only)</li> <li>Redis (or use Docker)</li> <li>Git</li> </ul>"},{"location":"development-setup/#quick-setup","title":"Quick Setup","text":"<pre><code># Clone repository\ngit clone https://github.com/jfraeys/fetch_ml.git\ncd fetch_ml\n\n# Start dependencies\nsee [Quick Start](quick-start.md) for Docker setup redis postgres\n\n# Build all components\nmake build\n\n# Run tests\nsee [Testing Guide](testing.md)\n</code></pre>"},{"location":"development-setup/#detailed-setup","title":"Detailed Setup","text":""},{"location":"development-setup/#quick-start","title":"Quick Start","text":"<pre><code>git clone https://github.com/jfraeys/fetch_ml.git\ncd fetch_ml\nsee [Quick Start](quick-start.md) for Docker setup\nmake build\nsee [Testing Guide](testing.md)\n</code></pre>"},{"location":"development-setup/#key-commands","title":"Key Commands","text":"<ul> <li><code>make build</code> - Build all components</li> <li><code>see [Testing Guide](testing.md)</code> - Run tests</li> <li><code>make dev</code> - Development build</li> <li><code>see [CLI Reference](cli-reference.md) and [Zig CLI](zig-cli.md)</code> - Build CLI</li> </ul>"},{"location":"development-setup/#common-issues","title":"Common Issues","text":"<ul> <li>Build fails: <code>go mod tidy</code></li> <li>Zig errors: <code>cd cli &amp;&amp; rm -rf zig-out zig-cache</code></li> <li>Port conflicts: <code>lsof -i :9101</code></li> </ul>"},{"location":"environment-variables/","title":"Environment Variables","text":"<p>Fetch ML supports environment variables for configuration, allowing you to override config file settings and deploy in different environments.</p>"},{"location":"environment-variables/#priority-order","title":"Priority Order","text":"<ol> <li>Environment variables (highest priority)</li> <li>Configuration file values</li> <li>Default values (lowest priority)</li> </ol>"},{"location":"environment-variables/#variable-prefixes","title":"Variable Prefixes","text":""},{"location":"environment-variables/#general-configuration","title":"General Configuration","text":"<ul> <li><code>FETCH_ML_*</code> - General server and application settings</li> </ul>"},{"location":"environment-variables/#cli-configuration","title":"CLI Configuration","text":"<ul> <li><code>FETCH_ML_CLI_*</code> - CLI-specific settings (overrides <code>~/.ml/config.toml</code>)</li> </ul>"},{"location":"environment-variables/#tui-configuration","title":"TUI Configuration","text":"<ul> <li><code>FETCH_ML_TUI_*</code> - TUI-specific settings (overrides TUI config file)</li> </ul>"},{"location":"environment-variables/#cli-environment-variables","title":"CLI Environment Variables","text":"Variable Config Field Example <code>FETCH_ML_CLI_HOST</code> <code>worker_host</code> <code>localhost</code> <code>FETCH_ML_CLI_USER</code> <code>worker_user</code> <code>mluser</code> <code>FETCH_ML_CLI_BASE</code> <code>worker_base</code> <code>/opt/ml</code> <code>FETCH_ML_CLI_PORT</code> <code>worker_port</code> <code>22</code> <code>FETCH_ML_CLI_API_KEY</code> <code>api_key</code> <code>your-api-key-here</code>"},{"location":"environment-variables/#tui-environment-variables","title":"TUI Environment Variables","text":"Variable Config Field Example <code>FETCH_ML_TUI_HOST</code> <code>host</code> <code>localhost</code> <code>FETCH_ML_TUI_USER</code> <code>user</code> <code>mluser</code> <code>FETCH_ML_TUI_SSH_KEY</code> <code>ssh_key</code> <code>~/.ssh/id_rsa</code> <code>FETCH_ML_TUI_PORT</code> <code>port</code> <code>22</code> <code>FETCH_ML_TUI_BASE_PATH</code> <code>base_path</code> <code>/opt/ml</code> <code>FETCH_ML_TUI_TRAIN_SCRIPT</code> <code>train_script</code> <code>train.py</code> <code>FETCH_ML_TUI_REDIS_ADDR</code> <code>redis_addr</code> <code>localhost:6379</code> <code>FETCH_ML_TUI_REDIS_PASSWORD</code> <code>redis_password</code> `` <code>FETCH_ML_TUI_REDIS_DB</code> <code>redis_db</code> <code>0</code> <code>FETCH_ML_TUI_KNOWN_HOSTS</code> <code>known_hosts</code> <code>~/.ssh/known_hosts</code>"},{"location":"environment-variables/#server-environment-variables-auth-debug","title":"Server Environment Variables (Auth &amp; Debug)","text":"<p>These variables control server-side authentication behavior and are intended only for local development and debugging.</p> Variable Purpose Allowed In Production? <code>FETCH_ML_ALLOW_INSECURE_AUTH</code> When set to <code>1</code> and <code>FETCH_ML_DEBUG=1</code>, allows the API server to run with <code>auth.enabled: false</code> by injecting a default admin user. No. Must never be set in production. <code>FETCH_ML_DEBUG</code> Enables additional debug behaviors. Required (set to <code>1</code>) to activate the insecure auth bypass above. No. Must never be set in production. <p>When both variables are set to <code>1</code> and <code>auth.enabled</code> is <code>false</code>, the server logs a clear warning and treats all requests as coming from a default admin user. This mode is convenient for local homelab experiments but is insecure by design and must not be used on any shared or internet-facing environment.</p>"},{"location":"environment-variables/#usage-examples","title":"Usage Examples","text":""},{"location":"environment-variables/#development-environment","title":"Development Environment","text":"<pre><code>export FETCH_ML_CLI_HOST=localhost\nexport FETCH_ML_CLI_USER=devuser\nexport FETCH_ML_CLI_API_KEY=dev-key-123456789012\n./ml status\n</code></pre>"},{"location":"environment-variables/#production-environment","title":"Production Environment","text":"<pre><code>export FETCH_ML_CLI_HOST=prod-server.example.com\nexport FETCH_ML_CLI_USER=mluser\nexport FETCH_ML_CLI_API_KEY=prod-key-abcdef1234567890\n./ml status\n</code></pre>"},{"location":"environment-variables/#dockerkubernetes","title":"Docker/Kubernetes","text":"<pre><code>env:\n  - name: FETCH_ML_CLI_HOST\n    value: \"ml-server.internal\"\n  - name: FETCH_ML_CLI_USER\n    value: \"mluser\"\n  - name: FETCH_ML_CLI_API_KEY\n    valueFrom:\n      secretKeyRef:\n        name: ml-secrets\n        key: api-key\n</code></pre>"},{"location":"environment-variables/#using-env-file","title":"Using .env file","text":"<pre><code># Copy the example file\ncp .env.example .env\n\n# Edit with your values\nvim .env\n\n# Load in your shell\nexport $(cat .env | xargs)\n</code></pre>"},{"location":"environment-variables/#backward-compatibility","title":"Backward Compatibility","text":"<p>The CLI also supports the legacy <code>ML_*</code> prefix for backward compatibility, but <code>FETCH_ML_CLI_*</code> takes priority if both are set.</p> Legacy Variable New Variable <code>ML_HOST</code> <code>FETCH_ML_CLI_HOST</code> <code>ML_USER</code> <code>FETCH_ML_CLI_USER</code> <code>ML_BASE</code> <code>FETCH_ML_CLI_BASE</code> <code>ML_PORT</code> <code>FETCH_ML_CLI_PORT</code> <code>ML_API_KEY</code> <code>FETCH_ML_CLI_API_KEY</code>"},{"location":"first-experiment/","title":"First Experiment","text":"<p>Run your first machine learning experiment with Fetch ML.</p>"},{"location":"first-experiment/#prerequisites","title":"Prerequisites","text":"<p>Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution</p> <ul> <li>Fetch ML installed and running</li> <li>API key (see Security and API Key Process)</li> <li>Basic ML knowledge</li> </ul>"},{"location":"first-experiment/#experiment-workflow","title":"Experiment Workflow","text":""},{"location":"first-experiment/#1-prepare-your-ml-code","title":"1. Prepare Your ML Code","text":"<p>Create a simple Python script:</p> <pre><code># experiment.py\nimport argparse\nimport json\nimport sys\nimport time\n\ndef main():\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--epochs', type=int, default=10)\n    parser.add_argument('--lr', type=float, default=0.001)\n    parser.add_argument('--output', default='results.json')\n\n    args = parser.parse_args()\n\n    # Simulate training\n    results = {\n        'epochs': args.epochs,\n        'learning_rate': args.lr,\n        'accuracy': 0.85 + (args.lr * 0.1),\n        'loss': 0.5 - (args.epochs * 0.01),\n        'training_time': args.epochs * 0.1\n    }\n\n    # Save results\n    with open(args.output, 'w') as f:\n        json.dump(results, f, indent=2)\n\n    print(f\"Training completed: {results}\")\n    return results\n\nif __name__ == '__main__':\n    main()\n</code></pre>"},{"location":"first-experiment/#2-submit-job-via-api","title":"2. Submit Job via API","text":"<pre><code># Submit experiment\ncurl -X POST http://localhost:9101/api/v1/jobs \\\n  -H \"Content-Type: application/json\" \\\n  -H \"X-API-Key: your-api-key\" \\\n  -d '{\n    \"job_name\": \"first-experiment\",\n    \"args\": \"--epochs 20 --lr 0.01 --output experiment_results.json\",\n    \"priority\": 1,\n    \"metadata\": {\n      \"experiment_type\": \"training\",\n      \"dataset\": \"sample_data\"\n    }\n  }'\n</code></pre>"},{"location":"first-experiment/#3-monitor-progress","title":"3. Monitor Progress","text":"<pre><code># Check job status\ncurl -H \"X-API-Key: your-api-key\" \\\n  http://localhost:9101/api/v1/jobs/first-experiment\n\n# List all jobs\ncurl -H \"X-API-Key: your-api-key\" \\\n  http://localhost:9101/api/v1/jobs\n\n# Get job metrics\ncurl -H \"X-API-Key: your-api-key\" \\\n  http://localhost:9101/api/v1/jobs/first-experiment/metrics\n</code></pre>"},{"location":"first-experiment/#4-use-cli","title":"4. Use CLI","text":"<pre><code># Submit with CLI\ncd cli &amp;&amp; zig build dev\n./cli/zig-out/dev/ml submit \\\n  --name \"cli-experiment\" \\\n  --args \"--epochs 15 --lr 0.005\" \\\n  --server http://localhost:9101\n\n# Monitor with CLI\n./cli/zig-out/dev/ml list-jobs --server http://localhost:9101\n./cli/zig-out/dev/ml job-status cli-experiment --server http://localhost:9101\n</code></pre>"},{"location":"first-experiment/#advanced-experiment","title":"Advanced Experiment","text":""},{"location":"first-experiment/#hyperparameter-tuning","title":"Hyperparameter Tuning","text":"<pre><code># Submit multiple experiments\nfor lr in 0.001 0.01 0.1; do\n  curl -X POST http://localhost:9101/api/v1/jobs \\\n    -H \"Content-Type: application/json\" \\\n    -H \"X-API-Key: your-api-key\" \\\n    -d \"{\n      \\\"job_name\\\": \\\"tune-lr-$lr\\\",\n      \\\"args\\\": \\\"--epochs 10 --lr $lr\\\",\n      \\\"metadata\\\": {\\\"learning_rate\\\": $lr}\n    }\"\ndone\n</code></pre>"},{"location":"first-experiment/#batch-processing","title":"Batch Processing","text":"<pre><code># Submit batch job\ncurl -X POST http://localhost:9101/api/v1/jobs \\\n  -H \"Content-Type: application/json\" \\\n  -H \"X-API-Key: your-api-key\" \\\n  -d '{\n    \"job_name\": \"batch-processing\",\n    \"args\": \"--input data/ --output results/ --batch-size 32\",\n    \"priority\": 2,\n    \"datasets\": [\"training_data\", \"validation_data\"]\n  }'\n</code></pre>"},{"location":"first-experiment/#results-and-output","title":"Results and Output","text":""},{"location":"first-experiment/#access-results","title":"Access Results","text":"<pre><code># Download results\ncurl -H \"X-API-Key: your-api-key\" \\\n  http://localhost:9101/api/v1/jobs/first-experiment/results\n\n# View job details\ncurl -H \"X-API-Key: your-api-key\" \\\n  http://localhost:9101/api/v1/jobs/first-experiment | jq .\n</code></pre>"},{"location":"first-experiment/#result-format","title":"Result Format","text":"<pre><code>{\n  \"job_id\": \"first-experiment\",\n  \"status\": \"completed\",\n  \"results\": {\n    \"epochs\": 20,\n    \"learning_rate\": 0.01,\n    \"accuracy\": 0.86,\n    \"loss\": 0.3,\n    \"training_time\": 2.0\n  },\n  \"metrics\": {\n    \"gpu_utilization\": \"85%\",\n    \"memory_usage\": \"2GB\",\n    \"execution_time\": \"120s\"\n  }\n}\n</code></pre>"},{"location":"first-experiment/#best-practices","title":"Best Practices","text":""},{"location":"first-experiment/#job-naming","title":"Job Naming","text":"<ul> <li>Use descriptive names: <code>model-training-v2</code>, <code>data-preprocessing</code></li> <li>Include version numbers: <code>experiment-v1</code>, <code>experiment-v2</code></li> <li>Add timestamps: <code>daily-batch-2024-01-15</code></li> </ul>"},{"location":"first-experiment/#metadata-usage","title":"Metadata Usage","text":"<pre><code>{\n  \"metadata\": {\n    \"experiment_type\": \"training\",\n    \"model_version\": \"v2.1\",\n    \"dataset\": \"imagenet-2024\",\n    \"environment\": \"gpu\",\n    \"team\": \"ml-team\"\n  }\n}\n</code></pre>"},{"location":"first-experiment/#error-handling","title":"Error Handling","text":"<pre><code># Check failed jobs\ncurl -H \"X-API-Key: your-api-key\" \\\n  \"http://localhost:9101/api/v1/jobs?status=failed\"\n\n# Retry failed job\ncurl -X POST http://localhost:9101/api/v1/jobs \\\n  -H \"Content-Type: application/json\" \\\n  -H \"X-API-Key: your-api-key\" \\\n  -d '{\n    \"job_name\": \"retry-experiment\",\n    \"args\": \"--epochs 20 --lr 0.01\",\n    \"metadata\": {\"retry_of\": \"first-experiment\"}\n  }'\n</code></pre>"},{"location":"first-experiment/#related-documentation","title":"## Related Documentation","text":"<ul> <li>Development Setup (see [Development Setup](development-setup.md)) - Local development environment</li> <li>Testing Guide (see [Testing Guide](testing.md)) - Test your experiments</li> <li>Production Deployment (see [Deployment](deployment.md)) - Scale to production</li> <li>Monitoring - Track experiment performance</li> </ul>"},{"location":"first-experiment/#troubleshooting","title":"Troubleshooting","text":"<p>Job stuck in pending? - Check worker status: <code>curl /api/v1/workers</code> - Verify resources: <code>docker stats</code> - Check logs: <code>docker-compose logs api-server</code></p> <p>Job failed? - Check error message: <code>curl /api/v1/jobs/job-id</code> - Review job arguments - Verify input data</p> <p>No results? - Check job completion status - Verify output file paths - Check storage permissions</p>"},{"location":"installation/","title":"Simple Installation Guide","text":""},{"location":"installation/#quick-start-5-minutes","title":"Quick Start (5 minutes)","text":"<pre><code># 1. Install\ngit clone https://github.com/jfraeys/fetch_ml.git\ncd fetch_ml\nmake install\n\n# 2. Setup (auto-configures)\n./bin/ml setup\n\n# 3. Run experiments\n./bin/ml run my-experiment.py\n</code></pre> <p>That's it. Everything else is optional.</p>"},{"location":"installation/#what-if-i-want-more-control","title":"What If I Want More Control?","text":""},{"location":"installation/#manual-configuration-optional","title":"Manual Configuration (Optional)","text":"<pre><code># Edit settings if defaults don't work\nnano ~/.ml/config.toml\n</code></pre>"},{"location":"installation/#monitoring-dashboard-optional","title":"Monitoring Dashboard (Optional)","text":"<pre><code># Real-time monitoring\n./bin/tui\n</code></pre>"},{"location":"installation/#senior-developer-feedback","title":"Senior Developer Feedback","text":"<p>\"Keep it simple\" - Most data scientists want: 1. One installation command 2. Sensible defaults 3. Works without configuration 4. Advanced features available when needed</p> <p>Current plan is too complex because it asks users to decide between: - CLI vs TUI vs Both - Zig vs Go build tools - Manual vs auto config - Multiple environment variables</p> <p>Better approach: Start simple, add complexity gradually.</p>"},{"location":"installation/#recommended-simplified-workflow","title":"Recommended Simplified Workflow","text":"<ol> <li>Single Binary - Combine CLI + basic TUI functionality</li> <li>Auto-Discovery - Detect common ML environments automatically  </li> <li>Progressive Disclosure - Show advanced options only when needed</li> <li>Zero Config - Work out-of-the-box with localhost defaults</li> </ol> <p>The goal: \"It just works\" for 80% of use cases.</p>"},{"location":"operations/","title":"Operations Runbook","text":"<p>Operational guide for troubleshooting and maintaining the ML experiment system.</p>"},{"location":"operations/#task-queue-operations","title":"Task Queue Operations","text":""},{"location":"operations/#monitoring-queue-health","title":"Monitoring Queue Health","text":"<pre><code># Check queue depth\nZCARD task:queue\n\n# List pending tasks\nZRANGE task:queue 0 -1 WITHSCORES\n\n# Check dead letter queue\nKEYS task:dlq:*\n</code></pre>"},{"location":"operations/#handling-stuck-tasks","title":"Handling Stuck Tasks","text":"<p>Symptom: Tasks stuck in \"running\" status</p> <p>Diagnosis: <pre><code># Check for expired leases\nredis-cli GET task:{task-id}\n# Look for LeaseExpiry in past\n</code></pre></p> <p>**Rem</p> <p>ediation:** Tasks with expired leases are automatically reclaimed every 1 minute. To force immediate reclamation: <pre><code># Restart worker to trigger reclaim cycle\nsystemctl restart ml-worker\n</code></pre></p>"},{"location":"operations/#dead-letter-queue-management","title":"Dead Letter Queue Management","text":"<p>View failed tasks: <pre><code>KEYS task:dlq:*\n</code></pre></p> <p>Inspect failed task: <pre><code>GET task:dlq:{task-id}\n</code></pre></p> <p>Retry from DLQ: <pre><code># Manual retry (requires custom script)\n# 1. Get task from DLQ\n# 2. Reset retry count\n# 3. Re-queue task\n</code></pre></p>"},{"location":"operations/#worker-crashes","title":"Worker Crashes","text":"<p>Symptom: Worker disappeared mid-task</p> <p>What Happens: 1. Lease expires after 30 minutes (default) 2. Background reclaim job detects expired lease 3. Task is retried (up to 3 attempts) 4. After max retries \u2192 Dead Letter Queue</p> <p>Prevention: - Monitor worker heartbeats - Set up alerts for worker down - Use process manager (systemd, supervisor)</p>"},{"location":"operations/#worker-operations","title":"Worker Operations","text":""},{"location":"operations/#graceful-shutdown","title":"Graceful Shutdown","text":"<pre><code># Send SIGTERM for graceful shutdown\nkill -TERM $(pgrep ml-worker)\n\n# Worker will:\n# 1. Stop accepting new tasks\n# 2. Finish active tasks (up to 5min timeout)\n# 3. Release all leases\n# 4. Exit cleanly\n</code></pre>"},{"location":"operations/#force-shutdown","title":"Force Shutdown","text":"<pre><code># Force kill (leases will be reclaimed automatically)\nkill -9 $(pgrep ml-worker)\n</code></pre>"},{"location":"operations/#worker-heartbeat-monitoring","title":"Worker Heartbeat Monitoring","text":"<pre><code># Check worker heartbeats\nHGETALL worker:heartbeat\n\n# Example output:\n# worker-abc123 1701234567\n# worker-def456 1701234580\n</code></pre> <p>Alert if: Heartbeat timestamp &gt; 5 minutes old</p>"},{"location":"operations/#redis-operations","title":"Redis Operations","text":""},{"location":"operations/#backup","title":"Backup","text":"<pre><code># Manual backup\nredis-cli SAVE\ncp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d).rdb\n</code></pre>"},{"location":"operations/#restore","title":"Restore","text":"<pre><code># Stop Redis\nsystemctl stop redis\n\n#  Restore snapshot\ncp /backup/redis-20231201.rdb /var/lib/redis/dump.rdb\n\n# Start Redis\nsystemctl start redis\n</code></pre>"},{"location":"operations/#memory-management","title":"Memory Management","text":"<pre><code># Check memory usage\nINFO memory\n\n# Evict old data if needed\nFLUSHDB  # DANGER: Clears all data!\n</code></pre>"},{"location":"operations/#common-issues","title":"Common Issues","text":""},{"location":"operations/#issue-queue-growing-unbounded","title":"Issue: Queue Growing Unbounded","text":"<p>Symptoms: - <code>ZCARD task:queue</code> keeps increasing - No workers processing tasks</p> <p>Diagnosis: <pre><code># Check worker status\nsystemctl status ml-worker\n\n# Check logs\njournalctl -u ml-worker -n 100\n</code></pre></p> <p>Resolution: 1. Verify workers are running 2. Check Redis connectivity 3. Verify lease configuration</p>"},{"location":"operations/#issue-high-retry-rate","title":"Issue: High Retry Rate","text":"<p>Symptoms: - Many tasks in DLQ - <code>retry_count</code> field high on tasks</p> <p>Diagnosis: <pre><code># Check worker logs for errors\njournalctl -u ml-worker | grep \"retry\"\n\n# Look for patterns (network issues, resource limits, etc)\n</code></pre></p> <p>Resolution: - Fix underlying issue (network, resources, etc) - Adjust retry limits if permanent failures - Increase task timeout if jobs are slow</p>"},{"location":"operations/#issue-leases-expiring-prematurely","title":"Issue: Leases Expiring Prematurely","text":"<p>Symptoms: - Tasks retried even though worker is healthy - Logs show \"lease expired\" frequently</p> <p>Diagnosis: <pre><code># Check worker config\ncat configs/worker-config.yaml | grep -A3 \"lease\"\n\ntask_lease_duration: 30m  # Too short?\nheartbeat_interval: 1m    # Too infrequent?\n</code></pre></p> <p>Resolution: <pre><code># Increase lease duration for long-running jobs\ntask_lease_duration: 60m\nheartbeat_interval: 30s  # More frequent heartbeats\n</code></pre></p>"},{"location":"operations/#performance-tuning","title":"Performance Tuning","text":""},{"location":"operations/#worker-concurrency","title":"Worker Concurrency","text":"<pre><code># worker-config.yaml\nmax_workers: 4  # Number of parallel tasks\n\n# Adjust based on:\n# - CPU cores available\n# - Memory per task\n# - GPU availability\n</code></pre>"},{"location":"operations/#redis-configuration","title":"Redis Configuration","text":"<pre><code># /etc/redis/redis.conf\n\n# Persistence\nsave 900 1\nsave 300 10\n\n# Memory\nmaxmemory 2gb\nmaxmemory-policy noeviction\n\n# Performance\ntcp-keepalive 300\ntimeout 0\n</code></pre>"},{"location":"operations/#alerting-rules","title":"Alerting Rules","text":""},{"location":"operations/#critical-alerts","title":"Critical Alerts","text":"<ol> <li>Worker Down (no heartbeat &gt; 5min)</li> <li>Queue Depth &gt; 1000 tasks</li> <li>DLQ Growth &gt; 100 tasks/hour</li> <li>Redis Down (connection failed)</li> </ol>"},{"location":"operations/#warning-alerts","title":"Warning Alerts","text":"<ol> <li>High Retry Rate &gt; 10% of tasks</li> <li>Slow Queue Drain (depth increasing over 1 hour)</li> <li>Worker Memory &gt; 80% usage</li> </ol>"},{"location":"operations/#health-checks","title":"Health Checks","text":"<pre><code>#!/bin/bash\n# health-check.sh\n\n# Check Redis\nredis-cli PING || echo \"Redis DOWN\"\n\n# Check worker heartbeat\nWORKER_ID=$(cat /var/run/ml-worker.pid)\nLAST_HB=$(redis-cli HGET worker:heartbeat \"$WORKER_ID\")\nNOW=$(date +%s)\nif [ $((NOW - LAST_HB)) -gt 300 ]; then\n  echo \"Worker heartbeat stale\"\nfi\n\n# Check queue depth\nDEPTH=$(redis-cli ZCARD task:queue)\nif [ \"$DEPTH\" -gt 1000 ]; then\n  echo \"Queue depth critical: $DEPTH\"\nfi\n</code></pre>"},{"location":"operations/#runbook-checklist","title":"Runbook Checklist","text":""},{"location":"operations/#daily-operations","title":"Daily Operations","text":"<ol> <li>Check queue depth</li> <li>Verify worker heartbeats</li> <li>Review DLQ for patterns</li> <li>Check Redis memory usage</li> </ol>"},{"location":"operations/#weekly-operations","title":"Weekly Operations","text":"<ol> <li>Review retry rates</li> <li>Analyze failed task patterns</li> <li>Backup Redis snapshot</li> <li>Review worker logs</li> </ol>"},{"location":"operations/#monthly-operations","title":"Monthly Operations","text":"<ol> <li>Performance tuning review</li> <li>Capacity planning</li> <li>Update documentation</li> <li>Test disaster recovery</li> </ol> <p>For homelab setups: Most of these operations can be simplified. Focus on: -  Basic monitoring (queue depth, worker status) - Periodic Redis backups - Graceful shutdowns for maintenance</p>"},{"location":"production-monitoring/","title":"Production Monitoring Deployment Guide (Linux)","text":"<p>This guide covers deploying the monitoring stack (Prometheus, Grafana, Loki, Promtail) on Linux production servers.</p>"},{"location":"production-monitoring/#architecture","title":"Architecture","text":"<p>Testing: Docker Compose (macOS/Linux) Production: Podman + systemd (Linux)</p> <p>Important: Docker is for testing only. Podman is used for running actual ML experiments in production.</p> <p>Dev (Testing): Docker Compose Prod (Experiments): Podman + systemd</p> <p>Each service runs as a separate Podman container managed by systemd for automatic restarts and proper lifecycle management.</p>"},{"location":"production-monitoring/#prerequisites","title":"Prerequisites","text":"<p>Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution</p> <ul> <li>Linux distribution with systemd (Rocky/RHEL/CentOS, Ubuntu/Debian, Arch, SUSE, etc.)</li> <li>Production app already deployed (see <code>scripts/setup-prod.sh</code>)</li> <li>Root or sudo access</li> <li>Ports 3000, 9090, 3100 available</li> </ul>"},{"location":"production-monitoring/#quick-setup","title":"Quick Setup","text":""},{"location":"production-monitoring/#1-run-setup-script","title":"1. Run Setup Script","text":"<pre><code>cd /path/to/fetch_ml\nsudo ./scripts/setup-monitoring-prod.sh /data/monitoring ml-user ml-group\n</code></pre> <p>This will: - Create directory structure at <code>/data/monitoring</code> - Copy configuration files to <code>/etc/fetch_ml/monitoring</code> - Create systemd services for each component - Set up firewall rules</p>"},{"location":"production-monitoring/#2-start-services","title":"2. Start Services","text":"<pre><code># Start all monitoring services\nsudo systemctl start prometheus\nsudo systemctl start loki\nsudo systemctl start promtail\nsudo systemctl start grafana\n\n# Enable on boot\nsudo systemctl enable prometheus loki promtail grafana\n</code></pre>"},{"location":"production-monitoring/#3-access-grafana","title":"3. Access Grafana","text":"<ul> <li>URL: <code>http://YOUR_SERVER_IP:3000</code></li> <li>Username: <code>admin</code></li> <li>Password: <code>admin</code> (change on first login)</li> </ul> <p>Dashboards will auto-load: - ML Task Queue Monitoring (metrics) - Application Logs (Loki logs)</p>"},{"location":"production-monitoring/#service-details","title":"Service Details","text":""},{"location":"production-monitoring/#prometheus","title":"Prometheus","text":"<ul> <li>Port: 9090</li> <li>Config: <code>/etc/fetch_ml/monitoring/prometheus.yml</code></li> <li>Data: <code>/data/monitoring/prometheus</code></li> <li>Purpose: Scrapes metrics from API server</li> </ul>"},{"location":"production-monitoring/#loki","title":"Loki","text":"<ul> <li>Port: 3100</li> <li>Config: <code>/etc/fetch_ml/monitoring/loki-config.yml</code></li> <li>Data: <code>/data/monitoring/loki</code></li> <li>Purpose: Log aggregation</li> </ul>"},{"location":"production-monitoring/#promtail","title":"Promtail","text":"<ul> <li>Config: <code>/etc/fetch_ml/monitoring/promtail-config.yml</code></li> <li>Log Source: <code>/var/log/fetch_ml/*.log</code></li> <li>Purpose: Ships logs to Loki</li> </ul>"},{"location":"production-monitoring/#grafana","title":"Grafana","text":"<ul> <li>Port: 3000</li> <li>Config: <code>/etc/fetch_ml/monitoring/grafana/provisioning</code></li> <li>Data: <code>/data/monitoring/grafana</code></li> <li>Dashboards: <code>/var/lib/grafana/dashboards</code></li> </ul>"},{"location":"production-monitoring/#management-commands","title":"Management Commands","text":"<pre><code># Check status\nsudo systemctl status prometheus grafana loki promtail\n\n# View logs\nsudo journalctl -u prometheus -f\nsudo journalctl -u grafana -f\nsudo journalctl -u loki -f\nsudo journalctl -u promtail -f\n\n# Restart services\nsudo systemctl restart prometheus\nsudo systemctl restart grafana\n\n# Stop all monitoring\nsudo systemctl stop prometheus grafana loki promtail\n</code></pre>"},{"location":"production-monitoring/#data-retention","title":"Data Retention","text":""},{"location":"production-monitoring/#prometheus_1","title":"Prometheus","text":"<p>Default: 15 days. Edit <code>/etc/fetch_ml/monitoring/prometheus.yml</code>: <pre><code>storage:\n  tsdb:\n    retention.time: 30d\n</code></pre></p>"},{"location":"production-monitoring/#loki_1","title":"Loki","text":"<p>Default: 30 days. Edit <code>/etc/fetch_ml/monitoring/loki-config.yml</code>: <pre><code>limits_config:\n  retention_period: 30d\n</code></pre></p>"},{"location":"production-monitoring/#security","title":"Security","text":""},{"location":"production-monitoring/#firewall","title":"Firewall","text":"<p>The setup script automatically configures firewall rules using the detected firewall manager (firewalld or ufw).</p> <p>For manual firewall configuration:</p> <p>RHEL/Rocky/Fedora (firewalld): <pre><code># Remove public access\nsudo firewall-cmd --permanent --remove-port=3000/tcp\nsudo firewall-cmd --permanent --remove-port=9090/tcp\n\n# Add specific source\nsudo firewall-cmd --permanent --add-rich-rule='rule family=\"ipv4\" source address=\"10.0.0.0/24\" port port=\"3000\" protocol=\"tcp\" accept'\nsudo firewall-cmd --reload\n</code></pre></p> <p>Ubuntu/Debian (ufw): <pre><code># Remove public access\nsudo ufw delete allow 3000/tcp\nsudo ufw delete allow 9090/tcp\n\n# Add specific source\nsudo ufw allow from 10.0.0.0/24 to any port 3000 proto tcp\n</code></pre></p>"},{"location":"production-monitoring/#authentication","title":"Authentication","text":"<p>Change Grafana admin password: 1. Login to Grafana 2. User menu \u2192 Profile \u2192 Change Password</p>"},{"location":"production-monitoring/#tls-optional","title":"TLS (Optional)","text":"<p>For HTTPS, configure reverse proxy (nginx/Apache) in front of Grafana.</p>"},{"location":"production-monitoring/#troubleshooting","title":"Troubleshooting","text":""},{"location":"production-monitoring/#grafana-shows-no-data","title":"Grafana shows no data","text":"<pre><code># Check if Prometheus is reachable\ncurl http://localhost:9090/-/healthy\n\n# Check datasource in Grafana\n# Settings \u2192 Data Sources \u2192 Prometheus \u2192 Save &amp; Test\n</code></pre>"},{"location":"production-monitoring/#loki-not-receiving-logs","title":"Loki not receiving logs","text":"<pre><code># Check Promtail is running\nsudo systemctl status promtail\n\n# Verify log file exists\nls -l /var/log/fetch_ml/\n\n# Check Promtail can reach Loki\ncurl http://localhost:3100/ready\n</code></pre>"},{"location":"production-monitoring/#podman-containers-not-starting","title":"Podman containers not starting","text":"<pre><code># Check pod status\nsudo -u ml-user podman pod ps\nsudo -u ml-user podman ps -a\n\n# Remove and recreate\nsudo -u ml-user podman pod stop monitoring\nsudo -u ml-user podman pod rm monitoring\nsudo systemctl restart prometheus\n</code></pre>"},{"location":"production-monitoring/#backup","title":"Backup","text":"<pre><code># Backup Grafana dashboards and data\nsudo tar -czf grafana-backup.tar.gz /data/monitoring/grafana\n\n# Backup Prometheus data\nsudo tar -czf prometheus-backup.tar.gz /data/monitoring/prometheus\n</code></pre>"},{"location":"production-monitoring/#updates","title":"Updates","text":"<pre><code># Pull latest images\nsudo -u ml-user podman pull docker.io/grafana/grafana:latest\nsudo -u ml-user podman pull docker.io/prom/prometheus:latest\nsudo -u ml-user podman pull docker.io/grafana/loki:latest\nsudo -u ml-user podman pull docker.io/grafana/promtail:latest\n\n# Restart services to use new images\nsudo systemctl restart grafana prometheus loki promtail\n</code></pre>"},{"location":"queue/","title":"Task Queue Architecture","text":"<p>The task queue system enables reliable job processing between the API server and workers using Redis.</p>"},{"location":"queue/#overview","title":"Overview","text":"<pre><code>graph LR\n    CLI[CLI/Client] --&gt;|WebSocket| API[API Server]\n    API --&gt;|Enqueue| Redis[(Redis)]\n    Redis --&gt;|Dequeue| Worker[Worker]\n    Worker --&gt;|Update Status| Redis\n</code></pre>"},{"location":"queue/#components","title":"Components","text":""},{"location":"queue/#taskqueue-internalqueue","title":"TaskQueue (<code>internal/queue</code>)","text":"<p>Shared package used by both API server and worker for job management.</p>"},{"location":"queue/#task-structure","title":"Task Structure","text":"<pre><code>type Task struct {\n    ID        string            // Unique task ID (UUID)\n    JobName   string            // User-defined job name  \n    Args      string            // Job arguments\n    Status    string            // queued, running, completed, failed\n    Priority  int64             // Higher = executed first\n    CreatedAt time.Time         \n    StartedAt *time.Time        \n    EndedAt   *time.Time        \n    WorkerID  string            \n    Error     string            \n    Datasets  []string          \n    Metadata  map[string]string // commit_id, user, etc\n}\n</code></pre>"},{"location":"queue/#taskqueue-interface","title":"TaskQueue Interface","text":"<pre><code>// Initialize queue\nqueue, err := queue.NewTaskQueue(queue.Config{\n    RedisAddr:     \"localhost:6379\",\n    RedisPassword: \"\",\n    RedisDB:       0,\n})\n\n// Add task (API server)\ntask := &amp;queue.Task{\n    ID:       uuid.New().String(),\n    JobName:  \"train-model\",\n    Status:   \"queued\",\n    Priority: 5,\n    Metadata: map[string]string{\n        \"commit_id\": commitID,\n        \"user\":      username,\n    },\n}\nerr = queue.AddTask(task)\n\n// Get next task (Worker)\ntask, err := queue.GetNextTask()\n\n// Update task status\ntask.Status = \"running\"\nerr = queue.UpdateTask(task)\n</code></pre>"},{"location":"queue/#data-flow","title":"Data Flow","text":""},{"location":"queue/#job-submission-flow","title":"Job Submission Flow","text":"<pre><code>sequenceDiagram\n    participant CLI\n    participant API\n    participant Redis\n    participant Worker\n\n    CLI-&gt;&gt;API: Queue Job (WebSocket)\n    API-&gt;&gt;API: Create Task (UUID)\n    API-&gt;&gt;Redis: ZADD task:queue\n    API-&gt;&gt;Redis: SET task:{id}\n    API-&gt;&gt;CLI: Success Response\n\n    Worker-&gt;&gt;Redis: ZPOPMAX task:queue\n    Redis-&gt;&gt;Worker: Task ID\n    Worker-&gt;&gt;Redis: GET task:{id}\n    Redis-&gt;&gt;Worker: Task Data\n    Worker-&gt;&gt;Worker: Execute Job\n    Worker-&gt;&gt;Redis: Update Status\n</code></pre>"},{"location":"queue/#protocol","title":"Protocol","text":"<p>CLI \u2192 API (Binary WebSocket): <pre><code>[opcode:1][api_key_hash:64][commit_id:64][priority:1][job_name_len:1][job_name:var]\n</code></pre></p> <p>API \u2192 Redis: - Priority queue: <code>ZADD task:queue {priority} {task_id}</code> - Task data: <code>SET task:{id} {json}</code> - Status: <code>HSET task:status:{job_name} ...</code></p> <p>Worker \u2190 Redis: - Poll: <code>ZPOPMAX task:queue 1</code> (highest priority first) - Fetch: <code>GET task:{id}</code></p>"},{"location":"queue/#redis-data-structures","title":"Redis Data Structures","text":""},{"location":"queue/#keys","title":"Keys","text":"<pre><code>task:queue                    # ZSET: priority queue\ntask:{uuid}                  # STRING: task JSON data\ntask:status:{job_name}       # HASH: job status\nworker:heartbeat             # HASH: worker health\njob:metrics:{job_name}       # HASH: job metrics\n</code></pre>"},{"location":"queue/#priority-queue-zset","title":"Priority Queue (ZSET)","text":"<pre><code>ZADD task:queue 10 \"uuid-1\"   # Priority 10\nZADD task:queue 5  \"uuid-2\"   # Priority 5\nZPOPMAX task:queue 1          # Returns uuid-1 (highest)\n</code></pre>"},{"location":"queue/#api-server-integration","title":"API Server Integration","text":""},{"location":"queue/#initialization","title":"Initialization","text":"<pre><code>// cmd/api-server/main.go\nqueueCfg := queue.Config{\n    RedisAddr:     cfg.Redis.Addr,\n    RedisPassword: cfg.Redis.Password,\n    RedisDB:       cfg.Redis.DB,\n}\ntaskQueue, err := queue.NewTaskQueue(queueCfg)\n</code></pre>"},{"location":"queue/#websocket-handler","title":"WebSocket Handler","text":"<pre><code>// internal/api/ws.go\nfunc (h *WSHandler) handleQueueJob(conn *websocket.Conn, payload []byte) error {\n    // Parse request\n    apiKeyHash, commitID, priority, jobName := parsePayload(payload)\n\n    // Create task with unique ID\n    taskID := uuid.New().String()\n    task := &amp;queue.Task{\n        ID:       taskID,\n        JobName:  jobName,\n        Status:   \"queued\",\n        Priority: int64(priority),\n        Metadata: map[string]string{\n            \"commit_id\": commitID,\n            \"user\":      user,\n        },\n    }\n\n    // Enqueue\n    if err := h.queue.AddTask(task); err != nil {\n        return h.sendErrorPacket(conn, ErrorCodeDatabaseError, ...)\n    }\n\n    return h.sendSuccessPacket(conn, \"Job queued\")\n}\n</code></pre>"},{"location":"queue/#worker-integration","title":"Worker Integration","text":""},{"location":"queue/#task-polling","title":"Task Polling","text":"<pre><code>// cmd/worker/worker_server.go\nfunc (w *Worker) Start() error {\n    for {\n        task, err := w.queue.WaitForNextTask(ctx, 5*time.Second)\n        if task != nil {\n            go w.executeTask(task)\n        }\n    }\n}\n</code></pre>"},{"location":"queue/#task-execution","title":"Task Execution","text":"<pre><code>func (w *Worker) executeTask(task *queue.Task) {\n    // Update status\n    task.Status = \"running\"\n    task.StartedAt = &amp;now\n    w.queue.UpdateTaskWithMetrics(task, \"start\")\n\n    // Execute\n    err := w.runJob(task)\n\n    // Finalize\n    task.Status = \"completed\" // or \"failed\"\n    task.EndedAt = &amp;endTime\n    task.Error = err.Error() // if err != nil\n    w.queue.UpdateTaskWithMetrics(task, \"final\")\n}\n</code></pre>"},{"location":"queue/#configuration","title":"Configuration","text":""},{"location":"queue/#api-server-configsconfigyaml","title":"API Server (<code>configs/config.yaml</code>)","text":"<pre><code>redis:\n  addr: \"localhost:6379\"\n  password: \"\"\n  db: 0\n</code></pre>"},{"location":"queue/#worker-configsworker-configyaml","title":"Worker (<code>configs/worker-config.yaml</code>)","text":"<pre><code>redis:\n  addr: \"localhost:6379\"\n  password: \"\"\n  db: 0\n\nmetrics_flush_interval: 500ms\n</code></pre>"},{"location":"queue/#monitoring","title":"Monitoring","text":""},{"location":"queue/#queue-depth","title":"Queue Depth","text":"<pre><code>depth, err := queue.QueueDepth()\nfmt.Printf(\"Pending tasks: %d\\n\", depth)\n</code></pre>"},{"location":"queue/#worker-heartbeat","title":"Worker Heartbeat","text":"<pre><code>// Worker sends heartbeat every 30s\nerr := queue.Heartbeat(workerID)\n</code></pre>"},{"location":"queue/#metrics","title":"Metrics","text":"<pre><code>HGETALL job:metrics:{job_name}\n# Returns: timestamp, tasks_start, tasks_final, etc\n</code></pre>"},{"location":"queue/#error-handling","title":"Error Handling","text":""},{"location":"queue/#task-failures","title":"Task Failures","text":"<pre><code>if err := w.runJob(task); err != nil {\n    task.Status = \"failed\"\n    task.Error = err.Error()\n    w.queue.UpdateTask(task)\n}\n</code></pre>"},{"location":"queue/#redis-connection-loss","title":"Redis Connection Loss","text":"<pre><code>// TaskQueue automatically reconnects\n// Workers should implement retry logic\nfor retries := 0; retries &lt; 3; retries++ {\n    task, err := queue.GetNextTask()\n    if err == nil {\n        break\n    }\n    time.Sleep(backoff)\n}\n</code></pre>"},{"location":"queue/#testing","title":"Testing","text":"<pre><code>// tests using miniredis\ns, _ := miniredis.Run()\ndefer s.Close()\n\ntq, _ := queue.NewTaskQueue(queue.Config{\n    RedisAddr: s.Addr(),\n})\n\ntask := &amp;queue.Task{ID: \"test-1\", JobName: \"test\"}\ntq.AddTask(task)\n\nfetched, _ := tq.GetNextTask()\n// assert fetched.ID == \"test-1\"\n</code></pre>"},{"location":"queue/#best-practices","title":"Best Practices","text":"<ol> <li>Unique Task IDs: Always use UUIDs to avoid conflicts</li> <li>Metadata: Store commit_id and user in task metadata</li> <li>Priority: Higher values execute first (0-255 range)</li> <li>Status Updates: Update status at each lifecycle stage</li> <li>Error Logging: Store detailed errors in task.Error</li> <li>Heartbeats: Workers should send heartbeats regularly</li> <li>Metrics: Use UpdateTaskWithMetrics for atomic updates</li> </ol> <p>For implementation details, see: - internal/queue/task.go - internal/queue/queue.go</p>"},{"location":"quick-start/","title":"Quick Start","text":"<p>Get Fetch ML running in minutes with Docker Compose.</p>"},{"location":"quick-start/#prerequisites","title":"Prerequisites","text":"<p>Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution</p> <ul> <li>Docker Compose (testing only)</li> <li>4GB+ RAM</li> <li>2GB+ disk space</li> </ul>"},{"location":"quick-start/#one-command-setup","title":"One-Command Setup","text":"<pre><code># Clone and start\ngit clone https://github.com/jfraeys/fetch_ml.git\ncd fetch_ml\ndocker-compose up -d (testing only)\n\n# Wait for services (30 seconds)\nsleep 30\n\n# Verify setup\ncurl http://localhost:9101/health\n</code></pre>"},{"location":"quick-start/#first-experiment","title":"First Experiment","text":"<pre><code># Submit a simple ML job (see [First Experiment](first-experiment.md) for details)\ncurl -X POST http://localhost:9101/api/v1/jobs \\\n  -H \"Content-Type: application/json\" \\\n  -H \"X-API-Key: admin\" \\\n  -d '{\n    \"job_name\": \"hello-world\",\n    \"args\": \"--echo Hello World\",\n    \"priority\": 1\n  }'\n\n# Check job status\ncurl http://localhost:9101/api/v1/jobs \\\n  -H \"X-API-Key: admin\"\n</code></pre>"},{"location":"quick-start/#cli-access","title":"CLI Access","text":"<pre><code># Build CLI\ncd cli &amp;&amp; zig build dev\n\n# List jobs\n./cli/zig-out/dev/ml --server http://localhost:9101 list-jobs\n\n# Submit new job\n./cli/zig-out/dev/ml --server http://localhost:9101 submit \\\n  --name \"test-job\" --args \"--epochs 10\"\n</code></pre>"},{"location":"quick-start/#related-documentation","title":"Related Documentation","text":"<ul> <li>Installation Guide - Detailed setup options</li> <li>First Experiment - Complete ML workflow</li> <li>Development Setup - Local development</li> <li>Security - Authentication and permissions</li> </ul>"},{"location":"quick-start/#troubleshooting","title":"Troubleshooting","text":"<p>Services not starting? <pre><code># Check logs\ndocker-compose logs\n\n# Restart services\ndocker-compose down &amp;&amp; docker-compose up -d (testing only)\n</code></pre></p> <p>API not responding? <pre><code># Check health\ncurl http://localhost:9101/health\n\n# Verify ports\ndocker-compose ps\n</code></pre></p> <p>Permission denied? <pre><code># Check API key\ncurl -H \"X-API-Key: admin\" http://localhost:9101/api/v1/jobs\n</code></pre></p>"},{"location":"redis-ha/","title":"Redis High Availability","text":"<p>Note: This is optional for homelab setups. Single Redis instance is sufficient for most use cases.</p>"},{"location":"redis-ha/#when-you-need-ha","title":"When You Need HA","text":"<p>Consider Redis HA if: - Running production workloads - Uptime &gt; 99.9% required - Can't afford to lose queued tasks - Multiple workers across machines</p>"},{"location":"redis-ha/#redis-sentinel-recommended","title":"Redis Sentinel (Recommended)","text":""},{"location":"redis-ha/#setup","title":"Setup","text":"<pre><code># docker-compose.yml\nversion: '3.8'\nservices:\n  redis-master:\n    image: redis:7-alpine\n    command: redis-server --maxmemory 2gb\n\n  redis-replica:\n    image: redis:7-alpine\n    command: redis-server --slaveof redis-master 6379\n\n  redis-sentinel-1:\n    image: redis:7-alpine\n    command: redis-sentinel /etc/redis/sentinel.conf\n    volumes:\n      - ./sentinel.conf:/etc/redis/sentinel.conf\n</code></pre> <p>sentinel.conf: <pre><code>sentinel monitor mymaster redis-master 6379 2\nsentinel down-after-milliseconds mymaster 5000\nsentinel parallel-syncs mymaster 1\nsentinel failover-timeout mymaster 10000\n</code></pre></p>"},{"location":"redis-ha/#application-configuration","title":"Application Configuration","text":"<pre><code># worker-config.yaml\nredis_addr: \"redis-sentinel-1:26379,redis-sentinel-2:26379\"\nredis_master_name: \"mymaster\"\n</code></pre>"},{"location":"redis-ha/#redis-cluster-advanced","title":"Redis Cluster (Advanced)","text":"<p>For larger deployments with sharding needs.</p> <pre><code># Minimum 3 masters + 3 replicas\nservices:\n  redis-1:\n    image: redis:7-alpine\n    command: redis-server --cluster-enabled yes\n\n  redis-2:\n    # ... similar config\n</code></pre>"},{"location":"redis-ha/#homelab-alternative-persistence-only","title":"Homelab Alternative: Persistence Only","text":"<p>For most homelabs, just enable persistence:</p> <pre><code># docker-compose.yml\nservices:\n  redis:\n    image: redis:7-alpine\n    command: redis-server --appendonly yes\n    volumes:\n      - redis_data:/data\n\nvolumes:\n  redis_data:\n</code></pre> <p>This ensures tasks survive Redis restarts without full HA complexity.</p> <p>Recommendation: Start simple. Add HA only if you experience actual downtime issues.</p>"},{"location":"release-checklist/","title":"Release Checklist","text":"<p>This checklist captures the work required before cutting a release that includes the graceful worker shutdown feature.</p>"},{"location":"release-checklist/#1-code-hygiene-compilation","title":"1. Code Hygiene / Compilation","text":"<ol> <li>Merge the graceful-shutdown helpers into the canonical worker type to avoid      <code>Worker redeclared</code> errors (see <code>cmd/worker/worker_graceful_shutdown.go</code> and      <code>cmd/worker/worker_server.go</code>).</li> <li>Ensure the worker struct exposes the fields referenced by the new helpers      (<code>logger</code>, <code>queue</code>, <code>cfg</code>, <code>metrics</code>).</li> <li><code>go build ./cmd/worker</code> succeeds without undefined-field errors.</li> </ol>"},{"location":"release-checklist/#2-graceful-shutdown-logic","title":"2. Graceful Shutdown Logic","text":"<ol> <li>Initialize <code>shutdownCh</code>, <code>activeTasks</code>, and <code>gracefulWait</code> during worker start-up.</li> <li>Confirm the heartbeat/lease helpers compile and handle queue errors gracefully      (<code>heartbeatLoop</code>, <code>releaseAllLeases</code>).</li> <li>Add tests (unit or integration) that simulate SIGINT/SIGTERM and verify leases      are released or tasks complete.</li> </ol>"},{"location":"release-checklist/#3-task-execution-flow","title":"3. Task Execution Flow","text":"<ol> <li>Align <code>executeTaskWithLease</code> with the real <code>executeTask</code> signature so the      \"no value used as value\" compile error disappears.</li> <li>Double-check retry/metrics paths still match existing worker behavior after      the new wrapper is added.</li> </ol>"},{"location":"release-checklist/#4-server-wiring","title":"4. Server Wiring","text":"<ol> <li>Ensure worker construction in <code>cmd/worker/worker_server.go</code> wires up config,      queue, metrics, and logger instances used by the shutdown logic.</li> <li>Re-run worker unit tests plus any queue/lease e2e tests.</li> </ol>"},{"location":"release-checklist/#5-validation-before-tagging","title":"5. Validation Before Tagging","text":"<ol> <li><code>go test ./cmd/worker/...</code> and <code>make test</code> (or equivalent) pass locally.</li> <li>Manual smoke test: start worker, queue jobs, send SIGTERM, confirm tasks finish      or leases are released and the process exits cleanly.</li> <li>Update release notes describing the new shutdown capability and any config      changes required (e.g., graceful timeout settings).</li> </ol>"},{"location":"security/","title":"Security Guide","text":"<p>This document outlines security features, best practices, and hardening procedures for FetchML.</p>"},{"location":"security/#security-features","title":"Security Features","text":""},{"location":"security/#authentication-authorization","title":"Authentication &amp; Authorization","text":"<ul> <li>API Keys: SHA256-hashed with role-based access control (RBAC)</li> <li>Permissions: Granular read/write/delete permissions per user</li> <li>IP Whitelisting: Network-level access control</li> <li>Rate Limiting: Per-user request quotas</li> </ul>"},{"location":"security/#communication-security","title":"Communication Security","text":"<ul> <li>TLS/HTTPS: End-to-end encryption for API traffic</li> <li>WebSocket Auth: API key required before upgrade</li> <li>Redis Auth: Password-protected task queue</li> </ul>"},{"location":"security/#data-privacy","title":"Data Privacy","text":"<ul> <li>Log Sanitization: Automatically redacts API keys, passwords, tokens</li> <li>Experiment Isolation: User-specific experiment directories</li> <li>No Anonymous Access: All services require authentication</li> </ul>"},{"location":"security/#network-security","title":"Network Security","text":"<ul> <li>Internal Networks: Backend services (Redis, Loki) not exposed publicly</li> <li>Firewall Rules: Restrictive port access</li> <li>Container Isolation: Services run in separate containers/pods</li> </ul>"},{"location":"security/#security-checklist","title":"Security Checklist","text":""},{"location":"security/#initial-setup","title":"Initial Setup","text":"<ol> <li> <p>Generate Strong Passwords <pre><code># Grafana admin password\nopenssl rand -base64 32 &gt; .grafana-password\n\n# Redis password\nopenssl rand -base64 32\n</code></pre></p> </li> <li> <p>Configure Environment Variables <pre><code>cp .env.example .env\n# Edit .env and set:\n# - GRAFANA_ADMIN_PASSWORD\n</code></pre></p> </li> <li> <p>Enable TLS (Production only)   <pre><code># configs/config-prod.yaml\nserver:\n  tls:\n    enabled: true\n    cert_file: \"/secrets/cert.pem\"\n    key_file: \"/secrets/key.pem\"\n</code></pre></p> </li> <li> <p>Configure Firewall <pre><code># Allow only necessary ports\nsudo ufw allow 22/tcp    # SSH\nsudo ufw allow 443/tcp   # HTTPS\nsudo ufw allow 80/tcp    # HTTP (redirect to HTTPS)\nsudo ufw enable\n</code></pre></p> </li> </ol>"},{"location":"security/#production-hardening","title":"Production Hardening","text":"<ol> <li> <p>Restrict IP Access <pre><code># configs/config-prod.yaml\nauth:\n  ip_whitelist:\n    - \"10.0.0.0/8\"\n    - \"192.168.0.0/16\"\n    - \"127.0.0.1\"\n</code></pre></p> </li> <li> <p>Enable Audit Logging <pre><code>logging:\n  level: \"info\"\n  audit: true\n  file: \"/var/log/fetch_ml/audit.log\"\n</code></pre></p> </li> <li> <p>Harden Redis <pre><code># Redis security\nredis-cli CONFIG SET requirepass \"your-strong-password\"\nredis-cli CONFIG SET rename-command FLUSHDB \"\"\nredis-cli CONFIG SET rename-command FLUSHALL \"\"\n</code></pre></p> </li> <li> <p>Secure Grafana <pre><code># Change default admin password\ndocker-compose exec grafana grafana-cli admin reset-admin-password new-strong-password\n</code></pre></p> </li> <li> <p>Regular Updates <pre><code># Update system packages\nsudo apt update &amp;&amp; sudo apt upgrade -y\n\n# Update containers\ndocker-compose pull\ndocker-compose up -d (testing only)\n</code></pre></p> </li> </ol>"},{"location":"security/#password-management","title":"Password Management","text":""},{"location":"security/#generate-secure-passwords","title":"Generate Secure Passwords","text":"<pre><code># Method 1: OpenSSL\nopenssl rand -base64 32\n\n# Method 2: pwgen (if installed)\npwgen -s 32 1\n\n# Method 3: /dev/urandom\nhead -c 32 /dev/urandom | base64\n</code></pre>"},{"location":"security/#store-passwords-securely","title":"Store Passwords Securely","text":"<p>Development: Use <code>.env</code> file (gitignored) <pre><code>echo \"REDIS_PASSWORD=$(openssl rand -base64 32)\" &gt;&gt; .env\necho \"GRAFANA_ADMIN_PASSWORD=$(openssl rand -base64 32)\" &gt;&gt; .env\n</code></pre></p> <p>Production: Use systemd environment files <pre><code>sudo mkdir -p /etc/fetch_ml/secrets\nsudo chmod 700 /etc/fetch_ml/secrets\necho \"REDIS_PASSWORD=...\" | sudo tee /etc/fetch_ml/secrets/redis.env\nsudo chmod 600 /etc/fetch_ml/secrets/redis.env\n</code></pre></p>"},{"location":"security/#api-key-management","title":"API Key Management","text":""},{"location":"security/#generate-api-keys","title":"Generate API Keys","text":"<pre><code># Generate random API key\nopenssl rand -hex 32\n\n# Hash for storage\necho -n \"your-api-key\" | sha256sum\n</code></pre>"},{"location":"security/#rotate-api-keys","title":"Rotate API Keys","text":"<ol> <li>Generate new API key</li> <li>Update <code>config-local.yaml</code> with new hash</li> <li>Distribute new key to users</li> <li>Remove old key after grace period</li> </ol>"},{"location":"security/#revoke-api-keys","title":"Revoke API Keys","text":"<p>Remove user entry from <code>config-local.yaml</code>: <pre><code>auth:\n  apikeys:\n    # user_to_revoke:  # Comment out or delete\n</code></pre></p>"},{"location":"security/#network-security_1","title":"Network Security","text":""},{"location":"security/#production-network-topology","title":"Production Network Topology","text":"<pre><code>Internet\n    \u2193\n[Firewall] (ports 3000, 9102)\n    \u2193\n[Reverse Proxy] (nginx/Apache) - TLS termination\n    \u2193\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502   Application Pod   \u2502\n\u2502                     \u2502\n\u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510   \u2502\n\u2502  \u2502 API Server   \u2502   \u2502  \u2190 Public (via reverse proxy)\n\u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518   \u2502\n\u2502                     \u2502\n\u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510   \u2502\n\u2502  \u2502   Redis      \u2502   \u2502  \u2190 Internal only\n\u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518   \u2502\n\u2502                     \u2502\n\u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510   \u2502\n\u2502  \u2502   Grafana    \u2502   \u2502  \u2190 Public (via reverse proxy)\n\u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518   \u2502\n\u2502                     \u2502\n\u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510   \u2502\n\u2502  \u2502 Prometheus   \u2502   \u2502  \u2190 Internal only\n\u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518   \u2502\n\u2502                     \u2502\n\u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510   \u2502\n\u2502  \u2502    Loki      \u2502   \u2502  \u2190 Internal only\n\u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518   \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n</code></pre>"},{"location":"security/#recommended-firewall-rules","title":"Recommended Firewall Rules","text":"<pre><code># Allow only necessary inbound connections\nsudo firewall-cmd --permanent --zone=public --add-rich-rule='\n  rule family=\"ipv4\"\n  source address=\"YOUR_NETWORK\"\n  port port=\"3000\" protocol=\"tcp\" accept'\n\nsudo firewall-cmd --permanent --zone=public --add-rich-rule='\n  rule family=\"ipv4\"\n  source address=\"YOUR_NETWORK\"\n  port port=\"9102\" protocol=\"tcp\" accept'\n\n# Block all other traffic\nsudo firewall-cmd --permanent --set-default-zone=drop\nsudo firewall-cmd --reload\n</code></pre>"},{"location":"security/#incident-response","title":"Incident Response","text":""},{"location":"security/#suspected-breach","title":"Suspected Breach","text":"<ol> <li>Immediate Actions</li> <li>Investigation </li> <li>Recovery </li> <li>Rotate all API keys</li> <li>Stop affected services</li> <li> <p>Review audit logs</p> </li> <li> <p>Investigation <pre><code># Check recent logins\nsudo journalctl -u fetchml-api --since \"1 hour ago\"\n\n# Review failed auth attempts\ngrep \"authentication failed\" /var/log/fetch_ml/*.log\n\n# Check active connections\nss -tnp | grep :9102\n</code></pre></p> </li> <li> <p>Recovery</p> </li> <li>Rotate all passwords and API keys</li> <li>Update firewall rules</li> <li>Patch vulnerabilities</li> <li>Resume services</li> </ol>"},{"location":"security/#security-monitoring","title":"Security Monitoring","text":"<pre><code># Monitor failed authentication\ntail -f /var/log/fetch_ml/api.log | grep \"auth.*failed\"\n\n# Monitor unusual activity\njournalctl -u fetchml-api -f | grep -E \"(ERROR|WARN)\"\n\n# Check open ports\nnmap -p- localhost\n</code></pre>"},{"location":"security/#security-best-practices","title":"Security Best Practices","text":"<ol> <li>Principle of Least Privilege: Grant minimum necessary permissions</li> <li>Defense in Depth: Multiple security layers (firewall + auth + TLS)</li> <li>Regular Updates: Keep all components patched</li> <li>Audit Regularly: Review logs and access patterns</li> <li>Secure Secrets: Never commit passwords/keys to git</li> <li>Network Segmentation: Isolate services with internal networks</li> <li>Monitor Everything: Enable comprehensive logging and alerting</li> <li>Test Security: Regular penetration testing and vulnerability scans</li> </ol>"},{"location":"security/#compliance","title":"Compliance","text":""},{"location":"security/#data-privacy_1","title":"Data Privacy","text":"<ul> <li>Logs are sanitized (no passwords/API keys)</li> <li>Experiment data is user-isolated</li> <li>No telemetry or external data sharing</li> </ul>"},{"location":"security/#audit-trail","title":"Audit Trail","text":"<p>All API access is logged with: - Timestamp - User/API key - Action performed - Source IP - Result (success/failure)</p>"},{"location":"security/#getting-help","title":"Getting Help","text":"<ul> <li>Security Issues: Report privately via email</li> <li>Questions: See documentation or create issue</li> <li>Updates: Monitor releases for security patches</li> </ul>"},{"location":"smart-defaults/","title":"Smart Defaults","text":"<p>This document describes Fetch ML's smart defaults system, which automatically adapts configuration based on the runtime environment.</p>"},{"location":"smart-defaults/#overview","title":"Overview","text":"<p>Smart defaults eliminate the need for manual configuration tweaks when running in different environments:</p> <ul> <li>Local Development: Optimized for developer machines with sensible paths and localhost services</li> <li>Container Environments: Uses container-friendly hostnames and paths</li> <li>CI/CD: Optimized for automated testing with fast polling and minimal resource usage</li> <li>Production: Uses production-ready defaults with proper security and scaling</li> </ul>"},{"location":"smart-defaults/#environment-detection","title":"Environment Detection","text":"<p>The system automatically detects the environment based on:</p> <ol> <li>CI Detection: Checks for <code>CI</code>, <code>GITHUB_ACTIONS</code>, <code>GITLAB_CI</code> environment variables</li> <li>Container Detection: Looks for <code>/.dockerenv</code>, <code>KUBERNETES_SERVICE_HOST</code>, or <code>CONTAINER</code> variables</li> <li>Production Detection: Checks <code>FETCH_ML_ENV=production</code> or <code>ENV=production</code></li> <li>Default: Falls back to local development</li> </ol>"},{"location":"smart-defaults/#default-values-by-environment","title":"Default Values by Environment","text":""},{"location":"smart-defaults/#host-configuration","title":"Host Configuration","text":"<ul> <li>Local: <code>localhost</code></li> <li>Container/CI: <code>host.docker.internal</code> (Docker Desktop/Colima)</li> <li>Production: <code>0.0.0.0</code></li> </ul>"},{"location":"smart-defaults/#base-paths","title":"Base Paths","text":"<ul> <li>Local: <code>~/ml-experiments</code></li> <li>Container/CI: <code>/workspace/ml-experiments</code></li> <li>Production: <code>/var/lib/fetch_ml/experiments</code></li> </ul>"},{"location":"smart-defaults/#data-directory","title":"Data Directory","text":"<ul> <li>Local: <code>~/ml-data</code></li> <li>Container/CI: <code>/workspace/data</code></li> <li>Production: <code>/var/lib/fetch_ml/data</code></li> </ul>"},{"location":"smart-defaults/#redis-address","title":"Redis Address","text":"<ul> <li>Local: <code>localhost:6379</code></li> <li>Container/CI: <code>redis:6379</code> (service name)</li> <li>Production: <code>redis:6379</code></li> </ul>"},{"location":"smart-defaults/#ssh-configuration","title":"SSH Configuration","text":"<ul> <li>Local: <code>~/.ssh/id_rsa</code> and <code>~/.ssh/known_hosts</code></li> <li>Container/CI: <code>/workspace/.ssh/id_rsa</code> and <code>/workspace/.ssh/known_hosts</code></li> <li>Production: <code>/etc/fetch_ml/ssh/id_rsa</code> and <code>/etc/fetch_ml/ssh/known_hosts</code></li> </ul>"},{"location":"smart-defaults/#worker-configuration","title":"Worker Configuration","text":"<ul> <li>Local: 2 workers, 5-second poll interval</li> <li>CI: 1 worker, 1-second poll interval (fast testing)</li> <li>Production: CPU core count workers, 10-second poll interval</li> </ul>"},{"location":"smart-defaults/#log-levels","title":"Log Levels","text":"<ul> <li>Local: <code>info</code></li> <li>CI: <code>debug</code> (verbose for debugging)</li> <li>Production: <code>info</code></li> </ul>"},{"location":"smart-defaults/#usage","title":"Usage","text":""},{"location":"smart-defaults/#in-configuration-loaders","title":"In Configuration Loaders","text":"<pre><code>// Get smart defaults for current environment\nsmart := config.GetSmartDefaults()\n\n// Use smart defaults\nif cfg.Host == \"\" {\n    cfg.Host = smart.Host()\n}\nif cfg.BasePath == \"\" {\n    cfg.BasePath = smart.BasePath()\n}\n</code></pre>"},{"location":"smart-defaults/#environment-overrides","title":"Environment Overrides","text":"<p>Smart defaults can be overridden with environment variables:</p> <ul> <li><code>FETCH_ML_HOST</code> - Override host</li> <li><code>FETCH_ML_BASE_PATH</code> - Override base path</li> <li><code>FETCH_ML_REDIS_ADDR</code> - Override Redis address</li> <li><code>FETCH_ML_ENV</code> - Force environment profile</li> </ul>"},{"location":"smart-defaults/#manual-environment-selection","title":"Manual Environment Selection","text":"<p>You can force a specific environment:</p> <pre><code># Force production mode\nexport FETCH_ML_ENV=production\n\n# Force container mode\nexport CONTAINER=true\n</code></pre>"},{"location":"smart-defaults/#implementation-details","title":"Implementation Details","text":"<p>The smart defaults system is implemented in <code>internal/config/smart_defaults.go</code>:</p> <ul> <li><code>DetectEnvironment()</code> - Determines current environment profile</li> <li><code>SmartDefaults</code> struct - Provides environment-aware defaults</li> <li>Helper methods for each configuration value</li> </ul>"},{"location":"smart-defaults/#migration-guide","title":"Migration Guide","text":""},{"location":"smart-defaults/#for-users","title":"For Users","text":"<p>No changes required - existing configurations continue to work. Smart defaults only apply when values are not explicitly set.</p>"},{"location":"smart-defaults/#for-developers","title":"For Developers","text":"<p>When adding new configuration options:</p> <ol> <li>Add a method to <code>SmartDefaults</code> struct</li> <li>Use the smart default in config loaders</li> <li>Document the environment-specific values</li> </ol> <p>Example:</p> <pre><code>// Add to SmartDefaults struct\nfunc (s *SmartDefaults) NewFeature() string {\n    switch s.Profile {\n    case ProfileContainer, ProfileCI:\n        return \"/workspace/new-feature\"\n    case ProfileProduction:\n        return \"/var/lib/fetch_ml/new-feature\"\n    default:\n        return \"./new-feature\"\n    }\n}\n\n// Use in config loader\nif cfg.NewFeature == \"\" {\n    cfg.NewFeature = smart.NewFeature()\n}\n</code></pre>"},{"location":"smart-defaults/#testing","title":"Testing","text":"<p>To test different environments:</p> <pre><code># Test local defaults (default)\n./bin/worker\n\n# Test container defaults\nexport CONTAINER=true\n./bin/worker\n\n# Test CI defaults\nexport CI=true\n./bin/worker\n\n# Test production defaults\nexport FETCH_ML_ENV=production\n./bin/worker\n</code></pre>"},{"location":"smart-defaults/#troubleshooting","title":"Troubleshooting","text":""},{"location":"smart-defaults/#wrong-environment-detection","title":"Wrong Environment Detection","text":"<p>Check environment variables:</p> <pre><code>echo \"CI: $CI\"\necho \"CONTAINER: $CONTAINER\"\necho \"FETCH_ML_ENV: $FETCH_ML_ENV\"\n</code></pre>"},{"location":"smart-defaults/#path-issues","title":"Path Issues","text":"<p>Smart defaults expand <code>~</code> and environment variables automatically. If paths don't work as expected:</p> <ol> <li>Check the detected environment: <code>config.GetSmartDefaults().GetEnvironmentDescription()</code></li> <li>Verify the path exists in the target environment</li> <li>Override with environment variable if needed</li> </ol>"},{"location":"smart-defaults/#container-networking","title":"Container Networking","text":"<p>For container environments, ensure: - Redis service is named <code>redis</code> in docker-compose - Host networking is configured properly - <code>host.docker.internal</code> resolves (Docker Desktop/Colima)</p>"},{"location":"testing/","title":"Testing Guide","text":"<p>How to run and write tests for FetchML.</p>"},{"location":"testing/#running-tests","title":"Running Tests","text":""},{"location":"testing/#quick-test","title":"Quick Test","text":"<pre><code># All tests\nmake test\n\n# Unit tests only\nmake test-unit\n\n# Integration tests\nmake test-integration\n\n# With coverage\nmake test-coverage\n\n\n## Quick Test\n```bash\nmake test          # All tests\nmake test-unit     # Unit only\n.\nmake test.\nmake test$\nmake test; make test  # Coverage\n  # E2E tests\n</code></pre>"},{"location":"testing/#docker-testing","title":"Docker Testing","text":"<pre><code>docker-compose up -d (testing only)\nmake test\ndocker-compose down\n</code></pre>"},{"location":"testing/#cli-testing","title":"CLI Testing","text":"<pre><code>cd cli &amp;&amp; zig build dev\n./cli/zig-out/dev/ml --help\nzig build test\n</code></pre>"},{"location":"troubleshooting/","title":"Troubleshooting","text":"<p>Common issues and solutions for Fetch ML.</p>"},{"location":"troubleshooting/#quick-fixes","title":"Quick Fixes","text":""},{"location":"troubleshooting/#services-not-starting","title":"Services Not Starting","text":"<pre><code># Check Docker status\ndocker-compose ps\n\n# Restart services\ndocker-compose down &amp;&amp; docker-compose up -d (testing only)\n\n# Check logs\ndocker-compose logs -f\n</code></pre>"},{"location":"troubleshooting/#api-not-responding","title":"API Not Responding","text":"<pre><code># Check health endpoint\ncurl http://localhost:9101/health\n\n# Check if port is in use\nlsof -i :9101\n\n# Kill process on port\nkill -9 $(lsof -ti :9101)\n</code></pre>"},{"location":"troubleshooting/#database-issues","title":"Database Issues","text":"<pre><code># Check database connection\ndocker-compose exec postgres psql -U postgres -d fetch_ml\n\n# Reset database\ndocker-compose down postgres\ndocker-compose up -d (testing only) postgres\n\n# Check Redis\ndocker-compose exec redis redis-cli ping\n</code></pre>"},{"location":"troubleshooting/#common-errors","title":"Common Errors","text":""},{"location":"troubleshooting/#authentication-errors","title":"Authentication Errors","text":"<ul> <li>Invalid API key: Check config and regenerate hash</li> <li>JWT expired: Check <code>jwt_expiry</code> setting</li> </ul>"},{"location":"troubleshooting/#database-errors","title":"Database Errors","text":"<ul> <li>Connection failed: Verify database type and connection params</li> <li>No such table: Run migrations with <code>--migrate</code> (see Development Setup)</li> </ul>"},{"location":"troubleshooting/#container-errors","title":"Container Errors","text":"<ul> <li>Runtime not found: Set <code>runtime: docker (testing only)</code> in config</li> <li>Image pull failed: Check registry access</li> </ul>"},{"location":"troubleshooting/#performance-issues","title":"Performance Issues","text":"<ul> <li>High memory: Adjust <code>resources.memory_limit</code></li> <li>Slow jobs: Check worker count and queue size</li> </ul>"},{"location":"troubleshooting/#development-issues","title":"Development Issues","text":"<ul> <li>Build fails: <code>go mod tidy</code> and <code>cd cli &amp;&amp; rm -rf zig-out zig-cache</code></li> <li>Tests fail: Start test dependencies with <code>docker-compose -f docker-compose.test.yml up -d</code></li> </ul>"},{"location":"troubleshooting/#cli-issues","title":"CLI Issues","text":"<ul> <li>Not found: <code>cd cli &amp;&amp; zig build dev</code></li> <li>Connection errors: Check <code>--server</code> and <code>--api-key</code></li> </ul>"},{"location":"troubleshooting/#network-issues","title":"Network Issues","text":"<ul> <li>Port conflicts: <code>lsof -i :9101</code> and kill processes</li> <li>Firewall: Allow ports 9101, 6379, 5432</li> </ul>"},{"location":"troubleshooting/#configuration-issues","title":"Configuration Issues","text":"<ul> <li>Invalid YAML: <code>python3 -c \"import yaml; yaml.safe_load(open('config.yaml'))\"</code></li> <li>Missing fields: Run <code>see [Configuration Schema](configuration-schema.md)</code></li> </ul>"},{"location":"troubleshooting/#debug-information","title":"Debug Information","text":"<pre><code>./bin/api-server --version\ndocker-compose ps\ndocker-compose logs api-server | grep ERROR\n</code></pre>"},{"location":"troubleshooting/#emergency-reset","title":"Emergency Reset","text":"<pre><code>docker-compose down -v\nrm -rf data/ results/ *.db\ndocker-compose up -d (testing only)\n</code></pre>"},{"location":"user-permissions/","title":"User Permissions in Fetch ML","text":"<p>Fetch ML now supports user-based permissions to ensure data scientists can only view and manage their own experiments while administrators retain full control.</p>"},{"location":"user-permissions/#overview","title":"Overview","text":"<ul> <li>User Isolation: Each user can only see their own experiments</li> <li>Admin Override: Administrators can view and manage all experiments</li> <li>Permission-Based: Fine-grained permissions for create, read, update operations</li> <li>API Key Authentication: Secure authentication using API keys</li> </ul>"},{"location":"user-permissions/#permissions","title":"Permissions","text":""},{"location":"user-permissions/#job-permissions","title":"Job Permissions","text":"<ul> <li><code>jobs:create</code> - Create new experiments</li> <li><code>jobs:read</code> - View experiment status and results</li> <li><code>jobs:update</code> - Cancel or modify experiments</li> </ul>"},{"location":"user-permissions/#user-types","title":"User Types","text":"<ul> <li>Administrators: Full access to all experiments and system operations</li> <li>Data Scientists: Access to their own experiments only</li> <li>Viewers: Read-only access to their own experiments</li> </ul>"},{"location":"user-permissions/#cli-usage","title":"CLI Usage","text":""},{"location":"user-permissions/#view-your-jobs","title":"View Your Jobs","text":"<p><pre><code>ml status\n</code></pre> Shows only your experiments with user context displayed.</p>"},{"location":"user-permissions/#cancel-your-jobs","title":"Cancel Your Jobs","text":"<p><pre><code>ml cancel &lt;job-name&gt;\n</code></pre> Only allows canceling your own experiments (unless you're an admin).</p>"},{"location":"user-permissions/#authentication","title":"Authentication","text":"<p>The CLI automatically authenticates using your API key from <code>~/.ml/config.toml</code>.</p>"},{"location":"user-permissions/#configuration","title":"Configuration","text":""},{"location":"user-permissions/#api-key-setup","title":"API Key Setup","text":"<pre><code>[worker]\napi_key = \"your-api-key-here\"\n</code></pre>"},{"location":"user-permissions/#user-roles","title":"User Roles","text":"<p>User roles and permissions are configured on the server side by administrators.</p>"},{"location":"user-permissions/#security-features","title":"Security Features","text":"<ul> <li>API Key Hashing: Keys are hashed before transmission</li> <li>User Filtering: Server-side filtering prevents unauthorized access</li> <li>Permission Validation: All operations require appropriate permissions</li> <li>Audit Logging: All user actions are logged</li> </ul>"},{"location":"user-permissions/#examples","title":"Examples","text":""},{"location":"user-permissions/#data-scientist-workflow","title":"Data Scientist Workflow","text":"<pre><code># Submit your experiment\nml run my-experiment\n\n# Check your experiments (only shows yours)\nml status\n\n# Cancel your own experiment\nml cancel my-experiment\n</code></pre>"},{"location":"user-permissions/#administrator-workflow","title":"Administrator Workflow","text":"<pre><code># View all experiments (admin sees everything)\nml status\n\n# Cancel any user's experiment\nml cancel user-experiment\n</code></pre>"},{"location":"user-permissions/#error-messages","title":"Error Messages","text":"<ul> <li>\"Insufficient permissions\": You don't have the required permission</li> <li>\"You can only cancel your own jobs\": Ownership restriction</li> <li>\"Invalid API key\": Authentication failed</li> </ul>"},{"location":"user-permissions/#migration-notes","title":"Migration Notes","text":"<ul> <li>Existing configurations continue to work</li> <li>When auth is disabled, all users have admin-like access</li> <li>User ownership is automatically assigned to new experiments</li> </ul> <p>For more details, see the architecture documentation.</p>"},{"location":"zig-cli/","title":"Zig CLI Guide","text":"<p>High-performance command-line interface for ML experiment management, written in Zig for maximum speed and efficiency.</p>"},{"location":"zig-cli/#overview","title":"Overview","text":"<p>The Zig CLI (<code>ml</code>) is the primary interface for managing ML experiments in your homelab. Built with Zig, it provides exceptional performance for file operations, network communication, and experiment management.</p>"},{"location":"zig-cli/#installation","title":"Installation","text":""},{"location":"zig-cli/#pre-built-binaries-recommended","title":"Pre-built Binaries (Recommended)","text":"<p>Download from GitHub Releases:</p> <pre><code># Download for your platform\ncurl -LO https://github.com/jfraeys/fetch_ml/releases/latest/download/ml-&lt;platform&gt;.tar.gz\n\n# Extract\ntar -xzf ml-&lt;platform&gt;.tar.gz\n\n# Install\nchmod +x ml-&lt;platform&gt;\nsudo mv ml-&lt;platform&gt; /usr/local/bin/ml\n\n# Verify\nml --help\n</code></pre> <p>Platforms: - <code>ml-linux-x86_64.tar.gz</code> - Linux (fully static, zero dependencies) - <code>ml-macos-x86_64.tar.gz</code> - macOS Intel - <code>ml-macos-arm64.tar.gz</code> - macOS Apple Silicon</p> <p>All release binaries include embedded static rsync for complete independence.</p>"},{"location":"zig-cli/#build-from-source","title":"Build from Source","text":"<p>Development Build (uses system rsync): <pre><code>cd cli\nzig build dev\n./zig-out/dev/ml-dev --help\n</code></pre></p> <p>Production Build (embedded rsync): <pre><code>cd cli\n# For testing: uses rsync wrapper\nzig build prod\n\n# For release with static rsync:\n# 1. Place static rsync binary at src/assets/rsync_release.bin\n# 2. Build\nzig build prod\nstrip zig-out/prod/ml  # Optional: reduce size\n\n# Verify\n./zig-out/prod/ml --help\nls -lh zig-out/prod/ml\n</code></pre></p> <p>See cli/src/assets/README.md for details on obtaining static rsync binaries.</p>"},{"location":"zig-cli/#verify-installation","title":"Verify Installation","text":"<pre><code>ml --help\nml --version  # Shows build config\n</code></pre>"},{"location":"zig-cli/#quick-start","title":"Quick Start","text":"<ol> <li> <p>Initialize Configuration <pre><code>./cli/zig-out/bin/ml init\n</code></pre></p> </li> <li> <p>Sync Your First Project <pre><code>./cli/zig-out/bin/ml sync ./my-project --queue\n</code></pre></p> </li> <li> <p>Monitor Progress <pre><code>./cli/zig-out/bin/ml status\n</code></pre></p> </li> </ol>"},{"location":"zig-cli/#command-reference","title":"Command Reference","text":""},{"location":"zig-cli/#init-configuration-setup","title":"<code>init</code> - Configuration Setup","text":"<p>Initialize the CLI configuration file.</p> <pre><code>ml init\n</code></pre> <p>Creates: <code>~/.ml/config.toml</code></p> <p>Configuration Template: <pre><code>worker_host = \"worker.local\"\nworker_user = \"mluser\"\nworker_base = \"/data/ml-experiments\"\nworker_port = 22\napi_key = \"your-api-key\"\n</code></pre></p>"},{"location":"zig-cli/#sync-project-synchronization","title":"<code>sync</code> - Project Synchronization","text":"<p>Sync project files to the worker with intelligent deduplication.</p> <pre><code># Basic sync\nml sync ./project\n\n# Sync with custom name and auto-queue\nml sync ./project --name \"experiment-1\" --queue\n\n# Sync with priority\nml sync ./project --priority 8\n</code></pre> <p>Options: - <code>--name &lt;name&gt;</code>: Custom experiment name - <code>--queue</code>: Automatically queue after sync - <code>--priority N</code>: Set priority (1-10, default 5)</p> <p>Features: - Content-Addressed Storage: Automatic deduplication - SHA256 Commit IDs: Reliable change detection - Incremental Transfer: Only sync changed files - Rsync Backend: Efficient file transfer</p>"},{"location":"zig-cli/#queue-job-management","title":"<code>queue</code> - Job Management","text":"<p>Queue experiments for execution on the worker.</p> <pre><code># Queue with commit ID\nml queue my-job --commit abc123def456\n\n# Queue with priority\nml queue my-job --commit abc123 --priority 8\n</code></pre> <p>Options: - <code>--commit &lt;id&gt;</code>: Commit ID from sync output - <code>--priority N</code>: Execution priority (1-10)</p> <p>Features: - WebSocket Communication: Real-time job submission - Priority Queuing: Higher priority jobs run first - API Authentication: Secure job submission</p>"},{"location":"zig-cli/#watch-auto-sync-monitoring","title":"<code>watch</code> - Auto-Sync Monitoring","text":"<p>Monitor directories for changes and auto-sync.</p> <pre><code># Watch for changes\nml watch ./project\n\n# Watch and auto-queue on changes\nml watch ./project --name \"dev-exp\" --queue\n</code></pre> <p>Options: - <code>--name &lt;name&gt;</code>: Custom experiment name - <code>--queue</code>: Auto-queue on changes - <code>--priority N</code>: Set priority for queued jobs</p> <p>Features: - Real-time Monitoring: 2-second polling interval - Change Detection: File modification time tracking - Commit Comparison: Only sync when content changes - Automatic Queuing: Seamless development workflow</p>"},{"location":"zig-cli/#status-system-status","title":"<code>status</code> - System Status","text":"<p>Check system and worker status.</p> <pre><code>ml status\n</code></pre> <p>Displays: - Worker connectivity - Queue status - Running jobs - System health</p>"},{"location":"zig-cli/#monitor-remote-monitoring","title":"<code>monitor</code> - Remote Monitoring","text":"<p>Launch TUI interface via SSH for real-time monitoring.</p> <pre><code>ml monitor\n</code></pre> <p>Features: - Real-time Updates: Live experiment status - Interactive Interface: Browse and manage experiments - SSH Integration: Secure remote access</p>"},{"location":"zig-cli/#cancel-job-cancellation","title":"<code>cancel</code> - Job Cancellation","text":"<p>Cancel running or queued jobs.</p> <pre><code>ml cancel job-id\n</code></pre> <p>Options: - <code>job-id</code>: Job identifier from status output</p>"},{"location":"zig-cli/#prune-cleanup-management","title":"<code>prune</code> - Cleanup Management","text":"<p>Clean up old experiments to save space.</p> <pre><code># Keep last N experiments\nml prune --keep 20\n\n# Remove experiments older than N days\nml prune --older-than 30\n</code></pre> <p>Options: - <code>--keep N</code>: Keep N most recent experiments - <code>--older-than N</code>: Remove experiments older than N days</p>"},{"location":"zig-cli/#architecture","title":"Architecture","text":"<p>Testing: Docker Compose (macOS/Linux) Production: Podman + systemd (Linux)</p> <p>Important: Docker is for testing only. Podman is used for running actual ML experiments in production.</p>"},{"location":"zig-cli/#core-components","title":"Core Components","text":"<pre><code>cli/src/\n\u251c\u2500\u2500 commands/        # Command implementations\n\u2502   \u251c\u2500\u2500 init.zig     # Configuration setup\n\u2502   \u251c\u2500\u2500 sync.zig     # Project synchronization\n\u2502   \u251c\u2500\u2500 queue.zig    # Job management\n\u2502   \u251c\u2500\u2500 watch.zig    # Auto-sync monitoring\n\u2502   \u251c\u2500\u2500 status.zig   # System status\n\u2502   \u251c\u2500\u2500 monitor.zig  # Remote monitoring\n\u2502   \u251c\u2500\u2500 cancel.zig   # Job cancellation\n\u2502   \u2514\u2500\u2500 prune.zig    # Cleanup operations\n\u251c\u2500\u2500 config.zig       # Configuration management\n\u251c\u2500\u2500 errors.zig       # Error handling\n\u251c\u2500\u2500 net/            # Network utilities\n\u2502   \u2514\u2500\u2500 ws.zig       # WebSocket client\n\u2514\u2500\u2500 utils/          # Utility functions\n    \u251c\u2500\u2500 crypto.zig   # Hashing and encryption\n    \u251c\u2500\u2500 storage.zig  # Content-addressed storage\n    \u2514\u2500\u2500 rsync.zig    # File synchronization\n</code></pre>"},{"location":"zig-cli/#performance-features","title":"Performance Features","text":""},{"location":"zig-cli/#content-addressed-storage","title":"Content-Addressed Storage","text":"<ul> <li>Deduplication: Identical files shared across experiments</li> <li>Hash-based Storage: Files stored by SHA256 hash</li> <li>Space Efficiency: Reduces storage by up to 90%</li> </ul>"},{"location":"zig-cli/#sha256-commit-ids","title":"SHA256 Commit IDs","text":"<ul> <li>Reliable Detection: Cryptographic change detection</li> <li>Collision Resistance: Guaranteed unique identifiers</li> <li>Fast Computation: Optimized for large directories</li> </ul>"},{"location":"zig-cli/#websocket-protocol","title":"WebSocket Protocol","text":"<ul> <li>Low Latency: Real-time communication</li> <li>Binary Protocol: Efficient message format</li> <li>Connection Pooling: Reused connections</li> </ul>"},{"location":"zig-cli/#memory-management","title":"Memory Management","text":"<ul> <li>Arena Allocators: Efficient memory allocation</li> <li>Zero-copy Operations: Minimized memory usage</li> <li>Resource Cleanup: Automatic resource management</li> </ul>"},{"location":"zig-cli/#security-features","title":"Security Features","text":""},{"location":"zig-cli/#authentication","title":"Authentication","text":"<ul> <li>API Key Hashing: Secure token storage</li> <li>SHA256 Hashes: Irreversible token protection</li> <li>Config Validation: Input sanitization</li> </ul>"},{"location":"zig-cli/#secure-communication","title":"Secure Communication","text":"<ul> <li>SSH Integration: Encrypted file transfers</li> <li>WebSocket Security: TLS-protected communication</li> <li>Input Validation: Comprehensive argument checking</li> </ul>"},{"location":"zig-cli/#error-handling","title":"Error Handling","text":"<ul> <li>Secure Reporting: No sensitive information leakage</li> <li>Graceful Degradation: Safe error recovery</li> <li>Audit Logging: Operation tracking</li> </ul>"},{"location":"zig-cli/#advanced-usage","title":"Advanced Usage","text":""},{"location":"zig-cli/#workflow-integration","title":"Workflow Integration","text":""},{"location":"zig-cli/#development-workflow","title":"Development Workflow","text":"<pre><code># 1. Initialize project\nml sync ./project --name \"dev\" --queue\n\n# 2. Auto-sync during development\nml watch ./project --name \"dev\" --queue\n\n# 3. Monitor progress\nml status\n</code></pre>"},{"location":"zig-cli/#batch-processing","title":"Batch Processing","text":"<pre><code># Process multiple experiments\nfor dir in experiments/*/; do\n    ml sync \"$dir\" --queue\ndone\n</code></pre>"},{"location":"zig-cli/#priority-management","title":"Priority Management","text":"<pre><code># High priority experiment\nml sync ./urgent --priority 10 --queue\n\n# Background processing\nml sync ./background --priority 1 --queue\n</code></pre>"},{"location":"zig-cli/#configuration-management","title":"Configuration Management","text":""},{"location":"zig-cli/#multiple-workers","title":"Multiple Workers","text":"<pre><code># ~/.ml/config.toml\nworker_host = \"worker.local\"\nworker_user = \"mluser\"\nworker_base = \"/data/ml-experiments\"\nworker_port = 22\napi_key = \"your-api-key\"\n</code></pre>"},{"location":"zig-cli/#security-settings","title":"Security Settings","text":"<pre><code># Set restrictive permissions\nchmod 600 ~/.ml/config.toml\n\n# Verify configuration\nml status\n</code></pre>"},{"location":"zig-cli/#troubleshooting","title":"Troubleshooting","text":""},{"location":"zig-cli/#common-issues","title":"Common Issues","text":""},{"location":"zig-cli/#build-problems","title":"Build Problems","text":"<pre><code># Check Zig installation\nzig version\n\n# Clean build\ncd cli &amp;&amp; make clean &amp;&amp; make build\n</code></pre>"},{"location":"zig-cli/#connection-issues","title":"Connection Issues","text":"<pre><code># Test SSH connectivity\nssh -p $worker_port $worker_user@$worker_host\n\n# Verify configuration\ncat ~/.ml/config.toml\n</code></pre>"},{"location":"zig-cli/#sync-failures","title":"Sync Failures","text":"<pre><code># Check rsync\nrsync --version\n\n# Manual sync test\nrsync -avz ./test/ $worker_user@$worker_host:/tmp/\n</code></pre>"},{"location":"zig-cli/#performance-issues","title":"Performance Issues","text":"<pre><code># Monitor resource usage\ntop -p $(pgrep ml)\n\n# Check disk space\ndf -h $worker_base\n</code></pre>"},{"location":"zig-cli/#debug-mode","title":"Debug Mode","text":"<p>Enable verbose logging: <pre><code># Environment variable\nexport ML_DEBUG=1\nml sync ./project\n\n# Or use debug build\ncd cli &amp;&amp; make debug\n</code></pre></p>"},{"location":"zig-cli/#performance-benchmarks","title":"Performance Benchmarks","text":""},{"location":"zig-cli/#file-operations","title":"File Operations","text":"<ul> <li>Sync Speed: 100MB/s+ (network limited)</li> <li>Hash Computation: 500MB/s+ (CPU limited)</li> <li>Deduplication: 90%+ space savings</li> </ul>"},{"location":"zig-cli/#memory-usage","title":"Memory Usage","text":"<ul> <li>Base Memory: ~10MB</li> <li>Large Projects: ~50MB (1GB+ projects)</li> <li>Memory Efficiency: Constant per-file overhead</li> </ul>"},{"location":"zig-cli/#network-performance","title":"Network Performance","text":"<ul> <li>WebSocket Latency: &lt;10ms (local network)</li> <li>Connection Setup: &lt;100ms</li> <li>Throughput: Network limited</li> </ul>"},{"location":"zig-cli/#contributing","title":"Contributing","text":""},{"location":"zig-cli/#development-setup","title":"Development Setup","text":"<pre><code>cd cli\nzig build-exe src/main.zig\n</code></pre>"},{"location":"zig-cli/#testing","title":"Testing","text":"<pre><code># Run tests\ncd cli &amp;&amp; zig test src/\n\n# Integration tests\nzig test tests/\n</code></pre>"},{"location":"zig-cli/#code-style","title":"Code Style","text":"<ul> <li>Follow Zig style guidelines</li> <li>Use explicit error handling</li> <li>Document public APIs</li> <li>Add comprehensive tests</li> </ul> <p>For more information, see the CLI Reference and Architecture pages.</p>"}]}