- Fix YAML tags in auth config struct (json -> yaml) - Update CLI configs to use pre-hashed API keys - Remove double hashing in WebSocket client - Fix port mapping (9102 -> 9103) in CLI commands - Update permission keys to use jobs:read, jobs:create, etc. - Clean up all debug logging from CLI and server - All user roles now authenticate correctly: * Admin: Can queue jobs and see all jobs * Researcher: Can queue jobs and see own jobs * Analyst: Can see status (read-only access) Multi-user authentication is now fully functional.
1 line
No EOL
169 KiB
JSON
1 line
No EOL
169 KiB
JSON
{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"],"fields":{"title":{"boost":1000.0},"text":{"boost":1.0},"tags":{"boost":1000000.0}}},"docs":[{"location":"","title":"Fetch ML - Secure Machine Learning Platform","text":"<p>A secure, containerized platform for running machine learning experiments with role-based access control and comprehensive audit trails.</p>"},{"location":"#quick-start","title":"Quick Start","text":"<p>New to the project? Start here!</p> <pre><code># Clone the repository\ngit clone https://github.com/your-username/fetch_ml.git\ncd fetch_ml\n\n# Quick setup (builds everything, creates test user)\nmake quick-start\n\n# Create your API key\n./bin/user_manager --config configs/config_dev.yaml --cmd generate-key --username your_name --role data_scientist\n\n# Run your first experiment\n./bin/worker --config configs/config_dev.yaml --api-key YOUR_GENERATED_KEY\n</code></pre>"},{"location":"#quick-navigation","title":"Quick Navigation","text":""},{"location":"#getting-started","title":"\ud83d\ude80 Getting Started","text":"<ul> <li>Getting Started Guide - Complete setup instructions</li> <li>Simple Install - Quick installation guide</li> </ul>"},{"location":"#security-authentication","title":"\ud83d\udd12 Security & Authentication","text":"<ul> <li>Security Overview - Security best practices</li> <li>API Key Process - Generate and manage API keys</li> <li>User Permissions - Role-based access control</li> </ul>"},{"location":"#configuration","title":"\u2699\ufe0f Configuration","text":"<ul> <li>Environment Variables - Configuration options</li> <li>Smart Defaults - Default configuration settings</li> </ul>"},{"location":"#development","title":"\ud83d\udee0\ufe0f Development","text":"<ul> <li>Architecture - System architecture and design</li> <li>CLI Reference - Command-line interface documentation</li> <li>Testing Guide - Testing procedures and guidelines</li> <li>Queue System - Job queue implementation</li> </ul>"},{"location":"#production-deployment","title":"\ud83c\udfed Production Deployment","text":"<ul> <li>Deployment Guide - Production deployment instructions</li> <li>Production Monitoring - Monitoring and observability</li> <li>Operations Guide - Production operations</li> </ul>"},{"location":"#features","title":"Features","text":"<ul> <li>\ud83d\udd10 Secure Authentication - RBAC with API keys, roles, and permissions</li> <li>\ud83d\udc33 Containerized - Podman-based secure execution environments</li> <li>\ud83d\uddc4\ufe0f Database Storage - SQLite backend for user management (optional)</li> <li>\ud83d\udccb Audit Trail - Complete logging of all actions</li> <li>\ud83d\ude80 Production Ready - Security audits, systemd services, log rotation</li> </ul>"},{"location":"#available-commands","title":"Available Commands","text":"<pre><code># Core commands\nmake help # See all available commands\nmake build # Build all binaries\nmake test-unit # Run tests\n\n# User management\n./bin/user_manager --config configs/config_dev.yaml --cmd generate-key --username new_user --role data_scientist\n./bin/user_manager --config configs/config_dev.yaml --cmd list-users\n\n# Run services\n./bin/worker --config configs/config_dev.yaml --api-key YOUR_KEY\n./bin/tui --config configs/config_dev.yaml\n./bin/data_manager --config configs/config_dev.yaml\n</code></pre>"},{"location":"#need-help","title":"Need Help?","text":"<ul> <li>\ud83d\udcd6 Documentation: Use the navigation menu on the left</li> <li>\u26a1 Quick help: <code>make help</code></li> <li>\ud83e\uddea Tests: <code>make test-unit</code></li> </ul> <p>Happy ML experimenting!</p>"},{"location":"api-key-process/","title":"FetchML API Key Process","text":"<p>This document describes how API keys are issued and how team members should configure the <code>ml</code> CLI to use them.</p> <p>The goal is to keep access easy for your homelab while treating API keys as sensitive secrets.</p>"},{"location":"api-key-process/#overview","title":"Overview","text":"<ul> <li>Each user gets a personal API key (no shared admin keys for normal use).</li> <li>API keys are used by the <code>ml</code> CLI to authenticate to the FetchML API.</li> <li>API keys and their SHA256 hashes must both be treated as secrets.</li> </ul> <p>There are two supported ways to receive your key:</p> <ol> <li>Bitwarden (recommended) \u2013 for users who already use Bitwarden.</li> <li>Direct share (minimal tools) \u2013 for users who do not use Bitwarden.</li> </ol>"},{"location":"api-key-process/#1-bitwarden-based-process-recommended","title":"1. Bitwarden-based process (recommended)","text":""},{"location":"api-key-process/#for-the-admin","title":"For the admin","text":"<ul> <li>Use the helper script to create a Bitwarden item for each user:</li> </ul> <pre><code>./scripts/create_bitwarden_fetchml_item.sh <username> <api_key> <api_key_hash>\n</code></pre> <p>This script:</p> <ul> <li>Creates a Bitwarden item named <code>FetchML API \u2013 <username></code>.</li> <li> <p>Stores:</p> <ul> <li>Username: <code><username></code></li> <li>Password: <code><api_key></code> (the actual API key)</li> <li>Custom field <code>api_key_hash</code>: <code><api_key_hash></code></li> </ul> </li> <li> <p>Share that item with the user in Bitwarden (for example, via a shared collection like <code>FetchML</code>).</p> </li> </ul>"},{"location":"api-key-process/#for-the-user","title":"For the user","text":"<ol> <li> <p>Open Bitwarden and locate the item:</p> </li> <li> <p>Name: <code>FetchML API \u2013 <your-name></code></p> </li> <li> <p>Copy the password field (this is your FetchML API key).</p> </li> <li> <p>Configure the CLI, e.g. in <code>~/.ml/config.toml</code>:</p> </li> </ol> <pre><code>api_key = \"<paste-from-bitwarden>\"\nworker_host = \"localhost\"\nworker_port = 9100\napi_url = \"ws://localhost:9100/ws\"\n</code></pre> <ol> <li>Test your setup:</li> </ol> <pre><code>ml status\n</code></pre> <p>If the command works, your key and tunnel/config are correct.</p>"},{"location":"api-key-process/#2-direct-share-no-password-manager-required","title":"2. Direct share (no password manager required)","text":"<p>For users who do not use Bitwarden, a lightweight alternative is a direct one-to-one share.</p>"},{"location":"api-key-process/#for-the-admin_1","title":"For the admin","text":"<ol> <li>Generate a per-user API key and hash as usual.</li> <li>Store them securely on your side (for example, in your own Bitwarden vault or configuration files).</li> <li> <p>Share only the API key with the user via a direct channel you both trust, such as:</p> </li> <li> <p>Signal / WhatsApp direct message</p> </li> <li>SMS</li> <li> <p>Short call/meeting where you read it to them</p> </li> <li> <p>Ask the user to:</p> </li> <li> <p>Paste the key into their local config.</p> </li> <li>Avoid keeping the key in plain chat history if possible.</li> </ol>"},{"location":"api-key-process/#for-the-user_1","title":"For the user","text":"<ol> <li>When you receive your FetchML API key from the admin, create or edit <code>~/.ml/config.toml</code>:</li> </ol> <pre><code>api_key = \"<your-api-key>\"\nworker_host = \"localhost\"\nworker_port = 9100\napi_url = \"ws://localhost:9100/ws\"\n</code></pre> <ol> <li>Save the file and run:</li> </ol> <pre><code>ml status\n</code></pre> <ol> <li>If it works, you are ready to use the CLI:</li> </ol> <pre><code>ml queue my-training-job\nml cancel my-training-job\n</code></pre>"},{"location":"api-key-process/#3-security-notes","title":"3. Security notes","text":"<ul> <li>API key and hash are secrets</li> <li>The 64-character <code>api_key_hash</code> is as sensitive as the API key itself.</li> <li> <p>Do not commit keys or hashes to Git or share them in screenshots or tickets.</p> </li> <li> <p>Rotation</p> </li> <li>If you suspect a key has leaked, notify the admin.</li> <li> <p>The admin will revoke the old key, generate a new one, and update Bitwarden or share a new key.</p> </li> <li> <p>Transport security</p> </li> <li>The <code>api_url</code> is typically <code>ws://localhost:9100/ws</code> when used through an SSH tunnel to the homelab.</li> <li>The SSH tunnel and nginx/TLS provide encryption over the network.</li> </ul> <p>Following these steps keeps API access easy for the team while maintaining a reasonable security posture for a personal homelab deployment.</p>"},{"location":"architecture/","title":"Homelab Architecture","text":"<p>Simple, secure architecture for ML experiments in your homelab.</p>"},{"location":"architecture/#components-overview","title":"Components Overview","text":"<pre><code>graph TB\n subgraph \"Homelab Stack\"\n CLI[Zig CLI]\n API[HTTPS API]\n REDIS[Redis Cache]\n FS[Local Storage]\n end\n\n CLI --> API\n API --> REDIS\n API --> FS\n</code></pre>"},{"location":"architecture/#core-services","title":"Core Services","text":""},{"location":"architecture/#api-server","title":"API Server","text":"<ul> <li>Purpose: Secure HTTPS API for ML experiments</li> <li>Port: 9101 (HTTPS only)</li> <li>Auth: API key authentication</li> <li>Security: Rate limiting, IP whitelisting</li> </ul>"},{"location":"architecture/#redis","title":"Redis","text":"<ul> <li>Purpose: Caching and job queuing</li> <li>Port: 6379 (localhost only)</li> <li>Storage: Temporary data only</li> <li>Persistence: Local volume</li> </ul>"},{"location":"architecture/#zig-cli","title":"Zig CLI","text":"<ul> <li>Purpose: High-performance experiment management</li> <li>Language: Zig for maximum speed and efficiency</li> <li>Features:</li> <li>Content-addressed storage with deduplication</li> <li>SHA256-based commit ID generation</li> <li>WebSocket communication for real-time updates</li> <li>Rsync-based incremental file transfers</li> <li>Multi-threaded operations</li> <li>Secure API key authentication</li> <li>Auto-sync monitoring with file system watching</li> <li>Priority-based job queuing</li> <li>Memory-efficient operations with arena allocators</li> </ul>"},{"location":"architecture/#security-architecture","title":"Security Architecture","text":"<pre><code>graph LR\n USER[User] --> AUTH[API Key Auth]\n AUTH --> RATE[Rate Limiting]\n RATE --> WHITELIST[IP Whitelist]\n WHITELIST --> API[Secure API]\n API --> AUDIT[Audit Logging]\n</code></pre>"},{"location":"architecture/#security-layers","title":"Security Layers","text":"<ol> <li>API Key Authentication - Hashed keys with roles</li> <li>Rate Limiting - 30 requests/minute</li> <li>IP Whitelisting - Local networks only</li> <li>Fail2Ban - Automatic IP blocking</li> <li>HTTPS/TLS - Encrypted communication</li> <li>Audit Logging - Complete action tracking</li> </ol>"},{"location":"architecture/#data-flow","title":"Data Flow","text":"<pre><code>sequenceDiagram\n participant CLI\n participant API\n participant Redis\n participant Storage\n\n CLI->>API: HTTPS Request\n API->>API: Validate Auth\n API->>Redis: Cache/Queue\n API->>Storage: Experiment Data\n Storage->>API: Results\n API->>CLI: Response\n</code></pre>"},{"location":"architecture/#deployment-options","title":"Deployment Options","text":""},{"location":"architecture/#docker-compose-recommended","title":"Docker Compose (Recommended)","text":"<pre><code>services:\n redis:\n image: redis:7-alpine\n ports: [\"6379:6379\"]\n volumes: [redis_data:/data]\n\n api-server:\n build: .\n ports: [\"9101:9101\"]\n depends_on: [redis]\n</code></pre>"},{"location":"architecture/#local-setup","title":"Local Setup","text":"<pre><code>./setup.sh && ./manage.sh start\n</code></pre>"},{"location":"architecture/#network-architecture","title":"Network Architecture","text":"<ul> <li>Private Network: Docker internal network</li> <li>Localhost Access: Redis only on localhost</li> <li>HTTPS API: Port 9101, TLS encrypted</li> <li>No External Dependencies: Everything runs locally</li> </ul>"},{"location":"architecture/#storage-architecture","title":"Storage Architecture","text":"<pre><code>data/\n\u251c\u2500\u2500 experiments/ # ML experiment results\n\u251c\u2500\u2500 cache/ # Temporary cache files\n\u2514\u2500\u2500 backups/ # Local backups\n\nlogs/\n\u251c\u2500\u2500 app.log # Application logs\n\u251c\u2500\u2500 audit.log # Security events\n\u2514\u2500\u2500 access.log # API access logs\n</code></pre>"},{"location":"architecture/#monitoring-architecture","title":"Monitoring Architecture","text":"<p>Simple, lightweight monitoring: - Health Checks: Service availability - Log Files: Structured logging - Basic Metrics: Request counts, error rates - Security Events: Failed auth, rate limits</p>"},{"location":"architecture/#homelab-benefits","title":"Homelab Benefits","text":"<ul> <li>\u2705 Simple Setup: One-command installation</li> <li>\u2705 Local Only: No external dependencies</li> <li>\u2705 Secure by Default: HTTPS, auth, rate limiting</li> <li>\u2705 Low Resource: Minimal CPU/memory usage</li> <li>\u2705 Easy Backup: Local file system</li> <li>\u2705 Privacy: Everything stays on your network</li> </ul>"},{"location":"architecture/#high-level-architecture","title":"High-Level Architecture","text":"<pre><code>graph TB\n subgraph \"Client Layer\"\n CLI[CLI Tools]\n TUI[Terminal UI]\n API[REST API]\n end\n\n subgraph \"Authentication Layer\"\n Auth[Authentication Service]\n RBAC[Role-Based Access Control]\n Perm[Permission Manager]\n end\n\n subgraph \"Core Services\"\n Worker[ML Worker Service]\n DataMgr[Data Manager Service]\n Queue[Job Queue]\n end\n\n subgraph \"Storage Layer\"\n Redis[(Redis Cache)]\n DB[(SQLite/PostgreSQL)]\n Files[File Storage]\n end\n\n subgraph \"Container Runtime\"\n Podman[Podman/Docker]\n Containers[ML Containers]\n end\n\n CLI --> Auth\n TUI --> Auth\n API --> Auth\n\n Auth --> RBAC\n RBAC --> Perm\n\n Worker --> Queue\n Worker --> DataMgr\n Worker --> Podman\n\n DataMgr --> DB\n DataMgr --> Files\n\n Queue --> Redis\n\n Podman --> Containers\n</code></pre>"},{"location":"architecture/#zig-cli-architecture","title":"Zig CLI Architecture","text":""},{"location":"architecture/#component-structure","title":"Component Structure","text":"<pre><code>graph TB\n subgraph \"Zig CLI Components\"\n Main[main.zig] --> Commands[commands/]\n Commands --> Config[config.zig]\n Commands --> Utils[utils/]\n Commands --> Net[net/]\n Commands --> Errors[errors.zig]\n\n subgraph \"Commands\"\n Init[init.zig]\n Sync[sync.zig]\n Queue[queue.zig]\n Watch[watch.zig]\n Status[status.zig]\n Monitor[monitor.zig]\n Cancel[cancel.zig]\n Prune[prune.zig]\n end\n\n subgraph \"Utils\"\n Crypto[crypto.zig]\n Storage[storage.zig]\n Rsync[rsync.zig]\n end\n\n subgraph \"Network\"\n WS[ws.zig]\n end\n end\n</code></pre>"},{"location":"architecture/#performance-optimizations","title":"Performance Optimizations","text":""},{"location":"architecture/#content-addressed-storage","title":"Content-Addressed Storage","text":"<ul> <li>Deduplication: Files stored by SHA256 hash</li> <li>Space Efficiency: Shared files across experiments</li> <li>Fast Lookup: Hash-based file retrieval</li> </ul>"},{"location":"architecture/#memory-management","title":"Memory Management","text":"<ul> <li>Arena Allocators: Efficient bulk allocation</li> <li>Zero-Copy Operations: Minimized memory copying</li> <li>Automatic Cleanup: Resource deallocation</li> </ul>"},{"location":"architecture/#network-communication","title":"Network Communication","text":"<ul> <li>WebSocket Protocol: Real-time bidirectional communication</li> <li>Connection Pooling: Reused connections</li> <li>Binary Messaging: Efficient data transfer</li> </ul>"},{"location":"architecture/#security-implementation","title":"Security Implementation","text":"<pre><code>graph LR\n subgraph \"CLI Security\"\n Config[Config File] --> Hash[SHA256 Hashing]\n Hash --> Auth[API Authentication]\n Auth --> SSH[SSH Transfer]\n SSH --> WS[WebSocket Security]\n end\n</code></pre>"},{"location":"architecture/#core-components","title":"Core Components","text":""},{"location":"architecture/#1-authentication-authorization","title":"1. Authentication & Authorization","text":"<pre><code>graph LR\n subgraph \"Auth Flow\"\n Client[Client] --> APIKey[API Key]\n APIKey --> Hash[Hash Validation]\n Hash --> Roles[Role Resolution]\n Roles --> Perms[Permission Check]\n Perms --> Access[Grant/Deny Access]\n end\n\n subgraph \"Permission Sources\"\n YAML[YAML Config]\n Inline[Inline Fallback]\n Roles --> YAML\n Roles --> Inline\n end\n</code></pre> <p>Features: - API key-based authentication - Role-based access control (RBAC) - YAML-based permission configuration - Fallback to inline permissions - Admin wildcard permissions</p>"},{"location":"architecture/#2-worker-service","title":"2. Worker Service","text":"<pre><code>graph TB\n subgraph \"Worker Architecture\"\n API[HTTP API] --> Router[Request Router]\n Router --> Auth[Auth Middleware]\n Auth --> Queue[Job Queue]\n Queue --> Processor[Job Processor]\n Processor --> Runtime[Container Runtime]\n Runtime --> Storage[Result Storage]\n\n subgraph \"Job Lifecycle\"\n Submit[Submit Job] --> Queue\n Queue --> Execute[Execute]\n Execute --> Monitor[Monitor]\n Monitor --> Complete[Complete]\n Complete --> Store[Store Results]\n end\n end\n</code></pre> <p>Responsibilities: - HTTP API for job submission - Job queue management - Container orchestration - Result collection and storage - Metrics and monitoring</p>"},{"location":"architecture/#3-data-manager-service","title":"3. Data Manager Service","text":"<pre><code>graph TB\n subgraph \"Data Management\"\n API[Data API] --> Storage[Storage Layer]\n Storage --> Metadata[Metadata DB]\n Storage --> Files[File System]\n Storage --> Cache[Redis Cache]\n\n subgraph \"Data Operations\"\n Upload[Upload Data] --> Validate[Validate]\n Validate --> Store[Store]\n Store --> Index[Index]\n Index --> Catalog[Catalog]\n end\n end\n</code></pre> <p>Features: - Data upload and validation - Metadata management - File system abstraction - Caching layer - Data catalog</p>"},{"location":"architecture/#4-terminal-ui-tui","title":"4. Terminal UI (TUI)","text":"<pre><code>graph TB\n subgraph \"TUI Architecture\"\n UI[UI Components] --> Model[Data Model]\n Model --> Update[Update Loop]\n Update --> Render[Render]\n\n subgraph \"UI Panels\"\n Jobs[Job List]\n Details[Job Details]\n Logs[Log Viewer]\n Status[Status Bar]\n end\n\n UI --> Jobs\n UI --> Details\n UI --> Logs\n UI --> Status\n end\n</code></pre> <p>Components: - Bubble Tea framework - Component-based architecture - Real-time updates - Keyboard navigation - Theme support</p>"},{"location":"architecture/#data-flow_1","title":"Data Flow","text":""},{"location":"architecture/#job-execution-flow","title":"Job Execution Flow","text":"<pre><code>sequenceDiagram\n participant Client\n participant Auth\n participant Worker\n participant Queue\n participant Container\n participant Storage\n\n Client->>Auth: Submit job with API key\n Auth->>Client: Validate and return job ID\n\n Client->>Worker: Execute job request\n Worker->>Queue: Queue job\n Queue->>Worker: Job ready\n Worker->>Container: Start ML container\n Container->>Worker: Execute experiment\n Worker->>Storage: Store results\n Worker->>Client: Return results\n</code></pre>"},{"location":"architecture/#authentication-flow","title":"Authentication Flow","text":"<pre><code>sequenceDiagram\n participant Client\n participant Auth\n participant PermMgr\n participant Config\n\n Client->>Auth: Request with API key\n Auth->>Auth: Validate key hash\n Auth->>PermMgr: Get user permissions\n PermMgr->>Config: Load YAML permissions\n Config->>PermMgr: Return permissions\n PermMgr->>Auth: Return resolved permissions\n Auth->>Client: Grant/deny access\n</code></pre>"},{"location":"architecture/#security-architecture_1","title":"Security Architecture","text":""},{"location":"architecture/#defense-in-depth","title":"Defense in Depth","text":"<pre><code>graph TB\n subgraph \"Security Layers\"\n Network[Network Security]\n Auth[Authentication]\n AuthZ[Authorization]\n Container[Container Security]\n Data[Data Protection]\n Audit[Audit Logging]\n end\n\n Network --> Auth\n Auth --> AuthZ\n AuthZ --> Container\n Container --> Data\n Data --> Audit\n</code></pre> <p>Security Features: - API key authentication - Role-based permissions - Container isolation - File system sandboxing - Comprehensive audit logs - Input validation and sanitization</p>"},{"location":"architecture/#container-security","title":"Container Security","text":"<pre><code>graph TB\n subgraph \"Container Isolation\"\n Host[Host System]\n Podman[Podman Runtime]\n Network[Network Isolation]\n FS[File System Isolation]\n User[User Namespaces]\n ML[ML Container]\n\n Host --> Podman\n Podman --> Network\n Podman --> FS\n Podman --> User\n User --> ML\n end\n</code></pre> <p>Isolation Features: - Rootless containers - Network isolation - File system sandboxing - User namespace mapping - Resource limits</p>"},{"location":"architecture/#configuration-architecture","title":"Configuration Architecture","text":""},{"location":"architecture/#configuration-hierarchy","title":"Configuration Hierarchy","text":"<pre><code>graph TB\n subgraph \"Config Sources\"\n Env[Environment Variables]\n File[Config Files]\n CLI[CLI Flags]\n Defaults[Default Values]\n end\n\n subgraph \"Config Processing\"\n Merge[Config Merger]\n Validate[Schema Validator]\n Apply[Config Applier]\n end\n\n Env --> Merge\n File --> Merge\n CLI --> Merge\n Defaults --> Merge\n\n Merge --> Validate\n Validate --> Apply\n</code></pre> <p>Configuration Priority: 1. CLI flags (highest) 2. Environment variables 3. Configuration files 4. Default values (lowest)</p>"},{"location":"architecture/#scalability-architecture","title":"Scalability Architecture","text":""},{"location":"architecture/#horizontal-scaling","title":"Horizontal Scaling","text":"<pre><code>graph TB\n subgraph \"Scaled Architecture\"\n LB[Load Balancer]\n W1[Worker 1]\n W2[Worker 2]\n W3[Worker N]\n Redis[Redis Cluster]\n Storage[Shared Storage]\n\n LB --> W1\n LB --> W2\n LB --> W3\n\n W1 --> Redis\n W2 --> Redis\n W3 --> Redis\n\n W1 --> Storage\n W2 --> Storage\n W3 --> Storage\n end\n</code></pre> <p>Scaling Features: - Stateless worker services - Shared job queue (Redis) - Distributed storage - Load balancer ready - Health checks and monitoring</p>"},{"location":"architecture/#technology-stack","title":"Technology Stack","text":""},{"location":"architecture/#backend-technologies","title":"Backend Technologies","text":"Component Technology Purpose Language Go 1.25+ Core application Web Framework Standard library HTTP server Authentication Custom API key + RBAC Database SQLite/PostgreSQL Metadata storage Cache Redis Job queue & caching Containers Podman/Docker Job isolation UI Framework Bubble Tea Terminal UI"},{"location":"architecture/#dependencies","title":"Dependencies","text":"<pre><code>// Core dependencies\nrequire (\n github.com/charmbracelet/bubbletea v1.3.10 // TUI framework\n github.com/go-redis/redis/v8 v8.11.5 // Redis client\n github.com/google/uuid v1.6.0 // UUID generation\n github.com/mattn/go-sqlite3 v1.14.32 // SQLite driver\n golang.org/x/crypto v0.45.0 // Crypto utilities\n gopkg.in/yaml.v3 v3.0.1 // YAML parsing\n)\n</code></pre>"},{"location":"architecture/#development-architecture","title":"Development Architecture","text":""},{"location":"architecture/#project-structure","title":"Project Structure","text":"<pre><code>fetch_ml/\n\u251c\u2500\u2500 cmd/ # CLI applications\n\u2502 \u251c\u2500\u2500 worker/ # ML worker service\n\u2502 \u251c\u2500\u2500 tui/ # Terminal UI\n\u2502 \u251c\u2500\u2500 data_manager/ # Data management\n\u2502 \u2514\u2500\u2500 user_manager/ # User management\n\u251c\u2500\u2500 internal/ # Internal packages\n\u2502 \u251c\u2500\u2500 auth/ # Authentication system\n\u2502 \u251c\u2500\u2500 config/ # Configuration management\n\u2502 \u251c\u2500\u2500 container/ # Container operations\n\u2502 \u251c\u2500\u2500 database/ # Database operations\n\u2502 \u251c\u2500\u2500 logging/ # Logging utilities\n\u2502 \u251c\u2500\u2500 metrics/ # Metrics collection\n\u2502 \u2514\u2500\u2500 network/ # Network utilities\n\u251c\u2500\u2500 configs/ # Configuration files\n\u251c\u2500\u2500 scripts/ # Setup and utility scripts\n\u251c\u2500\u2500 tests/ # Test suites\n\u2514\u2500\u2500 docs/ # Documentation\n</code></pre>"},{"location":"architecture/#package-dependencies","title":"Package Dependencies","text":"<pre><code>graph TB\n subgraph \"Application Layer\"\n Worker[cmd/worker]\n TUI[cmd/tui]\n DataMgr[cmd/data_manager]\n UserMgr[cmd/user_manager]\n end\n\n subgraph \"Service Layer\"\n Auth[internal/auth]\n Config[internal/config]\n Container[internal/container]\n Database[internal/database]\n end\n\n subgraph \"Utility Layer\"\n Logging[internal/logging]\n Metrics[internal/metrics]\n Network[internal/network]\n end\n\n Worker --> Auth\n Worker --> Config\n Worker --> Container\n TUI --> Auth\n DataMgr --> Database\n UserMgr --> Auth\n\n Auth --> Logging\n Container --> Network\n Database --> Metrics\n</code></pre>"},{"location":"architecture/#monitoring-observability","title":"Monitoring & Observability","text":""},{"location":"architecture/#metrics-collection","title":"Metrics Collection","text":"<pre><code>graph TB\n subgraph \"Metrics Pipeline\"\n App[Application] --> Metrics[Metrics Collector]\n Metrics --> Export[Prometheus Exporter]\n Export --> Prometheus[Prometheus Server]\n Prometheus --> Grafana[Grafana Dashboard]\n\n subgraph \"Metric Types\"\n Counter[Counters]\n Gauge[Gauges]\n Histogram[Histograms]\n Timer[Timers]\n end\n\n App --> Counter\n App --> Gauge\n App --> Histogram\n App --> Timer\n end\n</code></pre>"},{"location":"architecture/#logging-architecture","title":"Logging Architecture","text":"<pre><code>graph TB\n subgraph \"Logging Pipeline\"\n App[Application] --> Logger[Structured Logger]\n Logger --> File[File Output]\n Logger --> Console[Console Output]\n Logger --> Syslog[Syslog Forwarder]\n Syslog --> Aggregator[Log Aggregator]\n Aggregator --> Storage[Log Storage]\n Storage --> Viewer[Log Viewer]\n end\n</code></pre>"},{"location":"architecture/#deployment-architecture","title":"Deployment Architecture","text":""},{"location":"architecture/#container-deployment","title":"Container Deployment","text":"<pre><code>graph TB\n subgraph \"Deployment Stack\"\n Image[Container Image]\n Registry[Container Registry]\n Orchestrator[Docker Compose]\n Config[ConfigMaps/Secrets]\n Storage[Persistent Storage]\n\n Image --> Registry\n Registry --> Orchestrator\n Config --> Orchestrator\n Storage --> Orchestrator\n end\n</code></pre>"},{"location":"architecture/#service-discovery","title":"Service Discovery","text":"<pre><code>graph TB\n subgraph \"Service Mesh\"\n Gateway[API Gateway]\n Discovery[Service Discovery]\n Worker[Worker Service]\n Data[Data Service]\n Redis[Redis Cluster]\n\n Gateway --> Discovery\n Discovery --> Worker\n Discovery --> Data\n Discovery --> Redis\n end\n</code></pre>"},{"location":"architecture/#future-architecture-considerations","title":"Future Architecture Considerations","text":""},{"location":"architecture/#microservices-evolution","title":"Microservices Evolution","text":"<ul> <li>API Gateway: Centralized routing and authentication</li> <li>Service Mesh: Inter-service communication</li> <li>Event Streaming: Kafka for job events</li> <li>Distributed Tracing: OpenTelemetry integration</li> <li>Multi-tenant: Tenant isolation and quotas</li> </ul>"},{"location":"architecture/#homelab-features","title":"Homelab Features","text":"<ul> <li>Docker Compose: Simple container orchestration</li> <li>Local Development: Easy setup and testing</li> <li>Security: Built-in authentication and encryption</li> <li>Monitoring: Basic health checks and logging</li> </ul> <p>This architecture provides a solid foundation for secure, scalable machine learning experiments while maintaining simplicity and developer productivity.</p>"},{"location":"cicd/","title":"CI/CD Pipeline","text":"<p>Automated testing, building, and releasing for fetch_ml.</p>"},{"location":"cicd/#workflows","title":"Workflows","text":""},{"location":"cicd/#ci-workflow-githubworkflowsciyml","title":"CI Workflow (<code>.github/workflows/ci.yml</code>)","text":"<p>Runs on every push to <code>main</code>/<code>develop</code> and all pull requests.</p> <p>Jobs: 1. test - Go backend tests with Redis 2. build - Build all binaries (Go + Zig CLI) 3. test-scripts - Validate deployment scripts 4. security-scan - Trivy and Gosec security scans 5. docker-build - Build and push Docker images (main branch only)</p> <p>Test Coverage: - Go unit tests with race detection - <code>internal/queue</code> package tests - Zig CLI tests - Integration tests - Security audits</p>"},{"location":"cicd/#release-workflow-githubworkflowsreleaseyml","title":"Release Workflow (<code>.github/workflows/release.yml</code>)","text":"<p>Runs on version tags (e.g., <code>v1.0.0</code>).</p> <p>Jobs:</p> <ol> <li>build-cli (matrix build)</li> <li>Linux x86_64 (static musl)</li> <li>macOS x86_64</li> <li>macOS ARM64</li> <li>Downloads platform-specific static rsync</li> <li> <p>Embeds rsync for zero-dependency releases</p> </li> <li> <p>build-go-backends</p> </li> <li>Cross-platform Go builds</li> <li> <p>api-server, worker, tui, data_manager, user_manager</p> </li> <li> <p>create-release</p> </li> <li>Collects all artifacts</li> <li>Generates SHA256 checksums</li> <li>Creates GitHub release with notes</li> </ol>"},{"location":"cicd/#release-process","title":"Release Process","text":""},{"location":"cicd/#creating-a-release","title":"Creating a Release","text":"<pre><code># 1. Update version\ngit tag v1.0.0\n\n# 2. Push tag\ngit push origin v1.0.0\n\n# 3. CI automatically builds and releases\n</code></pre>"},{"location":"cicd/#release-artifacts","title":"Release Artifacts","text":"<p>CLI Binaries (with embedded rsync): - <code>ml-linux-x86_64.tar.gz</code> (~450-650KB) - <code>ml-macos-x86_64.tar.gz</code> (~450-650KB) - <code>ml-macos-arm64.tar.gz</code> (~450-650KB)</p> <p>Go Backends: - <code>fetch_ml_api-server.tar.gz</code> - <code>fetch_ml_worker.tar.gz</code> - <code>fetch_ml_tui.tar.gz</code> - <code>fetch_ml_data_manager.tar.gz</code> - <code>fetch_ml_user_manager.tar.gz</code></p> <p>Checksums: - <code>checksums.txt</code> - Combined SHA256 sums - Individual <code>.sha256</code> files per binary</p>"},{"location":"cicd/#development-workflow","title":"Development Workflow","text":""},{"location":"cicd/#local-testing","title":"Local Testing","text":"<pre><code># Run all tests\nmake test\n\n# Run specific package tests\ngo test ./internal/queue/...\n\n# Build CLI\ncd cli && zig build dev\n\n# Run formatters and linters\nmake lint\n\n# Security scans are handled automatically in CI by the `security-scan` job\n</code></pre>"},{"location":"cicd/#optional-heavy-end-to-end-tests","title":"Optional heavy end-to-end tests","text":"<p>Some e2e tests exercise full Docker deployments and performance scenarios and are skipped by default to keep local/CI runs fast. You can enable them explicitly with environment variables:</p> <pre><code># Run Docker deployment e2e tests\nFETCH_ML_E2E_DOCKER=1 go test ./tests/e2e/...\n\n# Run performance-oriented e2e tests\nFETCH_ML_E2E_PERF=1 go test ./tests/e2e/...\n</code></pre> <p>Without these variables, <code>TestDockerDeploymentE2E</code> and <code>TestPerformanceE2E</code> will <code>t.Skip</code>, while all lighter e2e tests still run.</p>"},{"location":"cicd/#pull-request-checks","title":"Pull Request Checks","text":"<p>All PRs must pass: - \u2705 Go tests (with Redis) - \u2705 CLI tests - \u2705 Security scans - \u2705 Code linting - \u2705 Build verification</p>"},{"location":"cicd/#configuration","title":"Configuration","text":""},{"location":"cicd/#environment-variables","title":"Environment Variables","text":"<pre><code>GO_VERSION: '1.25.0'\nZIG_VERSION: '0.15.2'\n</code></pre>"},{"location":"cicd/#secrets","title":"Secrets","text":"<p>Required for releases: - <code>GITHUB_TOKEN</code> - Automatic, provided by GitHub Actions</p>"},{"location":"cicd/#monitoring","title":"Monitoring","text":""},{"location":"cicd/#build-status","title":"Build Status","text":"<p>Check workflow runs at: <pre><code>https://github.com/jfraeys/fetch_ml/actions\n</code></pre></p>"},{"location":"cicd/#artifacts","title":"Artifacts","text":"<p>Download build artifacts from: - Successful workflow runs (30-day retention) - GitHub Releases (permanent)</p> <p>For implementation details: - .github/workflows/ci.yml - .github/workflows/release.yml</p>"},{"location":"cli-reference/","title":"Fetch ML CLI Reference","text":"<p>Comprehensive command-line tools for managing ML experiments in your homelab with Zig-based high-performance CLI.</p>"},{"location":"cli-reference/#overview","title":"Overview","text":"<p>Fetch ML provides a comprehensive CLI toolkit built with performance and security in mind:</p> <ul> <li>Zig CLI - High-performance experiment management written in Zig</li> <li>Go Commands - API server, TUI, and data management utilities</li> <li>Management Scripts - Service orchestration and deployment</li> <li>Setup Scripts - One-command installation and configuration</li> </ul>"},{"location":"cli-reference/#zig-cli-clizig-outbinml","title":"Zig CLI (<code>./cli/zig-out/bin/ml</code>)","text":"<p>High-performance command-line interface for experiment management, written in Zig for speed and efficiency.</p>"},{"location":"cli-reference/#available-commands","title":"Available Commands","text":"Command Description Example <code>init</code> Interactive configuration setup <code>ml init</code> <code>sync</code> Sync project to worker with deduplication <code>ml sync ./project --name myjob --queue</code> <code>queue</code> Queue job for execution <code>ml queue myjob --commit abc123 --priority 8</code> <code>status</code> Get system and worker status <code>ml status</code> <code>monitor</code> Launch TUI monitoring via SSH <code>ml monitor</code> <code>cancel</code> Cancel running job <code>ml cancel job123</code> <code>prune</code> Clean up old experiments <code>ml prune --keep 10</code> <code>watch</code> Auto-sync directory on changes <code>ml watch ./project --queue</code>"},{"location":"cli-reference/#command-details","title":"Command Details","text":""},{"location":"cli-reference/#init-configuration-setup","title":"<code>init</code> - Configuration Setup","text":"<p><pre><code>ml init\n</code></pre> Creates a configuration template at <code>~/.ml/config.toml</code> with: - Worker connection details - API authentication - Base paths and ports</p>"},{"location":"cli-reference/#sync-project-synchronization","title":"<code>sync</code> - Project Synchronization","text":"<pre><code># Basic sync\nml sync ./my-project\n\n# Sync with custom name and queue\nml sync ./my-project --name \"experiment-1\" --queue\n\n# Sync with priority\nml sync ./my-project --priority 9\n</code></pre> <p>Features: - Content-addressed storage for deduplication - SHA256 commit ID generation - Rsync-based file transfer - Automatic queuing (with <code>--queue</code> flag)</p>"},{"location":"cli-reference/#queue-job-management","title":"<code>queue</code> - Job Management","text":"<pre><code># Queue with commit ID\nml queue my-job --commit abc123def456\n\n# Queue with priority (1-10, default 5)\nml queue my-job --commit abc123 --priority 8\n</code></pre> <p>Features: - WebSocket-based communication - Priority queuing system - API key authentication</p>"},{"location":"cli-reference/#watch-auto-sync-monitoring","title":"<code>watch</code> - Auto-Sync Monitoring","text":"<pre><code># Watch directory for changes\nml watch ./project\n\n# Watch and auto-queue on changes\nml watch ./project --name \"dev-exp\" --queue\n</code></pre> <p>Features: - Real-time file system monitoring - Automatic re-sync on changes - Configurable polling interval (2 seconds) - Commit ID comparison for efficiency</p>"},{"location":"cli-reference/#prune-cleanup-management","title":"<code>prune</code> - Cleanup Management","text":"<pre><code># Keep last N experiments\nml prune --keep 20\n\n# Remove experiments older than N days\nml prune --older-than 30\n</code></pre>"},{"location":"cli-reference/#monitor-remote-monitoring","title":"<code>monitor</code> - Remote Monitoring","text":"<p><pre><code>ml monitor\n</code></pre> Launches TUI interface via SSH for real-time monitoring.</p>"},{"location":"cli-reference/#cancel-job-cancellation","title":"<code>cancel</code> - Job Cancellation","text":"<p><pre><code>ml cancel running-job-id\n</code></pre> Cancels currently running jobs by ID.</p>"},{"location":"cli-reference/#configuration","title":"Configuration","text":"<p>The Zig CLI reads configuration from <code>~/.ml/config.toml</code>:</p> <pre><code>worker_host = \"worker.local\"\nworker_user = \"mluser\"\nworker_base = \"/data/ml-experiments\"\nworker_port = 22\napi_key = \"your-api-key\"\n</code></pre>"},{"location":"cli-reference/#performance-features","title":"Performance Features","text":"<ul> <li>Content-Addressed Storage: Automatic deduplication of identical files</li> <li>Incremental Sync: Only transfers changed files</li> <li>SHA256 Hashing: Reliable commit ID generation</li> <li>WebSocket Communication: Efficient real-time messaging</li> <li>Multi-threaded: Concurrent operations where applicable</li> </ul>"},{"location":"cli-reference/#go-commands","title":"Go Commands","text":""},{"location":"cli-reference/#api-server-cmdapi-servermaingo","title":"API Server (<code>./cmd/api-server/main.go</code>)","text":"<p>Main HTTPS API server for experiment management.</p> <pre><code># Build and run\ngo run ./cmd/api-server/main.go\n\n# With configuration\n./bin/api-server --config configs/config-local.yaml\n</code></pre> <p>Features: - HTTPS-only communication - API key authentication - Rate limiting and IP whitelisting - WebSocket support for real-time updates - Redis integration for caching</p>"},{"location":"cli-reference/#tui-cmdtuimaingo","title":"TUI (<code>./cmd/tui/main.go</code>)","text":"<p>Terminal User Interface for monitoring experiments.</p> <pre><code># Launch TUI\ngo run ./cmd/tui/main.go\n\n# With custom config\n./tui --config configs/config-local.yaml\n</code></pre> <p>Features: - Real-time experiment monitoring - Interactive job management - Status visualization - Log viewing</p>"},{"location":"cli-reference/#data-manager-cmddata_manager","title":"Data Manager (<code>./cmd/data_manager/</code>)","text":"<p>Utilities for data synchronization and management.</p> <pre><code># Sync data\n./data_manager --sync ./data\n\n# Clean old data\n./data_manager --cleanup --older-than 30d\n</code></pre>"},{"location":"cli-reference/#config-lint-cmdconfiglintmaingo","title":"Config Lint (<code>./cmd/configlint/main.go</code>)","text":"<p>Configuration validation and linting tool.</p> <pre><code># Validate configuration\n./configlint configs/config-local.yaml\n\n# Check schema compliance\n./configlint --schema configs/schema/config_schema.yaml\n</code></pre>"},{"location":"cli-reference/#management-script-toolsmanagesh","title":"Management Script (<code>./tools/manage.sh</code>)","text":"<p>Simple service management for your homelab.</p>"},{"location":"cli-reference/#commands","title":"Commands","text":"<pre><code>./tools/manage.sh start # Start all services\n./tools/manage.sh stop # Stop all services\n./tools/manage.sh status # Check service status\n./tools/manage.sh logs # View logs\n./tools/manage.sh monitor # Basic monitoring\n./tools/manage.sh security # Security status\n./tools/manage.sh cleanup # Clean project artifacts\n</code></pre>"},{"location":"cli-reference/#setup-script-setupsh","title":"Setup Script (<code>./setup.sh</code>)","text":"<p>One-command homelab setup.</p>"},{"location":"cli-reference/#usage","title":"Usage","text":"<pre><code># Full setup\n./setup.sh\n\n# Setup includes:\n# - SSL certificate generation\n# - Configuration creation\n# - Build all components\n# - Start Redis\n# - Setup Fail2Ban (if available)\n</code></pre>"},{"location":"cli-reference/#api-testing","title":"API Testing","text":"<p>Test the API with curl:</p> <pre><code># Health check\ncurl -k -H 'X-API-Key: password' https://localhost:9101/health\n\n# List experiments\ncurl -k -H 'X-API-Key: password' https://localhost:9101/experiments\n\n# Submit experiment\ncurl -k -X POST -H 'X-API-Key: password' \\\n -H 'Content-Type: application/json' \\\n -d '{\"name\":\"test\",\"config\":{\"type\":\"basic\"}}' \\\n https://localhost:9101/experiments\n</code></pre>"},{"location":"cli-reference/#zig-cli-architecture","title":"Zig CLI Architecture","text":"<p>The Zig CLI is designed for performance and reliability:</p>"},{"location":"cli-reference/#core-components","title":"Core Components","text":"<ul> <li>Commands (<code>cli/src/commands/</code>): Individual command implementations</li> <li>Config (<code>cli/src/config.zig</code>): Configuration management</li> <li>Network (<code>cli/src/net/ws.zig</code>): WebSocket client implementation</li> <li>Utils (<code>cli/src/utils/</code>): Cryptography, storage, and rsync utilities</li> <li>Errors (<code>cli/src/errors.zig</code>): Centralized error handling</li> </ul>"},{"location":"cli-reference/#performance-optimizations","title":"Performance Optimizations","text":"<ul> <li>Content-Addressed Storage: Deduplicates identical files across experiments</li> <li>SHA256 Hashing: Fast, reliable commit ID generation</li> <li>Rsync Integration: Efficient incremental file transfers</li> <li>WebSocket Protocol: Low-latency communication with worker</li> <li>Memory Management: Efficient allocation with Zig's allocator system</li> </ul>"},{"location":"cli-reference/#security-features","title":"Security Features","text":"<ul> <li>API Key Hashing: Secure authentication token handling</li> <li>SSH Integration: Secure file transfers</li> <li>Input Validation: Comprehensive argument checking</li> <li>Error Handling: Secure error reporting without information leakage</li> </ul>"},{"location":"cli-reference/#configuration_1","title":"Configuration","text":"<p>Main configuration file: <code>configs/config-local.yaml</code></p>"},{"location":"cli-reference/#key-settings","title":"Key Settings","text":"<pre><code>auth:\n enabled: true\n api_keys:\n homelab_user:\n hash: \"5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8\"\n admin: true\n\nserver:\n address: \":9101\"\n tls:\n enabled: true\n cert_file: \"./ssl/cert.pem\"\n key_file: \"./ssl/key.pem\"\n\nsecurity:\n rate_limit:\n enabled: true\n requests_per_minute: 30\n ip_whitelist:\n - \"127.0.0.1\"\n - \"::1\"\n - \"192.168.0.0/16\"\n - \"10.0.0.0/8\"\n</code></pre>"},{"location":"cli-reference/#docker-commands","title":"Docker Commands","text":"<p>If using Docker Compose:</p> <pre><code># Start services\ndocker-compose up -d (testing only)\n\n# View logs\ndocker-compose logs -f\n\n# Stop services\ndocker-compose down\n\n# Check status\ndocker-compose ps\n</code></pre>"},{"location":"cli-reference/#troubleshooting","title":"Troubleshooting","text":""},{"location":"cli-reference/#common-issues","title":"Common Issues","text":"<p>Zig CLI not found: <pre><code># Build the CLI\ncd cli && make build\n\n# Check binary exists\nls -la ./cli/zig-out/bin/ml\n</code></pre></p> <p>Configuration not found: <pre><code># Create configuration\n./cli/zig-out/bin/ml init\n\n# Check config file\nls -la ~/.ml/config.toml\n</code></pre></p> <p>Worker connection failed: <pre><code># Test SSH connection\nssh -p 22 mluser@worker.local\n\n# Check configuration\ncat ~/.ml/config.toml\n</code></pre></p> <p>Sync not working: <pre><code># Check rsync availability\nrsync --version\n\n# Test manual sync\nrsync -avz ./project/ mluser@worker.local:/tmp/test/\n</code></pre></p> <p>WebSocket connection failed: <pre><code># Check worker WebSocket port\ntelnet worker.local 9100\n\n# Verify API key\n./cli/zig-out/bin/ml status\n</code></pre></p> <p>API not responding: <pre><code>./tools/manage.sh status\n./tools/manage.sh logs\n</code></pre></p> <p>Authentication failed: <pre><code># Check API key in config-local.yaml\ngrep -A 5 \"api_keys:\" configs/config-local.yaml\n</code></pre></p> <p>Redis connection failed: <pre><code># Check Redis status\nredis-cli ping\n\n# Start Redis\nredis-server\n</code></pre></p>"},{"location":"cli-reference/#getting-help","title":"Getting Help","text":"<pre><code># CLI help\n./cli/zig-out/bin/ml help\n\n# Management script help\n./tools/manage.sh help\n\n# Check all available commands\nmake help\n</code></pre> <p>That's it for the CLI reference! For complete setup instructions, see the main index.</p>"},{"location":"configuration-schema/","title":"Configuration Schema","text":"<p>Complete reference for Fetch ML configuration options.</p>"},{"location":"configuration-schema/#configuration-file-structure","title":"Configuration File Structure","text":"<p>Fetch ML uses YAML configuration files. The main configuration file is typically <code>config.yaml</code>.</p>"},{"location":"configuration-schema/#full-schema","title":"Full Schema","text":"<pre><code># Server Configuration\nserver:\n address: \":9101\"\n tls:\n enabled: false\n cert_file: \"\"\n key_file: \"\"\n\n# Database Configuration\ndatabase:\n type: \"sqlite\" # sqlite, postgres, mysql\n connection: \"fetch_ml.db\"\n host: \"localhost\"\n port: 5432\n username: \"postgres\"\n password: \"\"\n database: \"fetch_ml\"\n\n# Redis Configuration\n\n\n## Quick Reference\n\n### Database Types\n- **SQLite**: `type: sqlite, connection: file.db`\n- **PostgreSQL**: `type: postgres, host: localhost, port: 5432`\n\n### Key Settings\n- `server.address: :9101`\n- `database.type: sqlite`\n- `redis.addr: localhost:6379`\n- `auth.enabled: true`\n- `logging.level: info`\n\n### Environment Override\n```bash\nexport FETCHML_SERVER_ADDRESS=:8080\nexport FETCHML_DATABASE_TYPE=postgres\n</code></pre>"},{"location":"configuration-schema/#validation","title":"Validation","text":"<pre><code>make configlint\n</code></pre>"},{"location":"deployment/","title":"ML Experiment Manager - Deployment Guide","text":""},{"location":"deployment/#overview","title":"Overview","text":"<p>The ML Experiment Manager supports multiple deployment methods from local development to homelab Docker setups.</p>"},{"location":"deployment/#quick-start","title":"Quick Start","text":""},{"location":"deployment/#docker-compose-recommended-for-development","title":"Docker Compose (Recommended for Development)","text":"<pre><code># Clone repository\ngit clone https://github.com/your-org/fetch_ml.git\ncd fetch_ml\n\n# Start all services\ndocker-compose up -d (testing only)\n\n# Check status\ndocker-compose ps\n\n# View logs\ndocker-compose logs -f api-server\n</code></pre> <p>Access the API at <code>http://localhost:9100</code></p>"},{"location":"deployment/#deployment-options","title":"Deployment Options","text":""},{"location":"deployment/#1-local-development","title":"1. Local Development","text":""},{"location":"deployment/#prerequisites","title":"Prerequisites","text":"<p>Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution - Go 1.25+ - Zig 0.15.2 - Redis 7+ - Docker & Docker Compose (optional)</p>"},{"location":"deployment/#manual-setup","title":"Manual Setup","text":"<pre><code># Start Redis\nredis-server\n\n# Build and run Go server\ngo build -o bin/api-server ./cmd/api-server\n./bin/api-server -config configs/config-local.yaml\n\n# Build Zig CLI\ncd cli\nzig build prod\n./zig-out/bin/ml --help\n</code></pre>"},{"location":"deployment/#2-docker-deployment","title":"2. Docker Deployment","text":""},{"location":"deployment/#build-image","title":"Build Image","text":"<pre><code>docker build -t ml-experiment-manager:latest .\n</code></pre>"},{"location":"deployment/#run-container","title":"Run Container","text":"<pre><code>docker run -d \\\n --name ml-api \\\n -p 9100:9100 \\\n -p 9101:9101 \\\n -v $(pwd)/configs:/app/configs:ro \\\n -v experiment-data:/data/ml-experiments \\\n ml-experiment-manager:latest\n</code></pre>"},{"location":"deployment/#docker-compose","title":"Docker Compose","text":"<pre><code># Production mode\ndocker-compose -f docker-compose.yml up -d\n\n# Development mode with logs\ndocker-compose -f docker-compose.yml up\n</code></pre>"},{"location":"deployment/#3-homelab-setup","title":"3. Homelab Setup","text":"<pre><code># Use the simple setup script\n./setup.sh\n\n# Or manually with Docker Compose\ndocker-compose up -d (testing only)\n</code></pre>"},{"location":"deployment/#4-cloud-deployment","title":"4. Cloud Deployment","text":""},{"location":"deployment/#aws-ecs","title":"AWS ECS","text":"<pre><code># Build and push to ECR\naws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REGISTRY\ndocker build -t $ECR_REGISTRY/ml-experiment-manager:latest .\ndocker push $ECR_REGISTRY/ml-experiment-manager:latest\n\n# Deploy with ECS CLI\necs-cli compose --project-name ml-experiment-manager up\n</code></pre>"},{"location":"deployment/#google-cloud-run","title":"Google Cloud Run","text":"<pre><code># Build and push\ngcloud builds submit --tag gcr.io/$PROJECT_ID/ml-experiment-manager\n\n# Deploy\ngcloud run deploy ml-experiment-manager \\\n --image gcr.io/$PROJECT_ID/ml-experiment-manager \\\n --platform managed \\\n --region us-central1 \\\n --allow-unauthenticated\n</code></pre>"},{"location":"deployment/#configuration","title":"Configuration","text":""},{"location":"deployment/#environment-variables","title":"Environment Variables","text":"<pre><code># configs/config-local.yaml\nbase_path: \"/data/ml-experiments\"\nauth:\n enabled: true\n api_keys:\n - \"your-production-api-key\"\nserver:\n address: \":9100\"\n tls:\n enabled: true\n cert_file: \"/app/ssl/cert.pem\"\n key_file: \"/app/ssl/key.pem\"\n</code></pre>"},{"location":"deployment/#docker-compose-environment","title":"Docker Compose Environment","text":"<pre><code># docker-compose.yml\nversion: '3.8'\nservices:\n api-server:\n environment:\n - REDIS_URL=redis://redis:6379\n - LOG_LEVEL=info\n volumes:\n - ./configs:/configs:ro\n - ./data:/data/experiments\n</code></pre>"},{"location":"deployment/#monitoring-logging","title":"Monitoring & Logging","text":""},{"location":"deployment/#health-checks","title":"Health Checks","text":"<ul> <li>HTTP: <code>GET /health</code></li> <li>WebSocket: Connection test</li> <li>Redis: Ping check</li> </ul>"},{"location":"deployment/#metrics","title":"Metrics","text":"<ul> <li>Prometheus metrics at <code>/metrics</code></li> <li>Custom application metrics</li> <li>Container resource usage</li> </ul>"},{"location":"deployment/#logging","title":"Logging","text":"<ul> <li>Structured JSON logging</li> <li>Log levels: DEBUG, INFO, WARN, ERROR</li> <li>Centralized logging via ELK stack</li> </ul>"},{"location":"deployment/#security","title":"Security","text":""},{"location":"deployment/#tls-configuration","title":"TLS Configuration","text":"<pre><code># Generate self-signed cert (development)\nopenssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365 -nodes\n\n# Production - use Let's Encrypt\ncertbot certonly --standalone -d ml-experiments.example.com\n</code></pre>"},{"location":"deployment/#network-security","title":"Network Security","text":"<ul> <li>Firewall rules (ports 9100, 9101, 6379)</li> <li>VPN access for internal services</li> <li>API key authentication</li> <li>Rate limiting</li> </ul>"},{"location":"deployment/#performance-tuning","title":"Performance Tuning","text":""},{"location":"deployment/#resource-allocation","title":"Resource Allocation","text":"<pre><code>resources:\n requests:\n memory: \"256Mi\"\n cpu: \"250m\"\n limits:\n memory: \"1Gi\"\n cpu: \"1000m\"\n</code></pre>"},{"location":"deployment/#scaling-strategies","title":"Scaling Strategies","text":"<ul> <li>Horizontal pod autoscaling</li> <li>Redis clustering</li> <li>Load balancing</li> <li>CDN for static assets</li> </ul>"},{"location":"deployment/#backup-recovery","title":"Backup & Recovery","text":""},{"location":"deployment/#data-backup","title":"Data Backup","text":"<pre><code># Backup experiment data\ndocker-compose exec redis redis-cli BGSAVE\ndocker cp $(docker-compose ps -q redis):/data/dump.rdb ./redis-backup.rdb\n\n# Backup data volume\ndocker run --rm -v ml-experiments_redis_data:/data -v $(pwd):/backup alpine tar czf /backup/redis-backup.tar.gz -C /data .\n</code></pre>"},{"location":"deployment/#disaster-recovery","title":"Disaster Recovery","text":"<ol> <li>Restore Redis data</li> <li>Restart services</li> <li>Verify experiment metadata</li> <li>Test API endpoints</li> </ol>"},{"location":"deployment/#troubleshooting","title":"Troubleshooting","text":""},{"location":"deployment/#common-issues","title":"Common Issues","text":""},{"location":"deployment/#api-server-not-starting","title":"API Server Not Starting","text":"<pre><code># Check logs\ndocker-compose logs api-server\n\n# Check configuration\ncat configs/config-local.yaml\n\n# Check Redis connection\ndocker-compose exec redis redis-cli ping\n</code></pre>"},{"location":"deployment/#websocket-connection-issues","title":"WebSocket Connection Issues","text":"<pre><code># Test WebSocket\nwscat -c ws://localhost:9100/ws\n\n# Check TLS\nopenssl s_client -connect localhost:9101 -servername localhost\n</code></pre>"},{"location":"deployment/#performance-issues","title":"Performance Issues","text":"<pre><code># Check resource usage\ndocker-compose exec api-server ps aux\n\n# Check Redis memory\ndocker-compose exec redis redis-cli info memory\n</code></pre>"},{"location":"deployment/#debug-mode","title":"Debug Mode","text":"<pre><code># Enable debug logging\nexport LOG_LEVEL=debug\n./bin/api-server -config configs/config-local.yaml\n</code></pre>"},{"location":"deployment/#cicd-integration","title":"CI/CD Integration","text":""},{"location":"deployment/#github-actions","title":"GitHub Actions","text":"<ul> <li>Automated testing on PR</li> <li>Multi-platform builds</li> <li>Security scanning</li> <li>Automatic releases</li> </ul>"},{"location":"deployment/#deployment-pipeline","title":"Deployment Pipeline","text":"<ol> <li>Code commit \u2192 GitHub</li> <li>CI/CD pipeline triggers</li> <li>Build and test</li> <li>Security scan</li> <li>Deploy to staging</li> <li>Run integration tests</li> <li>Deploy to production</li> <li>Post-deployment verification</li> </ol>"},{"location":"deployment/#support","title":"Support","text":"<p>For deployment issues: 1. Check this guide 2. Review logs 3. Check GitHub Issues 4. Contact maintainers</p>"},{"location":"development-setup/","title":"Development Setup","text":"<p>Set up your local development environment for Fetch ML.</p>"},{"location":"development-setup/#prerequisites","title":"Prerequisites","text":"<p>Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution</p> <ul> <li>Go 1.21+</li> <li>Zig 0.11+</li> <li>Docker Compose (testing only)</li> <li>Redis (or use Docker)</li> <li>Git</li> </ul>"},{"location":"development-setup/#quick-setup","title":"Quick Setup","text":"<pre><code># Clone repository\ngit clone https://github.com/jfraeys/fetch_ml.git\ncd fetch_ml\n\n# Start dependencies\nsee [Quick Start](quick-start.md) for Docker setup redis postgres\n\n# Build all components\nmake build\n\n# Run tests\nsee [Testing Guide](testing.md)\n</code></pre>"},{"location":"development-setup/#detailed-setup","title":"Detailed Setup","text":""},{"location":"development-setup/#quick-start","title":"Quick Start","text":"<pre><code>git clone https://github.com/jfraeys/fetch_ml.git\ncd fetch_ml\nsee [Quick Start](quick-start.md) for Docker setup\nmake build\nsee [Testing Guide](testing.md)\n</code></pre>"},{"location":"development-setup/#key-commands","title":"Key Commands","text":"<ul> <li><code>make build</code> - Build all components</li> <li><code>see [Testing Guide](testing.md)</code> - Run tests</li> <li><code>make dev</code> - Development build</li> <li><code>see [CLI Reference](cli-reference.md) and [Zig CLI](zig-cli.md)</code> - Build CLI</li> </ul>"},{"location":"development-setup/#common-issues","title":"Common Issues","text":"<ul> <li>Build fails: <code>go mod tidy</code></li> <li>Zig errors: <code>cd cli && rm -rf zig-out zig-cache</code></li> <li>Port conflicts: <code>lsof -i :9101</code></li> </ul>"},{"location":"environment-variables/","title":"Environment Variables","text":"<p>Fetch ML supports environment variables for configuration, allowing you to override config file settings and deploy in different environments.</p>"},{"location":"environment-variables/#priority-order","title":"Priority Order","text":"<ol> <li>Environment variables (highest priority)</li> <li>Configuration file values</li> <li>Default values (lowest priority)</li> </ol>"},{"location":"environment-variables/#variable-prefixes","title":"Variable Prefixes","text":""},{"location":"environment-variables/#general-configuration","title":"General Configuration","text":"<ul> <li><code>FETCH_ML_*</code> - General server and application settings</li> </ul>"},{"location":"environment-variables/#cli-configuration","title":"CLI Configuration","text":"<ul> <li><code>FETCH_ML_CLI_*</code> - CLI-specific settings (overrides <code>~/.ml/config.toml</code>)</li> </ul>"},{"location":"environment-variables/#tui-configuration","title":"TUI Configuration","text":"<ul> <li><code>FETCH_ML_TUI_*</code> - TUI-specific settings (overrides TUI config file)</li> </ul>"},{"location":"environment-variables/#cli-environment-variables","title":"CLI Environment Variables","text":"Variable Config Field Example <code>FETCH_ML_CLI_HOST</code> <code>worker_host</code> <code>localhost</code> <code>FETCH_ML_CLI_USER</code> <code>worker_user</code> <code>mluser</code> <code>FETCH_ML_CLI_BASE</code> <code>worker_base</code> <code>/opt/ml</code> <code>FETCH_ML_CLI_PORT</code> <code>worker_port</code> <code>22</code> <code>FETCH_ML_CLI_API_KEY</code> <code>api_key</code> <code>your-api-key-here</code>"},{"location":"environment-variables/#tui-environment-variables","title":"TUI Environment Variables","text":"Variable Config Field Example <code>FETCH_ML_TUI_HOST</code> <code>host</code> <code>localhost</code> <code>FETCH_ML_TUI_USER</code> <code>user</code> <code>mluser</code> <code>FETCH_ML_TUI_SSH_KEY</code> <code>ssh_key</code> <code>~/.ssh/id_rsa</code> <code>FETCH_ML_TUI_PORT</code> <code>port</code> <code>22</code> <code>FETCH_ML_TUI_BASE_PATH</code> <code>base_path</code> <code>/opt/ml</code> <code>FETCH_ML_TUI_TRAIN_SCRIPT</code> <code>train_script</code> <code>train.py</code> <code>FETCH_ML_TUI_REDIS_ADDR</code> <code>redis_addr</code> <code>localhost:6379</code> <code>FETCH_ML_TUI_REDIS_PASSWORD</code> <code>redis_password</code> `` <code>FETCH_ML_TUI_REDIS_DB</code> <code>redis_db</code> <code>0</code> <code>FETCH_ML_TUI_KNOWN_HOSTS</code> <code>known_hosts</code> <code>~/.ssh/known_hosts</code>"},{"location":"environment-variables/#server-environment-variables-auth-debug","title":"Server Environment Variables (Auth & Debug)","text":"<p>These variables control server-side authentication behavior and are intended only for local development and debugging.</p> Variable Purpose Allowed In Production? <code>FETCH_ML_ALLOW_INSECURE_AUTH</code> When set to <code>1</code> and <code>FETCH_ML_DEBUG=1</code>, allows the API server to run with <code>auth.enabled: false</code> by injecting a default admin user. No. Must never be set in production. <code>FETCH_ML_DEBUG</code> Enables additional debug behaviors. Required (set to <code>1</code>) to activate the insecure auth bypass above. No. Must never be set in production. <p>When both variables are set to <code>1</code> and <code>auth.enabled</code> is <code>false</code>, the server logs a clear warning and treats all requests as coming from a default admin user. This mode is convenient for local homelab experiments but is insecure by design and must not be used on any shared or internet-facing environment.</p>"},{"location":"environment-variables/#usage-examples","title":"Usage Examples","text":""},{"location":"environment-variables/#development-environment","title":"Development Environment","text":"<pre><code>export FETCH_ML_CLI_HOST=localhost\nexport FETCH_ML_CLI_USER=devuser\nexport FETCH_ML_CLI_API_KEY=dev-key-123456789012\n./ml status\n</code></pre>"},{"location":"environment-variables/#production-environment","title":"Production Environment","text":"<pre><code>export FETCH_ML_CLI_HOST=prod-server.example.com\nexport FETCH_ML_CLI_USER=mluser\nexport FETCH_ML_CLI_API_KEY=prod-key-abcdef1234567890\n./ml status\n</code></pre>"},{"location":"environment-variables/#dockerkubernetes","title":"Docker/Kubernetes","text":"<pre><code>env:\n - name: FETCH_ML_CLI_HOST\n value: \"ml-server.internal\"\n - name: FETCH_ML_CLI_USER\n value: \"mluser\"\n - name: FETCH_ML_CLI_API_KEY\n valueFrom:\n secretKeyRef:\n name: ml-secrets\n key: api-key\n</code></pre>"},{"location":"environment-variables/#using-env-file","title":"Using .env file","text":"<pre><code># Copy the example file\ncp .env.example .env\n\n# Edit with your values\nvim .env\n\n# Load in your shell\nexport $(cat .env | xargs)\n</code></pre>"},{"location":"environment-variables/#backward-compatibility","title":"Backward Compatibility","text":"<p>The CLI also supports the legacy <code>ML_*</code> prefix for backward compatibility, but <code>FETCH_ML_CLI_*</code> takes priority if both are set.</p> Legacy Variable New Variable <code>ML_HOST</code> <code>FETCH_ML_CLI_HOST</code> <code>ML_USER</code> <code>FETCH_ML_CLI_USER</code> <code>ML_BASE</code> <code>FETCH_ML_CLI_BASE</code> <code>ML_PORT</code> <code>FETCH_ML_CLI_PORT</code> <code>ML_API_KEY</code> <code>FETCH_ML_CLI_API_KEY</code>"},{"location":"first-experiment/","title":"First Experiment","text":"<p>Run your first machine learning experiment with Fetch ML.</p>"},{"location":"first-experiment/#prerequisites","title":"Prerequisites","text":"<p>Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution</p> <ul> <li>Fetch ML installed and running</li> <li>API key (see Security and API Key Process)</li> <li>Basic ML knowledge</li> </ul>"},{"location":"first-experiment/#experiment-workflow","title":"Experiment Workflow","text":""},{"location":"first-experiment/#1-prepare-your-ml-code","title":"1. Prepare Your ML Code","text":"<p>Create a simple Python script:</p> <pre><code># experiment.py\nimport argparse\nimport json\nimport sys\nimport time\n\ndef main():\n parser = argparse.ArgumentParser()\n parser.add_argument('--epochs', type=int, default=10)\n parser.add_argument('--lr', type=float, default=0.001)\n parser.add_argument('--output', default='results.json')\n\n args = parser.parse_args()\n\n # Simulate training\n results = {\n 'epochs': args.epochs,\n 'learning_rate': args.lr,\n 'accuracy': 0.85 + (args.lr * 0.1),\n 'loss': 0.5 - (args.epochs * 0.01),\n 'training_time': args.epochs * 0.1\n }\n\n # Save results\n with open(args.output, 'w') as f:\n json.dump(results, f, indent=2)\n\n print(f\"Training completed: {results}\")\n return results\n\nif __name__ == '__main__':\n main()\n</code></pre>"},{"location":"first-experiment/#2-submit-job-via-api","title":"2. Submit Job via API","text":"<pre><code># Submit experiment\ncurl -X POST http://localhost:9101/api/v1/jobs \\\n -H \"Content-Type: application/json\" \\\n -H \"X-API-Key: your-api-key\" \\\n -d '{\n \"job_name\": \"first-experiment\",\n \"args\": \"--epochs 20 --lr 0.01 --output experiment_results.json\",\n \"priority\": 1,\n \"metadata\": {\n \"experiment_type\": \"training\",\n \"dataset\": \"sample_data\"\n }\n }'\n</code></pre>"},{"location":"first-experiment/#3-monitor-progress","title":"3. Monitor Progress","text":"<pre><code># Check job status\ncurl -H \"X-API-Key: your-api-key\" \\\n http://localhost:9101/api/v1/jobs/first-experiment\n\n# List all jobs\ncurl -H \"X-API-Key: your-api-key\" \\\n http://localhost:9101/api/v1/jobs\n\n# Get job metrics\ncurl -H \"X-API-Key: your-api-key\" \\\n http://localhost:9101/api/v1/jobs/first-experiment/metrics\n</code></pre>"},{"location":"first-experiment/#4-use-cli","title":"4. Use CLI","text":"<pre><code># Submit with CLI\ncd cli && zig build dev\n./cli/zig-out/dev/ml submit \\\n --name \"cli-experiment\" \\\n --args \"--epochs 15 --lr 0.005\" \\\n --server http://localhost:9101\n\n# Monitor with CLI\n./cli/zig-out/dev/ml list-jobs --server http://localhost:9101\n./cli/zig-out/dev/ml job-status cli-experiment --server http://localhost:9101\n</code></pre>"},{"location":"first-experiment/#advanced-experiment","title":"Advanced Experiment","text":""},{"location":"first-experiment/#hyperparameter-tuning","title":"Hyperparameter Tuning","text":"<pre><code># Submit multiple experiments\nfor lr in 0.001 0.01 0.1; do\n curl -X POST http://localhost:9101/api/v1/jobs \\\n -H \"Content-Type: application/json\" \\\n -H \"X-API-Key: your-api-key\" \\\n -d \"{\n \\\"job_name\\\": \\\"tune-lr-$lr\\\",\n \\\"args\\\": \\\"--epochs 10 --lr $lr\\\",\n \\\"metadata\\\": {\\\"learning_rate\\\": $lr}\n }\"\ndone\n</code></pre>"},{"location":"first-experiment/#batch-processing","title":"Batch Processing","text":"<pre><code># Submit batch job\ncurl -X POST http://localhost:9101/api/v1/jobs \\\n -H \"Content-Type: application/json\" \\\n -H \"X-API-Key: your-api-key\" \\\n -d '{\n \"job_name\": \"batch-processing\",\n \"args\": \"--input data/ --output results/ --batch-size 32\",\n \"priority\": 2,\n \"datasets\": [\"training_data\", \"validation_data\"]\n }'\n</code></pre>"},{"location":"first-experiment/#results-and-output","title":"Results and Output","text":""},{"location":"first-experiment/#access-results","title":"Access Results","text":"<pre><code># Download results\ncurl -H \"X-API-Key: your-api-key\" \\\n http://localhost:9101/api/v1/jobs/first-experiment/results\n\n# View job details\ncurl -H \"X-API-Key: your-api-key\" \\\n http://localhost:9101/api/v1/jobs/first-experiment | jq .\n</code></pre>"},{"location":"first-experiment/#result-format","title":"Result Format","text":"<pre><code>{\n \"job_id\": \"first-experiment\",\n \"status\": \"completed\",\n \"results\": {\n \"epochs\": 20,\n \"learning_rate\": 0.01,\n \"accuracy\": 0.86,\n \"loss\": 0.3,\n \"training_time\": 2.0\n },\n \"metrics\": {\n \"gpu_utilization\": \"85%\",\n \"memory_usage\": \"2GB\",\n \"execution_time\": \"120s\"\n }\n}\n</code></pre>"},{"location":"first-experiment/#best-practices","title":"Best Practices","text":""},{"location":"first-experiment/#job-naming","title":"Job Naming","text":"<ul> <li>Use descriptive names: <code>model-training-v2</code>, <code>data-preprocessing</code></li> <li>Include version numbers: <code>experiment-v1</code>, <code>experiment-v2</code></li> <li>Add timestamps: <code>daily-batch-2024-01-15</code></li> </ul>"},{"location":"first-experiment/#metadata-usage","title":"Metadata Usage","text":"<pre><code>{\n \"metadata\": {\n \"experiment_type\": \"training\",\n \"model_version\": \"v2.1\",\n \"dataset\": \"imagenet-2024\",\n \"environment\": \"gpu\",\n \"team\": \"ml-team\"\n }\n}\n</code></pre>"},{"location":"first-experiment/#error-handling","title":"Error Handling","text":"<pre><code># Check failed jobs\ncurl -H \"X-API-Key: your-api-key\" \\\n \"http://localhost:9101/api/v1/jobs?status=failed\"\n\n# Retry failed job\ncurl -X POST http://localhost:9101/api/v1/jobs \\\n -H \"Content-Type: application/json\" \\\n -H \"X-API-Key: your-api-key\" \\\n -d '{\n \"job_name\": \"retry-experiment\",\n \"args\": \"--epochs 20 --lr 0.01\",\n \"metadata\": {\"retry_of\": \"first-experiment\"}\n }'\n</code></pre>"},{"location":"first-experiment/#related-documentation","title":"## Related Documentation","text":"<ul> <li>Development Setup (see [Development Setup](development-setup.md)) - Local development environment</li> <li>Testing Guide (see [Testing Guide](testing.md)) - Test your experiments</li> <li>Production Deployment (see [Deployment](deployment.md)) - Scale to production</li> <li>Monitoring - Track experiment performance</li> </ul>"},{"location":"first-experiment/#troubleshooting","title":"Troubleshooting","text":"<p>Job stuck in pending? - Check worker status: <code>curl /api/v1/workers</code> - Verify resources: <code>docker stats</code> - Check logs: <code>docker-compose logs api-server</code></p> <p>Job failed? - Check error message: <code>curl /api/v1/jobs/job-id</code> - Review job arguments - Verify input data</p> <p>No results? - Check job completion status - Verify output file paths - Check storage permissions</p>"},{"location":"installation/","title":"Simple Installation Guide","text":""},{"location":"installation/#quick-start-5-minutes","title":"Quick Start (5 minutes)","text":"<pre><code># 1. Install\ngit clone https://github.com/jfraeys/fetch_ml.git\ncd fetch_ml\nmake install\n\n# 2. Setup (auto-configures)\n./bin/ml setup\n\n# 3. Run experiments\n./bin/ml run my-experiment.py\n</code></pre> <p>That's it. Everything else is optional.</p>"},{"location":"installation/#what-if-i-want-more-control","title":"What If I Want More Control?","text":""},{"location":"installation/#manual-configuration-optional","title":"Manual Configuration (Optional)","text":"<pre><code># Edit settings if defaults don't work\nnano ~/.ml/config.toml\n</code></pre>"},{"location":"installation/#monitoring-dashboard-optional","title":"Monitoring Dashboard (Optional)","text":"<pre><code># Real-time monitoring\n./bin/tui\n</code></pre>"},{"location":"installation/#senior-developer-feedback","title":"Senior Developer Feedback","text":"<p>\"Keep it simple\" - Most data scientists want: 1. One installation command 2. Sensible defaults 3. Works without configuration 4. Advanced features available when needed</p> <p>Current plan is too complex because it asks users to decide between: - CLI vs TUI vs Both - Zig vs Go build tools - Manual vs auto config - Multiple environment variables</p> <p>Better approach: Start simple, add complexity gradually.</p>"},{"location":"installation/#recommended-simplified-workflow","title":"Recommended Simplified Workflow","text":"<ol> <li>Single Binary - Combine CLI + basic TUI functionality</li> <li>Auto-Discovery - Detect common ML environments automatically </li> <li>Progressive Disclosure - Show advanced options only when needed</li> <li>Zero Config - Work out-of-the-box with localhost defaults</li> </ol> <p>The goal: \"It just works\" for 80% of use cases.</p>"},{"location":"operations/","title":"Operations Runbook","text":"<p>Operational guide for troubleshooting and maintaining the ML experiment system.</p>"},{"location":"operations/#task-queue-operations","title":"Task Queue Operations","text":""},{"location":"operations/#monitoring-queue-health","title":"Monitoring Queue Health","text":"<pre><code># Check queue depth\nZCARD task:queue\n\n# List pending tasks\nZRANGE task:queue 0 -1 WITHSCORES\n\n# Check dead letter queue\nKEYS task:dlq:*\n</code></pre>"},{"location":"operations/#handling-stuck-tasks","title":"Handling Stuck Tasks","text":"<p>Symptom: Tasks stuck in \"running\" status</p> <p>Diagnosis: <pre><code># Check for expired leases\nredis-cli GET task:{task-id}\n# Look for LeaseExpiry in past\n</code></pre></p> <p>**Rem</p> <p>ediation:** Tasks with expired leases are automatically reclaimed every 1 minute. To force immediate reclamation: <pre><code># Restart worker to trigger reclaim cycle\nsystemctl restart ml-worker\n</code></pre></p>"},{"location":"operations/#dead-letter-queue-management","title":"Dead Letter Queue Management","text":"<p>View failed tasks: <pre><code>KEYS task:dlq:*\n</code></pre></p> <p>Inspect failed task: <pre><code>GET task:dlq:{task-id}\n</code></pre></p> <p>Retry from DLQ: <pre><code># Manual retry (requires custom script)\n# 1. Get task from DLQ\n# 2. Reset retry count\n# 3. Re-queue task\n</code></pre></p>"},{"location":"operations/#worker-crashes","title":"Worker Crashes","text":"<p>Symptom: Worker disappeared mid-task</p> <p>What Happens: 1. Lease expires after 30 minutes (default) 2. Background reclaim job detects expired lease 3. Task is retried (up to 3 attempts) 4. After max retries \u2192 Dead Letter Queue</p> <p>Prevention: - Monitor worker heartbeats - Set up alerts for worker down - Use process manager (systemd, supervisor)</p>"},{"location":"operations/#worker-operations","title":"Worker Operations","text":""},{"location":"operations/#graceful-shutdown","title":"Graceful Shutdown","text":"<pre><code># Send SIGTERM for graceful shutdown\nkill -TERM $(pgrep ml-worker)\n\n# Worker will:\n# 1. Stop accepting new tasks\n# 2. Finish active tasks (up to 5min timeout)\n# 3. Release all leases\n# 4. Exit cleanly\n</code></pre>"},{"location":"operations/#force-shutdown","title":"Force Shutdown","text":"<pre><code># Force kill (leases will be reclaimed automatically)\nkill -9 $(pgrep ml-worker)\n</code></pre>"},{"location":"operations/#worker-heartbeat-monitoring","title":"Worker Heartbeat Monitoring","text":"<pre><code># Check worker heartbeats\nHGETALL worker:heartbeat\n\n# Example output:\n# worker-abc123 1701234567\n# worker-def456 1701234580\n</code></pre> <p>Alert if: Heartbeat timestamp > 5 minutes old</p>"},{"location":"operations/#redis-operations","title":"Redis Operations","text":""},{"location":"operations/#backup","title":"Backup","text":"<pre><code># Manual backup\nredis-cli SAVE\ncp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d).rdb\n</code></pre>"},{"location":"operations/#restore","title":"Restore","text":"<pre><code># Stop Redis\nsystemctl stop redis\n\n# Restore snapshot\ncp /backup/redis-20231201.rdb /var/lib/redis/dump.rdb\n\n# Start Redis\nsystemctl start redis\n</code></pre>"},{"location":"operations/#memory-management","title":"Memory Management","text":"<pre><code># Check memory usage\nINFO memory\n\n# Evict old data if needed\nFLUSHDB # DANGER: Clears all data!\n</code></pre>"},{"location":"operations/#common-issues","title":"Common Issues","text":""},{"location":"operations/#issue-queue-growing-unbounded","title":"Issue: Queue Growing Unbounded","text":"<p>Symptoms: - <code>ZCARD task:queue</code> keeps increasing - No workers processing tasks</p> <p>Diagnosis: <pre><code># Check worker status\nsystemctl status ml-worker\n\n# Check logs\njournalctl -u ml-worker -n 100\n</code></pre></p> <p>Resolution: 1. Verify workers are running 2. Check Redis connectivity 3. Verify lease configuration</p>"},{"location":"operations/#issue-high-retry-rate","title":"Issue: High Retry Rate","text":"<p>Symptoms: - Many tasks in DLQ - <code>retry_count</code> field high on tasks</p> <p>Diagnosis: <pre><code># Check worker logs for errors\njournalctl -u ml-worker | grep \"retry\"\n\n# Look for patterns (network issues, resource limits, etc)\n</code></pre></p> <p>Resolution: - Fix underlying issue (network, resources, etc) - Adjust retry limits if permanent failures - Increase task timeout if jobs are slow</p>"},{"location":"operations/#issue-leases-expiring-prematurely","title":"Issue: Leases Expiring Prematurely","text":"<p>Symptoms: - Tasks retried even though worker is healthy - Logs show \"lease expired\" frequently</p> <p>Diagnosis: <pre><code># Check worker config\ncat configs/worker-config.yaml | grep -A3 \"lease\"\n\ntask_lease_duration: 30m # Too short?\nheartbeat_interval: 1m # Too infrequent?\n</code></pre></p> <p>Resolution: <pre><code># Increase lease duration for long-running jobs\ntask_lease_duration: 60m\nheartbeat_interval: 30s # More frequent heartbeats\n</code></pre></p>"},{"location":"operations/#performance-tuning","title":"Performance Tuning","text":""},{"location":"operations/#worker-concurrency","title":"Worker Concurrency","text":"<pre><code># worker-config.yaml\nmax_workers: 4 # Number of parallel tasks\n\n# Adjust based on:\n# - CPU cores available\n# - Memory per task\n# - GPU availability\n</code></pre>"},{"location":"operations/#redis-configuration","title":"Redis Configuration","text":"<pre><code># /etc/redis/redis.conf\n\n# Persistence\nsave 900 1\nsave 300 10\n\n# Memory\nmaxmemory 2gb\nmaxmemory-policy noeviction\n\n# Performance\ntcp-keepalive 300\ntimeout 0\n</code></pre>"},{"location":"operations/#alerting-rules","title":"Alerting Rules","text":""},{"location":"operations/#critical-alerts","title":"Critical Alerts","text":"<ol> <li>Worker Down (no heartbeat > 5min)</li> <li>Queue Depth > 1000 tasks</li> <li>DLQ Growth > 100 tasks/hour</li> <li>Redis Down (connection failed)</li> </ol>"},{"location":"operations/#warning-alerts","title":"Warning Alerts","text":"<ol> <li>High Retry Rate > 10% of tasks</li> <li>Slow Queue Drain (depth increasing over 1 hour)</li> <li>Worker Memory > 80% usage</li> </ol>"},{"location":"operations/#health-checks","title":"Health Checks","text":"<pre><code>#!/bin/bash\n# health-check.sh\n\n# Check Redis\nredis-cli PING || echo \"Redis DOWN\"\n\n# Check worker heartbeat\nWORKER_ID=$(cat /var/run/ml-worker.pid)\nLAST_HB=$(redis-cli HGET worker:heartbeat \"$WORKER_ID\")\nNOW=$(date +%s)\nif [ $((NOW - LAST_HB)) -gt 300 ]; then\n echo \"Worker heartbeat stale\"\nfi\n\n# Check queue depth\nDEPTH=$(redis-cli ZCARD task:queue)\nif [ \"$DEPTH\" -gt 1000 ]; then\n echo \"Queue depth critical: $DEPTH\"\nfi\n</code></pre>"},{"location":"operations/#runbook-checklist","title":"Runbook Checklist","text":""},{"location":"operations/#daily-operations","title":"Daily Operations","text":"<ol> <li>Check queue depth</li> <li>Verify worker heartbeats</li> <li>Review DLQ for patterns</li> <li>Check Redis memory usage</li> </ol>"},{"location":"operations/#weekly-operations","title":"Weekly Operations","text":"<ol> <li>Review retry rates</li> <li>Analyze failed task patterns</li> <li>Backup Redis snapshot</li> <li>Review worker logs</li> </ol>"},{"location":"operations/#monthly-operations","title":"Monthly Operations","text":"<ol> <li>Performance tuning review</li> <li>Capacity planning</li> <li>Update documentation</li> <li>Test disaster recovery</li> </ol> <p>For homelab setups: Most of these operations can be simplified. Focus on: - Basic monitoring (queue depth, worker status) - Periodic Redis backups - Graceful shutdowns for maintenance</p>"},{"location":"performance-monitoring/","title":"Performance Monitoring","text":"<p>This document describes the performance monitoring system for Fetch ML, which automatically tracks benchmark metrics through CI/CD integration with Prometheus and Grafana.</p>"},{"location":"performance-monitoring/#overview","title":"Overview","text":"<p>The performance monitoring system provides:</p> <ul> <li>Automatic benchmark execution on every CI/CD run</li> <li>Real-time metrics collection via Prometheus Pushgateway</li> <li>Historical trend visualization in Grafana dashboards</li> <li>Performance regression detection</li> <li>Cross-commit comparisons</li> </ul>"},{"location":"performance-monitoring/#architecture","title":"Architecture","text":"<pre><code>GitHub Actions \u2192 Benchmark Tests \u2192 Prometheus Pushgateway \u2192 Prometheus \u2192 Grafana Dashboard\n</code></pre>"},{"location":"performance-monitoring/#components","title":"Components","text":""},{"location":"performance-monitoring/#1-github-actions-workflow","title":"1. GitHub Actions Workflow","text":"<ul> <li>File: <code>.github/workflows/benchmark-metrics.yml</code></li> <li>Triggers: Push to main/develop, PRs, daily schedule, manual</li> <li>Function: Runs benchmarks and pushes metrics to Prometheus</li> </ul>"},{"location":"performance-monitoring/#2-prometheus-pushgateway","title":"2. Prometheus Pushgateway","text":"<ul> <li>Port: 9091</li> <li>Purpose: Receives benchmark metrics from CI/CD runs</li> <li>URL: <code>http://localhost:9091</code></li> </ul>"},{"location":"performance-monitoring/#3-prometheus-server","title":"3. Prometheus Server","text":"<ul> <li>Configuration: <code>monitoring/prometheus.yml</code></li> <li>Scrapes: Pushgateway for benchmark metrics</li> <li>Retention: Configurable retention period</li> </ul>"},{"location":"performance-monitoring/#4-grafana-dashboard","title":"4. Grafana Dashboard","text":"<ul> <li>Location: <code>monitoring/dashboards/performance-dashboard.json</code></li> <li>Visualizations: Performance trends, regressions, comparisons</li> <li>Access: http://localhost:3001</li> </ul>"},{"location":"performance-monitoring/#setup","title":"Setup","text":""},{"location":"performance-monitoring/#1-start-monitoring-stack","title":"1. Start Monitoring Stack","text":"<pre><code>make monitoring-performance\n</code></pre> <p>This starts: - Grafana: http://localhost:3001 (admin/admin) - Loki: http://localhost:3100 - Pushgateway: http://localhost:9091</p>"},{"location":"performance-monitoring/#2-configure-github-secrets","title":"2. Configure GitHub Secrets","text":"<p>Add this secret to your GitHub repository:</p> <pre><code>PROMETHEUS_PUSHGATEWAY_URL=http://your-pushgateway:9091\n</code></pre>"},{"location":"performance-monitoring/#3-verify-integration","title":"3. Verify Integration","text":"<ol> <li>Push code to trigger the workflow</li> <li>Check Pushgateway: http://localhost:9091</li> <li>View metrics in Grafana dashboard</li> </ol>"},{"location":"performance-monitoring/#available-metrics","title":"Available Metrics","text":""},{"location":"performance-monitoring/#benchmark-metrics","title":"Benchmark Metrics","text":"<ul> <li><code>benchmark_time_per_op</code> - Time per operation in nanoseconds</li> <li><code>benchmark_memory_per_op</code> - Memory per operation in bytes</li> <li><code>benchmark_allocs_per_op</code> - Allocations per operation</li> </ul> <p>Labels: - <code>benchmark</code> - Benchmark name (sanitized) - <code>job</code> - Always \"benchmark\" - <code>instance</code> - GitHub Actions run ID</p>"},{"location":"performance-monitoring/#example-metrics-output","title":"Example Metrics Output","text":"<pre><code>benchmark_time_per_op{benchmark=\"BenchmarkAPIServerCreateJobSimple\"} 42653\nbenchmark_memory_per_op{benchmark=\"BenchmarkAPIServerCreateJobSimple\"} 13518\nbenchmark_allocs_per_op{benchmark=\"BenchmarkAPIServerCreateJobSimple\"} 98\n</code></pre>"},{"location":"performance-monitoring/#usage","title":"Usage","text":""},{"location":"performance-monitoring/#manual-benchmark-execution","title":"Manual Benchmark Execution","text":"<pre><code># Run benchmarks locally\nmake benchmark\n\n# View results in console\ngo test -bench=. -benchmem ./tests/benchmarks/...\n</code></pre>"},{"location":"performance-monitoring/#automated-monitoring","title":"Automated Monitoring","text":"<p>The system automatically runs benchmarks on:</p> <ul> <li>Every push to main/develop branches</li> <li>Pull requests to main branch</li> <li>Daily schedule at 6:00 AM UTC</li> <li>Manual trigger via GitHub Actions UI</li> </ul>"},{"location":"performance-monitoring/#viewing-results","title":"Viewing Results","text":"<ol> <li>Grafana Dashboard: http://localhost:3001</li> <li>Pushgateway: http://localhost:9091/metrics</li> <li>Prometheus: http://localhost:9090/targets</li> </ol>"},{"location":"performance-monitoring/#configuration","title":"Configuration","text":""},{"location":"performance-monitoring/#prometheus-configuration","title":"Prometheus Configuration","text":"<p>Edit <code>monitoring/prometheus.yml</code> to adjust:</p> <pre><code>scrape_configs:\n - job_name: 'benchmark'\n static_configs:\n - targets: ['pushgateway:9091']\n metrics_path: /metrics\n honor_labels: true\n scrape_interval: 15s\n</code></pre>"},{"location":"performance-monitoring/#grafana-dashboard","title":"Grafana Dashboard","text":"<p>Customize the dashboard in <code>monitoring/dashboards/performance-dashboard.json</code>:</p> <ul> <li>Add new panels</li> <li>Modify queries</li> <li>Adjust visualization types</li> <li>Set up alerts</li> </ul>"},{"location":"performance-monitoring/#troubleshooting","title":"Troubleshooting","text":""},{"location":"performance-monitoring/#common-issues","title":"Common Issues","text":"<ol> <li>Metrics not appearing in Grafana</li> <li>Check Pushgateway: http://localhost:9091</li> <li>Verify Prometheus targets: http://localhost:9090/targets</li> <li> <p>Check GitHub Actions logs</p> </li> <li> <p>GitHub Actions workflow failing</p> </li> <li>Verify <code>PROMETHEUS_PUSHGATEWAY_URL</code> secret</li> <li>Check workflow syntax</li> <li> <p>Review benchmark execution logs</p> </li> <li> <p>Pushgateway not receiving metrics</p> </li> <li>Verify URL accessibility from CI/CD</li> <li>Check network connectivity</li> <li>Review curl command in workflow</li> </ol>"},{"location":"performance-monitoring/#debug-commands","title":"Debug Commands","text":"<pre><code># Check running services\ndocker ps --filter \"name=monitoring\"\n\n# View Pushgateway metrics\ncurl http://localhost:9091/metrics\n\n# Check Prometheus targets\ncurl http://localhost:9090/api/v1/targets\n\n# Test manual metric push\necho \"test_metric 123\" | curl --data-binary @- http://localhost:9091/metrics/job/test\n</code></pre>"},{"location":"performance-monitoring/#best-practices","title":"Best Practices","text":""},{"location":"performance-monitoring/#benchmark-naming","title":"Benchmark Naming","text":"<p>Use consistent naming conventions: - <code>BenchmarkAPIServerCreateJob</code> - <code>BenchmarkMLExperimentTraining</code> - <code>BenchmarkDatasetOperations</code></p>"},{"location":"performance-monitoring/#alerting","title":"Alerting","text":"<p>Set up Grafana alerts for: - Performance regressions (>10% degradation) - Missing benchmark data - High memory allocation rates</p>"},{"location":"performance-monitoring/#retention","title":"Retention","text":"<p>Configure appropriate retention periods: - Raw metrics: 30 days - Aggregated data: 1 year - Dashboard snapshots: Permanent</p>"},{"location":"performance-monitoring/#integration-with-existing-workflows","title":"Integration with Existing Workflows","text":"<p>The benchmark monitoring integrates seamlessly with:</p> <ul> <li>CI/CD pipelines: Automatic execution</li> <li>Code reviews: Performance impact visible</li> <li>Release management: Performance trends over time</li> <li>Development: Local testing with same metrics</li> </ul>"},{"location":"performance-monitoring/#future-enhancements","title":"Future Enhancements","text":"<p>Potential improvements:</p> <ol> <li>Automated performance regression alerts</li> <li>Performance budgets and gates</li> <li>Comparative analysis across branches</li> <li>Integration with load testing results</li> <li>Performance impact scoring</li> </ol>"},{"location":"performance-monitoring/#support","title":"Support","text":"<p>For issuesundles:</p> <ol> <li>Check this documentation</li> <li>Review GitHub Actions logs</li> <li>Verify monitoring stack status</li> <li>Consult Grafana/Prometheus docs</li> </ol> <p>Last updated: December 2024</p>"},{"location":"performance-quick-start/","title":"Performance Monitoring Quick Start","text":"<p>Get started with performance monitoring in 5 minutes.</p>"},{"location":"performance-quick-start/#prerequisites","title":"Prerequisites","text":"<ul> <li>Docker and Docker Compose</li> <li>Go 1.21 or later</li> <li>GitHub repository (for CI/CD integration)</li> </ul>"},{"location":"performance-quick-start/#1-start-monitoring-stack","title":"1. Start Monitoring Stack","text":"<pre><code>make monitoring-performance\n</code></pre> <p>This starts: - Grafana: http://localhost:3001 (admin/admin) - Pushgateway: http://localhost:9091 - Loki: http://localhost:3100</p>"},{"location":"performance-quick-start/#2-run-benchmarks","title":"2. Run Benchmarks","text":"<pre><code># Run benchmarks locally\nmake benchmark\n\n# Or run with detailed output\ngo test -bench=. -benchmem ./tests/benchmarks/...\n</code></pre>"},{"location":"performance-quick-start/#3-cpu-profiling","title":"3. CPU Profiling","text":""},{"location":"performance-quick-start/#http-load-test-profiling","title":"HTTP Load Test Profiling","text":"<pre><code># CPU profile MediumLoad HTTP test (with rate limiting)\nmake profile-load\n\n# CPU profile MediumLoad HTTP test (no rate limiting - recommended for profiling)\nmake profile-load-norate\n</code></pre> <p>This generates <code>cpu_load.out</code> which you can analyze with:</p> <pre><code># View interactive profile\ngo tool pprof cpu_load.out\n\n# Generate flame graph\ngo tool pprof -raw cpu_load.out | go-flamegraph.pl > cpu_flame.svg\n\n# View top functions\ngo tool pprof -top cpu_load.out\n</code></pre>"},{"location":"performance-quick-start/#websocket-queue-profiling","title":"WebSocket Queue Profiling","text":"<pre><code># CPU profile WebSocket \u2192 Redis queue \u2192 worker path\nmake profile-ws-queue\n</code></pre> <p>Generates <code>cpu_ws.out</code> for WebSocket performance analysis.</p>"},{"location":"performance-quick-start/#profiling-tips","title":"Profiling Tips","text":"<ul> <li>Use <code>profile-load-norate</code> for cleaner CPU profiles (no rate limiting delays)</li> <li>Profiles run for 60 seconds by default</li> <li>Requires Redis running on localhost:6379</li> <li>Results show throughput, latency, and error rate metrics</li> </ul>"},{"location":"performance-quick-start/#4-view-results","title":"4. View Results","text":"<p>Open Grafana dashboard: http://localhost:3001</p> <p>Navigate to the Performance Dashboard to see: - Real-time benchmark results - Historical trends - Performance comparisons</p>"},{"location":"performance-quick-start/#5-enable-cicd-integration","title":"5. Enable CI/CD Integration","text":"<p>Add GitHub secret: <pre><code>PROMETHEUS_PUSHGATEWAY_URL=http://your-pushgateway:9091\n</code></pre></p> <p>Now benchmarks run automatically on: - Every push to main/develop - Pull requests - Daily schedule</p>"},{"location":"performance-quick-start/#6-verify-integration","title":"6. Verify Integration","text":"<ol> <li>Push code to trigger workflow</li> <li>Check Pushgateway: http://localhost:9091/metrics</li> <li>View metrics in Grafana</li> </ol>"},{"location":"performance-quick-start/#7-key-metrics","title":"7. Key Metrics","text":"<ul> <li><code>benchmark_time_per_op</code> - Execution time</li> <li><code>benchmark_memory_per_op</code> - Memory usage</li> <li><code>benchmark_allocs_per_op</code> - Allocation count</li> </ul>"},{"location":"performance-quick-start/#8-troubleshooting","title":"8. Troubleshooting","text":"<p>No metrics in Grafana? <pre><code># Check services\ndocker ps --filter \"name=monitoring\"\n\n# Check Pushgateway\ncurl http://localhost:9091/metrics\n</code></pre></p> <p>Workflow failing? - Verify GitHub secret configuration - Check workflow logs in GitHub Actions</p> <p>Profiling issues? <pre><code># Flag error like \"flag provided but not defined: -test.paniconexit0\"\n# This should be fixed now, but if it persists:\ngo test ./tests/load -run TestLoadProfile_Medium -count=1 -cpuprofile cpu_load.out -v -args -profile-norate\n\n# Redis not available?\n# Start Redis for profiling tests:\ndocker run -d -p 6379:6379 redis:alpine\n\n# Check profile file generated\nls -la cpu_load.out\n</code></pre></p>"},{"location":"performance-quick-start/#9-next-steps","title":"9. Next Steps","text":"<ul> <li>Full Documentation</li> <li>Dashboard Customization</li> <li>Alert Configuration</li> </ul> <p>Ready in 5 minutes!</p>"},{"location":"production-monitoring/","title":"Production Monitoring Deployment Guide (Linux)","text":"<p>This guide covers deploying the monitoring stack (Prometheus, Grafana, Loki, Promtail) on Linux production servers.</p>"},{"location":"production-monitoring/#architecture","title":"Architecture","text":"<p>Testing: Docker Compose (macOS/Linux) Production: Podman + systemd (Linux)</p> <p>Important: Docker is for testing only. Podman is used for running actual ML experiments in production.</p> <p>Dev (Testing): Docker Compose Prod (Experiments): Podman + systemd</p> <p>Each service runs as a separate Podman container managed by systemd for automatic restarts and proper lifecycle management.</p>"},{"location":"production-monitoring/#prerequisites","title":"Prerequisites","text":"<p>Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution</p> <ul> <li>Linux distribution with systemd (Rocky/RHEL/CentOS, Ubuntu/Debian, Arch, SUSE, etc.)</li> <li>Production app already deployed (see <code>scripts/setup-prod.sh</code>)</li> <li>Root or sudo access</li> <li>Ports 3000, 9090, 3100 available</li> </ul>"},{"location":"production-monitoring/#quick-setup","title":"Quick Setup","text":""},{"location":"production-monitoring/#1-run-setup-script","title":"1. Run Setup Script","text":"<pre><code>cd /path/to/fetch_ml\nsudo ./scripts/setup-monitoring-prod.sh /data/monitoring ml-user ml-group\n</code></pre> <p>This will: - Create directory structure at <code>/data/monitoring</code> - Copy configuration files to <code>/etc/fetch_ml/monitoring</code> - Create systemd services for each component - Set up firewall rules</p>"},{"location":"production-monitoring/#2-start-services","title":"2. Start Services","text":"<pre><code># Start all monitoring services\nsudo systemctl start prometheus\nsudo systemctl start loki\nsudo systemctl start promtail\nsudo systemctl start grafana\n\n# Enable on boot\nsudo systemctl enable prometheus loki promtail grafana\n</code></pre>"},{"location":"production-monitoring/#3-access-grafana","title":"3. Access Grafana","text":"<ul> <li>URL: <code>http://YOUR_SERVER_IP:3000</code></li> <li>Username: <code>admin</code></li> <li>Password: <code>admin</code> (change on first login)</li> </ul> <p>Dashboards will auto-load: - ML Task Queue Monitoring (metrics) - Application Logs (Loki logs)</p>"},{"location":"production-monitoring/#service-details","title":"Service Details","text":""},{"location":"production-monitoring/#prometheus","title":"Prometheus","text":"<ul> <li>Port: 9090</li> <li>Config: <code>/etc/fetch_ml/monitoring/prometheus.yml</code></li> <li>Data: <code>/data/monitoring/prometheus</code></li> <li>Purpose: Scrapes metrics from API server</li> </ul>"},{"location":"production-monitoring/#loki","title":"Loki","text":"<ul> <li>Port: 3100</li> <li>Config: <code>/etc/fetch_ml/monitoring/loki-config.yml</code></li> <li>Data: <code>/data/monitoring/loki</code></li> <li>Purpose: Log aggregation</li> </ul>"},{"location":"production-monitoring/#promtail","title":"Promtail","text":"<ul> <li>Config: <code>/etc/fetch_ml/monitoring/promtail-config.yml</code></li> <li>Log Source: <code>/var/log/fetch_ml/*.log</code></li> <li>Purpose: Ships logs to Loki</li> </ul>"},{"location":"production-monitoring/#grafana","title":"Grafana","text":"<ul> <li>Port: 3000</li> <li>Config: <code>/etc/fetch_ml/monitoring/grafana/provisioning</code></li> <li>Data: <code>/data/monitoring/grafana</code></li> <li>Dashboards: <code>/var/lib/grafana/dashboards</code></li> </ul>"},{"location":"production-monitoring/#management-commands","title":"Management Commands","text":"<pre><code># Check status\nsudo systemctl status prometheus grafana loki promtail\n\n# View logs\nsudo journalctl -u prometheus -f\nsudo journalctl -u grafana -f\nsudo journalctl -u loki -f\nsudo journalctl -u promtail -f\n\n# Restart services\nsudo systemctl restart prometheus\nsudo systemctl restart grafana\n\n# Stop all monitoring\nsudo systemctl stop prometheus grafana loki promtail\n</code></pre>"},{"location":"production-monitoring/#data-retention","title":"Data Retention","text":""},{"location":"production-monitoring/#prometheus_1","title":"Prometheus","text":"<p>Default: 15 days. Edit <code>/etc/fetch_ml/monitoring/prometheus.yml</code>: <pre><code>storage:\n tsdb:\n retention.time: 30d\n</code></pre></p>"},{"location":"production-monitoring/#loki_1","title":"Loki","text":"<p>Default: 30 days. Edit <code>/etc/fetch_ml/monitoring/loki-config.yml</code>: <pre><code>limits_config:\n retention_period: 30d\n</code></pre></p>"},{"location":"production-monitoring/#security","title":"Security","text":""},{"location":"production-monitoring/#firewall","title":"Firewall","text":"<p>The setup script automatically configures firewall rules using the detected firewall manager (firewalld or ufw).</p> <p>For manual firewall configuration:</p> <p>RHEL/Rocky/Fedora (firewalld): <pre><code># Remove public access\nsudo firewall-cmd --permanent --remove-port=3000/tcp\nsudo firewall-cmd --permanent --remove-port=9090/tcp\n\n# Add specific source\nsudo firewall-cmd --permanent --add-rich-rule='rule family=\"ipv4\" source address=\"10.0.0.0/24\" port port=\"3000\" protocol=\"tcp\" accept'\nsudo firewall-cmd --reload\n</code></pre></p> <p>Ubuntu/Debian (ufw): <pre><code># Remove public access\nsudo ufw delete allow 3000/tcp\nsudo ufw delete allow 9090/tcp\n\n# Add specific source\nsudo ufw allow from 10.0.0.0/24 to any port 3000 proto tcp\n</code></pre></p>"},{"location":"production-monitoring/#authentication","title":"Authentication","text":"<p>Change Grafana admin password: 1. Login to Grafana 2. User menu \u2192 Profile \u2192 Change Password</p>"},{"location":"production-monitoring/#tls-optional","title":"TLS (Optional)","text":"<p>For HTTPS, configure reverse proxy (nginx/Apache) in front of Grafana.</p>"},{"location":"production-monitoring/#troubleshooting","title":"Troubleshooting","text":""},{"location":"production-monitoring/#grafana-shows-no-data","title":"Grafana shows no data","text":"<pre><code># Check if Prometheus is reachable\ncurl http://localhost:9090/-/healthy\n\n# Check datasource in Grafana\n# Settings \u2192 Data Sources \u2192 Prometheus \u2192 Save & Test\n</code></pre>"},{"location":"production-monitoring/#loki-not-receiving-logs","title":"Loki not receiving logs","text":"<pre><code># Check Promtail is running\nsudo systemctl status promtail\n\n# Verify log file exists\nls -l /var/log/fetch_ml/\n\n# Check Promtail can reach Loki\ncurl http://localhost:3100/ready\n</code></pre>"},{"location":"production-monitoring/#podman-containers-not-starting","title":"Podman containers not starting","text":"<pre><code># Check pod status\nsudo -u ml-user podman pod ps\nsudo -u ml-user podman ps -a\n\n# Remove and recreate\nsudo -u ml-user podman pod stop monitoring\nsudo -u ml-user podman pod rm monitoring\nsudo systemctl restart prometheus\n</code></pre>"},{"location":"production-monitoring/#backup","title":"Backup","text":"<pre><code># Backup Grafana dashboards and data\nsudo tar -czf grafana-backup.tar.gz /data/monitoring/grafana\n\n# Backup Prometheus data\nsudo tar -czf prometheus-backup.tar.gz /data/monitoring/prometheus\n</code></pre>"},{"location":"production-monitoring/#updates","title":"Updates","text":"<pre><code># Pull latest images\nsudo -u ml-user podman pull docker.io/grafana/grafana:latest\nsudo -u ml-user podman pull docker.io/prom/prometheus:latest\nsudo -u ml-user podman pull docker.io/grafana/loki:latest\nsudo -u ml-user podman pull docker.io/grafana/promtail:latest\n\n# Restart services to use new images\nsudo systemctl restart grafana prometheus loki promtail\n</code></pre>"},{"location":"queue/","title":"Task Queue Architecture","text":"<p>The task queue system enables reliable job processing between the API server and workers using Redis.</p>"},{"location":"queue/#overview","title":"Overview","text":"<pre><code>graph LR\n CLI[CLI/Client] -->|WebSocket| API[API Server]\n API -->|Enqueue| Redis[(Redis)]\n Redis -->|Dequeue| Worker[Worker]\n Worker -->|Update Status| Redis\n</code></pre>"},{"location":"queue/#components","title":"Components","text":""},{"location":"queue/#taskqueue-internalqueue","title":"TaskQueue (<code>internal/queue</code>)","text":"<p>Shared package used by both API server and worker for job management.</p>"},{"location":"queue/#task-structure","title":"Task Structure","text":"<pre><code>type Task struct {\n ID string // Unique task ID (UUID)\n JobName string // User-defined job name \n Args string // Job arguments\n Status string // queued, running, completed, failed\n Priority int64 // Higher = executed first\n CreatedAt time.Time \n StartedAt *time.Time \n EndedAt *time.Time \n WorkerID string \n Error string \n Datasets []string \n Metadata map[string]string // commit_id, user, etc\n}\n</code></pre>"},{"location":"queue/#taskqueue-interface","title":"TaskQueue Interface","text":"<pre><code>// Initialize queue\nqueue, err := queue.NewTaskQueue(queue.Config{\n RedisAddr: \"localhost:6379\",\n RedisPassword: \"\",\n RedisDB: 0,\n})\n\n// Add task (API server)\ntask := &queue.Task{\n ID: uuid.New().String(),\n JobName: \"train-model\",\n Status: \"queued\",\n Priority: 5,\n Metadata: map[string]string{\n \"commit_id\": commitID,\n \"user\": username,\n },\n}\nerr = queue.AddTask(task)\n\n// Get next task (Worker)\ntask, err := queue.GetNextTask()\n\n// Update task status\ntask.Status = \"running\"\nerr = queue.UpdateTask(task)\n</code></pre>"},{"location":"queue/#data-flow","title":"Data Flow","text":""},{"location":"queue/#job-submission-flow","title":"Job Submission Flow","text":"<pre><code>sequenceDiagram\n participant CLI\n participant API\n participant Redis\n participant Worker\n\n CLI->>API: Queue Job (WebSocket)\n API->>API: Create Task (UUID)\n API->>Redis: ZADD task:queue\n API->>Redis: SET task:{id}\n API->>CLI: Success Response\n\n Worker->>Redis: ZPOPMAX task:queue\n Redis->>Worker: Task ID\n Worker->>Redis: GET task:{id}\n Redis->>Worker: Task Data\n Worker->>Worker: Execute Job\n Worker->>Redis: Update Status\n</code></pre>"},{"location":"queue/#protocol","title":"Protocol","text":"<p>CLI \u2192 API (Binary WebSocket): <pre><code>[opcode:1][api_key_hash:64][commit_id:64][priority:1][job_name_len:1][job_name:var]\n</code></pre></p> <p>API \u2192 Redis: - Priority queue: <code>ZADD task:queue {priority} {task_id}</code> - Task data: <code>SET task:{id} {json}</code> - Status: <code>HSET task:status:{job_name} ...</code></p> <p>Worker \u2190 Redis: - Poll: <code>ZPOPMAX task:queue 1</code> (highest priority first) - Fetch: <code>GET task:{id}</code></p>"},{"location":"queue/#redis-data-structures","title":"Redis Data Structures","text":""},{"location":"queue/#keys","title":"Keys","text":"<pre><code>task:queue # ZSET: priority queue\ntask:{uuid} # STRING: task JSON data\ntask:status:{job_name} # HASH: job status\nworker:heartbeat # HASH: worker health\njob:metrics:{job_name} # HASH: job metrics\n</code></pre>"},{"location":"queue/#priority-queue-zset","title":"Priority Queue (ZSET)","text":"<pre><code>ZADD task:queue 10 \"uuid-1\" # Priority 10\nZADD task:queue 5 \"uuid-2\" # Priority 5\nZPOPMAX task:queue 1 # Returns uuid-1 (highest)\n</code></pre>"},{"location":"queue/#api-server-integration","title":"API Server Integration","text":""},{"location":"queue/#initialization","title":"Initialization","text":"<pre><code>// cmd/api-server/main.go\nqueueCfg := queue.Config{\n RedisAddr: cfg.Redis.Addr,\n RedisPassword: cfg.Redis.Password,\n RedisDB: cfg.Redis.DB,\n}\ntaskQueue, err := queue.NewTaskQueue(queueCfg)\n</code></pre>"},{"location":"queue/#websocket-handler","title":"WebSocket Handler","text":"<pre><code>// internal/api/ws.go\nfunc (h *WSHandler) handleQueueJob(conn *websocket.Conn, payload []byte) error {\n // Parse request\n apiKeyHash, commitID, priority, jobName := parsePayload(payload)\n\n // Create task with unique ID\n taskID := uuid.New().String()\n task := &queue.Task{\n ID: taskID,\n JobName: jobName,\n Status: \"queued\",\n Priority: int64(priority),\n Metadata: map[string]string{\n \"commit_id\": commitID,\n \"user\": user,\n },\n }\n\n // Enqueue\n if err := h.queue.AddTask(task); err != nil {\n return h.sendErrorPacket(conn, ErrorCodeDatabaseError, ...)\n }\n\n return h.sendSuccessPacket(conn, \"Job queued\")\n}\n</code></pre>"},{"location":"queue/#worker-integration","title":"Worker Integration","text":""},{"location":"queue/#task-polling","title":"Task Polling","text":"<pre><code>// cmd/worker/worker_server.go\nfunc (w *Worker) Start() error {\n for {\n task, err := w.queue.WaitForNextTask(ctx, 5*time.Second)\n if task != nil {\n go w.executeTask(task)\n }\n }\n}\n</code></pre>"},{"location":"queue/#task-execution","title":"Task Execution","text":"<pre><code>func (w *Worker) executeTask(task *queue.Task) {\n // Update status\n task.Status = \"running\"\n task.StartedAt = &now\n w.queue.UpdateTaskWithMetrics(task, \"start\")\n\n // Execute\n err := w.runJob(task)\n\n // Finalize\n task.Status = \"completed\" // or \"failed\"\n task.EndedAt = &endTime\n task.Error = err.Error() // if err != nil\n w.queue.UpdateTaskWithMetrics(task, \"final\")\n}\n</code></pre>"},{"location":"queue/#configuration","title":"Configuration","text":""},{"location":"queue/#api-server-configsconfigyaml","title":"API Server (<code>configs/config.yaml</code>)","text":"<pre><code>redis:\n addr: \"localhost:6379\"\n password: \"\"\n db: 0\n</code></pre>"},{"location":"queue/#worker-configsworker-configyaml","title":"Worker (<code>configs/worker-config.yaml</code>)","text":"<pre><code>redis:\n addr: \"localhost:6379\"\n password: \"\"\n db: 0\n\nmetrics_flush_interval: 500ms\n</code></pre>"},{"location":"queue/#monitoring","title":"Monitoring","text":""},{"location":"queue/#queue-depth","title":"Queue Depth","text":"<pre><code>depth, err := queue.QueueDepth()\nfmt.Printf(\"Pending tasks: %d\\n\", depth)\n</code></pre>"},{"location":"queue/#worker-heartbeat","title":"Worker Heartbeat","text":"<pre><code>// Worker sends heartbeat every 30s\nerr := queue.Heartbeat(workerID)\n</code></pre>"},{"location":"queue/#metrics","title":"Metrics","text":"<pre><code>HGETALL job:metrics:{job_name}\n# Returns: timestamp, tasks_start, tasks_final, etc\n</code></pre>"},{"location":"queue/#error-handling","title":"Error Handling","text":""},{"location":"queue/#task-failures","title":"Task Failures","text":"<pre><code>if err := w.runJob(task); err != nil {\n task.Status = \"failed\"\n task.Error = err.Error()\n w.queue.UpdateTask(task)\n}\n</code></pre>"},{"location":"queue/#redis-connection-loss","title":"Redis Connection Loss","text":"<pre><code>// TaskQueue automatically reconnects\n// Workers should implement retry logic\nfor retries := 0; retries < 3; retries++ {\n task, err := queue.GetNextTask()\n if err == nil {\n break\n }\n time.Sleep(backoff)\n}\n</code></pre>"},{"location":"queue/#testing","title":"Testing","text":"<pre><code>// tests using miniredis\ns, _ := miniredis.Run()\ndefer s.Close()\n\ntq, _ := queue.NewTaskQueue(queue.Config{\n RedisAddr: s.Addr(),\n})\n\ntask := &queue.Task{ID: \"test-1\", JobName: \"test\"}\ntq.AddTask(task)\n\nfetched, _ := tq.GetNextTask()\n// assert fetched.ID == \"test-1\"\n</code></pre>"},{"location":"queue/#best-practices","title":"Best Practices","text":"<ol> <li>Unique Task IDs: Always use UUIDs to avoid conflicts</li> <li>Metadata: Store commit_id and user in task metadata</li> <li>Priority: Higher values execute first (0-255 range)</li> <li>Status Updates: Update status at each lifecycle stage</li> <li>Error Logging: Store detailed errors in task.Error</li> <li>Heartbeats: Workers should send heartbeats regularly</li> <li>Metrics: Use UpdateTaskWithMetrics for atomic updates</li> </ol> <p>For implementation details, see: - internal/queue/task.go - internal/queue/queue.go</p>"},{"location":"quick-start/","title":"Quick Start","text":"<p>Get Fetch ML running in minutes with Docker Compose.</p>"},{"location":"quick-start/#prerequisites","title":"Prerequisites","text":"<p>Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution</p> <ul> <li>Docker Compose (testing only)</li> <li>4GB+ RAM</li> <li>2GB+ disk space</li> </ul>"},{"location":"quick-start/#one-command-setup","title":"One-Command Setup","text":"<pre><code># Clone and start\ngit clone https://github.com/jfraeys/fetch_ml.git\ncd fetch_ml\ndocker-compose up -d (testing only)\n\n# Wait for services (30 seconds)\nsleep 30\n\n# Verify setup\ncurl http://localhost:9101/health\n</code></pre>"},{"location":"quick-start/#first-experiment","title":"First Experiment","text":"<pre><code># Submit a simple ML job (see [First Experiment](first-experiment.md) for details)\ncurl -X POST http://localhost:9101/api/v1/jobs \\\n -H \"Content-Type: application/json\" \\\n -H \"X-API-Key: admin\" \\\n -d '{\n \"job_name\": \"hello-world\",\n \"args\": \"--echo Hello World\",\n \"priority\": 1\n }'\n\n# Check job status\ncurl http://localhost:9101/api/v1/jobs \\\n -H \"X-API-Key: admin\"\n</code></pre>"},{"location":"quick-start/#cli-access","title":"CLI Access","text":"<pre><code># Build CLI\ncd cli && zig build dev\n\n# List jobs\n./cli/zig-out/dev/ml --server http://localhost:9101 list-jobs\n\n# Submit new job\n./cli/zig-out/dev/ml --server http://localhost:9101 submit \\\n --name \"test-job\" --args \"--epochs 10\"\n</code></pre>"},{"location":"quick-start/#related-documentation","title":"Related Documentation","text":"<ul> <li>Installation Guide - Detailed setup options</li> <li>First Experiment - Complete ML workflow</li> <li>Development Setup - Local development</li> <li>Security - Authentication and permissions</li> </ul>"},{"location":"quick-start/#troubleshooting","title":"Troubleshooting","text":"<p>Services not starting? <pre><code># Check logs\ndocker-compose logs\n\n# Restart services\ndocker-compose down && docker-compose up -d (testing only)\n</code></pre></p> <p>API not responding? <pre><code># Check health\ncurl http://localhost:9101/health\n\n# Verify ports\ndocker-compose ps\n</code></pre></p> <p>Permission denied? <pre><code># Check API key\ncurl -H \"X-API-Key: admin\" http://localhost:9101/api/v1/jobs\n</code></pre></p>"},{"location":"redis-ha/","title":"Redis High Availability","text":"<p>Note: This is optional for homelab setups. Single Redis instance is sufficient for most use cases.</p>"},{"location":"redis-ha/#when-you-need-ha","title":"When You Need HA","text":"<p>Consider Redis HA if: - Running production workloads - Uptime > 99.9% required - Can't afford to lose queued tasks - Multiple workers across machines</p>"},{"location":"redis-ha/#redis-sentinel-recommended","title":"Redis Sentinel (Recommended)","text":""},{"location":"redis-ha/#setup","title":"Setup","text":"<pre><code># docker-compose.yml\nversion: '3.8'\nservices:\n redis-master:\n image: redis:7-alpine\n command: redis-server --maxmemory 2gb\n\n redis-replica:\n image: redis:7-alpine\n command: redis-server --slaveof redis-master 6379\n\n redis-sentinel-1:\n image: redis:7-alpine\n command: redis-sentinel /etc/redis/sentinel.conf\n volumes:\n - ./sentinel.conf:/etc/redis/sentinel.conf\n</code></pre> <p>sentinel.conf: <pre><code>sentinel monitor mymaster redis-master 6379 2\nsentinel down-after-milliseconds mymaster 5000\nsentinel parallel-syncs mymaster 1\nsentinel failover-timeout mymaster 10000\n</code></pre></p>"},{"location":"redis-ha/#application-configuration","title":"Application Configuration","text":"<pre><code># worker-config.yaml\nredis_addr: \"redis-sentinel-1:26379,redis-sentinel-2:26379\"\nredis_master_name: \"mymaster\"\n</code></pre>"},{"location":"redis-ha/#redis-cluster-advanced","title":"Redis Cluster (Advanced)","text":"<p>For larger deployments with sharding needs.</p> <pre><code># Minimum 3 masters + 3 replicas\nservices:\n redis-1:\n image: redis:7-alpine\n command: redis-server --cluster-enabled yes\n\n redis-2:\n # ... similar config\n</code></pre>"},{"location":"redis-ha/#homelab-alternative-persistence-only","title":"Homelab Alternative: Persistence Only","text":"<p>For most homelabs, just enable persistence:</p> <pre><code># docker-compose.yml\nservices:\n redis:\n image: redis:7-alpine\n command: redis-server --appendonly yes\n volumes:\n - redis_data:/data\n\nvolumes:\n redis_data:\n</code></pre> <p>This ensures tasks survive Redis restarts without full HA complexity.</p> <p>Recommendation: Start simple. Add HA only if you experience actual downtime issues.</p>"},{"location":"release-checklist/","title":"Release Checklist","text":"<p>This checklist captures the work required before cutting a release that includes the graceful worker shutdown feature.</p>"},{"location":"release-checklist/#1-code-hygiene-compilation","title":"1. Code Hygiene / Compilation","text":"<ol> <li>Merge the graceful-shutdown helpers into the canonical worker type to avoid <code>Worker redeclared</code> errors (see <code>cmd/worker/worker_graceful_shutdown.go</code> and <code>cmd/worker/worker_server.go</code>).</li> <li>Ensure the worker struct exposes the fields referenced by the new helpers (<code>logger</code>, <code>queue</code>, <code>cfg</code>, <code>metrics</code>).</li> <li><code>go build ./cmd/worker</code> succeeds without undefined-field errors.</li> </ol>"},{"location":"release-checklist/#2-graceful-shutdown-logic","title":"2. Graceful Shutdown Logic","text":"<ol> <li>Initialize <code>shutdownCh</code>, <code>activeTasks</code>, and <code>gracefulWait</code> during worker start-up.</li> <li>Confirm the heartbeat/lease helpers compile and handle queue errors gracefully (<code>heartbeatLoop</code>, <code>releaseAllLeases</code>).</li> <li>Add tests (unit or integration) that simulate SIGINT/SIGTERM and verify leases are released or tasks complete.</li> </ol>"},{"location":"release-checklist/#3-task-execution-flow","title":"3. Task Execution Flow","text":"<ol> <li>Align <code>executeTaskWithLease</code> with the real <code>executeTask</code> signature so the \"no value used as value\" compile error disappears.</li> <li>Double-check retry/metrics paths still match existing worker behavior after the new wrapper is added.</li> </ol>"},{"location":"release-checklist/#4-server-wiring","title":"4. Server Wiring","text":"<ol> <li>Ensure worker construction in <code>cmd/worker/worker_server.go</code> wires up config, queue, metrics, and logger instances used by the shutdown logic.</li> <li>Re-run worker unit tests plus any queue/lease e2e tests.</li> </ol>"},{"location":"release-checklist/#5-validation-before-tagging","title":"5. Validation Before Tagging","text":"<ol> <li><code>go test ./cmd/worker/...</code> and <code>make test</code> (or equivalent) pass locally.</li> <li>Manual smoke test: start worker, queue jobs, send SIGTERM, confirm tasks finish or leases are released and the process exits cleanly.</li> <li>Update release notes describing the new shutdown capability and any config changes required (e.g., graceful timeout settings).</li> </ol>"},{"location":"security/","title":"Security Guide","text":"<p>This document outlines security features, best practices, and hardening procedures for FetchML.</p>"},{"location":"security/#security-features","title":"Security Features","text":""},{"location":"security/#authentication-authorization","title":"Authentication & Authorization","text":"<ul> <li>API Keys: SHA256-hashed with role-based access control (RBAC)</li> <li>Permissions: Granular read/write/delete permissions per user</li> <li>IP Whitelisting: Network-level access control</li> <li>Rate Limiting: Per-user request quotas</li> </ul>"},{"location":"security/#communication-security","title":"Communication Security","text":"<ul> <li>TLS/HTTPS: End-to-end encryption for API traffic</li> <li>WebSocket Auth: API key required before upgrade</li> <li>Redis Auth: Password-protected task queue</li> </ul>"},{"location":"security/#data-privacy","title":"Data Privacy","text":"<ul> <li>Log Sanitization: Automatically redacts API keys, passwords, tokens</li> <li>Experiment Isolation: User-specific experiment directories</li> <li>No Anonymous Access: All services require authentication</li> </ul>"},{"location":"security/#network-security","title":"Network Security","text":"<ul> <li>Internal Networks: Backend services (Redis, Loki) not exposed publicly</li> <li>Firewall Rules: Restrictive port access</li> <li>Container Isolation: Services run in separate containers/pods</li> </ul>"},{"location":"security/#security-checklist","title":"Security Checklist","text":""},{"location":"security/#initial-setup","title":"Initial Setup","text":"<ol> <li> <p>Generate Strong Passwords <pre><code># Grafana admin password\nopenssl rand -base64 32 > .grafana-password\n\n# Redis password\nopenssl rand -base64 32\n</code></pre></p> </li> <li> <p>Configure Environment Variables <pre><code>cp .env.example .env\n# Edit .env and set:\n# - GRAFANA_ADMIN_PASSWORD\n</code></pre></p> </li> <li> <p>Enable TLS (Production only) <pre><code># configs/config-prod.yaml\nserver:\n tls:\n enabled: true\n cert_file: \"/secrets/cert.pem\"\n key_file: \"/secrets/key.pem\"\n</code></pre></p> </li> <li> <p>Configure Firewall <pre><code># Allow only necessary ports\nsudo ufw allow 22/tcp # SSH\nsudo ufw allow 443/tcp # HTTPS\nsudo ufw allow 80/tcp # HTTP (redirect to HTTPS)\nsudo ufw enable\n</code></pre></p> </li> </ol>"},{"location":"security/#production-hardening","title":"Production Hardening","text":"<ol> <li> <p>Restrict IP Access <pre><code># configs/config-prod.yaml\nauth:\n ip_whitelist:\n - \"10.0.0.0/8\"\n - \"192.168.0.0/16\"\n - \"127.0.0.1\"\n</code></pre></p> </li> <li> <p>Enable Audit Logging <pre><code>logging:\n level: \"info\"\n audit: true\n file: \"/var/log/fetch_ml/audit.log\"\n</code></pre></p> </li> <li> <p>Harden Redis <pre><code># Redis security\nredis-cli CONFIG SET requirepass \"your-strong-password\"\nredis-cli CONFIG SET rename-command FLUSHDB \"\"\nredis-cli CONFIG SET rename-command FLUSHALL \"\"\n</code></pre></p> </li> <li> <p>Secure Grafana <pre><code># Change default admin password\ndocker-compose exec grafana grafana-cli admin reset-admin-password new-strong-password\n</code></pre></p> </li> <li> <p>Regular Updates <pre><code># Update system packages\nsudo apt update && sudo apt upgrade -y\n\n# Update containers\ndocker-compose pull\ndocker-compose up -d (testing only)\n</code></pre></p> </li> </ol>"},{"location":"security/#password-management","title":"Password Management","text":""},{"location":"security/#generate-secure-passwords","title":"Generate Secure Passwords","text":"<pre><code># Method 1: OpenSSL\nopenssl rand -base64 32\n\n# Method 2: pwgen (if installed)\npwgen -s 32 1\n\n# Method 3: /dev/urandom\nhead -c 32 /dev/urandom | base64\n</code></pre>"},{"location":"security/#store-passwords-securely","title":"Store Passwords Securely","text":"<p>Development: Use <code>.env</code> file (gitignored) <pre><code>echo \"REDIS_PASSWORD=$(openssl rand -base64 32)\" >> .env\necho \"GRAFANA_ADMIN_PASSWORD=$(openssl rand -base64 32)\" >> .env\n</code></pre></p> <p>Production: Use systemd environment files <pre><code>sudo mkdir -p /etc/fetch_ml/secrets\nsudo chmod 700 /etc/fetch_ml/secrets\necho \"REDIS_PASSWORD=...\" | sudo tee /etc/fetch_ml/secrets/redis.env\nsudo chmod 600 /etc/fetch_ml/secrets/redis.env\n</code></pre></p>"},{"location":"security/#api-key-management","title":"API Key Management","text":""},{"location":"security/#generate-api-keys","title":"Generate API Keys","text":"<pre><code># Generate random API key\nopenssl rand -hex 32\n\n# Hash for storage\necho -n \"your-api-key\" | sha256sum\n</code></pre>"},{"location":"security/#rotate-api-keys","title":"Rotate API Keys","text":"<ol> <li>Generate new API key</li> <li>Update <code>config-local.yaml</code> with new hash</li> <li>Distribute new key to users</li> <li>Remove old key after grace period</li> </ol>"},{"location":"security/#revoke-api-keys","title":"Revoke API Keys","text":"<p>Remove user entry from <code>config-local.yaml</code>: <pre><code>auth:\n apikeys:\n # user_to_revoke: # Comment out or delete\n</code></pre></p>"},{"location":"security/#network-security_1","title":"Network Security","text":""},{"location":"security/#production-network-topology","title":"Production Network Topology","text":"<pre><code>Internet\n \u2193\n[Firewall] (ports 3000, 9102)\n \u2193\n[Reverse Proxy] (nginx/Apache) - TLS termination\n \u2193\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Application Pod \u2502\n\u2502 \u2502\n\u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502\n\u2502 \u2502 API Server \u2502 \u2502 \u2190 Public (via reverse proxy)\n\u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502\n\u2502 \u2502\n\u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502\n\u2502 \u2502 Redis \u2502 \u2502 \u2190 Internal only\n\u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502\n\u2502 \u2502\n\u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502\n\u2502 \u2502 Grafana \u2502 \u2502 \u2190 Public (via reverse proxy)\n\u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502\n\u2502 \u2502\n\u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502\n\u2502 \u2502 Prometheus \u2502 \u2502 \u2190 Internal only\n\u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502\n\u2502 \u2502\n\u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502\n\u2502 \u2502 Loki \u2502 \u2502 \u2190 Internal only\n\u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n</code></pre>"},{"location":"security/#recommended-firewall-rules","title":"Recommended Firewall Rules","text":"<pre><code># Allow only necessary inbound connections\nsudo firewall-cmd --permanent --zone=public --add-rich-rule='\n rule family=\"ipv4\"\n source address=\"YOUR_NETWORK\"\n port port=\"3000\" protocol=\"tcp\" accept'\n\nsudo firewall-cmd --permanent --zone=public --add-rich-rule='\n rule family=\"ipv4\"\n source address=\"YOUR_NETWORK\"\n port port=\"9102\" protocol=\"tcp\" accept'\n\n# Block all other traffic\nsudo firewall-cmd --permanent --set-default-zone=drop\nsudo firewall-cmd --reload\n</code></pre>"},{"location":"security/#incident-response","title":"Incident Response","text":""},{"location":"security/#suspected-breach","title":"Suspected Breach","text":"<ol> <li>Immediate Actions</li> <li>Investigation </li> <li>Recovery </li> <li>Rotate all API keys</li> <li>Stop affected services</li> <li> <p>Review audit logs</p> </li> <li> <p>Investigation <pre><code># Check recent logins\nsudo journalctl -u fetchml-api --since \"1 hour ago\"\n\n# Review failed auth attempts\ngrep \"authentication failed\" /var/log/fetch_ml/*.log\n\n# Check active connections\nss -tnp | grep :9102\n</code></pre></p> </li> <li> <p>Recovery</p> </li> <li>Rotate all passwords and API keys</li> <li>Update firewall rules</li> <li>Patch vulnerabilities</li> <li>Resume services</li> </ol>"},{"location":"security/#security-monitoring","title":"Security Monitoring","text":"<pre><code># Monitor failed authentication\ntail -f /var/log/fetch_ml/api.log | grep \"auth.*failed\"\n\n# Monitor unusual activity\njournalctl -u fetchml-api -f | grep -E \"(ERROR|WARN)\"\n\n# Check open ports\nnmap -p- localhost\n</code></pre>"},{"location":"security/#security-best-practices","title":"Security Best Practices","text":"<ol> <li>Principle of Least Privilege: Grant minimum necessary permissions</li> <li>Defense in Depth: Multiple security layers (firewall + auth + TLS)</li> <li>Regular Updates: Keep all components patched</li> <li>Audit Regularly: Review logs and access patterns</li> <li>Secure Secrets: Never commit passwords/keys to git</li> <li>Network Segmentation: Isolate services with internal networks</li> <li>Monitor Everything: Enable comprehensive logging and alerting</li> <li>Test Security: Regular penetration testing and vulnerability scans</li> </ol>"},{"location":"security/#compliance","title":"Compliance","text":""},{"location":"security/#data-privacy_1","title":"Data Privacy","text":"<ul> <li>Logs are sanitized (no passwords/API keys)</li> <li>Experiment data is user-isolated</li> <li>No telemetry or external data sharing</li> </ul>"},{"location":"security/#audit-trail","title":"Audit Trail","text":"<p>All API access is logged with: - Timestamp - User/API key - Action performed - Source IP - Result (success/failure)</p>"},{"location":"security/#getting-help","title":"Getting Help","text":"<ul> <li>Security Issues: Report privately via email</li> <li>Questions: See documentation or create issue</li> <li>Updates: Monitor releases for security patches</li> </ul>"},{"location":"smart-defaults/","title":"Smart Defaults","text":"<p>This document describes Fetch ML's smart defaults system, which automatically adapts configuration based on the runtime environment.</p>"},{"location":"smart-defaults/#overview","title":"Overview","text":"<p>Smart defaults eliminate the need for manual configuration tweaks when running in different environments:</p> <ul> <li>Local Development: Optimized for developer machines with sensible paths and localhost services</li> <li>Container Environments: Uses container-friendly hostnames and paths</li> <li>CI/CD: Optimized for automated testing with fast polling and minimal resource usage</li> <li>Production: Uses production-ready defaults with proper security and scaling</li> </ul>"},{"location":"smart-defaults/#environment-detection","title":"Environment Detection","text":"<p>The system automatically detects the environment based on:</p> <ol> <li>CI Detection: Checks for <code>CI</code>, <code>GITHUB_ACTIONS</code>, <code>GITLAB_CI</code> environment variables</li> <li>Container Detection: Looks for <code>/.dockerenv</code>, <code>KUBERNETES_SERVICE_HOST</code>, or <code>CONTAINER</code> variables</li> <li>Production Detection: Checks <code>FETCH_ML_ENV=production</code> or <code>ENV=production</code></li> <li>Default: Falls back to local development</li> </ol>"},{"location":"smart-defaults/#default-values-by-environment","title":"Default Values by Environment","text":""},{"location":"smart-defaults/#host-configuration","title":"Host Configuration","text":"<ul> <li>Local: <code>localhost</code></li> <li>Container/CI: <code>host.docker.internal</code> (Docker Desktop/Colima)</li> <li>Production: <code>0.0.0.0</code></li> </ul>"},{"location":"smart-defaults/#base-paths","title":"Base Paths","text":"<ul> <li>Local: <code>~/ml-experiments</code></li> <li>Container/CI: <code>/workspace/ml-experiments</code></li> <li>Production: <code>/var/lib/fetch_ml/experiments</code></li> </ul>"},{"location":"smart-defaults/#data-directory","title":"Data Directory","text":"<ul> <li>Local: <code>~/ml-data</code></li> <li>Container/CI: <code>/workspace/data</code></li> <li>Production: <code>/var/lib/fetch_ml/data</code></li> </ul>"},{"location":"smart-defaults/#redis-address","title":"Redis Address","text":"<ul> <li>Local: <code>localhost:6379</code></li> <li>Container/CI: <code>redis:6379</code> (service name)</li> <li>Production: <code>redis:6379</code></li> </ul>"},{"location":"smart-defaults/#ssh-configuration","title":"SSH Configuration","text":"<ul> <li>Local: <code>~/.ssh/id_rsa</code> and <code>~/.ssh/known_hosts</code></li> <li>Container/CI: <code>/workspace/.ssh/id_rsa</code> and <code>/workspace/.ssh/known_hosts</code></li> <li>Production: <code>/etc/fetch_ml/ssh/id_rsa</code> and <code>/etc/fetch_ml/ssh/known_hosts</code></li> </ul>"},{"location":"smart-defaults/#worker-configuration","title":"Worker Configuration","text":"<ul> <li>Local: 2 workers, 5-second poll interval</li> <li>CI: 1 worker, 1-second poll interval (fast testing)</li> <li>Production: CPU core count workers, 10-second poll interval</li> </ul>"},{"location":"smart-defaults/#log-levels","title":"Log Levels","text":"<ul> <li>Local: <code>info</code></li> <li>CI: <code>debug</code> (verbose for debugging)</li> <li>Production: <code>info</code></li> </ul>"},{"location":"smart-defaults/#usage","title":"Usage","text":""},{"location":"smart-defaults/#in-configuration-loaders","title":"In Configuration Loaders","text":"<pre><code>// Get smart defaults for current environment\nsmart := config.GetSmartDefaults()\n\n// Use smart defaults\nif cfg.Host == \"\" {\n cfg.Host = smart.Host()\n}\nif cfg.BasePath == \"\" {\n cfg.BasePath = smart.BasePath()\n}\n</code></pre>"},{"location":"smart-defaults/#environment-overrides","title":"Environment Overrides","text":"<p>Smart defaults can be overridden with environment variables:</p> <ul> <li><code>FETCH_ML_HOST</code> - Override host</li> <li><code>FETCH_ML_BASE_PATH</code> - Override base path</li> <li><code>FETCH_ML_REDIS_ADDR</code> - Override Redis address</li> <li><code>FETCH_ML_ENV</code> - Force environment profile</li> </ul>"},{"location":"smart-defaults/#manual-environment-selection","title":"Manual Environment Selection","text":"<p>You can force a specific environment:</p> <pre><code># Force production mode\nexport FETCH_ML_ENV=production\n\n# Force container mode\nexport CONTAINER=true\n</code></pre>"},{"location":"smart-defaults/#implementation-details","title":"Implementation Details","text":"<p>The smart defaults system is implemented in <code>internal/config/smart_defaults.go</code>:</p> <ul> <li><code>DetectEnvironment()</code> - Determines current environment profile</li> <li><code>SmartDefaults</code> struct - Provides environment-aware defaults</li> <li>Helper methods for each configuration value</li> </ul>"},{"location":"smart-defaults/#migration-guide","title":"Migration Guide","text":""},{"location":"smart-defaults/#for-users","title":"For Users","text":"<p>No changes required - existing configurations continue to work. Smart defaults only apply when values are not explicitly set.</p>"},{"location":"smart-defaults/#for-developers","title":"For Developers","text":"<p>When adding new configuration options:</p> <ol> <li>Add a method to <code>SmartDefaults</code> struct</li> <li>Use the smart default in config loaders</li> <li>Document the environment-specific values</li> </ol> <p>Example:</p> <pre><code>// Add to SmartDefaults struct\nfunc (s *SmartDefaults) NewFeature() string {\n switch s.Profile {\n case ProfileContainer, ProfileCI:\n return \"/workspace/new-feature\"\n case ProfileProduction:\n return \"/var/lib/fetch_ml/new-feature\"\n default:\n return \"./new-feature\"\n }\n}\n\n// Use in config loader\nif cfg.NewFeature == \"\" {\n cfg.NewFeature = smart.NewFeature()\n}\n</code></pre>"},{"location":"smart-defaults/#testing","title":"Testing","text":"<p>To test different environments:</p> <pre><code># Test local defaults (default)\n./bin/worker\n\n# Test container defaults\nexport CONTAINER=true\n./bin/worker\n\n# Test CI defaults\nexport CI=true\n./bin/worker\n\n# Test production defaults\nexport FETCH_ML_ENV=production\n./bin/worker\n</code></pre>"},{"location":"smart-defaults/#troubleshooting","title":"Troubleshooting","text":""},{"location":"smart-defaults/#wrong-environment-detection","title":"Wrong Environment Detection","text":"<p>Check environment variables:</p> <pre><code>echo \"CI: $CI\"\necho \"CONTAINER: $CONTAINER\"\necho \"FETCH_ML_ENV: $FETCH_ML_ENV\"\n</code></pre>"},{"location":"smart-defaults/#path-issues","title":"Path Issues","text":"<p>Smart defaults expand <code>~</code> and environment variables automatically. If paths don't work as expected:</p> <ol> <li>Check the detected environment: <code>config.GetSmartDefaults().GetEnvironmentDescription()</code></li> <li>Verify the path exists in the target environment</li> <li>Override with environment variable if needed</li> </ol>"},{"location":"smart-defaults/#container-networking","title":"Container Networking","text":"<p>For container environments, ensure: - Redis service is named <code>redis</code> in docker-compose - Host networking is configured properly - <code>host.docker.internal</code> resolves (Docker Desktop/Colima)</p>"},{"location":"testing/","title":"Testing Guide","text":"<p>How to run and write tests for FetchML.</p>"},{"location":"testing/#running-tests","title":"Running Tests","text":""},{"location":"testing/#quick-test","title":"Quick Test","text":"<pre><code># All tests\nmake test\n\n# Unit tests only\nmake test-unit\n\n# Integration tests\nmake test-integration\n\n# With coverage\nmake test-coverage\n\n\n## Quick Test\n```bash\nmake test # All tests\nmake test-unit # Unit only\n.\nmake test.\nmake test$\nmake test; make test # Coverage\n # E2E tests\n</code></pre>"},{"location":"testing/#docker-testing","title":"Docker Testing","text":"<pre><code>docker-compose up -d (testing only)\nmake test\ndocker-compose down\n</code></pre>"},{"location":"testing/#cli-testing","title":"CLI Testing","text":"<pre><code>cd cli && zig build dev\n./cli/zig-out/dev/ml --help\nzig build test\n</code></pre>"},{"location":"troubleshooting/","title":"Troubleshooting","text":"<p>Common issues and solutions for Fetch ML.</p>"},{"location":"troubleshooting/#quick-fixes","title":"Quick Fixes","text":""},{"location":"troubleshooting/#services-not-starting","title":"Services Not Starting","text":"<pre><code># Check Docker status\ndocker-compose ps\n\n# Restart services\ndocker-compose down && docker-compose up -d (testing only)\n\n# Check logs\ndocker-compose logs -f\n</code></pre>"},{"location":"troubleshooting/#api-not-responding","title":"API Not Responding","text":"<pre><code># Check health endpoint\ncurl http://localhost:9101/health\n\n# Check if port is in use\nlsof -i :9101\n\n# Kill process on port\nkill -9 $(lsof -ti :9101)\n</code></pre>"},{"location":"troubleshooting/#database-issues","title":"Database Issues","text":"<pre><code># Check database connection\ndocker-compose exec postgres psql -U postgres -d fetch_ml\n\n# Reset database\ndocker-compose down postgres\ndocker-compose up -d (testing only) postgres\n\n# Check Redis\ndocker-compose exec redis redis-cli ping\n</code></pre>"},{"location":"troubleshooting/#common-errors","title":"Common Errors","text":""},{"location":"troubleshooting/#authentication-errors","title":"Authentication Errors","text":"<ul> <li>Invalid API key: Check config and regenerate hash</li> <li>JWT expired: Check <code>jwt_expiry</code> setting</li> </ul>"},{"location":"troubleshooting/#database-errors","title":"Database Errors","text":"<ul> <li>Connection failed: Verify database type and connection params</li> <li>No such table: Run migrations with <code>--migrate</code> (see Development Setup)</li> </ul>"},{"location":"troubleshooting/#container-errors","title":"Container Errors","text":"<ul> <li>Runtime not found: Set <code>runtime: docker (testing only)</code> in config</li> <li>Image pull failed: Check registry access</li> </ul>"},{"location":"troubleshooting/#performance-issues","title":"Performance Issues","text":"<ul> <li>High memory: Adjust <code>resources.memory_limit</code></li> <li>Slow jobs: Check worker count and queue size</li> </ul>"},{"location":"troubleshooting/#development-issues","title":"Development Issues","text":"<ul> <li>Build fails: <code>go mod tidy</code> and <code>cd cli && rm -rf zig-out zig-cache</code></li> <li>Tests fail: Start test dependencies with <code>docker-compose -f docker-compose.test.yml up -d</code></li> </ul>"},{"location":"troubleshooting/#cli-issues","title":"CLI Issues","text":"<ul> <li>Not found: <code>cd cli && zig build dev</code></li> <li>Connection errors: Check <code>--server</code> and <code>--api-key</code></li> </ul>"},{"location":"troubleshooting/#network-issues","title":"Network Issues","text":"<ul> <li>Port conflicts: <code>lsof -i :9101</code> and kill processes</li> <li>Firewall: Allow ports 9101, 6379, 5432</li> </ul>"},{"location":"troubleshooting/#configuration-issues","title":"Configuration Issues","text":"<ul> <li>Invalid YAML: <code>python3 -c \"import yaml; yaml.safe_load(open('config.yaml'))\"</code></li> <li>Missing fields: Run <code>see [Configuration Schema](configuration-schema.md)</code></li> </ul>"},{"location":"troubleshooting/#debug-information","title":"Debug Information","text":"<pre><code>./bin/api-server --version\ndocker-compose ps\ndocker-compose logs api-server | grep ERROR\n</code></pre>"},{"location":"troubleshooting/#emergency-reset","title":"Emergency Reset","text":"<pre><code>docker-compose down -v\nrm -rf data/ results/ *.db\ndocker-compose up -d (testing only)\n</code></pre>"},{"location":"user-permissions/","title":"User Permissions in Fetch ML","text":"<p>Fetch ML now supports user-based permissions to ensure data scientists can only view and manage their own experiments while administrators retain full control.</p>"},{"location":"user-permissions/#overview","title":"Overview","text":"<ul> <li>User Isolation: Each user can only see their own experiments</li> <li>Admin Override: Administrators can view and manage all experiments</li> <li>Permission-Based: Fine-grained permissions for create, read, update operations</li> <li>API Key Authentication: Secure authentication using API keys</li> </ul>"},{"location":"user-permissions/#permissions","title":"Permissions","text":""},{"location":"user-permissions/#job-permissions","title":"Job Permissions","text":"<ul> <li><code>jobs:create</code> - Create new experiments</li> <li><code>jobs:read</code> - View experiment status and results</li> <li><code>jobs:update</code> - Cancel or modify experiments</li> </ul>"},{"location":"user-permissions/#user-types","title":"User Types","text":"<ul> <li>Administrators: Full access to all experiments and system operations</li> <li>Data Scientists: Access to their own experiments only</li> <li>Viewers: Read-only access to their own experiments</li> </ul>"},{"location":"user-permissions/#cli-usage","title":"CLI Usage","text":""},{"location":"user-permissions/#view-your-jobs","title":"View Your Jobs","text":"<p><pre><code>ml status\n</code></pre> Shows only your experiments with user context displayed.</p>"},{"location":"user-permissions/#cancel-your-jobs","title":"Cancel Your Jobs","text":"<p><pre><code>ml cancel <job-name>\n</code></pre> Only allows canceling your own experiments (unless you're an admin).</p>"},{"location":"user-permissions/#authentication","title":"Authentication","text":"<p>The CLI automatically authenticates using your API key from <code>~/.ml/config.toml</code>.</p>"},{"location":"user-permissions/#configuration","title":"Configuration","text":""},{"location":"user-permissions/#api-key-setup","title":"API Key Setup","text":"<pre><code>[worker]\napi_key = \"your-api-key-here\"\n</code></pre>"},{"location":"user-permissions/#user-roles","title":"User Roles","text":"<p>User roles and permissions are configured on the server side by administrators.</p>"},{"location":"user-permissions/#security-features","title":"Security Features","text":"<ul> <li>API Key Hashing: Keys are hashed before transmission</li> <li>User Filtering: Server-side filtering prevents unauthorized access</li> <li>Permission Validation: All operations require appropriate permissions</li> <li>Audit Logging: All user actions are logged</li> </ul>"},{"location":"user-permissions/#examples","title":"Examples","text":""},{"location":"user-permissions/#data-scientist-workflow","title":"Data Scientist Workflow","text":"<pre><code># Submit your experiment\nml run my-experiment\n\n# Check your experiments (only shows yours)\nml status\n\n# Cancel your own experiment\nml cancel my-experiment\n</code></pre>"},{"location":"user-permissions/#administrator-workflow","title":"Administrator Workflow","text":"<pre><code># View all experiments (admin sees everything)\nml status\n\n# Cancel any user's experiment\nml cancel user-experiment\n</code></pre>"},{"location":"user-permissions/#error-messages","title":"Error Messages","text":"<ul> <li>\"Insufficient permissions\": You don't have the required permission</li> <li>\"You can only cancel your own jobs\": Ownership restriction</li> <li>\"Invalid API key\": Authentication failed</li> </ul>"},{"location":"user-permissions/#migration-notes","title":"Migration Notes","text":"<ul> <li>Existing configurations continue to work</li> <li>When auth is disabled, all users have admin-like access</li> <li>User ownership is automatically assigned to new experiments</li> </ul> <p>For more details, see the architecture documentation.</p>"},{"location":"zig-cli/","title":"Zig CLI Guide","text":"<p>High-performance command-line interface for ML experiment management, written in Zig for maximum speed and efficiency.</p>"},{"location":"zig-cli/#overview","title":"Overview","text":"<p>The Zig CLI (<code>ml</code>) is the primary interface for managing ML experiments in your homelab. Built with Zig, it provides exceptional performance for file operations, network communication, and experiment management.</p>"},{"location":"zig-cli/#installation","title":"Installation","text":""},{"location":"zig-cli/#pre-built-binaries-recommended","title":"Pre-built Binaries (Recommended)","text":"<p>Download from GitHub Releases:</p> <pre><code># Download for your platform\ncurl -LO https://github.com/jfraeys/fetch_ml/releases/latest/download/ml-<platform>.tar.gz\n\n# Extract\ntar -xzf ml-<platform>.tar.gz\n\n# Install\nchmod +x ml-<platform>\nsudo mv ml-<platform> /usr/local/bin/ml\n\n# Verify\nml --help\n</code></pre> <p>Platforms: - <code>ml-linux-x86_64.tar.gz</code> - Linux (fully static, zero dependencies) - <code>ml-macos-x86_64.tar.gz</code> - macOS Intel - <code>ml-macos-arm64.tar.gz</code> - macOS Apple Silicon</p> <p>All release binaries include embedded static rsync for complete independence.</p>"},{"location":"zig-cli/#build-from-source","title":"Build from Source","text":"<p>Development Build (uses system rsync): <pre><code>cd cli\nzig build dev\n./zig-out/dev/ml-dev --help\n</code></pre></p> <p>Production Build (embedded rsync): <pre><code>cd cli\n# For testing: uses rsync wrapper\nzig build prod\n\n# For release with static rsync:\n# 1. Place static rsync binary at src/assets/rsync_release.bin\n# 2. Build\nzig build prod\nstrip zig-out/prod/ml # Optional: reduce size\n\n# Verify\n./zig-out/prod/ml --help\nls -lh zig-out/prod/ml\n</code></pre></p> <p>See cli/src/assets/README.md for details on obtaining static rsync binaries.</p>"},{"location":"zig-cli/#verify-installation","title":"Verify Installation","text":"<pre><code>ml --help\nml --version # Shows build config\n</code></pre>"},{"location":"zig-cli/#quick-start","title":"Quick Start","text":"<ol> <li> <p>Initialize Configuration <pre><code>./cli/zig-out/bin/ml init\n</code></pre></p> </li> <li> <p>Sync Your First Project <pre><code>./cli/zig-out/bin/ml sync ./my-project --queue\n</code></pre></p> </li> <li> <p>Monitor Progress <pre><code>./cli/zig-out/bin/ml status\n</code></pre></p> </li> </ol>"},{"location":"zig-cli/#command-reference","title":"Command Reference","text":""},{"location":"zig-cli/#init-configuration-setup","title":"<code>init</code> - Configuration Setup","text":"<p>Initialize the CLI configuration file.</p> <pre><code>ml init\n</code></pre> <p>Creates: <code>~/.ml/config.toml</code></p> <p>Configuration Template: <pre><code>worker_host = \"worker.local\"\nworker_user = \"mluser\"\nworker_base = \"/data/ml-experiments\"\nworker_port = 22\napi_key = \"your-api-key\"\n</code></pre></p>"},{"location":"zig-cli/#sync-project-synchronization","title":"<code>sync</code> - Project Synchronization","text":"<p>Sync project files to the worker with intelligent deduplication.</p> <pre><code># Basic sync\nml sync ./project\n\n# Sync with custom name and auto-queue\nml sync ./project --name \"experiment-1\" --queue\n\n# Sync with priority\nml sync ./project --priority 8\n</code></pre> <p>Options: - <code>--name <name></code>: Custom experiment name - <code>--queue</code>: Automatically queue after sync - <code>--priority N</code>: Set priority (1-10, default 5)</p> <p>Features: - Content-Addressed Storage: Automatic deduplication - SHA256 Commit IDs: Reliable change detection - Incremental Transfer: Only sync changed files - Rsync Backend: Efficient file transfer</p>"},{"location":"zig-cli/#queue-job-management","title":"<code>queue</code> - Job Management","text":"<p>Queue experiments for execution on the worker.</p> <pre><code># Queue with commit ID\nml queue my-job --commit abc123def456\n\n# Queue with priority\nml queue my-job --commit abc123 --priority 8\n</code></pre> <p>Options: - <code>--commit <id></code>: Commit ID from sync output - <code>--priority N</code>: Execution priority (1-10)</p> <p>Features: - WebSocket Communication: Real-time job submission - Priority Queuing: Higher priority jobs run first - API Authentication: Secure job submission</p>"},{"location":"zig-cli/#watch-auto-sync-monitoring","title":"<code>watch</code> - Auto-Sync Monitoring","text":"<p>Monitor directories for changes and auto-sync.</p> <pre><code># Watch for changes\nml watch ./project\n\n# Watch and auto-queue on changes\nml watch ./project --name \"dev-exp\" --queue\n</code></pre> <p>Options: - <code>--name <name></code>: Custom experiment name - <code>--queue</code>: Auto-queue on changes - <code>--priority N</code>: Set priority for queued jobs</p> <p>Features: - Real-time Monitoring: 2-second polling interval - Change Detection: File modification time tracking - Commit Comparison: Only sync when content changes - Automatic Queuing: Seamless development workflow</p>"},{"location":"zig-cli/#status-system-status","title":"<code>status</code> - System Status","text":"<p>Check system and worker status.</p> <pre><code>ml status\n</code></pre> <p>Displays: - Worker connectivity - Queue status - Running jobs - System health</p>"},{"location":"zig-cli/#monitor-remote-monitoring","title":"<code>monitor</code> - Remote Monitoring","text":"<p>Launch TUI interface via SSH for real-time monitoring.</p> <pre><code>ml monitor\n</code></pre> <p>Features: - Real-time Updates: Live experiment status - Interactive Interface: Browse and manage experiments - SSH Integration: Secure remote access</p>"},{"location":"zig-cli/#cancel-job-cancellation","title":"<code>cancel</code> - Job Cancellation","text":"<p>Cancel running or queued jobs.</p> <pre><code>ml cancel job-id\n</code></pre> <p>Options: - <code>job-id</code>: Job identifier from status output</p>"},{"location":"zig-cli/#prune-cleanup-management","title":"<code>prune</code> - Cleanup Management","text":"<p>Clean up old experiments to save space.</p> <pre><code># Keep last N experiments\nml prune --keep 20\n\n# Remove experiments older than N days\nml prune --older-than 30\n</code></pre> <p>Options: - <code>--keep N</code>: Keep N most recent experiments - <code>--older-than N</code>: Remove experiments older than N days</p>"},{"location":"zig-cli/#architecture","title":"Architecture","text":"<p>Testing: Docker Compose (macOS/Linux) Production: Podman + systemd (Linux)</p> <p>Important: Docker is for testing only. Podman is used for running actual ML experiments in production.</p>"},{"location":"zig-cli/#core-components","title":"Core Components","text":"<pre><code>cli/src/\n\u251c\u2500\u2500 commands/ # Command implementations\n\u2502 \u251c\u2500\u2500 init.zig # Configuration setup\n\u2502 \u251c\u2500\u2500 sync.zig # Project synchronization\n\u2502 \u251c\u2500\u2500 queue.zig # Job management\n\u2502 \u251c\u2500\u2500 watch.zig # Auto-sync monitoring\n\u2502 \u251c\u2500\u2500 status.zig # System status\n\u2502 \u251c\u2500\u2500 monitor.zig # Remote monitoring\n\u2502 \u251c\u2500\u2500 cancel.zig # Job cancellation\n\u2502 \u2514\u2500\u2500 prune.zig # Cleanup operations\n\u251c\u2500\u2500 config.zig # Configuration management\n\u251c\u2500\u2500 errors.zig # Error handling\n\u251c\u2500\u2500 net/ # Network utilities\n\u2502 \u2514\u2500\u2500 ws.zig # WebSocket client\n\u2514\u2500\u2500 utils/ # Utility functions\n \u251c\u2500\u2500 crypto.zig # Hashing and encryption\n \u251c\u2500\u2500 storage.zig # Content-addressed storage\n \u2514\u2500\u2500 rsync.zig # File synchronization\n</code></pre>"},{"location":"zig-cli/#performance-features","title":"Performance Features","text":""},{"location":"zig-cli/#content-addressed-storage","title":"Content-Addressed Storage","text":"<ul> <li>Deduplication: Identical files shared across experiments</li> <li>Hash-based Storage: Files stored by SHA256 hash</li> <li>Space Efficiency: Reduces storage by up to 90%</li> </ul>"},{"location":"zig-cli/#sha256-commit-ids","title":"SHA256 Commit IDs","text":"<ul> <li>Reliable Detection: Cryptographic change detection</li> <li>Collision Resistance: Guaranteed unique identifiers</li> <li>Fast Computation: Optimized for large directories</li> </ul>"},{"location":"zig-cli/#websocket-protocol","title":"WebSocket Protocol","text":"<ul> <li>Low Latency: Real-time communication</li> <li>Binary Protocol: Efficient message format</li> <li>Connection Pooling: Reused connections</li> </ul>"},{"location":"zig-cli/#memory-management","title":"Memory Management","text":"<ul> <li>Arena Allocators: Efficient memory allocation</li> <li>Zero-copy Operations: Minimized memory usage</li> <li>Resource Cleanup: Automatic resource management</li> </ul>"},{"location":"zig-cli/#security-features","title":"Security Features","text":""},{"location":"zig-cli/#authentication","title":"Authentication","text":"<ul> <li>API Key Hashing: Secure token storage</li> <li>SHA256 Hashes: Irreversible token protection</li> <li>Config Validation: Input sanitization</li> </ul>"},{"location":"zig-cli/#secure-communication","title":"Secure Communication","text":"<ul> <li>SSH Integration: Encrypted file transfers</li> <li>WebSocket Security: TLS-protected communication</li> <li>Input Validation: Comprehensive argument checking</li> </ul>"},{"location":"zig-cli/#error-handling","title":"Error Handling","text":"<ul> <li>Secure Reporting: No sensitive information leakage</li> <li>Graceful Degradation: Safe error recovery</li> <li>Audit Logging: Operation tracking</li> </ul>"},{"location":"zig-cli/#advanced-usage","title":"Advanced Usage","text":""},{"location":"zig-cli/#workflow-integration","title":"Workflow Integration","text":""},{"location":"zig-cli/#development-workflow","title":"Development Workflow","text":"<pre><code># 1. Initialize project\nml sync ./project --name \"dev\" --queue\n\n# 2. Auto-sync during development\nml watch ./project --name \"dev\" --queue\n\n# 3. Monitor progress\nml status\n</code></pre>"},{"location":"zig-cli/#batch-processing","title":"Batch Processing","text":"<pre><code># Process multiple experiments\nfor dir in experiments/*/; do\n ml sync \"$dir\" --queue\ndone\n</code></pre>"},{"location":"zig-cli/#priority-management","title":"Priority Management","text":"<pre><code># High priority experiment\nml sync ./urgent --priority 10 --queue\n\n# Background processing\nml sync ./background --priority 1 --queue\n</code></pre>"},{"location":"zig-cli/#configuration-management","title":"Configuration Management","text":""},{"location":"zig-cli/#multiple-workers","title":"Multiple Workers","text":"<pre><code># ~/.ml/config.toml\nworker_host = \"worker.local\"\nworker_user = \"mluser\"\nworker_base = \"/data/ml-experiments\"\nworker_port = 22\napi_key = \"your-api-key\"\n</code></pre>"},{"location":"zig-cli/#security-settings","title":"Security Settings","text":"<pre><code># Set restrictive permissions\nchmod 600 ~/.ml/config.toml\n\n# Verify configuration\nml status\n</code></pre>"},{"location":"zig-cli/#troubleshooting","title":"Troubleshooting","text":""},{"location":"zig-cli/#common-issues","title":"Common Issues","text":""},{"location":"zig-cli/#build-problems","title":"Build Problems","text":"<pre><code># Check Zig installation\nzig version\n\n# Clean build\ncd cli && make clean && make build\n</code></pre>"},{"location":"zig-cli/#connection-issues","title":"Connection Issues","text":"<pre><code># Test SSH connectivity\nssh -p $worker_port $worker_user@$worker_host\n\n# Verify configuration\ncat ~/.ml/config.toml\n</code></pre>"},{"location":"zig-cli/#sync-failures","title":"Sync Failures","text":"<pre><code># Check rsync\nrsync --version\n\n# Manual sync test\nrsync -avz ./test/ $worker_user@$worker_host:/tmp/\n</code></pre>"},{"location":"zig-cli/#performance-issues","title":"Performance Issues","text":"<pre><code># Monitor resource usage\ntop -p $(pgrep ml)\n\n# Check disk space\ndf -h $worker_base\n</code></pre>"},{"location":"zig-cli/#debug-mode","title":"Debug Mode","text":"<p>Enable verbose logging: <pre><code># Environment variable\nexport ML_DEBUG=1\nml sync ./project\n\n# Or use debug build\ncd cli && make debug\n</code></pre></p>"},{"location":"zig-cli/#performance-benchmarks","title":"Performance Benchmarks","text":""},{"location":"zig-cli/#file-operations","title":"File Operations","text":"<ul> <li>Sync Speed: 100MB/s+ (network limited)</li> <li>Hash Computation: 500MB/s+ (CPU limited)</li> <li>Deduplication: 90%+ space savings</li> </ul>"},{"location":"zig-cli/#memory-usage","title":"Memory Usage","text":"<ul> <li>Base Memory: ~10MB</li> <li>Large Projects: ~50MB (1GB+ projects)</li> <li>Memory Efficiency: Constant per-file overhead</li> </ul>"},{"location":"zig-cli/#network-performance","title":"Network Performance","text":"<ul> <li>WebSocket Latency: <10ms (local network)</li> <li>Connection Setup: <100ms</li> <li>Throughput: Network limited</li> </ul>"},{"location":"zig-cli/#contributing","title":"Contributing","text":""},{"location":"zig-cli/#development-setup","title":"Development Setup","text":"<pre><code>cd cli\nzig build-exe src/main.zig\n</code></pre>"},{"location":"zig-cli/#testing","title":"Testing","text":"<pre><code># Run tests\ncd cli && zig test src/\n\n# Integration tests\nzig test tests/\n</code></pre>"},{"location":"zig-cli/#code-style","title":"Code Style","text":"<ul> <li>Follow Zig style guidelines</li> <li>Use explicit error handling</li> <li>Document public APIs</li> <li>Add comprehensive tests</li> </ul> <p>For more information, see the CLI Reference and Architecture pages.</p>"},{"location":"adr/","title":"Architecture Decision Records (ADRs)","text":"<p>This directory contains Architecture Decision Records (ADRs) for the Fetch ML project.</p>"},{"location":"adr/#what-are-adrs","title":"What are ADRs?","text":"<p>Architecture Decision Records are short text files that document a single architectural decision. They capture the context, options considered, decision made, and consequences of that decision.</p>"},{"location":"adr/#adr-template","title":"ADR Template","text":"<p>Each ADR follows this structure:</p> <pre><code># ADR-XXX: [Title]\n\n## Status\n[Proposed | Accepted | Deprecated | Superseded]\n\n## Context\n[What is the issue that we're facing that needs a decision?]\n\n## Decision\n[What is the change that we're proposing and/or doing?]\n\n## Consequences\n[What becomes easier or more difficult to do because of this change?]\n\n## Options Considered\n[What other approaches did we consider and why did we reject them?]\n</code></pre>"},{"location":"adr/#adr-index","title":"ADR Index","text":"ADR Title Status ADR-001 Use Go for API Server Accepted ADR-002 Use SQLite for Local Development Accepted ADR-003 Use Redis for Job Queue Accepted"},{"location":"adr/#how-to-add-a-new-adr","title":"How to Add a New ADR","text":"<ol> <li>Create a new file named <code>ADR-XXX-title.md</code> where XXX is the next sequential number</li> <li>Use the template above</li> <li>Update this README with the new ADR in the index</li> <li>Submit a pull request for review</li> </ol>"},{"location":"adr/#adr-lifecycle","title":"ADR Lifecycle","text":"<ul> <li>Proposed: Initial draft, under discussion</li> <li>Accepted: Decision made and implemented</li> <li>Deprecated: Decision no longer recommended but still in use</li> <li>Superseded: Replaced by a newer ADR</li> </ul>"},{"location":"adr/ADR-001-use-go-for-api-server/","title":"ADR-001: Use Go for API Server","text":""},{"location":"adr/ADR-001-use-go-for-api-server/#status","title":"Status","text":"<p>Accepted</p>"},{"location":"adr/ADR-001-use-go-for-api-server/#context","title":"Context","text":"<p>We needed to choose a programming language for the Fetch ML API server that would provide: - High performance for ML experiment management - Strong concurrency support for handling multiple experiments - Good ecosystem for HTTP APIs and WebSocket connections - Easy deployment and containerization - Strong type safety and reliability</p>"},{"location":"adr/ADR-001-use-go-for-api-server/#decision","title":"Decision","text":"<p>We chose Go as the primary language for the API server implementation.</p>"},{"location":"adr/ADR-001-use-go-for-api-server/#consequences","title":"Consequences","text":""},{"location":"adr/ADR-001-use-go-for-api-server/#positive","title":"Positive","text":"<ul> <li>Excellent performance with low memory footprint</li> <li>Built-in concurrency primitives (goroutines, channels) perfect for parallel ML experiment execution</li> <li>Rich ecosystem for HTTP servers, WebSocket, and database drivers</li> <li>Static compilation creates single binary deployments</li> <li>Strong typing catches many errors at compile time</li> <li>Good tooling for testing, benchmarking, and profiling</li> </ul>"},{"location":"adr/ADR-001-use-go-for-api-server/#negative","title":"Negative","text":"<ul> <li>Steeper learning curve for team members unfamiliar with Go</li> <li>Less expressive than dynamic languages for rapid prototyping</li> <li>Smaller ecosystem for ML-specific libraries compared to Python</li> </ul>"},{"location":"adr/ADR-001-use-go-for-api-server/#options-considered","title":"Options Considered","text":""},{"location":"adr/ADR-001-use-go-for-api-server/#python-with-fastapi","title":"Python with FastAPI","text":"<p>Pros: - Rich ML ecosystem (TensorFlow, PyTorch, scikit-learn) - Easy to learn and write - Great for data science teams - FastAPI provides good performance</p> <p>Cons: - Global Interpreter Lock limits true parallelism - Higher memory usage - Slower performance for high-throughput scenarios - More complex deployment (multiple files, dependencies)</p>"},{"location":"adr/ADR-001-use-go-for-api-server/#nodejs-with-express","title":"Node.js with Express","text":"<p>Pros: - Excellent WebSocket support - Large ecosystem - Fast development cycle</p> <p>Cons: - Single-threaded event loop can be limiting - Not ideal for CPU-intensive ML operations - Dynamic typing can lead to runtime errors</p>"},{"location":"adr/ADR-001-use-go-for-api-server/#rust","title":"Rust","text":"<p>Pros: - Maximum performance and memory safety - Strong type system - Growing ecosystem</p> <p>Cons: - Very steep learning curve - Longer development time - Smaller ecosystem for web frameworks</p>"},{"location":"adr/ADR-001-use-go-for-api-server/#java-with-spring-boot","title":"Java with Spring Boot","text":"<p>Pros: - Mature ecosystem - Good performance - Strong typing</p> <p>Cons: - Higher memory usage - More verbose syntax - Slower startup time - Heavier deployment footprint</p>"},{"location":"adr/ADR-001-use-go-for-api-server/#rationale","title":"Rationale","text":"<p>Go provides the best balance of performance, concurrency support, and deployment simplicity for our API server needs. The ability to handle many concurrent ML experiments efficiently with goroutines is a key advantage. The single binary deployment model also simplifies our containerization and distribution strategy.</p>"},{"location":"adr/ADR-002-use-sqlite-for-local-development/","title":"ADR-002: Use SQLite for Local Development","text":""},{"location":"adr/ADR-002-use-sqlite-for-local-development/#status","title":"Status","text":"<p>Accepted</p>"},{"location":"adr/ADR-002-use-sqlite-for-local-development/#context","title":"Context","text":"<p>For local development and testing, we needed a database solution that: - Requires minimal setup and configuration - Works well with Go's database drivers - Supports the same SQL features as production databases - Allows easy reset and recreation of test data - Doesn't require external services running locally</p>"},{"location":"adr/ADR-002-use-sqlite-for-local-development/#decision","title":"Decision","text":"<p>We chose SQLite as the default database for local development and testing environments.</p>"},{"location":"adr/ADR-002-use-sqlite-for-local-development/#consequences","title":"Consequences","text":""},{"location":"adr/ADR-002-use-sqlite-for-local-development/#positive","title":"Positive","text":"<ul> <li>Zero configuration - database is just a file</li> <li>Fast performance for local development workloads</li> <li>Easy to reset by deleting the database file</li> <li>Excellent Go driver support (mattn/go-sqlite3)</li> <li>Supports most SQL features we need</li> <li>Portable across different development machines</li> <li>No external dependencies or services to manage</li> </ul>"},{"location":"adr/ADR-002-use-sqlite-for-local-development/#negative","title":"Negative","text":"<ul> <li>Limited to single connection at a time (file locking)</li> <li>Not suitable for production multi-user scenarios</li> <li>Some advanced SQL features may not be available</li> <li>Different behavior compared to PostgreSQL in production</li> </ul>"},{"location":"adr/ADR-002-use-sqlite-for-local-development/#options-considered","title":"Options Considered","text":""},{"location":"adr/ADR-002-use-sqlite-for-local-development/#postgresql","title":"PostgreSQL","text":"<p>Pros: - Production-grade database - Excellent feature support - Good Go driver support - Consistent with production environment</p> <p>Cons: - Requires external service installation and configuration - Higher resource usage - More complex setup for new developers - Overkill for simple local development</p>"},{"location":"adr/ADR-002-use-sqlite-for-local-development/#mysql","title":"MySQL","text":"<p>Pros: - Popular and well-supported - Good Go drivers available</p> <p>Cons: - Requires external service - More complex setup - Different SQL dialect than PostgreSQL</p>"},{"location":"adr/ADR-002-use-sqlite-for-local-development/#in-memory-databases-redis-etc","title":"In-memory databases (Redis, etc.)","text":"<p>Pros: - Very fast - No persistence needed for some tests</p> <p>Cons: - Limited query capabilities - Not suitable for complex relational data - Different data model than production</p>"},{"location":"adr/ADR-002-use-sqlite-for-local-development/#no-database-file-based-storage","title":"No database (file-based storage)","text":"<p>Pros: - Simple implementation - No dependencies</p> <p>Cons: - Limited query capabilities - No transaction support - Hard to scale to complex data needs</p>"},{"location":"adr/ADR-002-use-sqlite-for-local-development/#rationale","title":"Rationale","text":"<p>SQLite provides the perfect balance of simplicity and functionality for local development. It requires zero setup - developers can just run the application and it works. The file-based nature makes it easy to reset test data by deleting the database file. While it differs from our production PostgreSQL database, it supports the same core SQL features needed for development and testing.</p> <p>The main limitation is single-writer access, but this is acceptable for local development where typically only one developer is working with the database at a time. For integration tests that need concurrent access, we can use PostgreSQL or Redis.</p>"},{"location":"adr/ADR-003-use-redis-for-job-queue/","title":"ADR-003: Use Redis for Job Queue","text":""},{"location":"adr/ADR-003-use-redis-for-job-queue/#status","title":"Status","text":"<p>Accepted</p>"},{"location":"adr/ADR-003-use-redis-for-job-queue/#context","title":"Context","text":"<p>For the ML experiment job queue system, we needed a solution that: - Provides reliable job queuing and distribution - Supports multiple workers consuming jobs concurrently - Offers persistence and durability - Handles job priorities and retries - Integrates well with our Go-based API server - Can scale horizontally with multiple workers</p>"},{"location":"adr/ADR-003-use-redis-for-job-queue/#decision","title":"Decision","text":"<p>We chose Redis as the job queue backend using its list data structures and pub/sub capabilities.</p>"},{"location":"adr/ADR-003-use-redis-for-job-queue/#consequences","title":"Consequences","text":""},{"location":"adr/ADR-003-use-redis-for-job-queue/#positive","title":"Positive","text":"<ul> <li>Excellent performance with sub-millisecond latency</li> <li>Built-in persistence options (AOF, RDB)</li> <li>Simple and reliable queue operations (LPUSH/RPOP)</li> <li>Good Go client library support</li> <li>Supports job priorities through multiple lists</li> <li>Easy to monitor and debug</li> <li>Can handle high throughput workloads</li> <li>Low memory overhead for queue operations</li> </ul>"},{"location":"adr/ADR-003-use-redis-for-job-queue/#negative","title":"Negative","text":"<ul> <li>Additional infrastructure component to manage</li> <li>Memory-based (requires sufficient RAM)</li> <li>Limited built-in job scheduling features</li> <li>No complex job dependency management</li> <li>Requires careful handling of connection failures</li> </ul>"},{"location":"adr/ADR-003-use-redis-for-job-queue/#options-considered","title":"Options Considered","text":""},{"location":"adr/ADR-003-use-redis-for-job-queue/#database-based-queuing-postgresql","title":"Database-based Queuing (PostgreSQL)","text":"<p>Pros: - No additional infrastructure - ACID transactions - Complex queries and joins possible - Integrated with primary database</p> <p>Cons: - Higher latency for queue operations - Database contention under high load - More complex implementation for reliable polling - Limited scalability for high-frequency operations</p>"},{"location":"adr/ADR-003-use-redis-for-job-queue/#rabbitmq","title":"RabbitMQ","text":"<p>Pros: - Purpose-built message broker - Advanced routing and filtering - Built-in acknowledgments and retries - Good clustering support</p> <p>Cons: - More complex setup and configuration - Higher resource requirements - Steeper learning curve - Overkill for simple queue needs</p>"},{"location":"adr/ADR-003-use-redis-for-job-queue/#apache-kafka","title":"Apache Kafka","text":"<p>Pros: - Extremely high throughput - Built-in partitioning and replication - Good for event streaming</p> <p>Cons: - Complex setup and operations - Designed for streaming, not job queuing - Higher latency for individual job processing - More resource intensive</p>"},{"location":"adr/ADR-003-use-redis-for-job-queue/#in-memory-queuing-go-channels","title":"In-memory Queuing (Go channels)","text":"<p>Pros: - Zero external dependencies - Very fast - Simple implementation</p> <p>Cons: - No persistence (jobs lost on restart) - Limited to single process - No monitoring or observability - Not suitable for distributed systems</p>"},{"location":"adr/ADR-003-use-redis-for-job-queue/#rationale","title":"Rationale","text":"<p>Redis provides the optimal balance of simplicity, performance, and reliability for our job queue needs. The list-based queue implementation (LPUSH/RPOP) is straightforward and highly performant. Redis's persistence options ensure jobs aren't lost during restarts, and the pub/sub capabilities enable real-time notifications for workers.</p> <p>The Go client library is excellent and provides connection pooling, automatic reconnection, and good error handling. Redis's low memory footprint and fast operations make it ideal for high-frequency job queuing scenarios common in ML workloads.</p> <p>While RabbitMQ offers more advanced features, Redis is sufficient for our current needs and much simpler to operate. The simple queue model also makes it easier to understand and debug when issues arise.</p>"}]} |