{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"],"fields":{"title":{"boost":1000.0},"text":{"boost":1.0},"tags":{"boost":1000000.0}}},"docs":[{"location":"","title":"Fetch ML - Secure Machine Learning Platform","text":"

A secure, containerized platform for running machine learning experiments with role-based access control and comprehensive audit trails.

"},{"location":"#quick-start","title":"Quick Start","text":"

New to the project? Start here!

# Clone the repository\ngit clone https://github.com/your-username/fetch_ml.git\ncd fetch_ml\n\n# Quick setup (builds everything, creates test user)\nmake quick-start\n\n# Create your API key\n./bin/user_manager --config configs/config_dev.yaml --cmd generate-key --username your_name --role data_scientist\n\n# Run your first experiment\n./bin/worker --config configs/config_dev.yaml --api-key YOUR_GENERATED_KEY\n
"},{"location":"#quick-navigation","title":"Quick Navigation","text":""},{"location":"#getting-started","title":"\ud83d\ude80 Getting Started","text":""},{"location":"#security-authentication","title":"\ud83d\udd12 Security & Authentication","text":""},{"location":"#configuration","title":"\u2699\ufe0f Configuration","text":""},{"location":"#development","title":"\ud83d\udee0\ufe0f Development","text":""},{"location":"#production-deployment","title":"\ud83c\udfed Production Deployment","text":""},{"location":"#features","title":"Features","text":""},{"location":"#available-commands","title":"Available Commands","text":"
# Core commands\nmake help                    # See all available commands\nmake build                   # Build all binaries\nmake test-unit              # Run tests\n\n# User management\n./bin/user_manager --config configs/config_dev.yaml --cmd generate-key --username new_user --role data_scientist\n./bin/user_manager --config configs/config_dev.yaml --cmd list-users\n\n# Run services\n./bin/worker --config configs/config_dev.yaml --api-key YOUR_KEY\n./bin/tui --config configs/config_dev.yaml\n./bin/data_manager --config configs/config_dev.yaml\n
"},{"location":"#need-help","title":"Need Help?","text":"

Happy ML experimenting!

"},{"location":"api-key-process/","title":"FetchML API Key Process","text":"

This document describes how API keys are issued and how team members should configure the ml CLI to use them.

The goal is to keep access easy for your homelab while treating API keys as sensitive secrets.

"},{"location":"api-key-process/#overview","title":"Overview","text":"

There are two supported ways to receive your key:

  1. Bitwarden (recommended) \u2013 for users who already use Bitwarden.
  2. Direct share (minimal tools) \u2013 for users who do not use Bitwarden.
"},{"location":"api-key-process/#1-bitwarden-based-process-recommended","title":"1. Bitwarden-based process (recommended)","text":""},{"location":"api-key-process/#for-the-admin","title":"For the admin","text":"
./scripts/create_bitwarden_fetchml_item.sh <username> <api_key> <api_key_hash>\n

This script:

"},{"location":"api-key-process/#for-the-user","title":"For the user","text":"
  1. Open Bitwarden and locate the item:

  2. Name: FetchML API \u2013 <your-name>

  3. Copy the password field (this is your FetchML API key).

  4. Configure the CLI, e.g. in ~/.ml/config.toml:

api_key     = \"<paste-from-bitwarden>\"\nworker_host = \"localhost\"\nworker_port = 9100\napi_url     = \"ws://localhost:9100/ws\"\n
  1. Test your setup:
ml status\n

If the command works, your key and tunnel/config are correct.

"},{"location":"api-key-process/#2-direct-share-no-password-manager-required","title":"2. Direct share (no password manager required)","text":"

For users who do not use Bitwarden, a lightweight alternative is a direct one-to-one share.

"},{"location":"api-key-process/#for-the-admin_1","title":"For the admin","text":"
  1. Generate a per-user API key and hash as usual.
  2. Store them securely on your side (for example, in your own Bitwarden vault or configuration files).
  3. Share only the API key with the user via a direct channel you both trust, such as:

  4. Signal / WhatsApp direct message

  5. SMS
  6. Short call/meeting where you read it to them

  7. Ask the user to:

  8. Paste the key into their local config.

  9. Avoid keeping the key in plain chat history if possible.
"},{"location":"api-key-process/#for-the-user_1","title":"For the user","text":"
  1. When you receive your FetchML API key from the admin, create or edit ~/.ml/config.toml:
api_key     = \"<your-api-key>\"\nworker_host = \"localhost\"\nworker_port = 9100\napi_url     = \"ws://localhost:9100/ws\"\n
  1. Save the file and run:
ml status\n
  1. If it works, you are ready to use the CLI:
ml queue my-training-job\nml cancel my-training-job\n
"},{"location":"api-key-process/#3-security-notes","title":"3. Security notes","text":"

Following these steps keeps API access easy for the team while maintaining a reasonable security posture for a personal homelab deployment.

"},{"location":"architecture/","title":"Homelab Architecture","text":"

Simple, secure architecture for ML experiments in your homelab.

"},{"location":"architecture/#components-overview","title":"Components Overview","text":"
graph TB\n    subgraph \"Homelab Stack\"\n        CLI[Zig CLI]\n        API[HTTPS API]\n        REDIS[Redis Cache]\n        FS[Local Storage]\n    end\n\n    CLI --> API\n    API --> REDIS\n    API --> FS\n
"},{"location":"architecture/#core-services","title":"Core Services","text":""},{"location":"architecture/#api-server","title":"API Server","text":""},{"location":"architecture/#redis","title":"Redis","text":""},{"location":"architecture/#zig-cli","title":"Zig CLI","text":""},{"location":"architecture/#security-architecture","title":"Security Architecture","text":"
graph LR\n    USER[User] --> AUTH[API Key Auth]\n    AUTH --> RATE[Rate Limiting]\n    RATE --> WHITELIST[IP Whitelist]\n    WHITELIST --> API[Secure API]\n    API --> AUDIT[Audit Logging]\n
"},{"location":"architecture/#security-layers","title":"Security Layers","text":"
  1. API Key Authentication - Hashed keys with roles
  2. Rate Limiting - 30 requests/minute
  3. IP Whitelisting - Local networks only
  4. Fail2Ban - Automatic IP blocking
  5. HTTPS/TLS - Encrypted communication
  6. Audit Logging - Complete action tracking
"},{"location":"architecture/#data-flow","title":"Data Flow","text":"
sequenceDiagram\n    participant CLI\n    participant API\n    participant Redis\n    participant Storage\n\n    CLI->>API: HTTPS Request\n    API->>API: Validate Auth\n    API->>Redis: Cache/Queue\n    API->>Storage: Experiment Data\n    Storage->>API: Results\n    API->>CLI: Response\n
"},{"location":"architecture/#deployment-options","title":"Deployment Options","text":""},{"location":"architecture/#docker-compose-recommended","title":"Docker Compose (Recommended)","text":"
services:\n  redis:\n    image: redis:7-alpine\n    ports: [\"6379:6379\"]\n    volumes: [redis_data:/data]\n\n  api-server:\n    build: .\n    ports: [\"9101:9101\"]\n    depends_on: [redis]\n
"},{"location":"architecture/#local-setup","title":"Local Setup","text":"
./setup.sh && ./manage.sh start\n
"},{"location":"architecture/#network-architecture","title":"Network Architecture","text":""},{"location":"architecture/#storage-architecture","title":"Storage Architecture","text":"
data/\n\u251c\u2500\u2500 experiments/     # ML experiment results\n\u251c\u2500\u2500 cache/          # Temporary cache files\n\u2514\u2500\u2500 backups/        # Local backups\n\nlogs/\n\u251c\u2500\u2500 app.log         # Application logs\n\u251c\u2500\u2500 audit.log       # Security events\n\u2514\u2500\u2500 access.log      # API access logs\n
"},{"location":"architecture/#monitoring-architecture","title":"Monitoring Architecture","text":"

Simple, lightweight monitoring: - Health Checks: Service availability - Log Files: Structured logging - Basic Metrics: Request counts, error rates - Security Events: Failed auth, rate limits

"},{"location":"architecture/#homelab-benefits","title":"Homelab Benefits","text":""},{"location":"architecture/#high-level-architecture","title":"High-Level Architecture","text":"
graph TB\n    subgraph \"Client Layer\"\n        CLI[CLI Tools]\n        TUI[Terminal UI]\n        API[REST API]\n    end\n\n    subgraph \"Authentication Layer\"\n        Auth[Authentication Service]\n        RBAC[Role-Based Access Control]\n        Perm[Permission Manager]\n    end\n\n    subgraph \"Core Services\"\n        Worker[ML Worker Service]\n        DataMgr[Data Manager Service]\n        Queue[Job Queue]\n    end\n\n    subgraph \"Storage Layer\"\n        Redis[(Redis Cache)]\n        DB[(SQLite/PostgreSQL)]\n        Files[File Storage]\n    end\n\n    subgraph \"Container Runtime\"\n        Podman[Podman/Docker]\n        Containers[ML Containers]\n    end\n\n    CLI --> Auth\n    TUI --> Auth\n    API --> Auth\n\n    Auth --> RBAC\n    RBAC --> Perm\n\n    Worker --> Queue\n    Worker --> DataMgr\n    Worker --> Podman\n\n    DataMgr --> DB\n    DataMgr --> Files\n\n    Queue --> Redis\n\n    Podman --> Containers\n
"},{"location":"architecture/#zig-cli-architecture","title":"Zig CLI Architecture","text":""},{"location":"architecture/#component-structure","title":"Component Structure","text":"
graph TB\n    subgraph \"Zig CLI Components\"\n        Main[main.zig] --> Commands[commands/]\n        Commands --> Config[config.zig]\n        Commands --> Utils[utils/]\n        Commands --> Net[net/]\n        Commands --> Errors[errors.zig]\n\n        subgraph \"Commands\"\n            Init[init.zig]\n            Sync[sync.zig]\n            Queue[queue.zig]\n            Watch[watch.zig]\n            Status[status.zig]\n            Monitor[monitor.zig]\n            Cancel[cancel.zig]\n            Prune[prune.zig]\n        end\n\n        subgraph \"Utils\"\n            Crypto[crypto.zig]\n            Storage[storage.zig]\n            Rsync[rsync.zig]\n        end\n\n        subgraph \"Network\"\n            WS[ws.zig]\n        end\n    end\n
"},{"location":"architecture/#performance-optimizations","title":"Performance Optimizations","text":""},{"location":"architecture/#content-addressed-storage","title":"Content-Addressed Storage","text":""},{"location":"architecture/#memory-management","title":"Memory Management","text":""},{"location":"architecture/#network-communication","title":"Network Communication","text":""},{"location":"architecture/#security-implementation","title":"Security Implementation","text":"
graph LR\n    subgraph \"CLI Security\"\n        Config[Config File] --> Hash[SHA256 Hashing]\n        Hash --> Auth[API Authentication]\n        Auth --> SSH[SSH Transfer]\n        SSH --> WS[WebSocket Security]\n    end\n
"},{"location":"architecture/#core-components","title":"Core Components","text":""},{"location":"architecture/#1-authentication-authorization","title":"1. Authentication & Authorization","text":"
graph LR\n    subgraph \"Auth Flow\"\n        Client[Client] --> APIKey[API Key]\n        APIKey --> Hash[Hash Validation]\n        Hash --> Roles[Role Resolution]\n        Roles --> Perms[Permission Check]\n        Perms --> Access[Grant/Deny Access]\n    end\n\n    subgraph \"Permission Sources\"\n        YAML[YAML Config]\n        Inline[Inline Fallback]\n        Roles --> YAML\n        Roles --> Inline\n    end\n

Features: - API key-based authentication - Role-based access control (RBAC) - YAML-based permission configuration - Fallback to inline permissions - Admin wildcard permissions

"},{"location":"architecture/#2-worker-service","title":"2. Worker Service","text":"
graph TB\n    subgraph \"Worker Architecture\"\n        API[HTTP API] --> Router[Request Router]\n        Router --> Auth[Auth Middleware]\n        Auth --> Queue[Job Queue]\n        Queue --> Processor[Job Processor]\n        Processor --> Runtime[Container Runtime]\n        Runtime --> Storage[Result Storage]\n\n        subgraph \"Job Lifecycle\"\n            Submit[Submit Job] --> Queue\n            Queue --> Execute[Execute]\n            Execute --> Monitor[Monitor]\n            Monitor --> Complete[Complete]\n            Complete --> Store[Store Results]\n        end\n    end\n

Responsibilities: - HTTP API for job submission - Job queue management - Container orchestration - Result collection and storage - Metrics and monitoring

"},{"location":"architecture/#3-data-manager-service","title":"3. Data Manager Service","text":"
graph TB\n    subgraph \"Data Management\"\n        API[Data API] --> Storage[Storage Layer]\n        Storage --> Metadata[Metadata DB]\n        Storage --> Files[File System]\n        Storage --> Cache[Redis Cache]\n\n        subgraph \"Data Operations\"\n            Upload[Upload Data] --> Validate[Validate]\n            Validate --> Store[Store]\n            Store --> Index[Index]\n            Index --> Catalog[Catalog]\n        end\n    end\n

Features: - Data upload and validation - Metadata management - File system abstraction - Caching layer - Data catalog

"},{"location":"architecture/#4-terminal-ui-tui","title":"4. Terminal UI (TUI)","text":"
graph TB\n    subgraph \"TUI Architecture\"\n        UI[UI Components] --> Model[Data Model]\n        Model --> Update[Update Loop]\n        Update --> Render[Render]\n\n        subgraph \"UI Panels\"\n            Jobs[Job List]\n            Details[Job Details]\n            Logs[Log Viewer]\n            Status[Status Bar]\n        end\n\n        UI --> Jobs\n        UI --> Details\n        UI --> Logs\n        UI --> Status\n    end\n

Components: - Bubble Tea framework - Component-based architecture - Real-time updates - Keyboard navigation - Theme support

"},{"location":"architecture/#data-flow_1","title":"Data Flow","text":""},{"location":"architecture/#job-execution-flow","title":"Job Execution Flow","text":"
sequenceDiagram\n    participant Client\n    participant Auth\n    participant Worker\n    participant Queue\n    participant Container\n    participant Storage\n\n    Client->>Auth: Submit job with API key\n    Auth->>Client: Validate and return job ID\n\n    Client->>Worker: Execute job request\n    Worker->>Queue: Queue job\n    Queue->>Worker: Job ready\n    Worker->>Container: Start ML container\n    Container->>Worker: Execute experiment\n    Worker->>Storage: Store results\n    Worker->>Client: Return results\n
"},{"location":"architecture/#authentication-flow","title":"Authentication Flow","text":"
sequenceDiagram\n    participant Client\n    participant Auth\n    participant PermMgr\n    participant Config\n\n    Client->>Auth: Request with API key\n    Auth->>Auth: Validate key hash\n    Auth->>PermMgr: Get user permissions\n    PermMgr->>Config: Load YAML permissions\n    Config->>PermMgr: Return permissions\n    PermMgr->>Auth: Return resolved permissions\n    Auth->>Client: Grant/deny access\n
"},{"location":"architecture/#security-architecture_1","title":"Security Architecture","text":""},{"location":"architecture/#defense-in-depth","title":"Defense in Depth","text":"
graph TB\n    subgraph \"Security Layers\"\n        Network[Network Security]\n        Auth[Authentication]\n        AuthZ[Authorization]\n        Container[Container Security]\n        Data[Data Protection]\n        Audit[Audit Logging]\n    end\n\n    Network --> Auth\n    Auth --> AuthZ\n    AuthZ --> Container\n    Container --> Data\n    Data --> Audit\n

Security Features: - API key authentication - Role-based permissions - Container isolation - File system sandboxing - Comprehensive audit logs - Input validation and sanitization

"},{"location":"architecture/#container-security","title":"Container Security","text":"
graph TB\n    subgraph \"Container Isolation\"\n        Host[Host System]\n        Podman[Podman Runtime]\n        Network[Network Isolation]\n        FS[File System Isolation]\n        User[User Namespaces]\n        ML[ML Container]\n\n        Host --> Podman\n        Podman --> Network\n        Podman --> FS\n        Podman --> User\n        User --> ML\n    end\n

Isolation Features: - Rootless containers - Network isolation - File system sandboxing - User namespace mapping - Resource limits

"},{"location":"architecture/#configuration-architecture","title":"Configuration Architecture","text":""},{"location":"architecture/#configuration-hierarchy","title":"Configuration Hierarchy","text":"
graph TB\n    subgraph \"Config Sources\"\n        Env[Environment Variables]\n        File[Config Files]\n        CLI[CLI Flags]\n        Defaults[Default Values]\n    end\n\n    subgraph \"Config Processing\"\n        Merge[Config Merger]\n        Validate[Schema Validator]\n        Apply[Config Applier]\n    end\n\n    Env --> Merge\n    File --> Merge\n    CLI --> Merge\n    Defaults --> Merge\n\n    Merge --> Validate\n    Validate --> Apply\n

Configuration Priority: 1. CLI flags (highest) 2. Environment variables 3. Configuration files 4. Default values (lowest)

"},{"location":"architecture/#scalability-architecture","title":"Scalability Architecture","text":""},{"location":"architecture/#horizontal-scaling","title":"Horizontal Scaling","text":"
graph TB\n    subgraph \"Scaled Architecture\"\n        LB[Load Balancer]\n        W1[Worker 1]\n        W2[Worker 2]\n        W3[Worker N]\n        Redis[Redis Cluster]\n        Storage[Shared Storage]\n\n        LB --> W1\n        LB --> W2\n        LB --> W3\n\n        W1 --> Redis\n        W2 --> Redis\n        W3 --> Redis\n\n        W1 --> Storage\n        W2 --> Storage\n        W3 --> Storage\n    end\n

Scaling Features: - Stateless worker services - Shared job queue (Redis) - Distributed storage - Load balancer ready - Health checks and monitoring

"},{"location":"architecture/#technology-stack","title":"Technology Stack","text":""},{"location":"architecture/#backend-technologies","title":"Backend Technologies","text":"Component Technology Purpose Language Go 1.25+ Core application Web Framework Standard library HTTP server Authentication Custom API key + RBAC Database SQLite/PostgreSQL Metadata storage Cache Redis Job queue & caching Containers Podman/Docker Job isolation UI Framework Bubble Tea Terminal UI"},{"location":"architecture/#dependencies","title":"Dependencies","text":"
// Core dependencies\nrequire (\n    github.com/charmbracelet/bubbletea v1.3.10  // TUI framework\n    github.com/go-redis/redis/v8 v8.11.5        // Redis client\n    github.com/google/uuid v1.6.0               // UUID generation\n    github.com/mattn/go-sqlite3 v1.14.32        // SQLite driver\n    golang.org/x/crypto v0.45.0                 // Crypto utilities\n    gopkg.in/yaml.v3 v3.0.1                     // YAML parsing\n)\n
"},{"location":"architecture/#development-architecture","title":"Development Architecture","text":""},{"location":"architecture/#project-structure","title":"Project Structure","text":"
fetch_ml/\n\u251c\u2500\u2500 cmd/                    # CLI applications\n\u2502   \u251c\u2500\u2500 worker/            # ML worker service\n\u2502   \u251c\u2500\u2500 tui/               # Terminal UI\n\u2502   \u251c\u2500\u2500 data_manager/      # Data management\n\u2502   \u2514\u2500\u2500 user_manager/      # User management\n\u251c\u2500\u2500 internal/              # Internal packages\n\u2502   \u251c\u2500\u2500 auth/              # Authentication system\n\u2502   \u251c\u2500\u2500 config/            # Configuration management\n\u2502   \u251c\u2500\u2500 container/         # Container operations\n\u2502   \u251c\u2500\u2500 database/          # Database operations\n\u2502   \u251c\u2500\u2500 logging/           # Logging utilities\n\u2502   \u251c\u2500\u2500 metrics/           # Metrics collection\n\u2502   \u2514\u2500\u2500 network/           # Network utilities\n\u251c\u2500\u2500 configs/               # Configuration files\n\u251c\u2500\u2500 scripts/               # Setup and utility scripts\n\u251c\u2500\u2500 tests/                 # Test suites\n\u2514\u2500\u2500 docs/                  # Documentation\n
"},{"location":"architecture/#package-dependencies","title":"Package Dependencies","text":"
graph TB\n    subgraph \"Application Layer\"\n        Worker[cmd/worker]\n        TUI[cmd/tui]\n        DataMgr[cmd/data_manager]\n        UserMgr[cmd/user_manager]\n    end\n\n    subgraph \"Service Layer\"\n        Auth[internal/auth]\n        Config[internal/config]\n        Container[internal/container]\n        Database[internal/database]\n    end\n\n    subgraph \"Utility Layer\"\n        Logging[internal/logging]\n        Metrics[internal/metrics]\n        Network[internal/network]\n    end\n\n    Worker --> Auth\n    Worker --> Config\n    Worker --> Container\n    TUI --> Auth\n    DataMgr --> Database\n    UserMgr --> Auth\n\n    Auth --> Logging\n    Container --> Network\n    Database --> Metrics\n
"},{"location":"architecture/#monitoring-observability","title":"Monitoring & Observability","text":""},{"location":"architecture/#metrics-collection","title":"Metrics Collection","text":"
graph TB\n    subgraph \"Metrics Pipeline\"\n        App[Application] --> Metrics[Metrics Collector]\n        Metrics --> Export[Prometheus Exporter]\n        Export --> Prometheus[Prometheus Server]\n        Prometheus --> Grafana[Grafana Dashboard]\n\n        subgraph \"Metric Types\"\n            Counter[Counters]\n            Gauge[Gauges]\n            Histogram[Histograms]\n            Timer[Timers]\n        end\n\n        App --> Counter\n        App --> Gauge\n        App --> Histogram\n        App --> Timer\n    end\n
"},{"location":"architecture/#logging-architecture","title":"Logging Architecture","text":"
graph TB\n    subgraph \"Logging Pipeline\"\n        App[Application] --> Logger[Structured Logger]\n        Logger --> File[File Output]\n        Logger --> Console[Console Output]\n        Logger --> Syslog[Syslog Forwarder]\n        Syslog --> Aggregator[Log Aggregator]\n        Aggregator --> Storage[Log Storage]\n        Storage --> Viewer[Log Viewer]\n    end\n
"},{"location":"architecture/#deployment-architecture","title":"Deployment Architecture","text":""},{"location":"architecture/#container-deployment","title":"Container Deployment","text":"
graph TB\n    subgraph \"Deployment Stack\"\n        Image[Container Image]\n        Registry[Container Registry]\n        Orchestrator[Docker Compose]\n        Config[ConfigMaps/Secrets]\n        Storage[Persistent Storage]\n\n        Image --> Registry\n        Registry --> Orchestrator\n        Config --> Orchestrator\n        Storage --> Orchestrator\n    end\n
"},{"location":"architecture/#service-discovery","title":"Service Discovery","text":"
graph TB\n    subgraph \"Service Mesh\"\n        Gateway[API Gateway]\n        Discovery[Service Discovery]\n        Worker[Worker Service]\n        Data[Data Service]\n        Redis[Redis Cluster]\n\n        Gateway --> Discovery\n        Discovery --> Worker\n        Discovery --> Data\n        Discovery --> Redis\n    end\n
"},{"location":"architecture/#future-architecture-considerations","title":"Future Architecture Considerations","text":""},{"location":"architecture/#microservices-evolution","title":"Microservices Evolution","text":""},{"location":"architecture/#homelab-features","title":"Homelab Features","text":"

This architecture provides a solid foundation for secure, scalable machine learning experiments while maintaining simplicity and developer productivity.

"},{"location":"cicd/","title":"CI/CD Pipeline","text":"

Automated testing, building, and releasing for fetch_ml.

"},{"location":"cicd/#workflows","title":"Workflows","text":""},{"location":"cicd/#ci-workflow-githubworkflowsciyml","title":"CI Workflow (.github/workflows/ci.yml)","text":"

Runs on every push to main/develop and all pull requests.

Jobs: 1. test - Go backend tests with Redis 2. build - Build all binaries (Go + Zig CLI) 3. test-scripts - Validate deployment scripts 4. security-scan - Trivy and Gosec security scans 5. docker-build - Build and push Docker images (main branch only)

Test Coverage: - Go unit tests with race detection - internal/queue package tests - Zig CLI tests - Integration tests - Security audits

"},{"location":"cicd/#release-workflow-githubworkflowsreleaseyml","title":"Release Workflow (.github/workflows/release.yml)","text":"

Runs on version tags (e.g., v1.0.0).

Jobs:

  1. build-cli (matrix build)
  2. Linux x86_64 (static musl)
  3. macOS x86_64
  4. macOS ARM64
  5. Downloads platform-specific static rsync
  6. Embeds rsync for zero-dependency releases

  7. build-go-backends

  8. Cross-platform Go builds
  9. api-server, worker, tui, data_manager, user_manager

  10. create-release

  11. Collects all artifacts
  12. Generates SHA256 checksums
  13. Creates GitHub release with notes
"},{"location":"cicd/#release-process","title":"Release Process","text":""},{"location":"cicd/#creating-a-release","title":"Creating a Release","text":"
# 1. Update version\ngit tag v1.0.0\n\n# 2. Push tag\ngit push origin v1.0.0\n\n# 3. CI automatically builds and releases\n
"},{"location":"cicd/#release-artifacts","title":"Release Artifacts","text":"

CLI Binaries (with embedded rsync): - ml-linux-x86_64.tar.gz (~450-650KB) - ml-macos-x86_64.tar.gz (~450-650KB) - ml-macos-arm64.tar.gz (~450-650KB)

Go Backends: - fetch_ml_api-server.tar.gz - fetch_ml_worker.tar.gz - fetch_ml_tui.tar.gz - fetch_ml_data_manager.tar.gz - fetch_ml_user_manager.tar.gz

Checksums: - checksums.txt - Combined SHA256 sums - Individual .sha256 files per binary

"},{"location":"cicd/#development-workflow","title":"Development Workflow","text":""},{"location":"cicd/#local-testing","title":"Local Testing","text":"
# Run all tests\nmake test\n\n# Run specific package tests\ngo test ./internal/queue/...\n\n# Build CLI\ncd cli && zig build dev\n\n# Run formatters and linters\nmake lint\n\n# Security scans are handled automatically in CI by the `security-scan` job\n
"},{"location":"cicd/#optional-heavy-end-to-end-tests","title":"Optional heavy end-to-end tests","text":"

Some e2e tests exercise full Docker deployments and performance scenarios and are skipped by default to keep local/CI runs fast. You can enable them explicitly with environment variables:

# Run Docker deployment e2e tests\nFETCH_ML_E2E_DOCKER=1 go test ./tests/e2e/...\n\n# Run performance-oriented e2e tests\nFETCH_ML_E2E_PERF=1 go test ./tests/e2e/...\n

Without these variables, TestDockerDeploymentE2E and TestPerformanceE2E will t.Skip, while all lighter e2e tests still run.

"},{"location":"cicd/#pull-request-checks","title":"Pull Request Checks","text":"

All PRs must pass: - \u2705 Go tests (with Redis) - \u2705 CLI tests - \u2705 Security scans - \u2705 Code linting - \u2705 Build verification

"},{"location":"cicd/#configuration","title":"Configuration","text":""},{"location":"cicd/#environment-variables","title":"Environment Variables","text":"
GO_VERSION: '1.25.0'\nZIG_VERSION: '0.15.2'\n
"},{"location":"cicd/#secrets","title":"Secrets","text":"

Required for releases: - GITHUB_TOKEN - Automatic, provided by GitHub Actions

"},{"location":"cicd/#monitoring","title":"Monitoring","text":""},{"location":"cicd/#build-status","title":"Build Status","text":"

Check workflow runs at:

https://github.com/jfraeys/fetch_ml/actions\n

"},{"location":"cicd/#artifacts","title":"Artifacts","text":"

Download build artifacts from: - Successful workflow runs (30-day retention) - GitHub Releases (permanent)

For implementation details: - .github/workflows/ci.yml - .github/workflows/release.yml

"},{"location":"cli-reference/","title":"Fetch ML CLI Reference","text":"

Comprehensive command-line tools for managing ML experiments in your homelab with Zig-based high-performance CLI.

"},{"location":"cli-reference/#overview","title":"Overview","text":"

Fetch ML provides a comprehensive CLI toolkit built with performance and security in mind:

"},{"location":"cli-reference/#zig-cli-clizig-outbinml","title":"Zig CLI (./cli/zig-out/bin/ml)","text":"

High-performance command-line interface for experiment management, written in Zig for speed and efficiency.

"},{"location":"cli-reference/#available-commands","title":"Available Commands","text":"Command Description Example init Interactive configuration setup ml init sync Sync project to worker with deduplication ml sync ./project --name myjob --queue queue Queue job for execution ml queue myjob --commit abc123 --priority 8 status Get system and worker status ml status monitor Launch TUI monitoring via SSH ml monitor cancel Cancel running job ml cancel job123 prune Clean up old experiments ml prune --keep 10 watch Auto-sync directory on changes ml watch ./project --queue"},{"location":"cli-reference/#command-details","title":"Command Details","text":""},{"location":"cli-reference/#init-configuration-setup","title":"init - Configuration Setup","text":"

ml init\n
Creates a configuration template at ~/.ml/config.toml with: - Worker connection details - API authentication - Base paths and ports

"},{"location":"cli-reference/#sync-project-synchronization","title":"sync - Project Synchronization","text":"
# Basic sync\nml sync ./my-project\n\n# Sync with custom name and queue\nml sync ./my-project --name \"experiment-1\" --queue\n\n# Sync with priority\nml sync ./my-project --priority 9\n

Features: - Content-addressed storage for deduplication - SHA256 commit ID generation - Rsync-based file transfer - Automatic queuing (with --queue flag)

"},{"location":"cli-reference/#queue-job-management","title":"queue - Job Management","text":"
# Queue with commit ID\nml queue my-job --commit abc123def456\n\n# Queue with priority (1-10, default 5)\nml queue my-job --commit abc123 --priority 8\n

Features: - WebSocket-based communication - Priority queuing system - API key authentication

"},{"location":"cli-reference/#watch-auto-sync-monitoring","title":"watch - Auto-Sync Monitoring","text":"
# Watch directory for changes\nml watch ./project\n\n# Watch and auto-queue on changes\nml watch ./project --name \"dev-exp\" --queue\n

Features: - Real-time file system monitoring - Automatic re-sync on changes - Configurable polling interval (2 seconds) - Commit ID comparison for efficiency

"},{"location":"cli-reference/#prune-cleanup-management","title":"prune - Cleanup Management","text":"
# Keep last N experiments\nml prune --keep 20\n\n# Remove experiments older than N days\nml prune --older-than 30\n
"},{"location":"cli-reference/#monitor-remote-monitoring","title":"monitor - Remote Monitoring","text":"

ml monitor\n
Launches TUI interface via SSH for real-time monitoring.

"},{"location":"cli-reference/#cancel-job-cancellation","title":"cancel - Job Cancellation","text":"

ml cancel running-job-id\n
Cancels currently running jobs by ID.

"},{"location":"cli-reference/#configuration","title":"Configuration","text":"

The Zig CLI reads configuration from ~/.ml/config.toml:

worker_host = \"worker.local\"\nworker_user = \"mluser\"\nworker_base = \"/data/ml-experiments\"\nworker_port = 22\napi_key = \"your-api-key\"\n
"},{"location":"cli-reference/#performance-features","title":"Performance Features","text":""},{"location":"cli-reference/#go-commands","title":"Go Commands","text":""},{"location":"cli-reference/#api-server-cmdapi-servermaingo","title":"API Server (./cmd/api-server/main.go)","text":"

Main HTTPS API server for experiment management.

# Build and run\ngo run ./cmd/api-server/main.go\n\n# With configuration\n./bin/api-server --config configs/config-local.yaml\n

Features: - HTTPS-only communication - API key authentication - Rate limiting and IP whitelisting - WebSocket support for real-time updates - Redis integration for caching

"},{"location":"cli-reference/#tui-cmdtuimaingo","title":"TUI (./cmd/tui/main.go)","text":"

Terminal User Interface for monitoring experiments.

# Launch TUI\ngo run ./cmd/tui/main.go\n\n# With custom config\n./tui --config configs/config-local.yaml\n

Features: - Real-time experiment monitoring - Interactive job management - Status visualization - Log viewing

"},{"location":"cli-reference/#data-manager-cmddata_manager","title":"Data Manager (./cmd/data_manager/)","text":"

Utilities for data synchronization and management.

# Sync data\n./data_manager --sync ./data\n\n# Clean old data\n./data_manager --cleanup --older-than 30d\n
"},{"location":"cli-reference/#config-lint-cmdconfiglintmaingo","title":"Config Lint (./cmd/configlint/main.go)","text":"

Configuration validation and linting tool.

# Validate configuration\n./configlint configs/config-local.yaml\n\n# Check schema compliance\n./configlint --schema configs/schema/config_schema.yaml\n
"},{"location":"cli-reference/#management-script-toolsmanagesh","title":"Management Script (./tools/manage.sh)","text":"

Simple service management for your homelab.

"},{"location":"cli-reference/#commands","title":"Commands","text":"
./tools/manage.sh start          # Start all services\n./tools/manage.sh stop           # Stop all services\n./tools/manage.sh status         # Check service status\n./tools/manage.sh logs           # View logs\n./tools/manage.sh monitor        # Basic monitoring\n./tools/manage.sh security       # Security status\n./tools/manage.sh cleanup        # Clean project artifacts\n
"},{"location":"cli-reference/#setup-script-setupsh","title":"Setup Script (./setup.sh)","text":"

One-command homelab setup.

"},{"location":"cli-reference/#usage","title":"Usage","text":"
# Full setup\n./setup.sh\n\n# Setup includes:\n# - SSL certificate generation\n# - Configuration creation\n# - Build all components\n# - Start Redis\n# - Setup Fail2Ban (if available)\n
"},{"location":"cli-reference/#api-testing","title":"API Testing","text":"

Test the API with curl:

# Health check\ncurl -k -H 'X-API-Key: password' https://localhost:9101/health\n\n# List experiments\ncurl -k -H 'X-API-Key: password' https://localhost:9101/experiments\n\n# Submit experiment\ncurl -k -X POST -H 'X-API-Key: password' \\\n     -H 'Content-Type: application/json' \\\n     -d '{\"name\":\"test\",\"config\":{\"type\":\"basic\"}}' \\\n     https://localhost:9101/experiments\n
"},{"location":"cli-reference/#zig-cli-architecture","title":"Zig CLI Architecture","text":"

The Zig CLI is designed for performance and reliability:

"},{"location":"cli-reference/#core-components","title":"Core Components","text":""},{"location":"cli-reference/#performance-optimizations","title":"Performance Optimizations","text":""},{"location":"cli-reference/#security-features","title":"Security Features","text":""},{"location":"cli-reference/#configuration_1","title":"Configuration","text":"

Main configuration file: configs/config-local.yaml

"},{"location":"cli-reference/#key-settings","title":"Key Settings","text":"
auth:\n  enabled: true\n  api_keys:\n    homelab_user:\n      hash: \"5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8\"\n      admin: true\n\nserver:\n  address: \":9101\"\n  tls:\n    enabled: true\n    cert_file: \"./ssl/cert.pem\"\n    key_file: \"./ssl/key.pem\"\n\nsecurity:\n  rate_limit:\n    enabled: true\n    requests_per_minute: 30\n  ip_whitelist:\n    - \"127.0.0.1\"\n    - \"::1\"\n    - \"192.168.0.0/16\"\n    - \"10.0.0.0/8\"\n
"},{"location":"cli-reference/#docker-commands","title":"Docker Commands","text":"

If using Docker Compose:

# Start services\ndocker-compose up -d (testing only)\n\n# View logs\ndocker-compose logs -f\n\n# Stop services\ndocker-compose down\n\n# Check status\ndocker-compose ps\n
"},{"location":"cli-reference/#troubleshooting","title":"Troubleshooting","text":""},{"location":"cli-reference/#common-issues","title":"Common Issues","text":"

Zig CLI not found:

# Build the CLI\ncd cli && make build\n\n# Check binary exists\nls -la ./cli/zig-out/bin/ml\n

Configuration not found:

# Create configuration\n./cli/zig-out/bin/ml init\n\n# Check config file\nls -la ~/.ml/config.toml\n

Worker connection failed:

# Test SSH connection\nssh -p 22 mluser@worker.local\n\n# Check configuration\ncat ~/.ml/config.toml\n

Sync not working:

# Check rsync availability\nrsync --version\n\n# Test manual sync\nrsync -avz ./project/ mluser@worker.local:/tmp/test/\n

WebSocket connection failed:

# Check worker WebSocket port\ntelnet worker.local 9100\n\n# Verify API key\n./cli/zig-out/bin/ml status\n

API not responding:

./tools/manage.sh status\n./tools/manage.sh logs\n

Authentication failed:

# Check API key in config-local.yaml\ngrep -A 5 \"api_keys:\" configs/config-local.yaml\n

Redis connection failed:

# Check Redis status\nredis-cli ping\n\n# Start Redis\nredis-server\n

"},{"location":"cli-reference/#getting-help","title":"Getting Help","text":"
# CLI help\n./cli/zig-out/bin/ml help\n\n# Management script help\n./tools/manage.sh help\n\n# Check all available commands\nmake help\n

That's it for the CLI reference! For complete setup instructions, see the main index.

"},{"location":"configuration-schema/","title":"Configuration Schema","text":"

Complete reference for Fetch ML configuration options.

"},{"location":"configuration-schema/#configuration-file-structure","title":"Configuration File Structure","text":"

Fetch ML uses YAML configuration files. The main configuration file is typically config.yaml.

"},{"location":"configuration-schema/#full-schema","title":"Full Schema","text":"
# Server Configuration\nserver:\n  address: \":9101\"\n  tls:\n    enabled: false\n    cert_file: \"\"\n    key_file: \"\"\n\n# Database Configuration\ndatabase:\n  type: \"sqlite\"  # sqlite, postgres, mysql\n  connection: \"fetch_ml.db\"\n  host: \"localhost\"\n  port: 5432\n  username: \"postgres\"\n  password: \"\"\n  database: \"fetch_ml\"\n\n# Redis Configuration\n\n\n## Quick Reference\n\n### Database Types\n- **SQLite**: `type: sqlite, connection: file.db`\n- **PostgreSQL**: `type: postgres, host: localhost, port: 5432`\n\n### Key Settings\n- `server.address: :9101`\n- `database.type: sqlite`\n- `redis.addr: localhost:6379`\n- `auth.enabled: true`\n- `logging.level: info`\n\n### Environment Override\n```bash\nexport FETCHML_SERVER_ADDRESS=:8080\nexport FETCHML_DATABASE_TYPE=postgres\n
"},{"location":"configuration-schema/#validation","title":"Validation","text":"
make configlint\n
"},{"location":"deployment/","title":"ML Experiment Manager - Deployment Guide","text":""},{"location":"deployment/#overview","title":"Overview","text":"

The ML Experiment Manager supports multiple deployment methods from local development to homelab Docker setups.

"},{"location":"deployment/#quick-start","title":"Quick Start","text":""},{"location":"deployment/#docker-compose-recommended-for-development","title":"Docker Compose (Recommended for Development)","text":"
# Clone repository\ngit clone https://github.com/your-org/fetch_ml.git\ncd fetch_ml\n\n# Start all services\ndocker-compose up -d (testing only)\n\n# Check status\ndocker-compose ps\n\n# View logs\ndocker-compose logs -f api-server\n

Access the API at http://localhost:9100

"},{"location":"deployment/#deployment-options","title":"Deployment Options","text":""},{"location":"deployment/#1-local-development","title":"1. Local Development","text":""},{"location":"deployment/#prerequisites","title":"Prerequisites","text":"

Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution - Go 1.25+ - Zig 0.15.2 - Redis 7+ - Docker & Docker Compose (optional)

"},{"location":"deployment/#manual-setup","title":"Manual Setup","text":"
# Start Redis\nredis-server\n\n# Build and run Go server\ngo build -o bin/api-server ./cmd/api-server\n./bin/api-server -config configs/config-local.yaml\n\n# Build Zig CLI\ncd cli\nzig build prod\n./zig-out/bin/ml --help\n
"},{"location":"deployment/#2-docker-deployment","title":"2. Docker Deployment","text":""},{"location":"deployment/#build-image","title":"Build Image","text":"
docker build -t ml-experiment-manager:latest .\n
"},{"location":"deployment/#run-container","title":"Run Container","text":"
docker run -d \\\n  --name ml-api \\\n  -p 9100:9100 \\\n  -p 9101:9101 \\\n  -v $(pwd)/configs:/app/configs:ro \\\n  -v experiment-data:/data/ml-experiments \\\n  ml-experiment-manager:latest\n
"},{"location":"deployment/#docker-compose","title":"Docker Compose","text":"
# Production mode\ndocker-compose -f docker-compose.yml up -d\n\n# Development mode with logs\ndocker-compose -f docker-compose.yml up\n
"},{"location":"deployment/#3-homelab-setup","title":"3. Homelab Setup","text":"
# Use the simple setup script\n./setup.sh\n\n# Or manually with Docker Compose\ndocker-compose up -d (testing only)\n
"},{"location":"deployment/#4-cloud-deployment","title":"4. Cloud Deployment","text":""},{"location":"deployment/#aws-ecs","title":"AWS ECS","text":"
# Build and push to ECR\naws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REGISTRY\ndocker build -t $ECR_REGISTRY/ml-experiment-manager:latest .\ndocker push $ECR_REGISTRY/ml-experiment-manager:latest\n\n# Deploy with ECS CLI\necs-cli compose --project-name ml-experiment-manager up\n
"},{"location":"deployment/#google-cloud-run","title":"Google Cloud Run","text":"
# Build and push\ngcloud builds submit --tag gcr.io/$PROJECT_ID/ml-experiment-manager\n\n# Deploy\ngcloud run deploy ml-experiment-manager \\\n  --image gcr.io/$PROJECT_ID/ml-experiment-manager \\\n  --platform managed \\\n  --region us-central1 \\\n  --allow-unauthenticated\n
"},{"location":"deployment/#configuration","title":"Configuration","text":""},{"location":"deployment/#environment-variables","title":"Environment Variables","text":"
# configs/config-local.yaml\nbase_path: \"/data/ml-experiments\"\nauth:\n  enabled: true\n  api_keys:\n    - \"your-production-api-key\"\nserver:\n  address: \":9100\"\n  tls:\n    enabled: true\n    cert_file: \"/app/ssl/cert.pem\"\n    key_file: \"/app/ssl/key.pem\"\n
"},{"location":"deployment/#docker-compose-environment","title":"Docker Compose Environment","text":"
# docker-compose.yml\nversion: '3.8'\nservices:\n  api-server:\n    environment:\n      - REDIS_URL=redis://redis:6379\n      - LOG_LEVEL=info\n    volumes:\n      - ./configs:/configs:ro\n      - ./data:/data/experiments\n
"},{"location":"deployment/#monitoring-logging","title":"Monitoring & Logging","text":""},{"location":"deployment/#health-checks","title":"Health Checks","text":""},{"location":"deployment/#metrics","title":"Metrics","text":""},{"location":"deployment/#logging","title":"Logging","text":""},{"location":"deployment/#security","title":"Security","text":""},{"location":"deployment/#tls-configuration","title":"TLS Configuration","text":"
# Generate self-signed cert (development)\nopenssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365 -nodes\n\n# Production - use Let's Encrypt\ncertbot certonly --standalone -d ml-experiments.example.com\n
"},{"location":"deployment/#network-security","title":"Network Security","text":""},{"location":"deployment/#performance-tuning","title":"Performance Tuning","text":""},{"location":"deployment/#resource-allocation","title":"Resource Allocation","text":"
resources:\n  requests:\n    memory: \"256Mi\"\n    cpu: \"250m\"\n  limits:\n    memory: \"1Gi\"\n    cpu: \"1000m\"\n
"},{"location":"deployment/#scaling-strategies","title":"Scaling Strategies","text":""},{"location":"deployment/#backup-recovery","title":"Backup & Recovery","text":""},{"location":"deployment/#data-backup","title":"Data Backup","text":"
# Backup experiment data\ndocker-compose exec redis redis-cli BGSAVE\ndocker cp $(docker-compose ps -q redis):/data/dump.rdb ./redis-backup.rdb\n\n# Backup data volume\ndocker run --rm -v ml-experiments_redis_data:/data -v $(pwd):/backup alpine tar czf /backup/redis-backup.tar.gz -C /data .\n
"},{"location":"deployment/#disaster-recovery","title":"Disaster Recovery","text":"
  1. Restore Redis data
  2. Restart services
  3. Verify experiment metadata
  4. Test API endpoints
"},{"location":"deployment/#troubleshooting","title":"Troubleshooting","text":""},{"location":"deployment/#common-issues","title":"Common Issues","text":""},{"location":"deployment/#api-server-not-starting","title":"API Server Not Starting","text":"
# Check logs\ndocker-compose logs api-server\n\n# Check configuration\ncat configs/config-local.yaml\n\n# Check Redis connection\ndocker-compose exec redis redis-cli ping\n
"},{"location":"deployment/#websocket-connection-issues","title":"WebSocket Connection Issues","text":"
# Test WebSocket\nwscat -c ws://localhost:9100/ws\n\n# Check TLS\nopenssl s_client -connect localhost:9101 -servername localhost\n
"},{"location":"deployment/#performance-issues","title":"Performance Issues","text":"
# Check resource usage\ndocker-compose exec api-server ps aux\n\n# Check Redis memory\ndocker-compose exec redis redis-cli info memory\n
"},{"location":"deployment/#debug-mode","title":"Debug Mode","text":"
# Enable debug logging\nexport LOG_LEVEL=debug\n./bin/api-server -config configs/config-local.yaml\n
"},{"location":"deployment/#cicd-integration","title":"CI/CD Integration","text":""},{"location":"deployment/#github-actions","title":"GitHub Actions","text":""},{"location":"deployment/#deployment-pipeline","title":"Deployment Pipeline","text":"
  1. Code commit \u2192 GitHub
  2. CI/CD pipeline triggers
  3. Build and test
  4. Security scan
  5. Deploy to staging
  6. Run integration tests
  7. Deploy to production
  8. Post-deployment verification
"},{"location":"deployment/#support","title":"Support","text":"

For deployment issues: 1. Check this guide 2. Review logs 3. Check GitHub Issues 4. Contact maintainers

"},{"location":"development-setup/","title":"Development Setup","text":"

Set up your local development environment for Fetch ML.

"},{"location":"development-setup/#prerequisites","title":"Prerequisites","text":"

Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution

"},{"location":"development-setup/#quick-setup","title":"Quick Setup","text":"
# Clone repository\ngit clone https://github.com/jfraeys/fetch_ml.git\ncd fetch_ml\n\n# Start dependencies\nsee [Quick Start](quick-start.md) for Docker setup redis postgres\n\n# Build all components\nmake build\n\n# Run tests\nsee [Testing Guide](testing.md)\n
"},{"location":"development-setup/#detailed-setup","title":"Detailed Setup","text":""},{"location":"development-setup/#quick-start","title":"Quick Start","text":"
git clone https://github.com/jfraeys/fetch_ml.git\ncd fetch_ml\nsee [Quick Start](quick-start.md) for Docker setup\nmake build\nsee [Testing Guide](testing.md)\n
"},{"location":"development-setup/#key-commands","title":"Key Commands","text":""},{"location":"development-setup/#common-issues","title":"Common Issues","text":""},{"location":"environment-variables/","title":"Environment Variables","text":"

Fetch ML supports environment variables for configuration, allowing you to override config file settings and deploy in different environments.

"},{"location":"environment-variables/#priority-order","title":"Priority Order","text":"
  1. Environment variables (highest priority)
  2. Configuration file values
  3. Default values (lowest priority)
"},{"location":"environment-variables/#variable-prefixes","title":"Variable Prefixes","text":""},{"location":"environment-variables/#general-configuration","title":"General Configuration","text":""},{"location":"environment-variables/#cli-configuration","title":"CLI Configuration","text":""},{"location":"environment-variables/#tui-configuration","title":"TUI Configuration","text":""},{"location":"environment-variables/#cli-environment-variables","title":"CLI Environment Variables","text":"Variable Config Field Example FETCH_ML_CLI_HOST worker_host localhost FETCH_ML_CLI_USER worker_user mluser FETCH_ML_CLI_BASE worker_base /opt/ml FETCH_ML_CLI_PORT worker_port 22 FETCH_ML_CLI_API_KEY api_key your-api-key-here"},{"location":"environment-variables/#tui-environment-variables","title":"TUI Environment Variables","text":"Variable Config Field Example FETCH_ML_TUI_HOST host localhost FETCH_ML_TUI_USER user mluser FETCH_ML_TUI_SSH_KEY ssh_key ~/.ssh/id_rsa FETCH_ML_TUI_PORT port 22 FETCH_ML_TUI_BASE_PATH base_path /opt/ml FETCH_ML_TUI_TRAIN_SCRIPT train_script train.py FETCH_ML_TUI_REDIS_ADDR redis_addr localhost:6379 FETCH_ML_TUI_REDIS_PASSWORD redis_password `` FETCH_ML_TUI_REDIS_DB redis_db 0 FETCH_ML_TUI_KNOWN_HOSTS known_hosts ~/.ssh/known_hosts"},{"location":"environment-variables/#server-environment-variables-auth-debug","title":"Server Environment Variables (Auth & Debug)","text":"

These variables control server-side authentication behavior and are intended only for local development and debugging.

Variable Purpose Allowed In Production? FETCH_ML_ALLOW_INSECURE_AUTH When set to 1 and FETCH_ML_DEBUG=1, allows the API server to run with auth.enabled: false by injecting a default admin user. No. Must never be set in production. FETCH_ML_DEBUG Enables additional debug behaviors. Required (set to 1) to activate the insecure auth bypass above. No. Must never be set in production.

When both variables are set to 1 and auth.enabled is false, the server logs a clear warning and treats all requests as coming from a default admin user. This mode is convenient for local homelab experiments but is insecure by design and must not be used on any shared or internet-facing environment.

"},{"location":"environment-variables/#usage-examples","title":"Usage Examples","text":""},{"location":"environment-variables/#development-environment","title":"Development Environment","text":"
export FETCH_ML_CLI_HOST=localhost\nexport FETCH_ML_CLI_USER=devuser\nexport FETCH_ML_CLI_API_KEY=dev-key-123456789012\n./ml status\n
"},{"location":"environment-variables/#production-environment","title":"Production Environment","text":"
export FETCH_ML_CLI_HOST=prod-server.example.com\nexport FETCH_ML_CLI_USER=mluser\nexport FETCH_ML_CLI_API_KEY=prod-key-abcdef1234567890\n./ml status\n
"},{"location":"environment-variables/#dockerkubernetes","title":"Docker/Kubernetes","text":"
env:\n  - name: FETCH_ML_CLI_HOST\n    value: \"ml-server.internal\"\n  - name: FETCH_ML_CLI_USER\n    value: \"mluser\"\n  - name: FETCH_ML_CLI_API_KEY\n    valueFrom:\n      secretKeyRef:\n        name: ml-secrets\n        key: api-key\n
"},{"location":"environment-variables/#using-env-file","title":"Using .env file","text":"
# Copy the example file\ncp .env.example .env\n\n# Edit with your values\nvim .env\n\n# Load in your shell\nexport $(cat .env | xargs)\n
"},{"location":"environment-variables/#backward-compatibility","title":"Backward Compatibility","text":"

The CLI also supports the legacy ML_* prefix for backward compatibility, but FETCH_ML_CLI_* takes priority if both are set.

Legacy Variable New Variable ML_HOST FETCH_ML_CLI_HOST ML_USER FETCH_ML_CLI_USER ML_BASE FETCH_ML_CLI_BASE ML_PORT FETCH_ML_CLI_PORT ML_API_KEY FETCH_ML_CLI_API_KEY"},{"location":"first-experiment/","title":"First Experiment","text":"

Run your first machine learning experiment with Fetch ML.

"},{"location":"first-experiment/#prerequisites","title":"Prerequisites","text":"

Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution

"},{"location":"first-experiment/#experiment-workflow","title":"Experiment Workflow","text":""},{"location":"first-experiment/#1-prepare-your-ml-code","title":"1. Prepare Your ML Code","text":"

Create a simple Python script:

# experiment.py\nimport argparse\nimport json\nimport sys\nimport time\n\ndef main():\n    parser = argparse.ArgumentParser()\n    parser.add_argument('--epochs', type=int, default=10)\n    parser.add_argument('--lr', type=float, default=0.001)\n    parser.add_argument('--output', default='results.json')\n\n    args = parser.parse_args()\n\n    # Simulate training\n    results = {\n        'epochs': args.epochs,\n        'learning_rate': args.lr,\n        'accuracy': 0.85 + (args.lr * 0.1),\n        'loss': 0.5 - (args.epochs * 0.01),\n        'training_time': args.epochs * 0.1\n    }\n\n    # Save results\n    with open(args.output, 'w') as f:\n        json.dump(results, f, indent=2)\n\n    print(f\"Training completed: {results}\")\n    return results\n\nif __name__ == '__main__':\n    main()\n
"},{"location":"first-experiment/#2-submit-job-via-api","title":"2. Submit Job via API","text":"
# Submit experiment\ncurl -X POST http://localhost:9101/api/v1/jobs \\\n  -H \"Content-Type: application/json\" \\\n  -H \"X-API-Key: your-api-key\" \\\n  -d '{\n    \"job_name\": \"first-experiment\",\n    \"args\": \"--epochs 20 --lr 0.01 --output experiment_results.json\",\n    \"priority\": 1,\n    \"metadata\": {\n      \"experiment_type\": \"training\",\n      \"dataset\": \"sample_data\"\n    }\n  }'\n
"},{"location":"first-experiment/#3-monitor-progress","title":"3. Monitor Progress","text":"
# Check job status\ncurl -H \"X-API-Key: your-api-key\" \\\n  http://localhost:9101/api/v1/jobs/first-experiment\n\n# List all jobs\ncurl -H \"X-API-Key: your-api-key\" \\\n  http://localhost:9101/api/v1/jobs\n\n# Get job metrics\ncurl -H \"X-API-Key: your-api-key\" \\\n  http://localhost:9101/api/v1/jobs/first-experiment/metrics\n
"},{"location":"first-experiment/#4-use-cli","title":"4. Use CLI","text":"
# Submit with CLI\ncd cli && zig build dev\n./cli/zig-out/dev/ml submit \\\n  --name \"cli-experiment\" \\\n  --args \"--epochs 15 --lr 0.005\" \\\n  --server http://localhost:9101\n\n# Monitor with CLI\n./cli/zig-out/dev/ml list-jobs --server http://localhost:9101\n./cli/zig-out/dev/ml job-status cli-experiment --server http://localhost:9101\n
"},{"location":"first-experiment/#advanced-experiment","title":"Advanced Experiment","text":""},{"location":"first-experiment/#hyperparameter-tuning","title":"Hyperparameter Tuning","text":"
# Submit multiple experiments\nfor lr in 0.001 0.01 0.1; do\n  curl -X POST http://localhost:9101/api/v1/jobs \\\n    -H \"Content-Type: application/json\" \\\n    -H \"X-API-Key: your-api-key\" \\\n    -d \"{\n      \\\"job_name\\\": \\\"tune-lr-$lr\\\",\n      \\\"args\\\": \\\"--epochs 10 --lr $lr\\\",\n      \\\"metadata\\\": {\\\"learning_rate\\\": $lr}\n    }\"\ndone\n
"},{"location":"first-experiment/#batch-processing","title":"Batch Processing","text":"
# Submit batch job\ncurl -X POST http://localhost:9101/api/v1/jobs \\\n  -H \"Content-Type: application/json\" \\\n  -H \"X-API-Key: your-api-key\" \\\n  -d '{\n    \"job_name\": \"batch-processing\",\n    \"args\": \"--input data/ --output results/ --batch-size 32\",\n    \"priority\": 2,\n    \"datasets\": [\"training_data\", \"validation_data\"]\n  }'\n
"},{"location":"first-experiment/#results-and-output","title":"Results and Output","text":""},{"location":"first-experiment/#access-results","title":"Access Results","text":"
# Download results\ncurl -H \"X-API-Key: your-api-key\" \\\n  http://localhost:9101/api/v1/jobs/first-experiment/results\n\n# View job details\ncurl -H \"X-API-Key: your-api-key\" \\\n  http://localhost:9101/api/v1/jobs/first-experiment | jq .\n
"},{"location":"first-experiment/#result-format","title":"Result Format","text":"
{\n  \"job_id\": \"first-experiment\",\n  \"status\": \"completed\",\n  \"results\": {\n    \"epochs\": 20,\n    \"learning_rate\": 0.01,\n    \"accuracy\": 0.86,\n    \"loss\": 0.3,\n    \"training_time\": 2.0\n  },\n  \"metrics\": {\n    \"gpu_utilization\": \"85%\",\n    \"memory_usage\": \"2GB\",\n    \"execution_time\": \"120s\"\n  }\n}\n
"},{"location":"first-experiment/#best-practices","title":"Best Practices","text":""},{"location":"first-experiment/#job-naming","title":"Job Naming","text":""},{"location":"first-experiment/#metadata-usage","title":"Metadata Usage","text":"
{\n  \"metadata\": {\n    \"experiment_type\": \"training\",\n    \"model_version\": \"v2.1\",\n    \"dataset\": \"imagenet-2024\",\n    \"environment\": \"gpu\",\n    \"team\": \"ml-team\"\n  }\n}\n
"},{"location":"first-experiment/#error-handling","title":"Error Handling","text":"
# Check failed jobs\ncurl -H \"X-API-Key: your-api-key\" \\\n  \"http://localhost:9101/api/v1/jobs?status=failed\"\n\n# Retry failed job\ncurl -X POST http://localhost:9101/api/v1/jobs \\\n  -H \"Content-Type: application/json\" \\\n  -H \"X-API-Key: your-api-key\" \\\n  -d '{\n    \"job_name\": \"retry-experiment\",\n    \"args\": \"--epochs 20 --lr 0.01\",\n    \"metadata\": {\"retry_of\": \"first-experiment\"}\n  }'\n
"},{"location":"first-experiment/#related-documentation","title":"## Related Documentation","text":""},{"location":"first-experiment/#troubleshooting","title":"Troubleshooting","text":"

Job stuck in pending? - Check worker status: curl /api/v1/workers - Verify resources: docker stats - Check logs: docker-compose logs api-server

Job failed? - Check error message: curl /api/v1/jobs/job-id - Review job arguments - Verify input data

No results? - Check job completion status - Verify output file paths - Check storage permissions

"},{"location":"installation/","title":"Simple Installation Guide","text":""},{"location":"installation/#quick-start-5-minutes","title":"Quick Start (5 minutes)","text":"
# 1. Install\ngit clone https://github.com/jfraeys/fetch_ml.git\ncd fetch_ml\nmake install\n\n# 2. Setup (auto-configures)\n./bin/ml setup\n\n# 3. Run experiments\n./bin/ml run my-experiment.py\n

That's it. Everything else is optional.

"},{"location":"installation/#what-if-i-want-more-control","title":"What If I Want More Control?","text":""},{"location":"installation/#manual-configuration-optional","title":"Manual Configuration (Optional)","text":"
# Edit settings if defaults don't work\nnano ~/.ml/config.toml\n
"},{"location":"installation/#monitoring-dashboard-optional","title":"Monitoring Dashboard (Optional)","text":"
# Real-time monitoring\n./bin/tui\n
"},{"location":"installation/#senior-developer-feedback","title":"Senior Developer Feedback","text":"

\"Keep it simple\" - Most data scientists want: 1. One installation command 2. Sensible defaults 3. Works without configuration 4. Advanced features available when needed

Current plan is too complex because it asks users to decide between: - CLI vs TUI vs Both - Zig vs Go build tools - Manual vs auto config - Multiple environment variables

Better approach: Start simple, add complexity gradually.

"},{"location":"installation/#recommended-simplified-workflow","title":"Recommended Simplified Workflow","text":"
  1. Single Binary - Combine CLI + basic TUI functionality
  2. Auto-Discovery - Detect common ML environments automatically
  3. Progressive Disclosure - Show advanced options only when needed
  4. Zero Config - Work out-of-the-box with localhost defaults

The goal: \"It just works\" for 80% of use cases.

"},{"location":"operations/","title":"Operations Runbook","text":"

Operational guide for troubleshooting and maintaining the ML experiment system.

"},{"location":"operations/#task-queue-operations","title":"Task Queue Operations","text":""},{"location":"operations/#monitoring-queue-health","title":"Monitoring Queue Health","text":"
# Check queue depth\nZCARD task:queue\n\n# List pending tasks\nZRANGE task:queue 0 -1 WITHSCORES\n\n# Check dead letter queue\nKEYS task:dlq:*\n
"},{"location":"operations/#handling-stuck-tasks","title":"Handling Stuck Tasks","text":"

Symptom: Tasks stuck in \"running\" status

Diagnosis:

# Check for expired leases\nredis-cli GET task:{task-id}\n# Look for LeaseExpiry in past\n

**Rem

ediation:** Tasks with expired leases are automatically reclaimed every 1 minute. To force immediate reclamation:

# Restart worker to trigger reclaim cycle\nsystemctl restart ml-worker\n

"},{"location":"operations/#dead-letter-queue-management","title":"Dead Letter Queue Management","text":"

View failed tasks:

KEYS task:dlq:*\n

Inspect failed task:

GET task:dlq:{task-id}\n

Retry from DLQ:

# Manual retry (requires custom script)\n# 1. Get task from DLQ\n# 2. Reset retry count\n# 3. Re-queue task\n

"},{"location":"operations/#worker-crashes","title":"Worker Crashes","text":"

Symptom: Worker disappeared mid-task

What Happens: 1. Lease expires after 30 minutes (default) 2. Background reclaim job detects expired lease 3. Task is retried (up to 3 attempts) 4. After max retries \u2192 Dead Letter Queue

Prevention: - Monitor worker heartbeats - Set up alerts for worker down - Use process manager (systemd, supervisor)

"},{"location":"operations/#worker-operations","title":"Worker Operations","text":""},{"location":"operations/#graceful-shutdown","title":"Graceful Shutdown","text":"
# Send SIGTERM for graceful shutdown\nkill -TERM $(pgrep ml-worker)\n\n# Worker will:\n# 1. Stop accepting new tasks\n# 2. Finish active tasks (up to 5min timeout)\n# 3. Release all leases\n# 4. Exit cleanly\n
"},{"location":"operations/#force-shutdown","title":"Force Shutdown","text":"
# Force kill (leases will be reclaimed automatically)\nkill -9 $(pgrep ml-worker)\n
"},{"location":"operations/#worker-heartbeat-monitoring","title":"Worker Heartbeat Monitoring","text":"
# Check worker heartbeats\nHGETALL worker:heartbeat\n\n# Example output:\n# worker-abc123 1701234567\n# worker-def456 1701234580\n

Alert if: Heartbeat timestamp > 5 minutes old

"},{"location":"operations/#redis-operations","title":"Redis Operations","text":""},{"location":"operations/#backup","title":"Backup","text":"
# Manual backup\nredis-cli SAVE\ncp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d).rdb\n
"},{"location":"operations/#restore","title":"Restore","text":"
# Stop Redis\nsystemctl stop redis\n\n#  Restore snapshot\ncp /backup/redis-20231201.rdb /var/lib/redis/dump.rdb\n\n# Start Redis\nsystemctl start redis\n
"},{"location":"operations/#memory-management","title":"Memory Management","text":"
# Check memory usage\nINFO memory\n\n# Evict old data if needed\nFLUSHDB  # DANGER: Clears all data!\n
"},{"location":"operations/#common-issues","title":"Common Issues","text":""},{"location":"operations/#issue-queue-growing-unbounded","title":"Issue: Queue Growing Unbounded","text":"

Symptoms: - ZCARD task:queue keeps increasing - No workers processing tasks

Diagnosis:

# Check worker status\nsystemctl status ml-worker\n\n# Check logs\njournalctl -u ml-worker -n 100\n

Resolution: 1. Verify workers are running 2. Check Redis connectivity 3. Verify lease configuration

"},{"location":"operations/#issue-high-retry-rate","title":"Issue: High Retry Rate","text":"

Symptoms: - Many tasks in DLQ - retry_count field high on tasks

Diagnosis:

# Check worker logs for errors\njournalctl -u ml-worker | grep \"retry\"\n\n# Look for patterns (network issues, resource limits, etc)\n

Resolution: - Fix underlying issue (network, resources, etc) - Adjust retry limits if permanent failures - Increase task timeout if jobs are slow

"},{"location":"operations/#issue-leases-expiring-prematurely","title":"Issue: Leases Expiring Prematurely","text":"

Symptoms: - Tasks retried even though worker is healthy - Logs show \"lease expired\" frequently

Diagnosis:

# Check worker config\ncat configs/worker-config.yaml | grep -A3 \"lease\"\n\ntask_lease_duration: 30m  # Too short?\nheartbeat_interval: 1m    # Too infrequent?\n

Resolution:

# Increase lease duration for long-running jobs\ntask_lease_duration: 60m\nheartbeat_interval: 30s  # More frequent heartbeats\n

"},{"location":"operations/#performance-tuning","title":"Performance Tuning","text":""},{"location":"operations/#worker-concurrency","title":"Worker Concurrency","text":"
# worker-config.yaml\nmax_workers: 4  # Number of parallel tasks\n\n# Adjust based on:\n# - CPU cores available\n# - Memory per task\n# - GPU availability\n
"},{"location":"operations/#redis-configuration","title":"Redis Configuration","text":"
# /etc/redis/redis.conf\n\n# Persistence\nsave 900 1\nsave 300 10\n\n# Memory\nmaxmemory 2gb\nmaxmemory-policy noeviction\n\n# Performance\ntcp-keepalive 300\ntimeout 0\n
"},{"location":"operations/#alerting-rules","title":"Alerting Rules","text":""},{"location":"operations/#critical-alerts","title":"Critical Alerts","text":"
  1. Worker Down (no heartbeat > 5min)
  2. Queue Depth > 1000 tasks
  3. DLQ Growth > 100 tasks/hour
  4. Redis Down (connection failed)
"},{"location":"operations/#warning-alerts","title":"Warning Alerts","text":"
  1. High Retry Rate > 10% of tasks
  2. Slow Queue Drain (depth increasing over 1 hour)
  3. Worker Memory > 80% usage
"},{"location":"operations/#health-checks","title":"Health Checks","text":"
#!/bin/bash\n# health-check.sh\n\n# Check Redis\nredis-cli PING || echo \"Redis DOWN\"\n\n# Check worker heartbeat\nWORKER_ID=$(cat /var/run/ml-worker.pid)\nLAST_HB=$(redis-cli HGET worker:heartbeat \"$WORKER_ID\")\nNOW=$(date +%s)\nif [ $((NOW - LAST_HB)) -gt 300 ]; then\n  echo \"Worker heartbeat stale\"\nfi\n\n# Check queue depth\nDEPTH=$(redis-cli ZCARD task:queue)\nif [ \"$DEPTH\" -gt 1000 ]; then\n  echo \"Queue depth critical: $DEPTH\"\nfi\n
"},{"location":"operations/#runbook-checklist","title":"Runbook Checklist","text":""},{"location":"operations/#daily-operations","title":"Daily Operations","text":"
  1. Check queue depth
  2. Verify worker heartbeats
  3. Review DLQ for patterns
  4. Check Redis memory usage
"},{"location":"operations/#weekly-operations","title":"Weekly Operations","text":"
  1. Review retry rates
  2. Analyze failed task patterns
  3. Backup Redis snapshot
  4. Review worker logs
"},{"location":"operations/#monthly-operations","title":"Monthly Operations","text":"
  1. Performance tuning review
  2. Capacity planning
  3. Update documentation
  4. Test disaster recovery

For homelab setups: Most of these operations can be simplified. Focus on: - Basic monitoring (queue depth, worker status) - Periodic Redis backups - Graceful shutdowns for maintenance

"},{"location":"production-monitoring/","title":"Production Monitoring Deployment Guide (Linux)","text":"

This guide covers deploying the monitoring stack (Prometheus, Grafana, Loki, Promtail) on Linux production servers.

"},{"location":"production-monitoring/#architecture","title":"Architecture","text":"

Testing: Docker Compose (macOS/Linux) Production: Podman + systemd (Linux)

Important: Docker is for testing only. Podman is used for running actual ML experiments in production.

Dev (Testing): Docker Compose Prod (Experiments): Podman + systemd

Each service runs as a separate Podman container managed by systemd for automatic restarts and proper lifecycle management.

"},{"location":"production-monitoring/#prerequisites","title":"Prerequisites","text":"

Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution

"},{"location":"production-monitoring/#quick-setup","title":"Quick Setup","text":""},{"location":"production-monitoring/#1-run-setup-script","title":"1. Run Setup Script","text":"
cd /path/to/fetch_ml\nsudo ./scripts/setup-monitoring-prod.sh /data/monitoring ml-user ml-group\n

This will: - Create directory structure at /data/monitoring - Copy configuration files to /etc/fetch_ml/monitoring - Create systemd services for each component - Set up firewall rules

"},{"location":"production-monitoring/#2-start-services","title":"2. Start Services","text":"
# Start all monitoring services\nsudo systemctl start prometheus\nsudo systemctl start loki\nsudo systemctl start promtail\nsudo systemctl start grafana\n\n# Enable on boot\nsudo systemctl enable prometheus loki promtail grafana\n
"},{"location":"production-monitoring/#3-access-grafana","title":"3. Access Grafana","text":"

Dashboards will auto-load: - ML Task Queue Monitoring (metrics) - Application Logs (Loki logs)

"},{"location":"production-monitoring/#service-details","title":"Service Details","text":""},{"location":"production-monitoring/#prometheus","title":"Prometheus","text":""},{"location":"production-monitoring/#loki","title":"Loki","text":""},{"location":"production-monitoring/#promtail","title":"Promtail","text":""},{"location":"production-monitoring/#grafana","title":"Grafana","text":""},{"location":"production-monitoring/#management-commands","title":"Management Commands","text":"
# Check status\nsudo systemctl status prometheus grafana loki promtail\n\n# View logs\nsudo journalctl -u prometheus -f\nsudo journalctl -u grafana -f\nsudo journalctl -u loki -f\nsudo journalctl -u promtail -f\n\n# Restart services\nsudo systemctl restart prometheus\nsudo systemctl restart grafana\n\n# Stop all monitoring\nsudo systemctl stop prometheus grafana loki promtail\n
"},{"location":"production-monitoring/#data-retention","title":"Data Retention","text":""},{"location":"production-monitoring/#prometheus_1","title":"Prometheus","text":"

Default: 15 days. Edit /etc/fetch_ml/monitoring/prometheus.yml:

storage:\n  tsdb:\n    retention.time: 30d\n

"},{"location":"production-monitoring/#loki_1","title":"Loki","text":"

Default: 30 days. Edit /etc/fetch_ml/monitoring/loki-config.yml:

limits_config:\n  retention_period: 30d\n

"},{"location":"production-monitoring/#security","title":"Security","text":""},{"location":"production-monitoring/#firewall","title":"Firewall","text":"

The setup script automatically configures firewall rules using the detected firewall manager (firewalld or ufw).

For manual firewall configuration:

RHEL/Rocky/Fedora (firewalld):

# Remove public access\nsudo firewall-cmd --permanent --remove-port=3000/tcp\nsudo firewall-cmd --permanent --remove-port=9090/tcp\n\n# Add specific source\nsudo firewall-cmd --permanent --add-rich-rule='rule family=\"ipv4\" source address=\"10.0.0.0/24\" port port=\"3000\" protocol=\"tcp\" accept'\nsudo firewall-cmd --reload\n

Ubuntu/Debian (ufw):

# Remove public access\nsudo ufw delete allow 3000/tcp\nsudo ufw delete allow 9090/tcp\n\n# Add specific source\nsudo ufw allow from 10.0.0.0/24 to any port 3000 proto tcp\n

"},{"location":"production-monitoring/#authentication","title":"Authentication","text":"

Change Grafana admin password: 1. Login to Grafana 2. User menu \u2192 Profile \u2192 Change Password

"},{"location":"production-monitoring/#tls-optional","title":"TLS (Optional)","text":"

For HTTPS, configure reverse proxy (nginx/Apache) in front of Grafana.

"},{"location":"production-monitoring/#troubleshooting","title":"Troubleshooting","text":""},{"location":"production-monitoring/#grafana-shows-no-data","title":"Grafana shows no data","text":"
# Check if Prometheus is reachable\ncurl http://localhost:9090/-/healthy\n\n# Check datasource in Grafana\n# Settings \u2192 Data Sources \u2192 Prometheus \u2192 Save & Test\n
"},{"location":"production-monitoring/#loki-not-receiving-logs","title":"Loki not receiving logs","text":"
# Check Promtail is running\nsudo systemctl status promtail\n\n# Verify log file exists\nls -l /var/log/fetch_ml/\n\n# Check Promtail can reach Loki\ncurl http://localhost:3100/ready\n
"},{"location":"production-monitoring/#podman-containers-not-starting","title":"Podman containers not starting","text":"
# Check pod status\nsudo -u ml-user podman pod ps\nsudo -u ml-user podman ps -a\n\n# Remove and recreate\nsudo -u ml-user podman pod stop monitoring\nsudo -u ml-user podman pod rm monitoring\nsudo systemctl restart prometheus\n
"},{"location":"production-monitoring/#backup","title":"Backup","text":"
# Backup Grafana dashboards and data\nsudo tar -czf grafana-backup.tar.gz /data/monitoring/grafana\n\n# Backup Prometheus data\nsudo tar -czf prometheus-backup.tar.gz /data/monitoring/prometheus\n
"},{"location":"production-monitoring/#updates","title":"Updates","text":"
# Pull latest images\nsudo -u ml-user podman pull docker.io/grafana/grafana:latest\nsudo -u ml-user podman pull docker.io/prom/prometheus:latest\nsudo -u ml-user podman pull docker.io/grafana/loki:latest\nsudo -u ml-user podman pull docker.io/grafana/promtail:latest\n\n# Restart services to use new images\nsudo systemctl restart grafana prometheus loki promtail\n
"},{"location":"queue/","title":"Task Queue Architecture","text":"

The task queue system enables reliable job processing between the API server and workers using Redis.

"},{"location":"queue/#overview","title":"Overview","text":"
graph LR\n    CLI[CLI/Client] -->|WebSocket| API[API Server]\n    API -->|Enqueue| Redis[(Redis)]\n    Redis -->|Dequeue| Worker[Worker]\n    Worker -->|Update Status| Redis\n
"},{"location":"queue/#components","title":"Components","text":""},{"location":"queue/#taskqueue-internalqueue","title":"TaskQueue (internal/queue)","text":"

Shared package used by both API server and worker for job management.

"},{"location":"queue/#task-structure","title":"Task Structure","text":"
type Task struct {\n    ID        string            // Unique task ID (UUID)\n    JobName   string            // User-defined job name  \n    Args      string            // Job arguments\n    Status    string            // queued, running, completed, failed\n    Priority  int64             // Higher = executed first\n    CreatedAt time.Time         \n    StartedAt *time.Time        \n    EndedAt   *time.Time        \n    WorkerID  string            \n    Error     string            \n    Datasets  []string          \n    Metadata  map[string]string // commit_id, user, etc\n}\n
"},{"location":"queue/#taskqueue-interface","title":"TaskQueue Interface","text":"
// Initialize queue\nqueue, err := queue.NewTaskQueue(queue.Config{\n    RedisAddr:     \"localhost:6379\",\n    RedisPassword: \"\",\n    RedisDB:       0,\n})\n\n// Add task (API server)\ntask := &queue.Task{\n    ID:       uuid.New().String(),\n    JobName:  \"train-model\",\n    Status:   \"queued\",\n    Priority: 5,\n    Metadata: map[string]string{\n        \"commit_id\": commitID,\n        \"user\":      username,\n    },\n}\nerr = queue.AddTask(task)\n\n// Get next task (Worker)\ntask, err := queue.GetNextTask()\n\n// Update task status\ntask.Status = \"running\"\nerr = queue.UpdateTask(task)\n
"},{"location":"queue/#data-flow","title":"Data Flow","text":""},{"location":"queue/#job-submission-flow","title":"Job Submission Flow","text":"
sequenceDiagram\n    participant CLI\n    participant API\n    participant Redis\n    participant Worker\n\n    CLI->>API: Queue Job (WebSocket)\n    API->>API: Create Task (UUID)\n    API->>Redis: ZADD task:queue\n    API->>Redis: SET task:{id}\n    API->>CLI: Success Response\n\n    Worker->>Redis: ZPOPMAX task:queue\n    Redis->>Worker: Task ID\n    Worker->>Redis: GET task:{id}\n    Redis->>Worker: Task Data\n    Worker->>Worker: Execute Job\n    Worker->>Redis: Update Status\n
"},{"location":"queue/#protocol","title":"Protocol","text":"

CLI \u2192 API (Binary WebSocket):

[opcode:1][api_key_hash:64][commit_id:64][priority:1][job_name_len:1][job_name:var]\n

API \u2192 Redis: - Priority queue: ZADD task:queue {priority} {task_id} - Task data: SET task:{id} {json} - Status: HSET task:status:{job_name} ...

Worker \u2190 Redis: - Poll: ZPOPMAX task:queue 1 (highest priority first) - Fetch: GET task:{id}

"},{"location":"queue/#redis-data-structures","title":"Redis Data Structures","text":""},{"location":"queue/#keys","title":"Keys","text":"
task:queue                    # ZSET: priority queue\ntask:{uuid}                  # STRING: task JSON data\ntask:status:{job_name}       # HASH: job status\nworker:heartbeat             # HASH: worker health\njob:metrics:{job_name}       # HASH: job metrics\n
"},{"location":"queue/#priority-queue-zset","title":"Priority Queue (ZSET)","text":"
ZADD task:queue 10 \"uuid-1\"   # Priority 10\nZADD task:queue 5  \"uuid-2\"   # Priority 5\nZPOPMAX task:queue 1          # Returns uuid-1 (highest)\n
"},{"location":"queue/#api-server-integration","title":"API Server Integration","text":""},{"location":"queue/#initialization","title":"Initialization","text":"
// cmd/api-server/main.go\nqueueCfg := queue.Config{\n    RedisAddr:     cfg.Redis.Addr,\n    RedisPassword: cfg.Redis.Password,\n    RedisDB:       cfg.Redis.DB,\n}\ntaskQueue, err := queue.NewTaskQueue(queueCfg)\n
"},{"location":"queue/#websocket-handler","title":"WebSocket Handler","text":"
// internal/api/ws.go\nfunc (h *WSHandler) handleQueueJob(conn *websocket.Conn, payload []byte) error {\n    // Parse request\n    apiKeyHash, commitID, priority, jobName := parsePayload(payload)\n\n    // Create task with unique ID\n    taskID := uuid.New().String()\n    task := &queue.Task{\n        ID:       taskID,\n        JobName:  jobName,\n        Status:   \"queued\",\n        Priority: int64(priority),\n        Metadata: map[string]string{\n            \"commit_id\": commitID,\n            \"user\":      user,\n        },\n    }\n\n    // Enqueue\n    if err := h.queue.AddTask(task); err != nil {\n        return h.sendErrorPacket(conn, ErrorCodeDatabaseError, ...)\n    }\n\n    return h.sendSuccessPacket(conn, \"Job queued\")\n}\n
"},{"location":"queue/#worker-integration","title":"Worker Integration","text":""},{"location":"queue/#task-polling","title":"Task Polling","text":"
// cmd/worker/worker_server.go\nfunc (w *Worker) Start() error {\n    for {\n        task, err := w.queue.WaitForNextTask(ctx, 5*time.Second)\n        if task != nil {\n            go w.executeTask(task)\n        }\n    }\n}\n
"},{"location":"queue/#task-execution","title":"Task Execution","text":"
func (w *Worker) executeTask(task *queue.Task) {\n    // Update status\n    task.Status = \"running\"\n    task.StartedAt = &now\n    w.queue.UpdateTaskWithMetrics(task, \"start\")\n\n    // Execute\n    err := w.runJob(task)\n\n    // Finalize\n    task.Status = \"completed\" // or \"failed\"\n    task.EndedAt = &endTime\n    task.Error = err.Error() // if err != nil\n    w.queue.UpdateTaskWithMetrics(task, \"final\")\n}\n
"},{"location":"queue/#configuration","title":"Configuration","text":""},{"location":"queue/#api-server-configsconfigyaml","title":"API Server (configs/config.yaml)","text":"
redis:\n  addr: \"localhost:6379\"\n  password: \"\"\n  db: 0\n
"},{"location":"queue/#worker-configsworker-configyaml","title":"Worker (configs/worker-config.yaml)","text":"
redis:\n  addr: \"localhost:6379\"\n  password: \"\"\n  db: 0\n\nmetrics_flush_interval: 500ms\n
"},{"location":"queue/#monitoring","title":"Monitoring","text":""},{"location":"queue/#queue-depth","title":"Queue Depth","text":"
depth, err := queue.QueueDepth()\nfmt.Printf(\"Pending tasks: %d\\n\", depth)\n
"},{"location":"queue/#worker-heartbeat","title":"Worker Heartbeat","text":"
// Worker sends heartbeat every 30s\nerr := queue.Heartbeat(workerID)\n
"},{"location":"queue/#metrics","title":"Metrics","text":"
HGETALL job:metrics:{job_name}\n# Returns: timestamp, tasks_start, tasks_final, etc\n
"},{"location":"queue/#error-handling","title":"Error Handling","text":""},{"location":"queue/#task-failures","title":"Task Failures","text":"
if err := w.runJob(task); err != nil {\n    task.Status = \"failed\"\n    task.Error = err.Error()\n    w.queue.UpdateTask(task)\n}\n
"},{"location":"queue/#redis-connection-loss","title":"Redis Connection Loss","text":"
// TaskQueue automatically reconnects\n// Workers should implement retry logic\nfor retries := 0; retries < 3; retries++ {\n    task, err := queue.GetNextTask()\n    if err == nil {\n        break\n    }\n    time.Sleep(backoff)\n}\n
"},{"location":"queue/#testing","title":"Testing","text":"
// tests using miniredis\ns, _ := miniredis.Run()\ndefer s.Close()\n\ntq, _ := queue.NewTaskQueue(queue.Config{\n    RedisAddr: s.Addr(),\n})\n\ntask := &queue.Task{ID: \"test-1\", JobName: \"test\"}\ntq.AddTask(task)\n\nfetched, _ := tq.GetNextTask()\n// assert fetched.ID == \"test-1\"\n
"},{"location":"queue/#best-practices","title":"Best Practices","text":"
  1. Unique Task IDs: Always use UUIDs to avoid conflicts
  2. Metadata: Store commit_id and user in task metadata
  3. Priority: Higher values execute first (0-255 range)
  4. Status Updates: Update status at each lifecycle stage
  5. Error Logging: Store detailed errors in task.Error
  6. Heartbeats: Workers should send heartbeats regularly
  7. Metrics: Use UpdateTaskWithMetrics for atomic updates

For implementation details, see: - internal/queue/task.go - internal/queue/queue.go

"},{"location":"quick-start/","title":"Quick Start","text":"

Get Fetch ML running in minutes with Docker Compose.

"},{"location":"quick-start/#prerequisites","title":"Prerequisites","text":"

Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution

"},{"location":"quick-start/#one-command-setup","title":"One-Command Setup","text":"
# Clone and start\ngit clone https://github.com/jfraeys/fetch_ml.git\ncd fetch_ml\ndocker-compose up -d (testing only)\n\n# Wait for services (30 seconds)\nsleep 30\n\n# Verify setup\ncurl http://localhost:9101/health\n
"},{"location":"quick-start/#first-experiment","title":"First Experiment","text":"
# Submit a simple ML job (see [First Experiment](first-experiment.md) for details)\ncurl -X POST http://localhost:9101/api/v1/jobs \\\n  -H \"Content-Type: application/json\" \\\n  -H \"X-API-Key: admin\" \\\n  -d '{\n    \"job_name\": \"hello-world\",\n    \"args\": \"--echo Hello World\",\n    \"priority\": 1\n  }'\n\n# Check job status\ncurl http://localhost:9101/api/v1/jobs \\\n  -H \"X-API-Key: admin\"\n
"},{"location":"quick-start/#cli-access","title":"CLI Access","text":"
# Build CLI\ncd cli && zig build dev\n\n# List jobs\n./cli/zig-out/dev/ml --server http://localhost:9101 list-jobs\n\n# Submit new job\n./cli/zig-out/dev/ml --server http://localhost:9101 submit \\\n  --name \"test-job\" --args \"--epochs 10\"\n
"},{"location":"quick-start/#related-documentation","title":"Related Documentation","text":""},{"location":"quick-start/#troubleshooting","title":"Troubleshooting","text":"

Services not starting?

# Check logs\ndocker-compose logs\n\n# Restart services\ndocker-compose down && docker-compose up -d (testing only)\n

API not responding?

# Check health\ncurl http://localhost:9101/health\n\n# Verify ports\ndocker-compose ps\n

Permission denied?

# Check API key\ncurl -H \"X-API-Key: admin\" http://localhost:9101/api/v1/jobs\n

"},{"location":"redis-ha/","title":"Redis High Availability","text":"

Note: This is optional for homelab setups. Single Redis instance is sufficient for most use cases.

"},{"location":"redis-ha/#when-you-need-ha","title":"When You Need HA","text":"

Consider Redis HA if: - Running production workloads - Uptime > 99.9% required - Can't afford to lose queued tasks - Multiple workers across machines

"},{"location":"redis-ha/#redis-sentinel-recommended","title":"Redis Sentinel (Recommended)","text":""},{"location":"redis-ha/#setup","title":"Setup","text":"
# docker-compose.yml\nversion: '3.8'\nservices:\n  redis-master:\n    image: redis:7-alpine\n    command: redis-server --maxmemory 2gb\n\n  redis-replica:\n    image: redis:7-alpine\n    command: redis-server --slaveof redis-master 6379\n\n  redis-sentinel-1:\n    image: redis:7-alpine\n    command: redis-sentinel /etc/redis/sentinel.conf\n    volumes:\n      - ./sentinel.conf:/etc/redis/sentinel.conf\n

sentinel.conf:

sentinel monitor mymaster redis-master 6379 2\nsentinel down-after-milliseconds mymaster 5000\nsentinel parallel-syncs mymaster 1\nsentinel failover-timeout mymaster 10000\n

"},{"location":"redis-ha/#application-configuration","title":"Application Configuration","text":"
# worker-config.yaml\nredis_addr: \"redis-sentinel-1:26379,redis-sentinel-2:26379\"\nredis_master_name: \"mymaster\"\n
"},{"location":"redis-ha/#redis-cluster-advanced","title":"Redis Cluster (Advanced)","text":"

For larger deployments with sharding needs.

# Minimum 3 masters + 3 replicas\nservices:\n  redis-1:\n    image: redis:7-alpine\n    command: redis-server --cluster-enabled yes\n\n  redis-2:\n    # ... similar config\n
"},{"location":"redis-ha/#homelab-alternative-persistence-only","title":"Homelab Alternative: Persistence Only","text":"

For most homelabs, just enable persistence:

# docker-compose.yml\nservices:\n  redis:\n    image: redis:7-alpine\n    command: redis-server --appendonly yes\n    volumes:\n      - redis_data:/data\n\nvolumes:\n  redis_data:\n

This ensures tasks survive Redis restarts without full HA complexity.

Recommendation: Start simple. Add HA only if you experience actual downtime issues.

"},{"location":"release-checklist/","title":"Release Checklist","text":"

This checklist captures the work required before cutting a release that includes the graceful worker shutdown feature.

"},{"location":"release-checklist/#1-code-hygiene-compilation","title":"1. Code Hygiene / Compilation","text":"
  1. Merge the graceful-shutdown helpers into the canonical worker type to avoid Worker redeclared errors (see cmd/worker/worker_graceful_shutdown.go and cmd/worker/worker_server.go).
  2. Ensure the worker struct exposes the fields referenced by the new helpers (logger, queue, cfg, metrics).
  3. go build ./cmd/worker succeeds without undefined-field errors.
"},{"location":"release-checklist/#2-graceful-shutdown-logic","title":"2. Graceful Shutdown Logic","text":"
  1. Initialize shutdownCh, activeTasks, and gracefulWait during worker start-up.
  2. Confirm the heartbeat/lease helpers compile and handle queue errors gracefully (heartbeatLoop, releaseAllLeases).
  3. Add tests (unit or integration) that simulate SIGINT/SIGTERM and verify leases are released or tasks complete.
"},{"location":"release-checklist/#3-task-execution-flow","title":"3. Task Execution Flow","text":"
  1. Align executeTaskWithLease with the real executeTask signature so the \"no value used as value\" compile error disappears.
  2. Double-check retry/metrics paths still match existing worker behavior after the new wrapper is added.
"},{"location":"release-checklist/#4-server-wiring","title":"4. Server Wiring","text":"
  1. Ensure worker construction in cmd/worker/worker_server.go wires up config, queue, metrics, and logger instances used by the shutdown logic.
  2. Re-run worker unit tests plus any queue/lease e2e tests.
"},{"location":"release-checklist/#5-validation-before-tagging","title":"5. Validation Before Tagging","text":"
  1. go test ./cmd/worker/... and make test (or equivalent) pass locally.
  2. Manual smoke test: start worker, queue jobs, send SIGTERM, confirm tasks finish or leases are released and the process exits cleanly.
  3. Update release notes describing the new shutdown capability and any config changes required (e.g., graceful timeout settings).
"},{"location":"security/","title":"Security Guide","text":"

This document outlines security features, best practices, and hardening procedures for FetchML.

"},{"location":"security/#security-features","title":"Security Features","text":""},{"location":"security/#authentication-authorization","title":"Authentication & Authorization","text":""},{"location":"security/#communication-security","title":"Communication Security","text":""},{"location":"security/#data-privacy","title":"Data Privacy","text":""},{"location":"security/#network-security","title":"Network Security","text":""},{"location":"security/#security-checklist","title":"Security Checklist","text":""},{"location":"security/#initial-setup","title":"Initial Setup","text":"
  1. Generate Strong Passwords

    # Grafana admin password\nopenssl rand -base64 32 > .grafana-password\n\n# Redis password\nopenssl rand -base64 32\n

  2. Configure Environment Variables

    cp .env.example .env\n# Edit .env and set:\n# - GRAFANA_ADMIN_PASSWORD\n

  3. Enable TLS (Production only)

    # configs/config-prod.yaml\nserver:\n  tls:\n    enabled: true\n    cert_file: \"/secrets/cert.pem\"\n    key_file: \"/secrets/key.pem\"\n

  4. Configure Firewall

    # Allow only necessary ports\nsudo ufw allow 22/tcp    # SSH\nsudo ufw allow 443/tcp   # HTTPS\nsudo ufw allow 80/tcp    # HTTP (redirect to HTTPS)\nsudo ufw enable\n

"},{"location":"security/#production-hardening","title":"Production Hardening","text":"
  1. Restrict IP Access

    # configs/config-prod.yaml\nauth:\n  ip_whitelist:\n    - \"10.0.0.0/8\"\n    - \"192.168.0.0/16\"\n    - \"127.0.0.1\"\n

  2. Enable Audit Logging

    logging:\n  level: \"info\"\n  audit: true\n  file: \"/var/log/fetch_ml/audit.log\"\n

  3. Harden Redis

    # Redis security\nredis-cli CONFIG SET requirepass \"your-strong-password\"\nredis-cli CONFIG SET rename-command FLUSHDB \"\"\nredis-cli CONFIG SET rename-command FLUSHALL \"\"\n

  4. Secure Grafana

    # Change default admin password\ndocker-compose exec grafana grafana-cli admin reset-admin-password new-strong-password\n

  5. Regular Updates

    # Update system packages\nsudo apt update && sudo apt upgrade -y\n\n# Update containers\ndocker-compose pull\ndocker-compose up -d (testing only)\n

"},{"location":"security/#password-management","title":"Password Management","text":""},{"location":"security/#generate-secure-passwords","title":"Generate Secure Passwords","text":"
# Method 1: OpenSSL\nopenssl rand -base64 32\n\n# Method 2: pwgen (if installed)\npwgen -s 32 1\n\n# Method 3: /dev/urandom\nhead -c 32 /dev/urandom | base64\n
"},{"location":"security/#store-passwords-securely","title":"Store Passwords Securely","text":"

Development: Use .env file (gitignored)

echo \"REDIS_PASSWORD=$(openssl rand -base64 32)\" >> .env\necho \"GRAFANA_ADMIN_PASSWORD=$(openssl rand -base64 32)\" >> .env\n

Production: Use systemd environment files

sudo mkdir -p /etc/fetch_ml/secrets\nsudo chmod 700 /etc/fetch_ml/secrets\necho \"REDIS_PASSWORD=...\" | sudo tee /etc/fetch_ml/secrets/redis.env\nsudo chmod 600 /etc/fetch_ml/secrets/redis.env\n

"},{"location":"security/#api-key-management","title":"API Key Management","text":""},{"location":"security/#generate-api-keys","title":"Generate API Keys","text":"
# Generate random API key\nopenssl rand -hex 32\n\n# Hash for storage\necho -n \"your-api-key\" | sha256sum\n
"},{"location":"security/#rotate-api-keys","title":"Rotate API Keys","text":"
  1. Generate new API key
  2. Update config-local.yaml with new hash
  3. Distribute new key to users
  4. Remove old key after grace period
"},{"location":"security/#revoke-api-keys","title":"Revoke API Keys","text":"

Remove user entry from config-local.yaml:

auth:\n  apikeys:\n    # user_to_revoke:  # Comment out or delete\n

"},{"location":"security/#network-security_1","title":"Network Security","text":""},{"location":"security/#production-network-topology","title":"Production Network Topology","text":"
Internet\n    \u2193\n[Firewall] (ports 3000, 9102)\n    \u2193\n[Reverse Proxy] (nginx/Apache) - TLS termination\n    \u2193\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502   Application Pod   \u2502\n\u2502                     \u2502\n\u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510   \u2502\n\u2502  \u2502 API Server   \u2502   \u2502  \u2190 Public (via reverse proxy)\n\u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518   \u2502\n\u2502                     \u2502\n\u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510   \u2502\n\u2502  \u2502   Redis      \u2502   \u2502  \u2190 Internal only\n\u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518   \u2502\n\u2502                     \u2502\n\u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510   \u2502\n\u2502  \u2502   Grafana    \u2502   \u2502  \u2190 Public (via reverse proxy)\n\u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518   \u2502\n\u2502                     \u2502\n\u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510   \u2502\n\u2502  \u2502 Prometheus   \u2502   \u2502  \u2190 Internal only\n\u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518   \u2502\n\u2502                     \u2502\n\u2502  \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510   \u2502\n\u2502  \u2502    Loki      \u2502   \u2502  \u2190 Internal only\n\u2502  \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518   \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n
"},{"location":"security/#recommended-firewall-rules","title":"Recommended Firewall Rules","text":"
# Allow only necessary inbound connections\nsudo firewall-cmd --permanent --zone=public --add-rich-rule='\n  rule family=\"ipv4\"\n  source address=\"YOUR_NETWORK\"\n  port port=\"3000\" protocol=\"tcp\" accept'\n\nsudo firewall-cmd --permanent --zone=public --add-rich-rule='\n  rule family=\"ipv4\"\n  source address=\"YOUR_NETWORK\"\n  port port=\"9102\" protocol=\"tcp\" accept'\n\n# Block all other traffic\nsudo firewall-cmd --permanent --set-default-zone=drop\nsudo firewall-cmd --reload\n
"},{"location":"security/#incident-response","title":"Incident Response","text":""},{"location":"security/#suspected-breach","title":"Suspected Breach","text":"
  1. Immediate Actions
  2. Investigation
  3. Recovery
  4. Rotate all API keys
  5. Stop affected services
  6. Review audit logs

  7. Investigation

    # Check recent logins\nsudo journalctl -u fetchml-api --since \"1 hour ago\"\n\n# Review failed auth attempts\ngrep \"authentication failed\" /var/log/fetch_ml/*.log\n\n# Check active connections\nss -tnp | grep :9102\n

  8. Recovery

  9. Rotate all passwords and API keys
  10. Update firewall rules
  11. Patch vulnerabilities
  12. Resume services
"},{"location":"security/#security-monitoring","title":"Security Monitoring","text":"
# Monitor failed authentication\ntail -f /var/log/fetch_ml/api.log | grep \"auth.*failed\"\n\n# Monitor unusual activity\njournalctl -u fetchml-api -f | grep -E \"(ERROR|WARN)\"\n\n# Check open ports\nnmap -p- localhost\n
"},{"location":"security/#security-best-practices","title":"Security Best Practices","text":"
  1. Principle of Least Privilege: Grant minimum necessary permissions
  2. Defense in Depth: Multiple security layers (firewall + auth + TLS)
  3. Regular Updates: Keep all components patched
  4. Audit Regularly: Review logs and access patterns
  5. Secure Secrets: Never commit passwords/keys to git
  6. Network Segmentation: Isolate services with internal networks
  7. Monitor Everything: Enable comprehensive logging and alerting
  8. Test Security: Regular penetration testing and vulnerability scans
"},{"location":"security/#compliance","title":"Compliance","text":""},{"location":"security/#data-privacy_1","title":"Data Privacy","text":""},{"location":"security/#audit-trail","title":"Audit Trail","text":"

All API access is logged with: - Timestamp - User/API key - Action performed - Source IP - Result (success/failure)

"},{"location":"security/#getting-help","title":"Getting Help","text":""},{"location":"smart-defaults/","title":"Smart Defaults","text":"

This document describes Fetch ML's smart defaults system, which automatically adapts configuration based on the runtime environment.

"},{"location":"smart-defaults/#overview","title":"Overview","text":"

Smart defaults eliminate the need for manual configuration tweaks when running in different environments:

"},{"location":"smart-defaults/#environment-detection","title":"Environment Detection","text":"

The system automatically detects the environment based on:

  1. CI Detection: Checks for CI, GITHUB_ACTIONS, GITLAB_CI environment variables
  2. Container Detection: Looks for /.dockerenv, KUBERNETES_SERVICE_HOST, or CONTAINER variables
  3. Production Detection: Checks FETCH_ML_ENV=production or ENV=production
  4. Default: Falls back to local development
"},{"location":"smart-defaults/#default-values-by-environment","title":"Default Values by Environment","text":""},{"location":"smart-defaults/#host-configuration","title":"Host Configuration","text":""},{"location":"smart-defaults/#base-paths","title":"Base Paths","text":""},{"location":"smart-defaults/#data-directory","title":"Data Directory","text":""},{"location":"smart-defaults/#redis-address","title":"Redis Address","text":""},{"location":"smart-defaults/#ssh-configuration","title":"SSH Configuration","text":""},{"location":"smart-defaults/#worker-configuration","title":"Worker Configuration","text":""},{"location":"smart-defaults/#log-levels","title":"Log Levels","text":""},{"location":"smart-defaults/#usage","title":"Usage","text":""},{"location":"smart-defaults/#in-configuration-loaders","title":"In Configuration Loaders","text":"
// Get smart defaults for current environment\nsmart := config.GetSmartDefaults()\n\n// Use smart defaults\nif cfg.Host == \"\" {\n    cfg.Host = smart.Host()\n}\nif cfg.BasePath == \"\" {\n    cfg.BasePath = smart.BasePath()\n}\n
"},{"location":"smart-defaults/#environment-overrides","title":"Environment Overrides","text":"

Smart defaults can be overridden with environment variables:

"},{"location":"smart-defaults/#manual-environment-selection","title":"Manual Environment Selection","text":"

You can force a specific environment:

# Force production mode\nexport FETCH_ML_ENV=production\n\n# Force container mode\nexport CONTAINER=true\n
"},{"location":"smart-defaults/#implementation-details","title":"Implementation Details","text":"

The smart defaults system is implemented in internal/config/smart_defaults.go:

"},{"location":"smart-defaults/#migration-guide","title":"Migration Guide","text":""},{"location":"smart-defaults/#for-users","title":"For Users","text":"

No changes required - existing configurations continue to work. Smart defaults only apply when values are not explicitly set.

"},{"location":"smart-defaults/#for-developers","title":"For Developers","text":"

When adding new configuration options:

  1. Add a method to SmartDefaults struct
  2. Use the smart default in config loaders
  3. Document the environment-specific values

Example:

// Add to SmartDefaults struct\nfunc (s *SmartDefaults) NewFeature() string {\n    switch s.Profile {\n    case ProfileContainer, ProfileCI:\n        return \"/workspace/new-feature\"\n    case ProfileProduction:\n        return \"/var/lib/fetch_ml/new-feature\"\n    default:\n        return \"./new-feature\"\n    }\n}\n\n// Use in config loader\nif cfg.NewFeature == \"\" {\n    cfg.NewFeature = smart.NewFeature()\n}\n
"},{"location":"smart-defaults/#testing","title":"Testing","text":"

To test different environments:

# Test local defaults (default)\n./bin/worker\n\n# Test container defaults\nexport CONTAINER=true\n./bin/worker\n\n# Test CI defaults\nexport CI=true\n./bin/worker\n\n# Test production defaults\nexport FETCH_ML_ENV=production\n./bin/worker\n
"},{"location":"smart-defaults/#troubleshooting","title":"Troubleshooting","text":""},{"location":"smart-defaults/#wrong-environment-detection","title":"Wrong Environment Detection","text":"

Check environment variables:

echo \"CI: $CI\"\necho \"CONTAINER: $CONTAINER\"\necho \"FETCH_ML_ENV: $FETCH_ML_ENV\"\n
"},{"location":"smart-defaults/#path-issues","title":"Path Issues","text":"

Smart defaults expand ~ and environment variables automatically. If paths don't work as expected:

  1. Check the detected environment: config.GetSmartDefaults().GetEnvironmentDescription()
  2. Verify the path exists in the target environment
  3. Override with environment variable if needed
"},{"location":"smart-defaults/#container-networking","title":"Container Networking","text":"

For container environments, ensure: - Redis service is named redis in docker-compose - Host networking is configured properly - host.docker.internal resolves (Docker Desktop/Colima)

"},{"location":"testing/","title":"Testing Guide","text":"

How to run and write tests for FetchML.

"},{"location":"testing/#running-tests","title":"Running Tests","text":""},{"location":"testing/#quick-test","title":"Quick Test","text":"
# All tests\nmake test\n\n# Unit tests only\nmake test-unit\n\n# Integration tests\nmake test-integration\n\n# With coverage\nmake test-coverage\n\n\n## Quick Test\n```bash\nmake test          # All tests\nmake test-unit     # Unit only\n.\nmake test.\nmake test$\nmake test; make test  # Coverage\n  # E2E tests\n
"},{"location":"testing/#docker-testing","title":"Docker Testing","text":"
docker-compose up -d (testing only)\nmake test\ndocker-compose down\n
"},{"location":"testing/#cli-testing","title":"CLI Testing","text":"
cd cli && zig build dev\n./cli/zig-out/dev/ml --help\nzig build test\n
"},{"location":"troubleshooting/","title":"Troubleshooting","text":"

Common issues and solutions for Fetch ML.

"},{"location":"troubleshooting/#quick-fixes","title":"Quick Fixes","text":""},{"location":"troubleshooting/#services-not-starting","title":"Services Not Starting","text":"
# Check Docker status\ndocker-compose ps\n\n# Restart services\ndocker-compose down && docker-compose up -d (testing only)\n\n# Check logs\ndocker-compose logs -f\n
"},{"location":"troubleshooting/#api-not-responding","title":"API Not Responding","text":"
# Check health endpoint\ncurl http://localhost:9101/health\n\n# Check if port is in use\nlsof -i :9101\n\n# Kill process on port\nkill -9 $(lsof -ti :9101)\n
"},{"location":"troubleshooting/#database-issues","title":"Database Issues","text":"
# Check database connection\ndocker-compose exec postgres psql -U postgres -d fetch_ml\n\n# Reset database\ndocker-compose down postgres\ndocker-compose up -d (testing only) postgres\n\n# Check Redis\ndocker-compose exec redis redis-cli ping\n
"},{"location":"troubleshooting/#common-errors","title":"Common Errors","text":""},{"location":"troubleshooting/#authentication-errors","title":"Authentication Errors","text":""},{"location":"troubleshooting/#database-errors","title":"Database Errors","text":""},{"location":"troubleshooting/#container-errors","title":"Container Errors","text":""},{"location":"troubleshooting/#performance-issues","title":"Performance Issues","text":""},{"location":"troubleshooting/#development-issues","title":"Development Issues","text":""},{"location":"troubleshooting/#cli-issues","title":"CLI Issues","text":""},{"location":"troubleshooting/#network-issues","title":"Network Issues","text":""},{"location":"troubleshooting/#configuration-issues","title":"Configuration Issues","text":""},{"location":"troubleshooting/#debug-information","title":"Debug Information","text":"
./bin/api-server --version\ndocker-compose ps\ndocker-compose logs api-server | grep ERROR\n
"},{"location":"troubleshooting/#emergency-reset","title":"Emergency Reset","text":"
docker-compose down -v\nrm -rf data/ results/ *.db\ndocker-compose up -d (testing only)\n
"},{"location":"user-permissions/","title":"User Permissions in Fetch ML","text":"

Fetch ML now supports user-based permissions to ensure data scientists can only view and manage their own experiments while administrators retain full control.

"},{"location":"user-permissions/#overview","title":"Overview","text":""},{"location":"user-permissions/#permissions","title":"Permissions","text":""},{"location":"user-permissions/#job-permissions","title":"Job Permissions","text":""},{"location":"user-permissions/#user-types","title":"User Types","text":""},{"location":"user-permissions/#cli-usage","title":"CLI Usage","text":""},{"location":"user-permissions/#view-your-jobs","title":"View Your Jobs","text":"

ml status\n
Shows only your experiments with user context displayed.

"},{"location":"user-permissions/#cancel-your-jobs","title":"Cancel Your Jobs","text":"

ml cancel <job-name>\n
Only allows canceling your own experiments (unless you're an admin).

"},{"location":"user-permissions/#authentication","title":"Authentication","text":"

The CLI automatically authenticates using your API key from ~/.ml/config.toml.

"},{"location":"user-permissions/#configuration","title":"Configuration","text":""},{"location":"user-permissions/#api-key-setup","title":"API Key Setup","text":"
[worker]\napi_key = \"your-api-key-here\"\n
"},{"location":"user-permissions/#user-roles","title":"User Roles","text":"

User roles and permissions are configured on the server side by administrators.

"},{"location":"user-permissions/#security-features","title":"Security Features","text":""},{"location":"user-permissions/#examples","title":"Examples","text":""},{"location":"user-permissions/#data-scientist-workflow","title":"Data Scientist Workflow","text":"
# Submit your experiment\nml run my-experiment\n\n# Check your experiments (only shows yours)\nml status\n\n# Cancel your own experiment\nml cancel my-experiment\n
"},{"location":"user-permissions/#administrator-workflow","title":"Administrator Workflow","text":"
# View all experiments (admin sees everything)\nml status\n\n# Cancel any user's experiment\nml cancel user-experiment\n
"},{"location":"user-permissions/#error-messages","title":"Error Messages","text":""},{"location":"user-permissions/#migration-notes","title":"Migration Notes","text":"

For more details, see the architecture documentation.

"},{"location":"zig-cli/","title":"Zig CLI Guide","text":"

High-performance command-line interface for ML experiment management, written in Zig for maximum speed and efficiency.

"},{"location":"zig-cli/#overview","title":"Overview","text":"

The Zig CLI (ml) is the primary interface for managing ML experiments in your homelab. Built with Zig, it provides exceptional performance for file operations, network communication, and experiment management.

"},{"location":"zig-cli/#installation","title":"Installation","text":""},{"location":"zig-cli/#pre-built-binaries-recommended","title":"Pre-built Binaries (Recommended)","text":"

Download from GitHub Releases:

# Download for your platform\ncurl -LO https://github.com/jfraeys/fetch_ml/releases/latest/download/ml-<platform>.tar.gz\n\n# Extract\ntar -xzf ml-<platform>.tar.gz\n\n# Install\nchmod +x ml-<platform>\nsudo mv ml-<platform> /usr/local/bin/ml\n\n# Verify\nml --help\n

Platforms: - ml-linux-x86_64.tar.gz - Linux (fully static, zero dependencies) - ml-macos-x86_64.tar.gz - macOS Intel - ml-macos-arm64.tar.gz - macOS Apple Silicon

All release binaries include embedded static rsync for complete independence.

"},{"location":"zig-cli/#build-from-source","title":"Build from Source","text":"

Development Build (uses system rsync):

cd cli\nzig build dev\n./zig-out/dev/ml-dev --help\n

Production Build (embedded rsync):

cd cli\n# For testing: uses rsync wrapper\nzig build prod\n\n# For release with static rsync:\n# 1. Place static rsync binary at src/assets/rsync_release.bin\n# 2. Build\nzig build prod\nstrip zig-out/prod/ml  # Optional: reduce size\n\n# Verify\n./zig-out/prod/ml --help\nls -lh zig-out/prod/ml\n

See cli/src/assets/README.md for details on obtaining static rsync binaries.

"},{"location":"zig-cli/#verify-installation","title":"Verify Installation","text":"
ml --help\nml --version  # Shows build config\n
"},{"location":"zig-cli/#quick-start","title":"Quick Start","text":"
  1. Initialize Configuration

    ./cli/zig-out/bin/ml init\n

  2. Sync Your First Project

    ./cli/zig-out/bin/ml sync ./my-project --queue\n

  3. Monitor Progress

    ./cli/zig-out/bin/ml status\n

"},{"location":"zig-cli/#command-reference","title":"Command Reference","text":""},{"location":"zig-cli/#init-configuration-setup","title":"init - Configuration Setup","text":"

Initialize the CLI configuration file.

ml init\n

Creates: ~/.ml/config.toml

Configuration Template:

worker_host = \"worker.local\"\nworker_user = \"mluser\"\nworker_base = \"/data/ml-experiments\"\nworker_port = 22\napi_key = \"your-api-key\"\n

"},{"location":"zig-cli/#sync-project-synchronization","title":"sync - Project Synchronization","text":"

Sync project files to the worker with intelligent deduplication.

# Basic sync\nml sync ./project\n\n# Sync with custom name and auto-queue\nml sync ./project --name \"experiment-1\" --queue\n\n# Sync with priority\nml sync ./project --priority 8\n

Options: - --name <name>: Custom experiment name - --queue: Automatically queue after sync - --priority N: Set priority (1-10, default 5)

Features: - Content-Addressed Storage: Automatic deduplication - SHA256 Commit IDs: Reliable change detection - Incremental Transfer: Only sync changed files - Rsync Backend: Efficient file transfer

"},{"location":"zig-cli/#queue-job-management","title":"queue - Job Management","text":"

Queue experiments for execution on the worker.

# Queue with commit ID\nml queue my-job --commit abc123def456\n\n# Queue with priority\nml queue my-job --commit abc123 --priority 8\n

Options: - --commit <id>: Commit ID from sync output - --priority N: Execution priority (1-10)

Features: - WebSocket Communication: Real-time job submission - Priority Queuing: Higher priority jobs run first - API Authentication: Secure job submission

"},{"location":"zig-cli/#watch-auto-sync-monitoring","title":"watch - Auto-Sync Monitoring","text":"

Monitor directories for changes and auto-sync.

# Watch for changes\nml watch ./project\n\n# Watch and auto-queue on changes\nml watch ./project --name \"dev-exp\" --queue\n

Options: - --name <name>: Custom experiment name - --queue: Auto-queue on changes - --priority N: Set priority for queued jobs

Features: - Real-time Monitoring: 2-second polling interval - Change Detection: File modification time tracking - Commit Comparison: Only sync when content changes - Automatic Queuing: Seamless development workflow

"},{"location":"zig-cli/#status-system-status","title":"status - System Status","text":"

Check system and worker status.

ml status\n

Displays: - Worker connectivity - Queue status - Running jobs - System health

"},{"location":"zig-cli/#monitor-remote-monitoring","title":"monitor - Remote Monitoring","text":"

Launch TUI interface via SSH for real-time monitoring.

ml monitor\n

Features: - Real-time Updates: Live experiment status - Interactive Interface: Browse and manage experiments - SSH Integration: Secure remote access

"},{"location":"zig-cli/#cancel-job-cancellation","title":"cancel - Job Cancellation","text":"

Cancel running or queued jobs.

ml cancel job-id\n

Options: - job-id: Job identifier from status output

"},{"location":"zig-cli/#prune-cleanup-management","title":"prune - Cleanup Management","text":"

Clean up old experiments to save space.

# Keep last N experiments\nml prune --keep 20\n\n# Remove experiments older than N days\nml prune --older-than 30\n

Options: - --keep N: Keep N most recent experiments - --older-than N: Remove experiments older than N days

"},{"location":"zig-cli/#architecture","title":"Architecture","text":"

Testing: Docker Compose (macOS/Linux) Production: Podman + systemd (Linux)

Important: Docker is for testing only. Podman is used for running actual ML experiments in production.

"},{"location":"zig-cli/#core-components","title":"Core Components","text":"
cli/src/\n\u251c\u2500\u2500 commands/        # Command implementations\n\u2502   \u251c\u2500\u2500 init.zig     # Configuration setup\n\u2502   \u251c\u2500\u2500 sync.zig     # Project synchronization\n\u2502   \u251c\u2500\u2500 queue.zig    # Job management\n\u2502   \u251c\u2500\u2500 watch.zig    # Auto-sync monitoring\n\u2502   \u251c\u2500\u2500 status.zig   # System status\n\u2502   \u251c\u2500\u2500 monitor.zig  # Remote monitoring\n\u2502   \u251c\u2500\u2500 cancel.zig   # Job cancellation\n\u2502   \u2514\u2500\u2500 prune.zig    # Cleanup operations\n\u251c\u2500\u2500 config.zig       # Configuration management\n\u251c\u2500\u2500 errors.zig       # Error handling\n\u251c\u2500\u2500 net/            # Network utilities\n\u2502   \u2514\u2500\u2500 ws.zig       # WebSocket client\n\u2514\u2500\u2500 utils/          # Utility functions\n    \u251c\u2500\u2500 crypto.zig   # Hashing and encryption\n    \u251c\u2500\u2500 storage.zig  # Content-addressed storage\n    \u2514\u2500\u2500 rsync.zig    # File synchronization\n
"},{"location":"zig-cli/#performance-features","title":"Performance Features","text":""},{"location":"zig-cli/#content-addressed-storage","title":"Content-Addressed Storage","text":""},{"location":"zig-cli/#sha256-commit-ids","title":"SHA256 Commit IDs","text":""},{"location":"zig-cli/#websocket-protocol","title":"WebSocket Protocol","text":""},{"location":"zig-cli/#memory-management","title":"Memory Management","text":""},{"location":"zig-cli/#security-features","title":"Security Features","text":""},{"location":"zig-cli/#authentication","title":"Authentication","text":""},{"location":"zig-cli/#secure-communication","title":"Secure Communication","text":""},{"location":"zig-cli/#error-handling","title":"Error Handling","text":""},{"location":"zig-cli/#advanced-usage","title":"Advanced Usage","text":""},{"location":"zig-cli/#workflow-integration","title":"Workflow Integration","text":""},{"location":"zig-cli/#development-workflow","title":"Development Workflow","text":"
# 1. Initialize project\nml sync ./project --name \"dev\" --queue\n\n# 2. Auto-sync during development\nml watch ./project --name \"dev\" --queue\n\n# 3. Monitor progress\nml status\n
"},{"location":"zig-cli/#batch-processing","title":"Batch Processing","text":"
# Process multiple experiments\nfor dir in experiments/*/; do\n    ml sync \"$dir\" --queue\ndone\n
"},{"location":"zig-cli/#priority-management","title":"Priority Management","text":"
# High priority experiment\nml sync ./urgent --priority 10 --queue\n\n# Background processing\nml sync ./background --priority 1 --queue\n
"},{"location":"zig-cli/#configuration-management","title":"Configuration Management","text":""},{"location":"zig-cli/#multiple-workers","title":"Multiple Workers","text":"
# ~/.ml/config.toml\nworker_host = \"worker.local\"\nworker_user = \"mluser\"\nworker_base = \"/data/ml-experiments\"\nworker_port = 22\napi_key = \"your-api-key\"\n
"},{"location":"zig-cli/#security-settings","title":"Security Settings","text":"
# Set restrictive permissions\nchmod 600 ~/.ml/config.toml\n\n# Verify configuration\nml status\n
"},{"location":"zig-cli/#troubleshooting","title":"Troubleshooting","text":""},{"location":"zig-cli/#common-issues","title":"Common Issues","text":""},{"location":"zig-cli/#build-problems","title":"Build Problems","text":"
# Check Zig installation\nzig version\n\n# Clean build\ncd cli && make clean && make build\n
"},{"location":"zig-cli/#connection-issues","title":"Connection Issues","text":"
# Test SSH connectivity\nssh -p $worker_port $worker_user@$worker_host\n\n# Verify configuration\ncat ~/.ml/config.toml\n
"},{"location":"zig-cli/#sync-failures","title":"Sync Failures","text":"
# Check rsync\nrsync --version\n\n# Manual sync test\nrsync -avz ./test/ $worker_user@$worker_host:/tmp/\n
"},{"location":"zig-cli/#performance-issues","title":"Performance Issues","text":"
# Monitor resource usage\ntop -p $(pgrep ml)\n\n# Check disk space\ndf -h $worker_base\n
"},{"location":"zig-cli/#debug-mode","title":"Debug Mode","text":"

Enable verbose logging:

# Environment variable\nexport ML_DEBUG=1\nml sync ./project\n\n# Or use debug build\ncd cli && make debug\n

"},{"location":"zig-cli/#performance-benchmarks","title":"Performance Benchmarks","text":""},{"location":"zig-cli/#file-operations","title":"File Operations","text":""},{"location":"zig-cli/#memory-usage","title":"Memory Usage","text":""},{"location":"zig-cli/#network-performance","title":"Network Performance","text":""},{"location":"zig-cli/#contributing","title":"Contributing","text":""},{"location":"zig-cli/#development-setup","title":"Development Setup","text":"
cd cli\nzig build-exe src/main.zig\n
"},{"location":"zig-cli/#testing","title":"Testing","text":"
# Run tests\ncd cli && zig test src/\n\n# Integration tests\nzig test tests/\n
"},{"location":"zig-cli/#code-style","title":"Code Style","text":"

For more information, see the CLI Reference and Architecture pages.

"}]}