{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"],"fields":{"title":{"boost":1000.0},"text":{"boost":1.0},"tags":{"boost":1000000.0}}},"docs":[{"location":"","title":"Fetch ML - Secure Machine Learning Platform","text":"
A secure, containerized platform for running machine learning experiments with role-based access control and comprehensive audit trails.
"},{"location":"#quick-start","title":"Quick Start","text":"New to the project? Start here!
# Clone the repository\ngit clone https://github.com/your-username/fetch_ml.git\ncd fetch_ml\n\n# Quick setup (builds everything, creates test user)\nmake quick-start\n\n# Create your API key\n./bin/user_manager --config configs/config_dev.yaml --cmd generate-key --username your_name --role data_scientist\n\n# Run your first experiment\n./bin/worker --config configs/config_dev.yaml --api-key YOUR_GENERATED_KEY\n"},{"location":"#quick-navigation","title":"Quick Navigation","text":""},{"location":"#getting-started","title":"\ud83d\ude80 Getting Started","text":"# Core commands\nmake help # See all available commands\nmake build # Build all binaries\nmake test-unit # Run tests\n\n# User management\n./bin/user_manager --config configs/config_dev.yaml --cmd generate-key --username new_user --role data_scientist\n./bin/user_manager --config configs/config_dev.yaml --cmd list-users\n\n# Run services\n./bin/worker --config configs/config_dev.yaml --api-key YOUR_KEY\n./bin/tui --config configs/config_dev.yaml\n./bin/data_manager --config configs/config_dev.yaml\n"},{"location":"#need-help","title":"Need Help?","text":"make helpmake test-unitHappy ML experimenting!
"},{"location":"api-key-process/","title":"FetchML API Key Process","text":"This document describes how API keys are issued and how team members should configure the ml CLI to use them.
The goal is to keep access easy for your homelab while treating API keys as sensitive secrets.
"},{"location":"api-key-process/#overview","title":"Overview","text":"ml CLI to authenticate to the FetchML API.There are two supported ways to receive your key:
./scripts/create_bitwarden_fetchml_item.sh <username> <api_key> <api_key_hash>\n This script:
FetchML API \u2013 <username>.Stores:
<username><api_key> (the actual API key)api_key_hash: <api_key_hash>Share that item with the user in Bitwarden (for example, via a shared collection like FetchML).
Open Bitwarden and locate the item:
Name: FetchML API \u2013 <your-name>
Copy the password field (this is your FetchML API key).
Configure the CLI, e.g. in ~/.ml/config.toml:
api_key = \"<paste-from-bitwarden>\"\nworker_host = \"localhost\"\nworker_port = 9100\napi_url = \"ws://localhost:9100/ws\"\n ml status\n If the command works, your key and tunnel/config are correct.
"},{"location":"api-key-process/#2-direct-share-no-password-manager-required","title":"2. Direct share (no password manager required)","text":"For users who do not use Bitwarden, a lightweight alternative is a direct one-to-one share.
"},{"location":"api-key-process/#for-the-admin_1","title":"For the admin","text":"Share only the API key with the user via a direct channel you both trust, such as:
Signal / WhatsApp direct message
Short call/meeting where you read it to them
Ask the user to:
Paste the key into their local config.
~/.ml/config.toml:api_key = \"<your-api-key>\"\nworker_host = \"localhost\"\nworker_port = 9100\napi_url = \"ws://localhost:9100/ws\"\n ml status\n ml queue my-training-job\nml cancel my-training-job\n"},{"location":"api-key-process/#3-security-notes","title":"3. Security notes","text":"api_key_hash is as sensitive as the API key itself.Do not commit keys or hashes to Git or share them in screenshots or tickets.
Rotation
The admin will revoke the old key, generate a new one, and update Bitwarden or share a new key.
Transport security
api_url is typically ws://localhost:9100/ws when used through an SSH tunnel to the homelab.Following these steps keeps API access easy for the team while maintaining a reasonable security posture for a personal homelab deployment.
"},{"location":"architecture/","title":"Homelab Architecture","text":"Simple, secure architecture for ML experiments in your homelab.
"},{"location":"architecture/#components-overview","title":"Components Overview","text":"graph TB\n subgraph \"Homelab Stack\"\n CLI[Zig CLI]\n API[HTTPS API]\n REDIS[Redis Cache]\n FS[Local Storage]\n end\n\n CLI --> API\n API --> REDIS\n API --> FS\n"},{"location":"architecture/#core-services","title":"Core Services","text":""},{"location":"architecture/#api-server","title":"API Server","text":"graph LR\n USER[User] --> AUTH[API Key Auth]\n AUTH --> RATE[Rate Limiting]\n RATE --> WHITELIST[IP Whitelist]\n WHITELIST --> API[Secure API]\n API --> AUDIT[Audit Logging]\n"},{"location":"architecture/#security-layers","title":"Security Layers","text":"sequenceDiagram\n participant CLI\n participant API\n participant Redis\n participant Storage\n\n CLI->>API: HTTPS Request\n API->>API: Validate Auth\n API->>Redis: Cache/Queue\n API->>Storage: Experiment Data\n Storage->>API: Results\n API->>CLI: Response\n"},{"location":"architecture/#deployment-options","title":"Deployment Options","text":""},{"location":"architecture/#docker-compose-recommended","title":"Docker Compose (Recommended)","text":"services:\n redis:\n image: redis:7-alpine\n ports: [\"6379:6379\"]\n volumes: [redis_data:/data]\n\n api-server:\n build: .\n ports: [\"9101:9101\"]\n depends_on: [redis]\n"},{"location":"architecture/#local-setup","title":"Local Setup","text":"./setup.sh && ./manage.sh start\n"},{"location":"architecture/#network-architecture","title":"Network Architecture","text":"data/\n\u251c\u2500\u2500 experiments/ # ML experiment results\n\u251c\u2500\u2500 cache/ # Temporary cache files\n\u2514\u2500\u2500 backups/ # Local backups\n\nlogs/\n\u251c\u2500\u2500 app.log # Application logs\n\u251c\u2500\u2500 audit.log # Security events\n\u2514\u2500\u2500 access.log # API access logs\n"},{"location":"architecture/#monitoring-architecture","title":"Monitoring Architecture","text":"Simple, lightweight monitoring: - Health Checks: Service availability - Log Files: Structured logging - Basic Metrics: Request counts, error rates - Security Events: Failed auth, rate limits
"},{"location":"architecture/#homelab-benefits","title":"Homelab Benefits","text":"graph TB\n subgraph \"Client Layer\"\n CLI[CLI Tools]\n TUI[Terminal UI]\n API[REST API]\n end\n\n subgraph \"Authentication Layer\"\n Auth[Authentication Service]\n RBAC[Role-Based Access Control]\n Perm[Permission Manager]\n end\n\n subgraph \"Core Services\"\n Worker[ML Worker Service]\n DataMgr[Data Manager Service]\n Queue[Job Queue]\n end\n\n subgraph \"Storage Layer\"\n Redis[(Redis Cache)]\n DB[(SQLite/PostgreSQL)]\n Files[File Storage]\n end\n\n subgraph \"Container Runtime\"\n Podman[Podman/Docker]\n Containers[ML Containers]\n end\n\n CLI --> Auth\n TUI --> Auth\n API --> Auth\n\n Auth --> RBAC\n RBAC --> Perm\n\n Worker --> Queue\n Worker --> DataMgr\n Worker --> Podman\n\n DataMgr --> DB\n DataMgr --> Files\n\n Queue --> Redis\n\n Podman --> Containers\n"},{"location":"architecture/#zig-cli-architecture","title":"Zig CLI Architecture","text":""},{"location":"architecture/#component-structure","title":"Component Structure","text":"graph TB\n subgraph \"Zig CLI Components\"\n Main[main.zig] --> Commands[commands/]\n Commands --> Config[config.zig]\n Commands --> Utils[utils/]\n Commands --> Net[net/]\n Commands --> Errors[errors.zig]\n\n subgraph \"Commands\"\n Init[init.zig]\n Sync[sync.zig]\n Queue[queue.zig]\n Watch[watch.zig]\n Status[status.zig]\n Monitor[monitor.zig]\n Cancel[cancel.zig]\n Prune[prune.zig]\n end\n\n subgraph \"Utils\"\n Crypto[crypto.zig]\n Storage[storage.zig]\n Rsync[rsync.zig]\n end\n\n subgraph \"Network\"\n WS[ws.zig]\n end\n end\n"},{"location":"architecture/#performance-optimizations","title":"Performance Optimizations","text":""},{"location":"architecture/#content-addressed-storage","title":"Content-Addressed Storage","text":"graph LR\n subgraph \"CLI Security\"\n Config[Config File] --> Hash[SHA256 Hashing]\n Hash --> Auth[API Authentication]\n Auth --> SSH[SSH Transfer]\n SSH --> WS[WebSocket Security]\n end\n"},{"location":"architecture/#core-components","title":"Core Components","text":""},{"location":"architecture/#1-authentication-authorization","title":"1. Authentication & Authorization","text":"graph LR\n subgraph \"Auth Flow\"\n Client[Client] --> APIKey[API Key]\n APIKey --> Hash[Hash Validation]\n Hash --> Roles[Role Resolution]\n Roles --> Perms[Permission Check]\n Perms --> Access[Grant/Deny Access]\n end\n\n subgraph \"Permission Sources\"\n YAML[YAML Config]\n Inline[Inline Fallback]\n Roles --> YAML\n Roles --> Inline\n end\n Features: - API key-based authentication - Role-based access control (RBAC) - YAML-based permission configuration - Fallback to inline permissions - Admin wildcard permissions
"},{"location":"architecture/#2-worker-service","title":"2. Worker Service","text":"graph TB\n subgraph \"Worker Architecture\"\n API[HTTP API] --> Router[Request Router]\n Router --> Auth[Auth Middleware]\n Auth --> Queue[Job Queue]\n Queue --> Processor[Job Processor]\n Processor --> Runtime[Container Runtime]\n Runtime --> Storage[Result Storage]\n\n subgraph \"Job Lifecycle\"\n Submit[Submit Job] --> Queue\n Queue --> Execute[Execute]\n Execute --> Monitor[Monitor]\n Monitor --> Complete[Complete]\n Complete --> Store[Store Results]\n end\n end\n Responsibilities: - HTTP API for job submission - Job queue management - Container orchestration - Result collection and storage - Metrics and monitoring
"},{"location":"architecture/#3-data-manager-service","title":"3. Data Manager Service","text":"graph TB\n subgraph \"Data Management\"\n API[Data API] --> Storage[Storage Layer]\n Storage --> Metadata[Metadata DB]\n Storage --> Files[File System]\n Storage --> Cache[Redis Cache]\n\n subgraph \"Data Operations\"\n Upload[Upload Data] --> Validate[Validate]\n Validate --> Store[Store]\n Store --> Index[Index]\n Index --> Catalog[Catalog]\n end\n end\n Features: - Data upload and validation - Metadata management - File system abstraction - Caching layer - Data catalog
"},{"location":"architecture/#4-terminal-ui-tui","title":"4. Terminal UI (TUI)","text":"graph TB\n subgraph \"TUI Architecture\"\n UI[UI Components] --> Model[Data Model]\n Model --> Update[Update Loop]\n Update --> Render[Render]\n\n subgraph \"UI Panels\"\n Jobs[Job List]\n Details[Job Details]\n Logs[Log Viewer]\n Status[Status Bar]\n end\n\n UI --> Jobs\n UI --> Details\n UI --> Logs\n UI --> Status\n end\n Components: - Bubble Tea framework - Component-based architecture - Real-time updates - Keyboard navigation - Theme support
"},{"location":"architecture/#data-flow_1","title":"Data Flow","text":""},{"location":"architecture/#job-execution-flow","title":"Job Execution Flow","text":"sequenceDiagram\n participant Client\n participant Auth\n participant Worker\n participant Queue\n participant Container\n participant Storage\n\n Client->>Auth: Submit job with API key\n Auth->>Client: Validate and return job ID\n\n Client->>Worker: Execute job request\n Worker->>Queue: Queue job\n Queue->>Worker: Job ready\n Worker->>Container: Start ML container\n Container->>Worker: Execute experiment\n Worker->>Storage: Store results\n Worker->>Client: Return results\n"},{"location":"architecture/#authentication-flow","title":"Authentication Flow","text":"sequenceDiagram\n participant Client\n participant Auth\n participant PermMgr\n participant Config\n\n Client->>Auth: Request with API key\n Auth->>Auth: Validate key hash\n Auth->>PermMgr: Get user permissions\n PermMgr->>Config: Load YAML permissions\n Config->>PermMgr: Return permissions\n PermMgr->>Auth: Return resolved permissions\n Auth->>Client: Grant/deny access\n"},{"location":"architecture/#security-architecture_1","title":"Security Architecture","text":""},{"location":"architecture/#defense-in-depth","title":"Defense in Depth","text":"graph TB\n subgraph \"Security Layers\"\n Network[Network Security]\n Auth[Authentication]\n AuthZ[Authorization]\n Container[Container Security]\n Data[Data Protection]\n Audit[Audit Logging]\n end\n\n Network --> Auth\n Auth --> AuthZ\n AuthZ --> Container\n Container --> Data\n Data --> Audit\n Security Features: - API key authentication - Role-based permissions - Container isolation - File system sandboxing - Comprehensive audit logs - Input validation and sanitization
"},{"location":"architecture/#container-security","title":"Container Security","text":"graph TB\n subgraph \"Container Isolation\"\n Host[Host System]\n Podman[Podman Runtime]\n Network[Network Isolation]\n FS[File System Isolation]\n User[User Namespaces]\n ML[ML Container]\n\n Host --> Podman\n Podman --> Network\n Podman --> FS\n Podman --> User\n User --> ML\n end\n Isolation Features: - Rootless containers - Network isolation - File system sandboxing - User namespace mapping - Resource limits
"},{"location":"architecture/#configuration-architecture","title":"Configuration Architecture","text":""},{"location":"architecture/#configuration-hierarchy","title":"Configuration Hierarchy","text":"graph TB\n subgraph \"Config Sources\"\n Env[Environment Variables]\n File[Config Files]\n CLI[CLI Flags]\n Defaults[Default Values]\n end\n\n subgraph \"Config Processing\"\n Merge[Config Merger]\n Validate[Schema Validator]\n Apply[Config Applier]\n end\n\n Env --> Merge\n File --> Merge\n CLI --> Merge\n Defaults --> Merge\n\n Merge --> Validate\n Validate --> Apply\n Configuration Priority: 1. CLI flags (highest) 2. Environment variables 3. Configuration files 4. Default values (lowest)
"},{"location":"architecture/#scalability-architecture","title":"Scalability Architecture","text":""},{"location":"architecture/#horizontal-scaling","title":"Horizontal Scaling","text":"graph TB\n subgraph \"Scaled Architecture\"\n LB[Load Balancer]\n W1[Worker 1]\n W2[Worker 2]\n W3[Worker N]\n Redis[Redis Cluster]\n Storage[Shared Storage]\n\n LB --> W1\n LB --> W2\n LB --> W3\n\n W1 --> Redis\n W2 --> Redis\n W3 --> Redis\n\n W1 --> Storage\n W2 --> Storage\n W3 --> Storage\n end\n Scaling Features: - Stateless worker services - Shared job queue (Redis) - Distributed storage - Load balancer ready - Health checks and monitoring
"},{"location":"architecture/#technology-stack","title":"Technology Stack","text":""},{"location":"architecture/#backend-technologies","title":"Backend Technologies","text":"Component Technology Purpose Language Go 1.25+ Core application Web Framework Standard library HTTP server Authentication Custom API key + RBAC Database SQLite/PostgreSQL Metadata storage Cache Redis Job queue & caching Containers Podman/Docker Job isolation UI Framework Bubble Tea Terminal UI"},{"location":"architecture/#dependencies","title":"Dependencies","text":"// Core dependencies\nrequire (\n github.com/charmbracelet/bubbletea v1.3.10 // TUI framework\n github.com/go-redis/redis/v8 v8.11.5 // Redis client\n github.com/google/uuid v1.6.0 // UUID generation\n github.com/mattn/go-sqlite3 v1.14.32 // SQLite driver\n golang.org/x/crypto v0.45.0 // Crypto utilities\n gopkg.in/yaml.v3 v3.0.1 // YAML parsing\n)\n"},{"location":"architecture/#development-architecture","title":"Development Architecture","text":""},{"location":"architecture/#project-structure","title":"Project Structure","text":"fetch_ml/\n\u251c\u2500\u2500 cmd/ # CLI applications\n\u2502 \u251c\u2500\u2500 worker/ # ML worker service\n\u2502 \u251c\u2500\u2500 tui/ # Terminal UI\n\u2502 \u251c\u2500\u2500 data_manager/ # Data management\n\u2502 \u2514\u2500\u2500 user_manager/ # User management\n\u251c\u2500\u2500 internal/ # Internal packages\n\u2502 \u251c\u2500\u2500 auth/ # Authentication system\n\u2502 \u251c\u2500\u2500 config/ # Configuration management\n\u2502 \u251c\u2500\u2500 container/ # Container operations\n\u2502 \u251c\u2500\u2500 database/ # Database operations\n\u2502 \u251c\u2500\u2500 logging/ # Logging utilities\n\u2502 \u251c\u2500\u2500 metrics/ # Metrics collection\n\u2502 \u2514\u2500\u2500 network/ # Network utilities\n\u251c\u2500\u2500 configs/ # Configuration files\n\u251c\u2500\u2500 scripts/ # Setup and utility scripts\n\u251c\u2500\u2500 tests/ # Test suites\n\u2514\u2500\u2500 docs/ # Documentation\n"},{"location":"architecture/#package-dependencies","title":"Package Dependencies","text":"graph TB\n subgraph \"Application Layer\"\n Worker[cmd/worker]\n TUI[cmd/tui]\n DataMgr[cmd/data_manager]\n UserMgr[cmd/user_manager]\n end\n\n subgraph \"Service Layer\"\n Auth[internal/auth]\n Config[internal/config]\n Container[internal/container]\n Database[internal/database]\n end\n\n subgraph \"Utility Layer\"\n Logging[internal/logging]\n Metrics[internal/metrics]\n Network[internal/network]\n end\n\n Worker --> Auth\n Worker --> Config\n Worker --> Container\n TUI --> Auth\n DataMgr --> Database\n UserMgr --> Auth\n\n Auth --> Logging\n Container --> Network\n Database --> Metrics\n"},{"location":"architecture/#monitoring-observability","title":"Monitoring & Observability","text":""},{"location":"architecture/#metrics-collection","title":"Metrics Collection","text":"graph TB\n subgraph \"Metrics Pipeline\"\n App[Application] --> Metrics[Metrics Collector]\n Metrics --> Export[Prometheus Exporter]\n Export --> Prometheus[Prometheus Server]\n Prometheus --> Grafana[Grafana Dashboard]\n\n subgraph \"Metric Types\"\n Counter[Counters]\n Gauge[Gauges]\n Histogram[Histograms]\n Timer[Timers]\n end\n\n App --> Counter\n App --> Gauge\n App --> Histogram\n App --> Timer\n end\n"},{"location":"architecture/#logging-architecture","title":"Logging Architecture","text":"graph TB\n subgraph \"Logging Pipeline\"\n App[Application] --> Logger[Structured Logger]\n Logger --> File[File Output]\n Logger --> Console[Console Output]\n Logger --> Syslog[Syslog Forwarder]\n Syslog --> Aggregator[Log Aggregator]\n Aggregator --> Storage[Log Storage]\n Storage --> Viewer[Log Viewer]\n end\n"},{"location":"architecture/#deployment-architecture","title":"Deployment Architecture","text":""},{"location":"architecture/#container-deployment","title":"Container Deployment","text":"graph TB\n subgraph \"Deployment Stack\"\n Image[Container Image]\n Registry[Container Registry]\n Orchestrator[Docker Compose]\n Config[ConfigMaps/Secrets]\n Storage[Persistent Storage]\n\n Image --> Registry\n Registry --> Orchestrator\n Config --> Orchestrator\n Storage --> Orchestrator\n end\n"},{"location":"architecture/#service-discovery","title":"Service Discovery","text":"graph TB\n subgraph \"Service Mesh\"\n Gateway[API Gateway]\n Discovery[Service Discovery]\n Worker[Worker Service]\n Data[Data Service]\n Redis[Redis Cluster]\n\n Gateway --> Discovery\n Discovery --> Worker\n Discovery --> Data\n Discovery --> Redis\n end\n"},{"location":"architecture/#future-architecture-considerations","title":"Future Architecture Considerations","text":""},{"location":"architecture/#microservices-evolution","title":"Microservices Evolution","text":"This architecture provides a solid foundation for secure, scalable machine learning experiments while maintaining simplicity and developer productivity.
"},{"location":"cicd/","title":"CI/CD Pipeline","text":"Automated testing, building, and releasing for fetch_ml.
"},{"location":"cicd/#workflows","title":"Workflows","text":""},{"location":"cicd/#ci-workflow-githubworkflowsciyml","title":"CI Workflow (.github/workflows/ci.yml)","text":"Runs on every push to main/develop and all pull requests.
Jobs: 1. test - Go backend tests with Redis 2. build - Build all binaries (Go + Zig CLI) 3. test-scripts - Validate deployment scripts 4. security-scan - Trivy and Gosec security scans 5. docker-build - Build and push Docker images (main branch only)
Test Coverage: - Go unit tests with race detection - internal/queue package tests - Zig CLI tests - Integration tests - Security audits
.github/workflows/release.yml)","text":"Runs on version tags (e.g., v1.0.0).
Jobs:
Embeds rsync for zero-dependency releases
build-go-backends
api-server, worker, tui, data_manager, user_manager
create-release
# 1. Update version\ngit tag v1.0.0\n\n# 2. Push tag\ngit push origin v1.0.0\n\n# 3. CI automatically builds and releases\n"},{"location":"cicd/#release-artifacts","title":"Release Artifacts","text":"CLI Binaries (with embedded rsync): - ml-linux-x86_64.tar.gz (~450-650KB) - ml-macos-x86_64.tar.gz (~450-650KB) - ml-macos-arm64.tar.gz (~450-650KB)
Go Backends: - fetch_ml_api-server.tar.gz - fetch_ml_worker.tar.gz - fetch_ml_tui.tar.gz - fetch_ml_data_manager.tar.gz - fetch_ml_user_manager.tar.gz
Checksums: - checksums.txt - Combined SHA256 sums - Individual .sha256 files per binary
# Run all tests\nmake test\n\n# Run specific package tests\ngo test ./internal/queue/...\n\n# Build CLI\ncd cli && zig build dev\n\n# Run formatters and linters\nmake lint\n\n# Security scans are handled automatically in CI by the `security-scan` job\n"},{"location":"cicd/#optional-heavy-end-to-end-tests","title":"Optional heavy end-to-end tests","text":"Some e2e tests exercise full Docker deployments and performance scenarios and are skipped by default to keep local/CI runs fast. You can enable them explicitly with environment variables:
# Run Docker deployment e2e tests\nFETCH_ML_E2E_DOCKER=1 go test ./tests/e2e/...\n\n# Run performance-oriented e2e tests\nFETCH_ML_E2E_PERF=1 go test ./tests/e2e/...\n Without these variables, TestDockerDeploymentE2E and TestPerformanceE2E will t.Skip, while all lighter e2e tests still run.
All PRs must pass: - \u2705 Go tests (with Redis) - \u2705 CLI tests - \u2705 Security scans - \u2705 Code linting - \u2705 Build verification
"},{"location":"cicd/#configuration","title":"Configuration","text":""},{"location":"cicd/#environment-variables","title":"Environment Variables","text":"GO_VERSION: '1.25.0'\nZIG_VERSION: '0.15.2'\n"},{"location":"cicd/#secrets","title":"Secrets","text":"Required for releases: - GITHUB_TOKEN - Automatic, provided by GitHub Actions
Check workflow runs at:
https://github.com/jfraeys/fetch_ml/actions\n"},{"location":"cicd/#artifacts","title":"Artifacts","text":"Download build artifacts from: - Successful workflow runs (30-day retention) - GitHub Releases (permanent)
For implementation details: - .github/workflows/ci.yml - .github/workflows/release.yml
"},{"location":"cli-reference/","title":"Fetch ML CLI Reference","text":"Comprehensive command-line tools for managing ML experiments in your homelab with Zig-based high-performance CLI.
"},{"location":"cli-reference/#overview","title":"Overview","text":"Fetch ML provides a comprehensive CLI toolkit built with performance and security in mind:
./cli/zig-out/bin/ml)","text":"High-performance command-line interface for experiment management, written in Zig for speed and efficiency.
"},{"location":"cli-reference/#available-commands","title":"Available Commands","text":"Command Description Exampleinit Interactive configuration setup ml init sync Sync project to worker with deduplication ml sync ./project --name myjob --queue queue Queue job for execution ml queue myjob --commit abc123 --priority 8 status Get system and worker status ml status monitor Launch TUI monitoring via SSH ml monitor cancel Cancel running job ml cancel job123 prune Clean up old experiments ml prune --keep 10 watch Auto-sync directory on changes ml watch ./project --queue"},{"location":"cli-reference/#command-details","title":"Command Details","text":""},{"location":"cli-reference/#init-configuration-setup","title":"init - Configuration Setup","text":"ml init\n Creates a configuration template at ~/.ml/config.toml with: - Worker connection details - API authentication - Base paths and ports"},{"location":"cli-reference/#sync-project-synchronization","title":"sync - Project Synchronization","text":"# Basic sync\nml sync ./my-project\n\n# Sync with custom name and queue\nml sync ./my-project --name \"experiment-1\" --queue\n\n# Sync with priority\nml sync ./my-project --priority 9\n Features: - Content-addressed storage for deduplication - SHA256 commit ID generation - Rsync-based file transfer - Automatic queuing (with --queue flag)
queue - Job Management","text":"# Queue with commit ID\nml queue my-job --commit abc123def456\n\n# Queue with priority (1-10, default 5)\nml queue my-job --commit abc123 --priority 8\n Features: - WebSocket-based communication - Priority queuing system - API key authentication
"},{"location":"cli-reference/#watch-auto-sync-monitoring","title":"watch - Auto-Sync Monitoring","text":"# Watch directory for changes\nml watch ./project\n\n# Watch and auto-queue on changes\nml watch ./project --name \"dev-exp\" --queue\n Features: - Real-time file system monitoring - Automatic re-sync on changes - Configurable polling interval (2 seconds) - Commit ID comparison for efficiency
"},{"location":"cli-reference/#prune-cleanup-management","title":"prune - Cleanup Management","text":"# Keep last N experiments\nml prune --keep 20\n\n# Remove experiments older than N days\nml prune --older-than 30\n"},{"location":"cli-reference/#monitor-remote-monitoring","title":"monitor - Remote Monitoring","text":"ml monitor\n Launches TUI interface via SSH for real-time monitoring."},{"location":"cli-reference/#cancel-job-cancellation","title":"cancel - Job Cancellation","text":"ml cancel running-job-id\n Cancels currently running jobs by ID."},{"location":"cli-reference/#configuration","title":"Configuration","text":"The Zig CLI reads configuration from ~/.ml/config.toml:
worker_host = \"worker.local\"\nworker_user = \"mluser\"\nworker_base = \"/data/ml-experiments\"\nworker_port = 22\napi_key = \"your-api-key\"\n"},{"location":"cli-reference/#performance-features","title":"Performance Features","text":"./cmd/api-server/main.go)","text":"Main HTTPS API server for experiment management.
# Build and run\ngo run ./cmd/api-server/main.go\n\n# With configuration\n./bin/api-server --config configs/config-local.yaml\n Features: - HTTPS-only communication - API key authentication - Rate limiting and IP whitelisting - WebSocket support for real-time updates - Redis integration for caching
"},{"location":"cli-reference/#tui-cmdtuimaingo","title":"TUI (./cmd/tui/main.go)","text":"Terminal User Interface for monitoring experiments.
# Launch TUI\ngo run ./cmd/tui/main.go\n\n# With custom config\n./tui --config configs/config-local.yaml\n Features: - Real-time experiment monitoring - Interactive job management - Status visualization - Log viewing
"},{"location":"cli-reference/#data-manager-cmddata_manager","title":"Data Manager (./cmd/data_manager/)","text":"Utilities for data synchronization and management.
# Sync data\n./data_manager --sync ./data\n\n# Clean old data\n./data_manager --cleanup --older-than 30d\n"},{"location":"cli-reference/#config-lint-cmdconfiglintmaingo","title":"Config Lint (./cmd/configlint/main.go)","text":"Configuration validation and linting tool.
# Validate configuration\n./configlint configs/config-local.yaml\n\n# Check schema compliance\n./configlint --schema configs/schema/config_schema.yaml\n"},{"location":"cli-reference/#management-script-toolsmanagesh","title":"Management Script (./tools/manage.sh)","text":"Simple service management for your homelab.
"},{"location":"cli-reference/#commands","title":"Commands","text":"./tools/manage.sh start # Start all services\n./tools/manage.sh stop # Stop all services\n./tools/manage.sh status # Check service status\n./tools/manage.sh logs # View logs\n./tools/manage.sh monitor # Basic monitoring\n./tools/manage.sh security # Security status\n./tools/manage.sh cleanup # Clean project artifacts\n"},{"location":"cli-reference/#setup-script-setupsh","title":"Setup Script (./setup.sh)","text":"One-command homelab setup.
"},{"location":"cli-reference/#usage","title":"Usage","text":"# Full setup\n./setup.sh\n\n# Setup includes:\n# - SSL certificate generation\n# - Configuration creation\n# - Build all components\n# - Start Redis\n# - Setup Fail2Ban (if available)\n"},{"location":"cli-reference/#api-testing","title":"API Testing","text":"Test the API with curl:
# Health check\ncurl -k -H 'X-API-Key: password' https://localhost:9101/health\n\n# List experiments\ncurl -k -H 'X-API-Key: password' https://localhost:9101/experiments\n\n# Submit experiment\ncurl -k -X POST -H 'X-API-Key: password' \\\n -H 'Content-Type: application/json' \\\n -d '{\"name\":\"test\",\"config\":{\"type\":\"basic\"}}' \\\n https://localhost:9101/experiments\n"},{"location":"cli-reference/#zig-cli-architecture","title":"Zig CLI Architecture","text":"The Zig CLI is designed for performance and reliability:
"},{"location":"cli-reference/#core-components","title":"Core Components","text":"cli/src/commands/): Individual command implementationscli/src/config.zig): Configuration managementcli/src/net/ws.zig): WebSocket client implementationcli/src/utils/): Cryptography, storage, and rsync utilitiescli/src/errors.zig): Centralized error handlingMain configuration file: configs/config-local.yaml
auth:\n enabled: true\n api_keys:\n homelab_user:\n hash: \"5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8\"\n admin: true\n\nserver:\n address: \":9101\"\n tls:\n enabled: true\n cert_file: \"./ssl/cert.pem\"\n key_file: \"./ssl/key.pem\"\n\nsecurity:\n rate_limit:\n enabled: true\n requests_per_minute: 30\n ip_whitelist:\n - \"127.0.0.1\"\n - \"::1\"\n - \"192.168.0.0/16\"\n - \"10.0.0.0/8\"\n"},{"location":"cli-reference/#docker-commands","title":"Docker Commands","text":"If using Docker Compose:
# Start services\ndocker-compose up -d (testing only)\n\n# View logs\ndocker-compose logs -f\n\n# Stop services\ndocker-compose down\n\n# Check status\ndocker-compose ps\n"},{"location":"cli-reference/#troubleshooting","title":"Troubleshooting","text":""},{"location":"cli-reference/#common-issues","title":"Common Issues","text":"Zig CLI not found:
# Build the CLI\ncd cli && make build\n\n# Check binary exists\nls -la ./cli/zig-out/bin/ml\n Configuration not found:
# Create configuration\n./cli/zig-out/bin/ml init\n\n# Check config file\nls -la ~/.ml/config.toml\n Worker connection failed:
# Test SSH connection\nssh -p 22 mluser@worker.local\n\n# Check configuration\ncat ~/.ml/config.toml\n Sync not working:
# Check rsync availability\nrsync --version\n\n# Test manual sync\nrsync -avz ./project/ mluser@worker.local:/tmp/test/\n WebSocket connection failed:
# Check worker WebSocket port\ntelnet worker.local 9100\n\n# Verify API key\n./cli/zig-out/bin/ml status\n API not responding:
./tools/manage.sh status\n./tools/manage.sh logs\n Authentication failed:
# Check API key in config-local.yaml\ngrep -A 5 \"api_keys:\" configs/config-local.yaml\n Redis connection failed:
# Check Redis status\nredis-cli ping\n\n# Start Redis\nredis-server\n"},{"location":"cli-reference/#getting-help","title":"Getting Help","text":"# CLI help\n./cli/zig-out/bin/ml help\n\n# Management script help\n./tools/manage.sh help\n\n# Check all available commands\nmake help\n That's it for the CLI reference! For complete setup instructions, see the main index.
"},{"location":"configuration-schema/","title":"Configuration Schema","text":"Complete reference for Fetch ML configuration options.
"},{"location":"configuration-schema/#configuration-file-structure","title":"Configuration File Structure","text":"Fetch ML uses YAML configuration files. The main configuration file is typically config.yaml.
# Server Configuration\nserver:\n address: \":9101\"\n tls:\n enabled: false\n cert_file: \"\"\n key_file: \"\"\n\n# Database Configuration\ndatabase:\n type: \"sqlite\" # sqlite, postgres, mysql\n connection: \"fetch_ml.db\"\n host: \"localhost\"\n port: 5432\n username: \"postgres\"\n password: \"\"\n database: \"fetch_ml\"\n\n# Redis Configuration\n\n\n## Quick Reference\n\n### Database Types\n- **SQLite**: `type: sqlite, connection: file.db`\n- **PostgreSQL**: `type: postgres, host: localhost, port: 5432`\n\n### Key Settings\n- `server.address: :9101`\n- `database.type: sqlite`\n- `redis.addr: localhost:6379`\n- `auth.enabled: true`\n- `logging.level: info`\n\n### Environment Override\n```bash\nexport FETCHML_SERVER_ADDRESS=:8080\nexport FETCHML_DATABASE_TYPE=postgres\n"},{"location":"configuration-schema/#validation","title":"Validation","text":"make configlint\n"},{"location":"deployment/","title":"ML Experiment Manager - Deployment Guide","text":""},{"location":"deployment/#overview","title":"Overview","text":"The ML Experiment Manager supports multiple deployment methods from local development to homelab Docker setups.
"},{"location":"deployment/#quick-start","title":"Quick Start","text":""},{"location":"deployment/#docker-compose-recommended-for-development","title":"Docker Compose (Recommended for Development)","text":"# Clone repository\ngit clone https://github.com/your-org/fetch_ml.git\ncd fetch_ml\n\n# Start all services\ndocker-compose up -d (testing only)\n\n# Check status\ndocker-compose ps\n\n# View logs\ndocker-compose logs -f api-server\n Access the API at http://localhost:9100
Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution - Go 1.25+ - Zig 0.15.2 - Redis 7+ - Docker & Docker Compose (optional)
"},{"location":"deployment/#manual-setup","title":"Manual Setup","text":"# Start Redis\nredis-server\n\n# Build and run Go server\ngo build -o bin/api-server ./cmd/api-server\n./bin/api-server -config configs/config-local.yaml\n\n# Build Zig CLI\ncd cli\nzig build prod\n./zig-out/bin/ml --help\n"},{"location":"deployment/#2-docker-deployment","title":"2. Docker Deployment","text":""},{"location":"deployment/#build-image","title":"Build Image","text":"docker build -t ml-experiment-manager:latest .\n"},{"location":"deployment/#run-container","title":"Run Container","text":"docker run -d \\\n --name ml-api \\\n -p 9100:9100 \\\n -p 9101:9101 \\\n -v $(pwd)/configs:/app/configs:ro \\\n -v experiment-data:/data/ml-experiments \\\n ml-experiment-manager:latest\n"},{"location":"deployment/#docker-compose","title":"Docker Compose","text":"# Production mode\ndocker-compose -f docker-compose.yml up -d\n\n# Development mode with logs\ndocker-compose -f docker-compose.yml up\n"},{"location":"deployment/#3-homelab-setup","title":"3. Homelab Setup","text":"# Use the simple setup script\n./setup.sh\n\n# Or manually with Docker Compose\ndocker-compose up -d (testing only)\n"},{"location":"deployment/#4-cloud-deployment","title":"4. Cloud Deployment","text":""},{"location":"deployment/#aws-ecs","title":"AWS ECS","text":"# Build and push to ECR\naws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REGISTRY\ndocker build -t $ECR_REGISTRY/ml-experiment-manager:latest .\ndocker push $ECR_REGISTRY/ml-experiment-manager:latest\n\n# Deploy with ECS CLI\necs-cli compose --project-name ml-experiment-manager up\n"},{"location":"deployment/#google-cloud-run","title":"Google Cloud Run","text":"# Build and push\ngcloud builds submit --tag gcr.io/$PROJECT_ID/ml-experiment-manager\n\n# Deploy\ngcloud run deploy ml-experiment-manager \\\n --image gcr.io/$PROJECT_ID/ml-experiment-manager \\\n --platform managed \\\n --region us-central1 \\\n --allow-unauthenticated\n"},{"location":"deployment/#configuration","title":"Configuration","text":""},{"location":"deployment/#environment-variables","title":"Environment Variables","text":"# configs/config-local.yaml\nbase_path: \"/data/ml-experiments\"\nauth:\n enabled: true\n api_keys:\n - \"your-production-api-key\"\nserver:\n address: \":9100\"\n tls:\n enabled: true\n cert_file: \"/app/ssl/cert.pem\"\n key_file: \"/app/ssl/key.pem\"\n"},{"location":"deployment/#docker-compose-environment","title":"Docker Compose Environment","text":"# docker-compose.yml\nversion: '3.8'\nservices:\n api-server:\n environment:\n - REDIS_URL=redis://redis:6379\n - LOG_LEVEL=info\n volumes:\n - ./configs:/configs:ro\n - ./data:/data/experiments\n"},{"location":"deployment/#monitoring-logging","title":"Monitoring & Logging","text":""},{"location":"deployment/#health-checks","title":"Health Checks","text":"GET /health/metrics# Generate self-signed cert (development)\nopenssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365 -nodes\n\n# Production - use Let's Encrypt\ncertbot certonly --standalone -d ml-experiments.example.com\n"},{"location":"deployment/#network-security","title":"Network Security","text":"resources:\n requests:\n memory: \"256Mi\"\n cpu: \"250m\"\n limits:\n memory: \"1Gi\"\n cpu: \"1000m\"\n"},{"location":"deployment/#scaling-strategies","title":"Scaling Strategies","text":"# Backup experiment data\ndocker-compose exec redis redis-cli BGSAVE\ndocker cp $(docker-compose ps -q redis):/data/dump.rdb ./redis-backup.rdb\n\n# Backup data volume\ndocker run --rm -v ml-experiments_redis_data:/data -v $(pwd):/backup alpine tar czf /backup/redis-backup.tar.gz -C /data .\n"},{"location":"deployment/#disaster-recovery","title":"Disaster Recovery","text":"# Check logs\ndocker-compose logs api-server\n\n# Check configuration\ncat configs/config-local.yaml\n\n# Check Redis connection\ndocker-compose exec redis redis-cli ping\n"},{"location":"deployment/#websocket-connection-issues","title":"WebSocket Connection Issues","text":"# Test WebSocket\nwscat -c ws://localhost:9100/ws\n\n# Check TLS\nopenssl s_client -connect localhost:9101 -servername localhost\n"},{"location":"deployment/#performance-issues","title":"Performance Issues","text":"# Check resource usage\ndocker-compose exec api-server ps aux\n\n# Check Redis memory\ndocker-compose exec redis redis-cli info memory\n"},{"location":"deployment/#debug-mode","title":"Debug Mode","text":"# Enable debug logging\nexport LOG_LEVEL=debug\n./bin/api-server -config configs/config-local.yaml\n"},{"location":"deployment/#cicd-integration","title":"CI/CD Integration","text":""},{"location":"deployment/#github-actions","title":"GitHub Actions","text":"For deployment issues: 1. Check this guide 2. Review logs 3. Check GitHub Issues 4. Contact maintainers
"},{"location":"development-setup/","title":"Development Setup","text":"Set up your local development environment for Fetch ML.
"},{"location":"development-setup/#prerequisites","title":"Prerequisites","text":"Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution
# Clone repository\ngit clone https://github.com/jfraeys/fetch_ml.git\ncd fetch_ml\n\n# Start dependencies\nsee [Quick Start](quick-start.md) for Docker setup redis postgres\n\n# Build all components\nmake build\n\n# Run tests\nsee [Testing Guide](testing.md)\n"},{"location":"development-setup/#detailed-setup","title":"Detailed Setup","text":""},{"location":"development-setup/#quick-start","title":"Quick Start","text":"git clone https://github.com/jfraeys/fetch_ml.git\ncd fetch_ml\nsee [Quick Start](quick-start.md) for Docker setup\nmake build\nsee [Testing Guide](testing.md)\n"},{"location":"development-setup/#key-commands","title":"Key Commands","text":"make build - Build all componentssee [Testing Guide](testing.md) - Run testsmake dev - Development buildsee [CLI Reference](cli-reference.md) and [Zig CLI](zig-cli.md) - Build CLIgo mod tidycd cli && rm -rf zig-out zig-cachelsof -i :9101Fetch ML supports environment variables for configuration, allowing you to override config file settings and deploy in different environments.
"},{"location":"environment-variables/#priority-order","title":"Priority Order","text":"FETCH_ML_* - General server and application settingsFETCH_ML_CLI_* - CLI-specific settings (overrides ~/.ml/config.toml)FETCH_ML_TUI_* - TUI-specific settings (overrides TUI config file)FETCH_ML_CLI_HOST worker_host localhost FETCH_ML_CLI_USER worker_user mluser FETCH_ML_CLI_BASE worker_base /opt/ml FETCH_ML_CLI_PORT worker_port 22 FETCH_ML_CLI_API_KEY api_key your-api-key-here"},{"location":"environment-variables/#tui-environment-variables","title":"TUI Environment Variables","text":"Variable Config Field Example FETCH_ML_TUI_HOST host localhost FETCH_ML_TUI_USER user mluser FETCH_ML_TUI_SSH_KEY ssh_key ~/.ssh/id_rsa FETCH_ML_TUI_PORT port 22 FETCH_ML_TUI_BASE_PATH base_path /opt/ml FETCH_ML_TUI_TRAIN_SCRIPT train_script train.py FETCH_ML_TUI_REDIS_ADDR redis_addr localhost:6379 FETCH_ML_TUI_REDIS_PASSWORD redis_password `` FETCH_ML_TUI_REDIS_DB redis_db 0 FETCH_ML_TUI_KNOWN_HOSTS known_hosts ~/.ssh/known_hosts"},{"location":"environment-variables/#server-environment-variables-auth-debug","title":"Server Environment Variables (Auth & Debug)","text":"These variables control server-side authentication behavior and are intended only for local development and debugging.
Variable Purpose Allowed In Production?FETCH_ML_ALLOW_INSECURE_AUTH When set to 1 and FETCH_ML_DEBUG=1, allows the API server to run with auth.enabled: false by injecting a default admin user. No. Must never be set in production. FETCH_ML_DEBUG Enables additional debug behaviors. Required (set to 1) to activate the insecure auth bypass above. No. Must never be set in production. When both variables are set to 1 and auth.enabled is false, the server logs a clear warning and treats all requests as coming from a default admin user. This mode is convenient for local homelab experiments but is insecure by design and must not be used on any shared or internet-facing environment.
export FETCH_ML_CLI_HOST=localhost\nexport FETCH_ML_CLI_USER=devuser\nexport FETCH_ML_CLI_API_KEY=dev-key-123456789012\n./ml status\n"},{"location":"environment-variables/#production-environment","title":"Production Environment","text":"export FETCH_ML_CLI_HOST=prod-server.example.com\nexport FETCH_ML_CLI_USER=mluser\nexport FETCH_ML_CLI_API_KEY=prod-key-abcdef1234567890\n./ml status\n"},{"location":"environment-variables/#dockerkubernetes","title":"Docker/Kubernetes","text":"env:\n - name: FETCH_ML_CLI_HOST\n value: \"ml-server.internal\"\n - name: FETCH_ML_CLI_USER\n value: \"mluser\"\n - name: FETCH_ML_CLI_API_KEY\n valueFrom:\n secretKeyRef:\n name: ml-secrets\n key: api-key\n"},{"location":"environment-variables/#using-env-file","title":"Using .env file","text":"# Copy the example file\ncp .env.example .env\n\n# Edit with your values\nvim .env\n\n# Load in your shell\nexport $(cat .env | xargs)\n"},{"location":"environment-variables/#backward-compatibility","title":"Backward Compatibility","text":"The CLI also supports the legacy ML_* prefix for backward compatibility, but FETCH_ML_CLI_* takes priority if both are set.
ML_HOST FETCH_ML_CLI_HOST ML_USER FETCH_ML_CLI_USER ML_BASE FETCH_ML_CLI_BASE ML_PORT FETCH_ML_CLI_PORT ML_API_KEY FETCH_ML_CLI_API_KEY"},{"location":"first-experiment/","title":"First Experiment","text":"Run your first machine learning experiment with Fetch ML.
"},{"location":"first-experiment/#prerequisites","title":"Prerequisites","text":"Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution
Create a simple Python script:
# experiment.py\nimport argparse\nimport json\nimport sys\nimport time\n\ndef main():\n parser = argparse.ArgumentParser()\n parser.add_argument('--epochs', type=int, default=10)\n parser.add_argument('--lr', type=float, default=0.001)\n parser.add_argument('--output', default='results.json')\n\n args = parser.parse_args()\n\n # Simulate training\n results = {\n 'epochs': args.epochs,\n 'learning_rate': args.lr,\n 'accuracy': 0.85 + (args.lr * 0.1),\n 'loss': 0.5 - (args.epochs * 0.01),\n 'training_time': args.epochs * 0.1\n }\n\n # Save results\n with open(args.output, 'w') as f:\n json.dump(results, f, indent=2)\n\n print(f\"Training completed: {results}\")\n return results\n\nif __name__ == '__main__':\n main()\n"},{"location":"first-experiment/#2-submit-job-via-api","title":"2. Submit Job via API","text":"# Submit experiment\ncurl -X POST http://localhost:9101/api/v1/jobs \\\n -H \"Content-Type: application/json\" \\\n -H \"X-API-Key: your-api-key\" \\\n -d '{\n \"job_name\": \"first-experiment\",\n \"args\": \"--epochs 20 --lr 0.01 --output experiment_results.json\",\n \"priority\": 1,\n \"metadata\": {\n \"experiment_type\": \"training\",\n \"dataset\": \"sample_data\"\n }\n }'\n"},{"location":"first-experiment/#3-monitor-progress","title":"3. Monitor Progress","text":"# Check job status\ncurl -H \"X-API-Key: your-api-key\" \\\n http://localhost:9101/api/v1/jobs/first-experiment\n\n# List all jobs\ncurl -H \"X-API-Key: your-api-key\" \\\n http://localhost:9101/api/v1/jobs\n\n# Get job metrics\ncurl -H \"X-API-Key: your-api-key\" \\\n http://localhost:9101/api/v1/jobs/first-experiment/metrics\n"},{"location":"first-experiment/#4-use-cli","title":"4. Use CLI","text":"# Submit with CLI\ncd cli && zig build dev\n./cli/zig-out/dev/ml submit \\\n --name \"cli-experiment\" \\\n --args \"--epochs 15 --lr 0.005\" \\\n --server http://localhost:9101\n\n# Monitor with CLI\n./cli/zig-out/dev/ml list-jobs --server http://localhost:9101\n./cli/zig-out/dev/ml job-status cli-experiment --server http://localhost:9101\n"},{"location":"first-experiment/#advanced-experiment","title":"Advanced Experiment","text":""},{"location":"first-experiment/#hyperparameter-tuning","title":"Hyperparameter Tuning","text":"# Submit multiple experiments\nfor lr in 0.001 0.01 0.1; do\n curl -X POST http://localhost:9101/api/v1/jobs \\\n -H \"Content-Type: application/json\" \\\n -H \"X-API-Key: your-api-key\" \\\n -d \"{\n \\\"job_name\\\": \\\"tune-lr-$lr\\\",\n \\\"args\\\": \\\"--epochs 10 --lr $lr\\\",\n \\\"metadata\\\": {\\\"learning_rate\\\": $lr}\n }\"\ndone\n"},{"location":"first-experiment/#batch-processing","title":"Batch Processing","text":"# Submit batch job\ncurl -X POST http://localhost:9101/api/v1/jobs \\\n -H \"Content-Type: application/json\" \\\n -H \"X-API-Key: your-api-key\" \\\n -d '{\n \"job_name\": \"batch-processing\",\n \"args\": \"--input data/ --output results/ --batch-size 32\",\n \"priority\": 2,\n \"datasets\": [\"training_data\", \"validation_data\"]\n }'\n"},{"location":"first-experiment/#results-and-output","title":"Results and Output","text":""},{"location":"first-experiment/#access-results","title":"Access Results","text":"# Download results\ncurl -H \"X-API-Key: your-api-key\" \\\n http://localhost:9101/api/v1/jobs/first-experiment/results\n\n# View job details\ncurl -H \"X-API-Key: your-api-key\" \\\n http://localhost:9101/api/v1/jobs/first-experiment | jq .\n"},{"location":"first-experiment/#result-format","title":"Result Format","text":"{\n \"job_id\": \"first-experiment\",\n \"status\": \"completed\",\n \"results\": {\n \"epochs\": 20,\n \"learning_rate\": 0.01,\n \"accuracy\": 0.86,\n \"loss\": 0.3,\n \"training_time\": 2.0\n },\n \"metrics\": {\n \"gpu_utilization\": \"85%\",\n \"memory_usage\": \"2GB\",\n \"execution_time\": \"120s\"\n }\n}\n"},{"location":"first-experiment/#best-practices","title":"Best Practices","text":""},{"location":"first-experiment/#job-naming","title":"Job Naming","text":"model-training-v2, data-preprocessingexperiment-v1, experiment-v2daily-batch-2024-01-15{\n \"metadata\": {\n \"experiment_type\": \"training\",\n \"model_version\": \"v2.1\",\n \"dataset\": \"imagenet-2024\",\n \"environment\": \"gpu\",\n \"team\": \"ml-team\"\n }\n}\n"},{"location":"first-experiment/#error-handling","title":"Error Handling","text":"# Check failed jobs\ncurl -H \"X-API-Key: your-api-key\" \\\n \"http://localhost:9101/api/v1/jobs?status=failed\"\n\n# Retry failed job\ncurl -X POST http://localhost:9101/api/v1/jobs \\\n -H \"Content-Type: application/json\" \\\n -H \"X-API-Key: your-api-key\" \\\n -d '{\n \"job_name\": \"retry-experiment\",\n \"args\": \"--epochs 20 --lr 0.01\",\n \"metadata\": {\"retry_of\": \"first-experiment\"}\n }'\n"},{"location":"first-experiment/#related-documentation","title":"## Related Documentation","text":"Job stuck in pending? - Check worker status: curl /api/v1/workers - Verify resources: docker stats - Check logs: docker-compose logs api-server
Job failed? - Check error message: curl /api/v1/jobs/job-id - Review job arguments - Verify input data
No results? - Check job completion status - Verify output file paths - Check storage permissions
"},{"location":"installation/","title":"Simple Installation Guide","text":""},{"location":"installation/#quick-start-5-minutes","title":"Quick Start (5 minutes)","text":"# 1. Install\ngit clone https://github.com/jfraeys/fetch_ml.git\ncd fetch_ml\nmake install\n\n# 2. Setup (auto-configures)\n./bin/ml setup\n\n# 3. Run experiments\n./bin/ml run my-experiment.py\n That's it. Everything else is optional.
"},{"location":"installation/#what-if-i-want-more-control","title":"What If I Want More Control?","text":""},{"location":"installation/#manual-configuration-optional","title":"Manual Configuration (Optional)","text":"# Edit settings if defaults don't work\nnano ~/.ml/config.toml\n"},{"location":"installation/#monitoring-dashboard-optional","title":"Monitoring Dashboard (Optional)","text":"# Real-time monitoring\n./bin/tui\n"},{"location":"installation/#senior-developer-feedback","title":"Senior Developer Feedback","text":"\"Keep it simple\" - Most data scientists want: 1. One installation command 2. Sensible defaults 3. Works without configuration 4. Advanced features available when needed
Current plan is too complex because it asks users to decide between: - CLI vs TUI vs Both - Zig vs Go build tools - Manual vs auto config - Multiple environment variables
Better approach: Start simple, add complexity gradually.
"},{"location":"installation/#recommended-simplified-workflow","title":"Recommended Simplified Workflow","text":"The goal: \"It just works\" for 80% of use cases.
"},{"location":"operations/","title":"Operations Runbook","text":"Operational guide for troubleshooting and maintaining the ML experiment system.
"},{"location":"operations/#task-queue-operations","title":"Task Queue Operations","text":""},{"location":"operations/#monitoring-queue-health","title":"Monitoring Queue Health","text":"# Check queue depth\nZCARD task:queue\n\n# List pending tasks\nZRANGE task:queue 0 -1 WITHSCORES\n\n# Check dead letter queue\nKEYS task:dlq:*\n"},{"location":"operations/#handling-stuck-tasks","title":"Handling Stuck Tasks","text":"Symptom: Tasks stuck in \"running\" status
Diagnosis:
# Check for expired leases\nredis-cli GET task:{task-id}\n# Look for LeaseExpiry in past\n **Rem
ediation:** Tasks with expired leases are automatically reclaimed every 1 minute. To force immediate reclamation:
# Restart worker to trigger reclaim cycle\nsystemctl restart ml-worker\n"},{"location":"operations/#dead-letter-queue-management","title":"Dead Letter Queue Management","text":"View failed tasks:
KEYS task:dlq:*\n Inspect failed task:
GET task:dlq:{task-id}\n Retry from DLQ:
# Manual retry (requires custom script)\n# 1. Get task from DLQ\n# 2. Reset retry count\n# 3. Re-queue task\n"},{"location":"operations/#worker-crashes","title":"Worker Crashes","text":"Symptom: Worker disappeared mid-task
What Happens: 1. Lease expires after 30 minutes (default) 2. Background reclaim job detects expired lease 3. Task is retried (up to 3 attempts) 4. After max retries \u2192 Dead Letter Queue
Prevention: - Monitor worker heartbeats - Set up alerts for worker down - Use process manager (systemd, supervisor)
"},{"location":"operations/#worker-operations","title":"Worker Operations","text":""},{"location":"operations/#graceful-shutdown","title":"Graceful Shutdown","text":"# Send SIGTERM for graceful shutdown\nkill -TERM $(pgrep ml-worker)\n\n# Worker will:\n# 1. Stop accepting new tasks\n# 2. Finish active tasks (up to 5min timeout)\n# 3. Release all leases\n# 4. Exit cleanly\n"},{"location":"operations/#force-shutdown","title":"Force Shutdown","text":"# Force kill (leases will be reclaimed automatically)\nkill -9 $(pgrep ml-worker)\n"},{"location":"operations/#worker-heartbeat-monitoring","title":"Worker Heartbeat Monitoring","text":"# Check worker heartbeats\nHGETALL worker:heartbeat\n\n# Example output:\n# worker-abc123 1701234567\n# worker-def456 1701234580\n Alert if: Heartbeat timestamp > 5 minutes old
"},{"location":"operations/#redis-operations","title":"Redis Operations","text":""},{"location":"operations/#backup","title":"Backup","text":"# Manual backup\nredis-cli SAVE\ncp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d).rdb\n"},{"location":"operations/#restore","title":"Restore","text":"# Stop Redis\nsystemctl stop redis\n\n# Restore snapshot\ncp /backup/redis-20231201.rdb /var/lib/redis/dump.rdb\n\n# Start Redis\nsystemctl start redis\n"},{"location":"operations/#memory-management","title":"Memory Management","text":"# Check memory usage\nINFO memory\n\n# Evict old data if needed\nFLUSHDB # DANGER: Clears all data!\n"},{"location":"operations/#common-issues","title":"Common Issues","text":""},{"location":"operations/#issue-queue-growing-unbounded","title":"Issue: Queue Growing Unbounded","text":"Symptoms: - ZCARD task:queue keeps increasing - No workers processing tasks
Diagnosis:
# Check worker status\nsystemctl status ml-worker\n\n# Check logs\njournalctl -u ml-worker -n 100\n Resolution: 1. Verify workers are running 2. Check Redis connectivity 3. Verify lease configuration
"},{"location":"operations/#issue-high-retry-rate","title":"Issue: High Retry Rate","text":"Symptoms: - Many tasks in DLQ - retry_count field high on tasks
Diagnosis:
# Check worker logs for errors\njournalctl -u ml-worker | grep \"retry\"\n\n# Look for patterns (network issues, resource limits, etc)\n Resolution: - Fix underlying issue (network, resources, etc) - Adjust retry limits if permanent failures - Increase task timeout if jobs are slow
"},{"location":"operations/#issue-leases-expiring-prematurely","title":"Issue: Leases Expiring Prematurely","text":"Symptoms: - Tasks retried even though worker is healthy - Logs show \"lease expired\" frequently
Diagnosis:
# Check worker config\ncat configs/worker-config.yaml | grep -A3 \"lease\"\n\ntask_lease_duration: 30m # Too short?\nheartbeat_interval: 1m # Too infrequent?\n Resolution:
# Increase lease duration for long-running jobs\ntask_lease_duration: 60m\nheartbeat_interval: 30s # More frequent heartbeats\n"},{"location":"operations/#performance-tuning","title":"Performance Tuning","text":""},{"location":"operations/#worker-concurrency","title":"Worker Concurrency","text":"# worker-config.yaml\nmax_workers: 4 # Number of parallel tasks\n\n# Adjust based on:\n# - CPU cores available\n# - Memory per task\n# - GPU availability\n"},{"location":"operations/#redis-configuration","title":"Redis Configuration","text":"# /etc/redis/redis.conf\n\n# Persistence\nsave 900 1\nsave 300 10\n\n# Memory\nmaxmemory 2gb\nmaxmemory-policy noeviction\n\n# Performance\ntcp-keepalive 300\ntimeout 0\n"},{"location":"operations/#alerting-rules","title":"Alerting Rules","text":""},{"location":"operations/#critical-alerts","title":"Critical Alerts","text":"#!/bin/bash\n# health-check.sh\n\n# Check Redis\nredis-cli PING || echo \"Redis DOWN\"\n\n# Check worker heartbeat\nWORKER_ID=$(cat /var/run/ml-worker.pid)\nLAST_HB=$(redis-cli HGET worker:heartbeat \"$WORKER_ID\")\nNOW=$(date +%s)\nif [ $((NOW - LAST_HB)) -gt 300 ]; then\n echo \"Worker heartbeat stale\"\nfi\n\n# Check queue depth\nDEPTH=$(redis-cli ZCARD task:queue)\nif [ \"$DEPTH\" -gt 1000 ]; then\n echo \"Queue depth critical: $DEPTH\"\nfi\n"},{"location":"operations/#runbook-checklist","title":"Runbook Checklist","text":""},{"location":"operations/#daily-operations","title":"Daily Operations","text":"For homelab setups: Most of these operations can be simplified. Focus on: - Basic monitoring (queue depth, worker status) - Periodic Redis backups - Graceful shutdowns for maintenance
"},{"location":"performance-monitoring/","title":"Performance Monitoring","text":"This document describes the performance monitoring system for Fetch ML, which automatically tracks benchmark metrics through CI/CD integration with Prometheus and Grafana.
"},{"location":"performance-monitoring/#overview","title":"Overview","text":"The performance monitoring system provides:
GitHub Actions \u2192 Benchmark Tests \u2192 Prometheus Pushgateway \u2192 Prometheus \u2192 Grafana Dashboard\n"},{"location":"performance-monitoring/#components","title":"Components","text":""},{"location":"performance-monitoring/#1-github-actions-workflow","title":"1. GitHub Actions Workflow","text":".github/workflows/benchmark-metrics.ymlhttp://localhost:9091monitoring/prometheus.ymlmonitoring/dashboards/performance-dashboard.jsonmake monitoring-performance\n This starts: - Grafana: http://localhost:3001 (admin/admin) - Loki: http://localhost:3100 - Pushgateway: http://localhost:9091
"},{"location":"performance-monitoring/#2-configure-github-secrets","title":"2. Configure GitHub Secrets","text":"Add this secret to your GitHub repository:
PROMETHEUS_PUSHGATEWAY_URL=http://your-pushgateway:9091\n"},{"location":"performance-monitoring/#3-verify-integration","title":"3. Verify Integration","text":"benchmark_time_per_op - Time per operation in nanosecondsbenchmark_memory_per_op - Memory per operation in bytesbenchmark_allocs_per_op - Allocations per operationLabels: - benchmark - Benchmark name (sanitized) - job - Always \"benchmark\" - instance - GitHub Actions run ID
benchmark_time_per_op{benchmark=\"BenchmarkAPIServerCreateJobSimple\"} 42653\nbenchmark_memory_per_op{benchmark=\"BenchmarkAPIServerCreateJobSimple\"} 13518\nbenchmark_allocs_per_op{benchmark=\"BenchmarkAPIServerCreateJobSimple\"} 98\n"},{"location":"performance-monitoring/#usage","title":"Usage","text":""},{"location":"performance-monitoring/#manual-benchmark-execution","title":"Manual Benchmark Execution","text":"# Run benchmarks locally\nmake benchmark\n\n# View results in console\ngo test -bench=. -benchmem ./tests/benchmarks/...\n"},{"location":"performance-monitoring/#automated-monitoring","title":"Automated Monitoring","text":"The system automatically runs benchmarks on:
Edit monitoring/prometheus.yml to adjust:
scrape_configs:\n - job_name: 'benchmark'\n static_configs:\n - targets: ['pushgateway:9091']\n metrics_path: /metrics\n honor_labels: true\n scrape_interval: 15s\n"},{"location":"performance-monitoring/#grafana-dashboard","title":"Grafana Dashboard","text":"Customize the dashboard in monitoring/dashboards/performance-dashboard.json:
Check GitHub Actions logs
GitHub Actions workflow failing
PROMETHEUS_PUSHGATEWAY_URL secretReview benchmark execution logs
Pushgateway not receiving metrics
# Check running services\ndocker ps --filter \"name=monitoring\"\n\n# View Pushgateway metrics\ncurl http://localhost:9091/metrics\n\n# Check Prometheus targets\ncurl http://localhost:9090/api/v1/targets\n\n# Test manual metric push\necho \"test_metric 123\" | curl --data-binary @- http://localhost:9091/metrics/job/test\n"},{"location":"performance-monitoring/#best-practices","title":"Best Practices","text":""},{"location":"performance-monitoring/#benchmark-naming","title":"Benchmark Naming","text":"Use consistent naming conventions: - BenchmarkAPIServerCreateJob - BenchmarkMLExperimentTraining - BenchmarkDatasetOperations
Set up Grafana alerts for: - Performance regressions (>10% degradation) - Missing benchmark data - High memory allocation rates
"},{"location":"performance-monitoring/#retention","title":"Retention","text":"Configure appropriate retention periods: - Raw metrics: 30 days - Aggregated data: 1 year - Dashboard snapshots: Permanent
"},{"location":"performance-monitoring/#integration-with-existing-workflows","title":"Integration with Existing Workflows","text":"The benchmark monitoring integrates seamlessly with:
Potential improvements:
For issuesundles:
Last updated: December 2024
"},{"location":"performance-quick-start/","title":"Performance Monitoring Quick Start","text":"Get started with performance monitoring in 5 minutes.
"},{"location":"performance-quick-start/#prerequisites","title":"Prerequisites","text":"make monitoring-performance\n This starts: - Grafana: http://localhost:3001 (admin/admin) - Pushgateway: http://localhost:9091 - Loki: http://localhost:3100
"},{"location":"performance-quick-start/#2-run-benchmarks","title":"2. Run Benchmarks","text":"# Run benchmarks locally\nmake benchmark\n\n# Or run with detailed output\ngo test -bench=. -benchmem ./tests/benchmarks/...\n"},{"location":"performance-quick-start/#3-cpu-profiling","title":"3. CPU Profiling","text":""},{"location":"performance-quick-start/#http-load-test-profiling","title":"HTTP Load Test Profiling","text":"# CPU profile MediumLoad HTTP test (with rate limiting)\nmake profile-load\n\n# CPU profile MediumLoad HTTP test (no rate limiting - recommended for profiling)\nmake profile-load-norate\n This generates cpu_load.out which you can analyze with:
# View interactive profile\ngo tool pprof cpu_load.out\n\n# Generate flame graph\ngo tool pprof -raw cpu_load.out | go-flamegraph.pl > cpu_flame.svg\n\n# View top functions\ngo tool pprof -top cpu_load.out\n"},{"location":"performance-quick-start/#websocket-queue-profiling","title":"WebSocket Queue Profiling","text":"# CPU profile WebSocket \u2192 Redis queue \u2192 worker path\nmake profile-ws-queue\n Generates cpu_ws.out for WebSocket performance analysis.
profile-load-norate for cleaner CPU profiles (no rate limiting delays)Open Grafana dashboard: http://localhost:3001
Navigate to the Performance Dashboard to see: - Real-time benchmark results - Historical trends - Performance comparisons
"},{"location":"performance-quick-start/#5-enable-cicd-integration","title":"5. Enable CI/CD Integration","text":"Add GitHub secret:
PROMETHEUS_PUSHGATEWAY_URL=http://your-pushgateway:9091\n Now benchmarks run automatically on: - Every push to main/develop - Pull requests - Daily schedule
"},{"location":"performance-quick-start/#6-verify-integration","title":"6. Verify Integration","text":"benchmark_time_per_op - Execution timebenchmark_memory_per_op - Memory usagebenchmark_allocs_per_op - Allocation countNo metrics in Grafana?
# Check services\ndocker ps --filter \"name=monitoring\"\n\n# Check Pushgateway\ncurl http://localhost:9091/metrics\n Workflow failing? - Verify GitHub secret configuration - Check workflow logs in GitHub Actions
Profiling issues?
# Flag error like \"flag provided but not defined: -test.paniconexit0\"\n# This should be fixed now, but if it persists:\ngo test ./tests/load -run TestLoadProfile_Medium -count=1 -cpuprofile cpu_load.out -v -args -profile-norate\n\n# Redis not available?\n# Start Redis for profiling tests:\ndocker run -d -p 6379:6379 redis:alpine\n\n# Check profile file generated\nls -la cpu_load.out\n"},{"location":"performance-quick-start/#9-next-steps","title":"9. Next Steps","text":"Ready in 5 minutes!
"},{"location":"production-monitoring/","title":"Production Monitoring Deployment Guide (Linux)","text":"This guide covers deploying the monitoring stack (Prometheus, Grafana, Loki, Promtail) on Linux production servers.
"},{"location":"production-monitoring/#architecture","title":"Architecture","text":"Testing: Docker Compose (macOS/Linux) Production: Podman + systemd (Linux)
Important: Docker is for testing only. Podman is used for running actual ML experiments in production.
Dev (Testing): Docker Compose Prod (Experiments): Podman + systemd
Each service runs as a separate Podman container managed by systemd for automatic restarts and proper lifecycle management.
"},{"location":"production-monitoring/#prerequisites","title":"Prerequisites","text":"Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution
scripts/setup-prod.sh)cd /path/to/fetch_ml\nsudo ./scripts/setup-monitoring-prod.sh /data/monitoring ml-user ml-group\n This will: - Create directory structure at /data/monitoring - Copy configuration files to /etc/fetch_ml/monitoring - Create systemd services for each component - Set up firewall rules
# Start all monitoring services\nsudo systemctl start prometheus\nsudo systemctl start loki\nsudo systemctl start promtail\nsudo systemctl start grafana\n\n# Enable on boot\nsudo systemctl enable prometheus loki promtail grafana\n"},{"location":"production-monitoring/#3-access-grafana","title":"3. Access Grafana","text":"http://YOUR_SERVER_IP:3000adminadmin (change on first login)Dashboards will auto-load: - ML Task Queue Monitoring (metrics) - Application Logs (Loki logs)
"},{"location":"production-monitoring/#service-details","title":"Service Details","text":""},{"location":"production-monitoring/#prometheus","title":"Prometheus","text":"/etc/fetch_ml/monitoring/prometheus.yml/data/monitoring/prometheus/etc/fetch_ml/monitoring/loki-config.yml/data/monitoring/loki/etc/fetch_ml/monitoring/promtail-config.yml/var/log/fetch_ml/*.log/etc/fetch_ml/monitoring/grafana/provisioning/data/monitoring/grafana/var/lib/grafana/dashboards# Check status\nsudo systemctl status prometheus grafana loki promtail\n\n# View logs\nsudo journalctl -u prometheus -f\nsudo journalctl -u grafana -f\nsudo journalctl -u loki -f\nsudo journalctl -u promtail -f\n\n# Restart services\nsudo systemctl restart prometheus\nsudo systemctl restart grafana\n\n# Stop all monitoring\nsudo systemctl stop prometheus grafana loki promtail\n"},{"location":"production-monitoring/#data-retention","title":"Data Retention","text":""},{"location":"production-monitoring/#prometheus_1","title":"Prometheus","text":"Default: 15 days. Edit /etc/fetch_ml/monitoring/prometheus.yml:
storage:\n tsdb:\n retention.time: 30d\n"},{"location":"production-monitoring/#loki_1","title":"Loki","text":"Default: 30 days. Edit /etc/fetch_ml/monitoring/loki-config.yml:
limits_config:\n retention_period: 30d\n"},{"location":"production-monitoring/#security","title":"Security","text":""},{"location":"production-monitoring/#firewall","title":"Firewall","text":"The setup script automatically configures firewall rules using the detected firewall manager (firewalld or ufw).
For manual firewall configuration:
RHEL/Rocky/Fedora (firewalld):
# Remove public access\nsudo firewall-cmd --permanent --remove-port=3000/tcp\nsudo firewall-cmd --permanent --remove-port=9090/tcp\n\n# Add specific source\nsudo firewall-cmd --permanent --add-rich-rule='rule family=\"ipv4\" source address=\"10.0.0.0/24\" port port=\"3000\" protocol=\"tcp\" accept'\nsudo firewall-cmd --reload\n Ubuntu/Debian (ufw):
# Remove public access\nsudo ufw delete allow 3000/tcp\nsudo ufw delete allow 9090/tcp\n\n# Add specific source\nsudo ufw allow from 10.0.0.0/24 to any port 3000 proto tcp\n"},{"location":"production-monitoring/#authentication","title":"Authentication","text":"Change Grafana admin password: 1. Login to Grafana 2. User menu \u2192 Profile \u2192 Change Password
"},{"location":"production-monitoring/#tls-optional","title":"TLS (Optional)","text":"For HTTPS, configure reverse proxy (nginx/Apache) in front of Grafana.
"},{"location":"production-monitoring/#troubleshooting","title":"Troubleshooting","text":""},{"location":"production-monitoring/#grafana-shows-no-data","title":"Grafana shows no data","text":"# Check if Prometheus is reachable\ncurl http://localhost:9090/-/healthy\n\n# Check datasource in Grafana\n# Settings \u2192 Data Sources \u2192 Prometheus \u2192 Save & Test\n"},{"location":"production-monitoring/#loki-not-receiving-logs","title":"Loki not receiving logs","text":"# Check Promtail is running\nsudo systemctl status promtail\n\n# Verify log file exists\nls -l /var/log/fetch_ml/\n\n# Check Promtail can reach Loki\ncurl http://localhost:3100/ready\n"},{"location":"production-monitoring/#podman-containers-not-starting","title":"Podman containers not starting","text":"# Check pod status\nsudo -u ml-user podman pod ps\nsudo -u ml-user podman ps -a\n\n# Remove and recreate\nsudo -u ml-user podman pod stop monitoring\nsudo -u ml-user podman pod rm monitoring\nsudo systemctl restart prometheus\n"},{"location":"production-monitoring/#backup","title":"Backup","text":"# Backup Grafana dashboards and data\nsudo tar -czf grafana-backup.tar.gz /data/monitoring/grafana\n\n# Backup Prometheus data\nsudo tar -czf prometheus-backup.tar.gz /data/monitoring/prometheus\n"},{"location":"production-monitoring/#updates","title":"Updates","text":"# Pull latest images\nsudo -u ml-user podman pull docker.io/grafana/grafana:latest\nsudo -u ml-user podman pull docker.io/prom/prometheus:latest\nsudo -u ml-user podman pull docker.io/grafana/loki:latest\nsudo -u ml-user podman pull docker.io/grafana/promtail:latest\n\n# Restart services to use new images\nsudo systemctl restart grafana prometheus loki promtail\n"},{"location":"queue/","title":"Task Queue Architecture","text":"The task queue system enables reliable job processing between the API server and workers using Redis.
"},{"location":"queue/#overview","title":"Overview","text":"graph LR\n CLI[CLI/Client] -->|WebSocket| API[API Server]\n API -->|Enqueue| Redis[(Redis)]\n Redis -->|Dequeue| Worker[Worker]\n Worker -->|Update Status| Redis\n"},{"location":"queue/#components","title":"Components","text":""},{"location":"queue/#taskqueue-internalqueue","title":"TaskQueue (internal/queue)","text":"Shared package used by both API server and worker for job management.
"},{"location":"queue/#task-structure","title":"Task Structure","text":"type Task struct {\n ID string // Unique task ID (UUID)\n JobName string // User-defined job name \n Args string // Job arguments\n Status string // queued, running, completed, failed\n Priority int64 // Higher = executed first\n CreatedAt time.Time \n StartedAt *time.Time \n EndedAt *time.Time \n WorkerID string \n Error string \n Datasets []string \n Metadata map[string]string // commit_id, user, etc\n}\n"},{"location":"queue/#taskqueue-interface","title":"TaskQueue Interface","text":"// Initialize queue\nqueue, err := queue.NewTaskQueue(queue.Config{\n RedisAddr: \"localhost:6379\",\n RedisPassword: \"\",\n RedisDB: 0,\n})\n\n// Add task (API server)\ntask := &queue.Task{\n ID: uuid.New().String(),\n JobName: \"train-model\",\n Status: \"queued\",\n Priority: 5,\n Metadata: map[string]string{\n \"commit_id\": commitID,\n \"user\": username,\n },\n}\nerr = queue.AddTask(task)\n\n// Get next task (Worker)\ntask, err := queue.GetNextTask()\n\n// Update task status\ntask.Status = \"running\"\nerr = queue.UpdateTask(task)\n"},{"location":"queue/#data-flow","title":"Data Flow","text":""},{"location":"queue/#job-submission-flow","title":"Job Submission Flow","text":"sequenceDiagram\n participant CLI\n participant API\n participant Redis\n participant Worker\n\n CLI->>API: Queue Job (WebSocket)\n API->>API: Create Task (UUID)\n API->>Redis: ZADD task:queue\n API->>Redis: SET task:{id}\n API->>CLI: Success Response\n\n Worker->>Redis: ZPOPMAX task:queue\n Redis->>Worker: Task ID\n Worker->>Redis: GET task:{id}\n Redis->>Worker: Task Data\n Worker->>Worker: Execute Job\n Worker->>Redis: Update Status\n"},{"location":"queue/#protocol","title":"Protocol","text":"CLI \u2192 API (Binary WebSocket):
[opcode:1][api_key_hash:64][commit_id:64][priority:1][job_name_len:1][job_name:var]\n API \u2192 Redis: - Priority queue: ZADD task:queue {priority} {task_id} - Task data: SET task:{id} {json} - Status: HSET task:status:{job_name} ...
Worker \u2190 Redis: - Poll: ZPOPMAX task:queue 1 (highest priority first) - Fetch: GET task:{id}
task:queue # ZSET: priority queue\ntask:{uuid} # STRING: task JSON data\ntask:status:{job_name} # HASH: job status\nworker:heartbeat # HASH: worker health\njob:metrics:{job_name} # HASH: job metrics\n"},{"location":"queue/#priority-queue-zset","title":"Priority Queue (ZSET)","text":"ZADD task:queue 10 \"uuid-1\" # Priority 10\nZADD task:queue 5 \"uuid-2\" # Priority 5\nZPOPMAX task:queue 1 # Returns uuid-1 (highest)\n"},{"location":"queue/#api-server-integration","title":"API Server Integration","text":""},{"location":"queue/#initialization","title":"Initialization","text":"// cmd/api-server/main.go\nqueueCfg := queue.Config{\n RedisAddr: cfg.Redis.Addr,\n RedisPassword: cfg.Redis.Password,\n RedisDB: cfg.Redis.DB,\n}\ntaskQueue, err := queue.NewTaskQueue(queueCfg)\n"},{"location":"queue/#websocket-handler","title":"WebSocket Handler","text":"// internal/api/ws.go\nfunc (h *WSHandler) handleQueueJob(conn *websocket.Conn, payload []byte) error {\n // Parse request\n apiKeyHash, commitID, priority, jobName := parsePayload(payload)\n\n // Create task with unique ID\n taskID := uuid.New().String()\n task := &queue.Task{\n ID: taskID,\n JobName: jobName,\n Status: \"queued\",\n Priority: int64(priority),\n Metadata: map[string]string{\n \"commit_id\": commitID,\n \"user\": user,\n },\n }\n\n // Enqueue\n if err := h.queue.AddTask(task); err != nil {\n return h.sendErrorPacket(conn, ErrorCodeDatabaseError, ...)\n }\n\n return h.sendSuccessPacket(conn, \"Job queued\")\n}\n"},{"location":"queue/#worker-integration","title":"Worker Integration","text":""},{"location":"queue/#task-polling","title":"Task Polling","text":"// cmd/worker/worker_server.go\nfunc (w *Worker) Start() error {\n for {\n task, err := w.queue.WaitForNextTask(ctx, 5*time.Second)\n if task != nil {\n go w.executeTask(task)\n }\n }\n}\n"},{"location":"queue/#task-execution","title":"Task Execution","text":"func (w *Worker) executeTask(task *queue.Task) {\n // Update status\n task.Status = \"running\"\n task.StartedAt = &now\n w.queue.UpdateTaskWithMetrics(task, \"start\")\n\n // Execute\n err := w.runJob(task)\n\n // Finalize\n task.Status = \"completed\" // or \"failed\"\n task.EndedAt = &endTime\n task.Error = err.Error() // if err != nil\n w.queue.UpdateTaskWithMetrics(task, \"final\")\n}\n"},{"location":"queue/#configuration","title":"Configuration","text":""},{"location":"queue/#api-server-configsconfigyaml","title":"API Server (configs/config.yaml)","text":"redis:\n addr: \"localhost:6379\"\n password: \"\"\n db: 0\n"},{"location":"queue/#worker-configsworker-configyaml","title":"Worker (configs/worker-config.yaml)","text":"redis:\n addr: \"localhost:6379\"\n password: \"\"\n db: 0\n\nmetrics_flush_interval: 500ms\n"},{"location":"queue/#monitoring","title":"Monitoring","text":""},{"location":"queue/#queue-depth","title":"Queue Depth","text":"depth, err := queue.QueueDepth()\nfmt.Printf(\"Pending tasks: %d\\n\", depth)\n"},{"location":"queue/#worker-heartbeat","title":"Worker Heartbeat","text":"// Worker sends heartbeat every 30s\nerr := queue.Heartbeat(workerID)\n"},{"location":"queue/#metrics","title":"Metrics","text":"HGETALL job:metrics:{job_name}\n# Returns: timestamp, tasks_start, tasks_final, etc\n"},{"location":"queue/#error-handling","title":"Error Handling","text":""},{"location":"queue/#task-failures","title":"Task Failures","text":"if err := w.runJob(task); err != nil {\n task.Status = \"failed\"\n task.Error = err.Error()\n w.queue.UpdateTask(task)\n}\n"},{"location":"queue/#redis-connection-loss","title":"Redis Connection Loss","text":"// TaskQueue automatically reconnects\n// Workers should implement retry logic\nfor retries := 0; retries < 3; retries++ {\n task, err := queue.GetNextTask()\n if err == nil {\n break\n }\n time.Sleep(backoff)\n}\n"},{"location":"queue/#testing","title":"Testing","text":"// tests using miniredis\ns, _ := miniredis.Run()\ndefer s.Close()\n\ntq, _ := queue.NewTaskQueue(queue.Config{\n RedisAddr: s.Addr(),\n})\n\ntask := &queue.Task{ID: \"test-1\", JobName: \"test\"}\ntq.AddTask(task)\n\nfetched, _ := tq.GetNextTask()\n// assert fetched.ID == \"test-1\"\n"},{"location":"queue/#best-practices","title":"Best Practices","text":"For implementation details, see: - internal/queue/task.go - internal/queue/queue.go
"},{"location":"quick-start/","title":"Quick Start","text":"Get Fetch ML running in minutes with Docker Compose.
"},{"location":"quick-start/#prerequisites","title":"Prerequisites","text":"Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution
# Clone and start\ngit clone https://github.com/jfraeys/fetch_ml.git\ncd fetch_ml\ndocker-compose up -d (testing only)\n\n# Wait for services (30 seconds)\nsleep 30\n\n# Verify setup\ncurl http://localhost:9101/health\n"},{"location":"quick-start/#first-experiment","title":"First Experiment","text":"# Submit a simple ML job (see [First Experiment](first-experiment.md) for details)\ncurl -X POST http://localhost:9101/api/v1/jobs \\\n -H \"Content-Type: application/json\" \\\n -H \"X-API-Key: admin\" \\\n -d '{\n \"job_name\": \"hello-world\",\n \"args\": \"--echo Hello World\",\n \"priority\": 1\n }'\n\n# Check job status\ncurl http://localhost:9101/api/v1/jobs \\\n -H \"X-API-Key: admin\"\n"},{"location":"quick-start/#cli-access","title":"CLI Access","text":"# Build CLI\ncd cli && zig build dev\n\n# List jobs\n./cli/zig-out/dev/ml --server http://localhost:9101 list-jobs\n\n# Submit new job\n./cli/zig-out/dev/ml --server http://localhost:9101 submit \\\n --name \"test-job\" --args \"--epochs 10\"\n"},{"location":"quick-start/#related-documentation","title":"Related Documentation","text":"Services not starting?
# Check logs\ndocker-compose logs\n\n# Restart services\ndocker-compose down && docker-compose up -d (testing only)\n API not responding?
# Check health\ncurl http://localhost:9101/health\n\n# Verify ports\ndocker-compose ps\n Permission denied?
# Check API key\ncurl -H \"X-API-Key: admin\" http://localhost:9101/api/v1/jobs\n"},{"location":"redis-ha/","title":"Redis High Availability","text":"Note: This is optional for homelab setups. Single Redis instance is sufficient for most use cases.
"},{"location":"redis-ha/#when-you-need-ha","title":"When You Need HA","text":"Consider Redis HA if: - Running production workloads - Uptime > 99.9% required - Can't afford to lose queued tasks - Multiple workers across machines
"},{"location":"redis-ha/#redis-sentinel-recommended","title":"Redis Sentinel (Recommended)","text":""},{"location":"redis-ha/#setup","title":"Setup","text":"# docker-compose.yml\nversion: '3.8'\nservices:\n redis-master:\n image: redis:7-alpine\n command: redis-server --maxmemory 2gb\n\n redis-replica:\n image: redis:7-alpine\n command: redis-server --slaveof redis-master 6379\n\n redis-sentinel-1:\n image: redis:7-alpine\n command: redis-sentinel /etc/redis/sentinel.conf\n volumes:\n - ./sentinel.conf:/etc/redis/sentinel.conf\n sentinel.conf:
sentinel monitor mymaster redis-master 6379 2\nsentinel down-after-milliseconds mymaster 5000\nsentinel parallel-syncs mymaster 1\nsentinel failover-timeout mymaster 10000\n"},{"location":"redis-ha/#application-configuration","title":"Application Configuration","text":"# worker-config.yaml\nredis_addr: \"redis-sentinel-1:26379,redis-sentinel-2:26379\"\nredis_master_name: \"mymaster\"\n"},{"location":"redis-ha/#redis-cluster-advanced","title":"Redis Cluster (Advanced)","text":"For larger deployments with sharding needs.
# Minimum 3 masters + 3 replicas\nservices:\n redis-1:\n image: redis:7-alpine\n command: redis-server --cluster-enabled yes\n\n redis-2:\n # ... similar config\n"},{"location":"redis-ha/#homelab-alternative-persistence-only","title":"Homelab Alternative: Persistence Only","text":"For most homelabs, just enable persistence:
# docker-compose.yml\nservices:\n redis:\n image: redis:7-alpine\n command: redis-server --appendonly yes\n volumes:\n - redis_data:/data\n\nvolumes:\n redis_data:\n This ensures tasks survive Redis restarts without full HA complexity.
Recommendation: Start simple. Add HA only if you experience actual downtime issues.
"},{"location":"release-checklist/","title":"Release Checklist","text":"This checklist captures the work required before cutting a release that includes the graceful worker shutdown feature.
"},{"location":"release-checklist/#1-code-hygiene-compilation","title":"1. Code Hygiene / Compilation","text":"Worker redeclared errors (see cmd/worker/worker_graceful_shutdown.go and cmd/worker/worker_server.go).logger, queue, cfg, metrics).go build ./cmd/worker succeeds without undefined-field errors.shutdownCh, activeTasks, and gracefulWait during worker start-up.heartbeatLoop, releaseAllLeases).executeTaskWithLease with the real executeTask signature so the \"no value used as value\" compile error disappears.cmd/worker/worker_server.go wires up config, queue, metrics, and logger instances used by the shutdown logic.go test ./cmd/worker/... and make test (or equivalent) pass locally.This document outlines security features, best practices, and hardening procedures for FetchML.
"},{"location":"security/#security-features","title":"Security Features","text":""},{"location":"security/#authentication-authorization","title":"Authentication & Authorization","text":"Generate Strong Passwords
# Grafana admin password\nopenssl rand -base64 32 > .grafana-password\n\n# Redis password\nopenssl rand -base64 32\n Configure Environment Variables
cp .env.example .env\n# Edit .env and set:\n# - GRAFANA_ADMIN_PASSWORD\n Enable TLS (Production only)
# configs/config-prod.yaml\nserver:\n tls:\n enabled: true\n cert_file: \"/secrets/cert.pem\"\n key_file: \"/secrets/key.pem\"\n Configure Firewall
# Allow only necessary ports\nsudo ufw allow 22/tcp # SSH\nsudo ufw allow 443/tcp # HTTPS\nsudo ufw allow 80/tcp # HTTP (redirect to HTTPS)\nsudo ufw enable\n Restrict IP Access
# configs/config-prod.yaml\nauth:\n ip_whitelist:\n - \"10.0.0.0/8\"\n - \"192.168.0.0/16\"\n - \"127.0.0.1\"\n Enable Audit Logging
logging:\n level: \"info\"\n audit: true\n file: \"/var/log/fetch_ml/audit.log\"\n Harden Redis
# Redis security\nredis-cli CONFIG SET requirepass \"your-strong-password\"\nredis-cli CONFIG SET rename-command FLUSHDB \"\"\nredis-cli CONFIG SET rename-command FLUSHALL \"\"\n Secure Grafana
# Change default admin password\ndocker-compose exec grafana grafana-cli admin reset-admin-password new-strong-password\n Regular Updates
# Update system packages\nsudo apt update && sudo apt upgrade -y\n\n# Update containers\ndocker-compose pull\ndocker-compose up -d (testing only)\n # Method 1: OpenSSL\nopenssl rand -base64 32\n\n# Method 2: pwgen (if installed)\npwgen -s 32 1\n\n# Method 3: /dev/urandom\nhead -c 32 /dev/urandom | base64\n"},{"location":"security/#store-passwords-securely","title":"Store Passwords Securely","text":"Development: Use .env file (gitignored)
echo \"REDIS_PASSWORD=$(openssl rand -base64 32)\" >> .env\necho \"GRAFANA_ADMIN_PASSWORD=$(openssl rand -base64 32)\" >> .env\n Production: Use systemd environment files
sudo mkdir -p /etc/fetch_ml/secrets\nsudo chmod 700 /etc/fetch_ml/secrets\necho \"REDIS_PASSWORD=...\" | sudo tee /etc/fetch_ml/secrets/redis.env\nsudo chmod 600 /etc/fetch_ml/secrets/redis.env\n"},{"location":"security/#api-key-management","title":"API Key Management","text":""},{"location":"security/#generate-api-keys","title":"Generate API Keys","text":"# Generate random API key\nopenssl rand -hex 32\n\n# Hash for storage\necho -n \"your-api-key\" | sha256sum\n"},{"location":"security/#rotate-api-keys","title":"Rotate API Keys","text":"config-local.yaml with new hashRemove user entry from config-local.yaml:
auth:\n apikeys:\n # user_to_revoke: # Comment out or delete\n"},{"location":"security/#network-security_1","title":"Network Security","text":""},{"location":"security/#production-network-topology","title":"Production Network Topology","text":"Internet\n \u2193\n[Firewall] (ports 3000, 9102)\n \u2193\n[Reverse Proxy] (nginx/Apache) - TLS termination\n \u2193\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Application Pod \u2502\n\u2502 \u2502\n\u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502\n\u2502 \u2502 API Server \u2502 \u2502 \u2190 Public (via reverse proxy)\n\u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502\n\u2502 \u2502\n\u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502\n\u2502 \u2502 Redis \u2502 \u2502 \u2190 Internal only\n\u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502\n\u2502 \u2502\n\u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502\n\u2502 \u2502 Grafana \u2502 \u2502 \u2190 Public (via reverse proxy)\n\u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502\n\u2502 \u2502\n\u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502\n\u2502 \u2502 Prometheus \u2502 \u2502 \u2190 Internal only\n\u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502\n\u2502 \u2502\n\u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502\n\u2502 \u2502 Loki \u2502 \u2502 \u2190 Internal only\n\u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n"},{"location":"security/#recommended-firewall-rules","title":"Recommended Firewall Rules","text":"# Allow only necessary inbound connections\nsudo firewall-cmd --permanent --zone=public --add-rich-rule='\n rule family=\"ipv4\"\n source address=\"YOUR_NETWORK\"\n port port=\"3000\" protocol=\"tcp\" accept'\n\nsudo firewall-cmd --permanent --zone=public --add-rich-rule='\n rule family=\"ipv4\"\n source address=\"YOUR_NETWORK\"\n port port=\"9102\" protocol=\"tcp\" accept'\n\n# Block all other traffic\nsudo firewall-cmd --permanent --set-default-zone=drop\nsudo firewall-cmd --reload\n"},{"location":"security/#incident-response","title":"Incident Response","text":""},{"location":"security/#suspected-breach","title":"Suspected Breach","text":"Review audit logs
Investigation
# Check recent logins\nsudo journalctl -u fetchml-api --since \"1 hour ago\"\n\n# Review failed auth attempts\ngrep \"authentication failed\" /var/log/fetch_ml/*.log\n\n# Check active connections\nss -tnp | grep :9102\n Recovery
# Monitor failed authentication\ntail -f /var/log/fetch_ml/api.log | grep \"auth.*failed\"\n\n# Monitor unusual activity\njournalctl -u fetchml-api -f | grep -E \"(ERROR|WARN)\"\n\n# Check open ports\nnmap -p- localhost\n"},{"location":"security/#security-best-practices","title":"Security Best Practices","text":"All API access is logged with: - Timestamp - User/API key - Action performed - Source IP - Result (success/failure)
"},{"location":"security/#getting-help","title":"Getting Help","text":"This document describes Fetch ML's smart defaults system, which automatically adapts configuration based on the runtime environment.
"},{"location":"smart-defaults/#overview","title":"Overview","text":"Smart defaults eliminate the need for manual configuration tweaks when running in different environments:
The system automatically detects the environment based on:
CI, GITHUB_ACTIONS, GITLAB_CI environment variables/.dockerenv, KUBERNETES_SERVICE_HOST, or CONTAINER variablesFETCH_ML_ENV=production or ENV=productionlocalhosthost.docker.internal (Docker Desktop/Colima)0.0.0.0~/ml-experiments/workspace/ml-experiments/var/lib/fetch_ml/experiments~/ml-data/workspace/data/var/lib/fetch_ml/datalocalhost:6379redis:6379 (service name)redis:6379~/.ssh/id_rsa and ~/.ssh/known_hosts/workspace/.ssh/id_rsa and /workspace/.ssh/known_hosts/etc/fetch_ml/ssh/id_rsa and /etc/fetch_ml/ssh/known_hostsinfodebug (verbose for debugging)info// Get smart defaults for current environment\nsmart := config.GetSmartDefaults()\n\n// Use smart defaults\nif cfg.Host == \"\" {\n cfg.Host = smart.Host()\n}\nif cfg.BasePath == \"\" {\n cfg.BasePath = smart.BasePath()\n}\n"},{"location":"smart-defaults/#environment-overrides","title":"Environment Overrides","text":"Smart defaults can be overridden with environment variables:
FETCH_ML_HOST - Override hostFETCH_ML_BASE_PATH - Override base pathFETCH_ML_REDIS_ADDR - Override Redis addressFETCH_ML_ENV - Force environment profileYou can force a specific environment:
# Force production mode\nexport FETCH_ML_ENV=production\n\n# Force container mode\nexport CONTAINER=true\n"},{"location":"smart-defaults/#implementation-details","title":"Implementation Details","text":"The smart defaults system is implemented in internal/config/smart_defaults.go:
DetectEnvironment() - Determines current environment profileSmartDefaults struct - Provides environment-aware defaultsNo changes required - existing configurations continue to work. Smart defaults only apply when values are not explicitly set.
"},{"location":"smart-defaults/#for-developers","title":"For Developers","text":"When adding new configuration options:
SmartDefaults structExample:
// Add to SmartDefaults struct\nfunc (s *SmartDefaults) NewFeature() string {\n switch s.Profile {\n case ProfileContainer, ProfileCI:\n return \"/workspace/new-feature\"\n case ProfileProduction:\n return \"/var/lib/fetch_ml/new-feature\"\n default:\n return \"./new-feature\"\n }\n}\n\n// Use in config loader\nif cfg.NewFeature == \"\" {\n cfg.NewFeature = smart.NewFeature()\n}\n"},{"location":"smart-defaults/#testing","title":"Testing","text":"To test different environments:
# Test local defaults (default)\n./bin/worker\n\n# Test container defaults\nexport CONTAINER=true\n./bin/worker\n\n# Test CI defaults\nexport CI=true\n./bin/worker\n\n# Test production defaults\nexport FETCH_ML_ENV=production\n./bin/worker\n"},{"location":"smart-defaults/#troubleshooting","title":"Troubleshooting","text":""},{"location":"smart-defaults/#wrong-environment-detection","title":"Wrong Environment Detection","text":"Check environment variables:
echo \"CI: $CI\"\necho \"CONTAINER: $CONTAINER\"\necho \"FETCH_ML_ENV: $FETCH_ML_ENV\"\n"},{"location":"smart-defaults/#path-issues","title":"Path Issues","text":"Smart defaults expand ~ and environment variables automatically. If paths don't work as expected:
config.GetSmartDefaults().GetEnvironmentDescription()For container environments, ensure: - Redis service is named redis in docker-compose - Host networking is configured properly - host.docker.internal resolves (Docker Desktop/Colima)
How to run and write tests for FetchML.
"},{"location":"testing/#running-tests","title":"Running Tests","text":""},{"location":"testing/#quick-test","title":"Quick Test","text":"# All tests\nmake test\n\n# Unit tests only\nmake test-unit\n\n# Integration tests\nmake test-integration\n\n# With coverage\nmake test-coverage\n\n\n## Quick Test\n```bash\nmake test # All tests\nmake test-unit # Unit only\n.\nmake test.\nmake test$\nmake test; make test # Coverage\n # E2E tests\n"},{"location":"testing/#docker-testing","title":"Docker Testing","text":"docker-compose up -d (testing only)\nmake test\ndocker-compose down\n"},{"location":"testing/#cli-testing","title":"CLI Testing","text":"cd cli && zig build dev\n./cli/zig-out/dev/ml --help\nzig build test\n"},{"location":"troubleshooting/","title":"Troubleshooting","text":"Common issues and solutions for Fetch ML.
"},{"location":"troubleshooting/#quick-fixes","title":"Quick Fixes","text":""},{"location":"troubleshooting/#services-not-starting","title":"Services Not Starting","text":"# Check Docker status\ndocker-compose ps\n\n# Restart services\ndocker-compose down && docker-compose up -d (testing only)\n\n# Check logs\ndocker-compose logs -f\n"},{"location":"troubleshooting/#api-not-responding","title":"API Not Responding","text":"# Check health endpoint\ncurl http://localhost:9101/health\n\n# Check if port is in use\nlsof -i :9101\n\n# Kill process on port\nkill -9 $(lsof -ti :9101)\n"},{"location":"troubleshooting/#database-issues","title":"Database Issues","text":"# Check database connection\ndocker-compose exec postgres psql -U postgres -d fetch_ml\n\n# Reset database\ndocker-compose down postgres\ndocker-compose up -d (testing only) postgres\n\n# Check Redis\ndocker-compose exec redis redis-cli ping\n"},{"location":"troubleshooting/#common-errors","title":"Common Errors","text":""},{"location":"troubleshooting/#authentication-errors","title":"Authentication Errors","text":"jwt_expiry setting--migrate (see Development Setup)runtime: docker (testing only) in configresources.memory_limitgo mod tidy and cd cli && rm -rf zig-out zig-cachedocker-compose -f docker-compose.test.yml up -dcd cli && zig build dev--server and --api-keylsof -i :9101 and kill processespython3 -c \"import yaml; yaml.safe_load(open('config.yaml'))\"see [Configuration Schema](configuration-schema.md)./bin/api-server --version\ndocker-compose ps\ndocker-compose logs api-server | grep ERROR\n"},{"location":"troubleshooting/#emergency-reset","title":"Emergency Reset","text":"docker-compose down -v\nrm -rf data/ results/ *.db\ndocker-compose up -d (testing only)\n"},{"location":"user-permissions/","title":"User Permissions in Fetch ML","text":"Fetch ML now supports user-based permissions to ensure data scientists can only view and manage their own experiments while administrators retain full control.
"},{"location":"user-permissions/#overview","title":"Overview","text":"jobs:create - Create new experimentsjobs:read - View experiment status and resultsjobs:update - Cancel or modify experimentsml status\n Shows only your experiments with user context displayed."},{"location":"user-permissions/#cancel-your-jobs","title":"Cancel Your Jobs","text":"ml cancel <job-name>\n Only allows canceling your own experiments (unless you're an admin)."},{"location":"user-permissions/#authentication","title":"Authentication","text":"The CLI automatically authenticates using your API key from ~/.ml/config.toml.
[worker]\napi_key = \"your-api-key-here\"\n"},{"location":"user-permissions/#user-roles","title":"User Roles","text":"User roles and permissions are configured on the server side by administrators.
"},{"location":"user-permissions/#security-features","title":"Security Features","text":"# Submit your experiment\nml run my-experiment\n\n# Check your experiments (only shows yours)\nml status\n\n# Cancel your own experiment\nml cancel my-experiment\n"},{"location":"user-permissions/#administrator-workflow","title":"Administrator Workflow","text":"# View all experiments (admin sees everything)\nml status\n\n# Cancel any user's experiment\nml cancel user-experiment\n"},{"location":"user-permissions/#error-messages","title":"Error Messages","text":"For more details, see the architecture documentation.
"},{"location":"zig-cli/","title":"Zig CLI Guide","text":"High-performance command-line interface for ML experiment management, written in Zig for maximum speed and efficiency.
"},{"location":"zig-cli/#overview","title":"Overview","text":"The Zig CLI (ml) is the primary interface for managing ML experiments in your homelab. Built with Zig, it provides exceptional performance for file operations, network communication, and experiment management.
Download from GitHub Releases:
# Download for your platform\ncurl -LO https://github.com/jfraeys/fetch_ml/releases/latest/download/ml-<platform>.tar.gz\n\n# Extract\ntar -xzf ml-<platform>.tar.gz\n\n# Install\nchmod +x ml-<platform>\nsudo mv ml-<platform> /usr/local/bin/ml\n\n# Verify\nml --help\n Platforms: - ml-linux-x86_64.tar.gz - Linux (fully static, zero dependencies) - ml-macos-x86_64.tar.gz - macOS Intel - ml-macos-arm64.tar.gz - macOS Apple Silicon
All release binaries include embedded static rsync for complete independence.
"},{"location":"zig-cli/#build-from-source","title":"Build from Source","text":"Development Build (uses system rsync):
cd cli\nzig build dev\n./zig-out/dev/ml-dev --help\n Production Build (embedded rsync):
cd cli\n# For testing: uses rsync wrapper\nzig build prod\n\n# For release with static rsync:\n# 1. Place static rsync binary at src/assets/rsync_release.bin\n# 2. Build\nzig build prod\nstrip zig-out/prod/ml # Optional: reduce size\n\n# Verify\n./zig-out/prod/ml --help\nls -lh zig-out/prod/ml\n See cli/src/assets/README.md for details on obtaining static rsync binaries.
"},{"location":"zig-cli/#verify-installation","title":"Verify Installation","text":"ml --help\nml --version # Shows build config\n"},{"location":"zig-cli/#quick-start","title":"Quick Start","text":"Initialize Configuration
./cli/zig-out/bin/ml init\n Sync Your First Project
./cli/zig-out/bin/ml sync ./my-project --queue\n Monitor Progress
./cli/zig-out/bin/ml status\n init - Configuration Setup","text":"Initialize the CLI configuration file.
ml init\n Creates: ~/.ml/config.toml
Configuration Template:
worker_host = \"worker.local\"\nworker_user = \"mluser\"\nworker_base = \"/data/ml-experiments\"\nworker_port = 22\napi_key = \"your-api-key\"\n"},{"location":"zig-cli/#sync-project-synchronization","title":"sync - Project Synchronization","text":"Sync project files to the worker with intelligent deduplication.
# Basic sync\nml sync ./project\n\n# Sync with custom name and auto-queue\nml sync ./project --name \"experiment-1\" --queue\n\n# Sync with priority\nml sync ./project --priority 8\n Options: - --name <name>: Custom experiment name - --queue: Automatically queue after sync - --priority N: Set priority (1-10, default 5)
Features: - Content-Addressed Storage: Automatic deduplication - SHA256 Commit IDs: Reliable change detection - Incremental Transfer: Only sync changed files - Rsync Backend: Efficient file transfer
"},{"location":"zig-cli/#queue-job-management","title":"queue - Job Management","text":"Queue experiments for execution on the worker.
# Queue with commit ID\nml queue my-job --commit abc123def456\n\n# Queue with priority\nml queue my-job --commit abc123 --priority 8\n Options: - --commit <id>: Commit ID from sync output - --priority N: Execution priority (1-10)
Features: - WebSocket Communication: Real-time job submission - Priority Queuing: Higher priority jobs run first - API Authentication: Secure job submission
"},{"location":"zig-cli/#watch-auto-sync-monitoring","title":"watch - Auto-Sync Monitoring","text":"Monitor directories for changes and auto-sync.
# Watch for changes\nml watch ./project\n\n# Watch and auto-queue on changes\nml watch ./project --name \"dev-exp\" --queue\n Options: - --name <name>: Custom experiment name - --queue: Auto-queue on changes - --priority N: Set priority for queued jobs
Features: - Real-time Monitoring: 2-second polling interval - Change Detection: File modification time tracking - Commit Comparison: Only sync when content changes - Automatic Queuing: Seamless development workflow
"},{"location":"zig-cli/#status-system-status","title":"status - System Status","text":"Check system and worker status.
ml status\n Displays: - Worker connectivity - Queue status - Running jobs - System health
"},{"location":"zig-cli/#monitor-remote-monitoring","title":"monitor - Remote Monitoring","text":"Launch TUI interface via SSH for real-time monitoring.
ml monitor\n Features: - Real-time Updates: Live experiment status - Interactive Interface: Browse and manage experiments - SSH Integration: Secure remote access
"},{"location":"zig-cli/#cancel-job-cancellation","title":"cancel - Job Cancellation","text":"Cancel running or queued jobs.
ml cancel job-id\n Options: - job-id: Job identifier from status output
prune - Cleanup Management","text":"Clean up old experiments to save space.
# Keep last N experiments\nml prune --keep 20\n\n# Remove experiments older than N days\nml prune --older-than 30\n Options: - --keep N: Keep N most recent experiments - --older-than N: Remove experiments older than N days
Testing: Docker Compose (macOS/Linux) Production: Podman + systemd (Linux)
Important: Docker is for testing only. Podman is used for running actual ML experiments in production.
"},{"location":"zig-cli/#core-components","title":"Core Components","text":"cli/src/\n\u251c\u2500\u2500 commands/ # Command implementations\n\u2502 \u251c\u2500\u2500 init.zig # Configuration setup\n\u2502 \u251c\u2500\u2500 sync.zig # Project synchronization\n\u2502 \u251c\u2500\u2500 queue.zig # Job management\n\u2502 \u251c\u2500\u2500 watch.zig # Auto-sync monitoring\n\u2502 \u251c\u2500\u2500 status.zig # System status\n\u2502 \u251c\u2500\u2500 monitor.zig # Remote monitoring\n\u2502 \u251c\u2500\u2500 cancel.zig # Job cancellation\n\u2502 \u2514\u2500\u2500 prune.zig # Cleanup operations\n\u251c\u2500\u2500 config.zig # Configuration management\n\u251c\u2500\u2500 errors.zig # Error handling\n\u251c\u2500\u2500 net/ # Network utilities\n\u2502 \u2514\u2500\u2500 ws.zig # WebSocket client\n\u2514\u2500\u2500 utils/ # Utility functions\n \u251c\u2500\u2500 crypto.zig # Hashing and encryption\n \u251c\u2500\u2500 storage.zig # Content-addressed storage\n \u2514\u2500\u2500 rsync.zig # File synchronization\n"},{"location":"zig-cli/#performance-features","title":"Performance Features","text":""},{"location":"zig-cli/#content-addressed-storage","title":"Content-Addressed Storage","text":"# 1. Initialize project\nml sync ./project --name \"dev\" --queue\n\n# 2. Auto-sync during development\nml watch ./project --name \"dev\" --queue\n\n# 3. Monitor progress\nml status\n"},{"location":"zig-cli/#batch-processing","title":"Batch Processing","text":"# Process multiple experiments\nfor dir in experiments/*/; do\n ml sync \"$dir\" --queue\ndone\n"},{"location":"zig-cli/#priority-management","title":"Priority Management","text":"# High priority experiment\nml sync ./urgent --priority 10 --queue\n\n# Background processing\nml sync ./background --priority 1 --queue\n"},{"location":"zig-cli/#configuration-management","title":"Configuration Management","text":""},{"location":"zig-cli/#multiple-workers","title":"Multiple Workers","text":"# ~/.ml/config.toml\nworker_host = \"worker.local\"\nworker_user = \"mluser\"\nworker_base = \"/data/ml-experiments\"\nworker_port = 22\napi_key = \"your-api-key\"\n"},{"location":"zig-cli/#security-settings","title":"Security Settings","text":"# Set restrictive permissions\nchmod 600 ~/.ml/config.toml\n\n# Verify configuration\nml status\n"},{"location":"zig-cli/#troubleshooting","title":"Troubleshooting","text":""},{"location":"zig-cli/#common-issues","title":"Common Issues","text":""},{"location":"zig-cli/#build-problems","title":"Build Problems","text":"# Check Zig installation\nzig version\n\n# Clean build\ncd cli && make clean && make build\n"},{"location":"zig-cli/#connection-issues","title":"Connection Issues","text":"# Test SSH connectivity\nssh -p $worker_port $worker_user@$worker_host\n\n# Verify configuration\ncat ~/.ml/config.toml\n"},{"location":"zig-cli/#sync-failures","title":"Sync Failures","text":"# Check rsync\nrsync --version\n\n# Manual sync test\nrsync -avz ./test/ $worker_user@$worker_host:/tmp/\n"},{"location":"zig-cli/#performance-issues","title":"Performance Issues","text":"# Monitor resource usage\ntop -p $(pgrep ml)\n\n# Check disk space\ndf -h $worker_base\n"},{"location":"zig-cli/#debug-mode","title":"Debug Mode","text":"Enable verbose logging:
# Environment variable\nexport ML_DEBUG=1\nml sync ./project\n\n# Or use debug build\ncd cli && make debug\n"},{"location":"zig-cli/#performance-benchmarks","title":"Performance Benchmarks","text":""},{"location":"zig-cli/#file-operations","title":"File Operations","text":"cd cli\nzig build-exe src/main.zig\n"},{"location":"zig-cli/#testing","title":"Testing","text":"# Run tests\ncd cli && zig test src/\n\n# Integration tests\nzig test tests/\n"},{"location":"zig-cli/#code-style","title":"Code Style","text":"For more information, see the CLI Reference and Architecture pages.
"},{"location":"adr/","title":"Architecture Decision Records (ADRs)","text":"This directory contains Architecture Decision Records (ADRs) for the Fetch ML project.
"},{"location":"adr/#what-are-adrs","title":"What are ADRs?","text":"Architecture Decision Records are short text files that document a single architectural decision. They capture the context, options considered, decision made, and consequences of that decision.
"},{"location":"adr/#adr-template","title":"ADR Template","text":"Each ADR follows this structure:
# ADR-XXX: [Title]\n\n## Status\n[Proposed | Accepted | Deprecated | Superseded]\n\n## Context\n[What is the issue that we're facing that needs a decision?]\n\n## Decision\n[What is the change that we're proposing and/or doing?]\n\n## Consequences\n[What becomes easier or more difficult to do because of this change?]\n\n## Options Considered\n[What other approaches did we consider and why did we reject them?]\n"},{"location":"adr/#adr-index","title":"ADR Index","text":"ADR Title Status ADR-001 Use Go for API Server Accepted ADR-002 Use SQLite for Local Development Accepted ADR-003 Use Redis for Job Queue Accepted"},{"location":"adr/#how-to-add-a-new-adr","title":"How to Add a New ADR","text":"ADR-XXX-title.md where XXX is the next sequential numberAccepted
"},{"location":"adr/ADR-001-use-go-for-api-server/#context","title":"Context","text":"We needed to choose a programming language for the Fetch ML API server that would provide: - High performance for ML experiment management - Strong concurrency support for handling multiple experiments - Good ecosystem for HTTP APIs and WebSocket connections - Easy deployment and containerization - Strong type safety and reliability
"},{"location":"adr/ADR-001-use-go-for-api-server/#decision","title":"Decision","text":"We chose Go as the primary language for the API server implementation.
"},{"location":"adr/ADR-001-use-go-for-api-server/#consequences","title":"Consequences","text":""},{"location":"adr/ADR-001-use-go-for-api-server/#positive","title":"Positive","text":"Pros: - Rich ML ecosystem (TensorFlow, PyTorch, scikit-learn) - Easy to learn and write - Great for data science teams - FastAPI provides good performance
Cons: - Global Interpreter Lock limits true parallelism - Higher memory usage - Slower performance for high-throughput scenarios - More complex deployment (multiple files, dependencies)
"},{"location":"adr/ADR-001-use-go-for-api-server/#nodejs-with-express","title":"Node.js with Express","text":"Pros: - Excellent WebSocket support - Large ecosystem - Fast development cycle
Cons: - Single-threaded event loop can be limiting - Not ideal for CPU-intensive ML operations - Dynamic typing can lead to runtime errors
"},{"location":"adr/ADR-001-use-go-for-api-server/#rust","title":"Rust","text":"Pros: - Maximum performance and memory safety - Strong type system - Growing ecosystem
Cons: - Very steep learning curve - Longer development time - Smaller ecosystem for web frameworks
"},{"location":"adr/ADR-001-use-go-for-api-server/#java-with-spring-boot","title":"Java with Spring Boot","text":"Pros: - Mature ecosystem - Good performance - Strong typing
Cons: - Higher memory usage - More verbose syntax - Slower startup time - Heavier deployment footprint
"},{"location":"adr/ADR-001-use-go-for-api-server/#rationale","title":"Rationale","text":"Go provides the best balance of performance, concurrency support, and deployment simplicity for our API server needs. The ability to handle many concurrent ML experiments efficiently with goroutines is a key advantage. The single binary deployment model also simplifies our containerization and distribution strategy.
"},{"location":"adr/ADR-002-use-sqlite-for-local-development/","title":"ADR-002: Use SQLite for Local Development","text":""},{"location":"adr/ADR-002-use-sqlite-for-local-development/#status","title":"Status","text":"Accepted
"},{"location":"adr/ADR-002-use-sqlite-for-local-development/#context","title":"Context","text":"For local development and testing, we needed a database solution that: - Requires minimal setup and configuration - Works well with Go's database drivers - Supports the same SQL features as production databases - Allows easy reset and recreation of test data - Doesn't require external services running locally
"},{"location":"adr/ADR-002-use-sqlite-for-local-development/#decision","title":"Decision","text":"We chose SQLite as the default database for local development and testing environments.
"},{"location":"adr/ADR-002-use-sqlite-for-local-development/#consequences","title":"Consequences","text":""},{"location":"adr/ADR-002-use-sqlite-for-local-development/#positive","title":"Positive","text":"Pros: - Production-grade database - Excellent feature support - Good Go driver support - Consistent with production environment
Cons: - Requires external service installation and configuration - Higher resource usage - More complex setup for new developers - Overkill for simple local development
"},{"location":"adr/ADR-002-use-sqlite-for-local-development/#mysql","title":"MySQL","text":"Pros: - Popular and well-supported - Good Go drivers available
Cons: - Requires external service - More complex setup - Different SQL dialect than PostgreSQL
"},{"location":"adr/ADR-002-use-sqlite-for-local-development/#in-memory-databases-redis-etc","title":"In-memory databases (Redis, etc.)","text":"Pros: - Very fast - No persistence needed for some tests
Cons: - Limited query capabilities - Not suitable for complex relational data - Different data model than production
"},{"location":"adr/ADR-002-use-sqlite-for-local-development/#no-database-file-based-storage","title":"No database (file-based storage)","text":"Pros: - Simple implementation - No dependencies
Cons: - Limited query capabilities - No transaction support - Hard to scale to complex data needs
"},{"location":"adr/ADR-002-use-sqlite-for-local-development/#rationale","title":"Rationale","text":"SQLite provides the perfect balance of simplicity and functionality for local development. It requires zero setup - developers can just run the application and it works. The file-based nature makes it easy to reset test data by deleting the database file. While it differs from our production PostgreSQL database, it supports the same core SQL features needed for development and testing.
The main limitation is single-writer access, but this is acceptable for local development where typically only one developer is working with the database at a time. For integration tests that need concurrent access, we can use PostgreSQL or Redis.
"},{"location":"adr/ADR-003-use-redis-for-job-queue/","title":"ADR-003: Use Redis for Job Queue","text":""},{"location":"adr/ADR-003-use-redis-for-job-queue/#status","title":"Status","text":"Accepted
"},{"location":"adr/ADR-003-use-redis-for-job-queue/#context","title":"Context","text":"For the ML experiment job queue system, we needed a solution that: - Provides reliable job queuing and distribution - Supports multiple workers consuming jobs concurrently - Offers persistence and durability - Handles job priorities and retries - Integrates well with our Go-based API server - Can scale horizontally with multiple workers
"},{"location":"adr/ADR-003-use-redis-for-job-queue/#decision","title":"Decision","text":"We chose Redis as the job queue backend using its list data structures and pub/sub capabilities.
"},{"location":"adr/ADR-003-use-redis-for-job-queue/#consequences","title":"Consequences","text":""},{"location":"adr/ADR-003-use-redis-for-job-queue/#positive","title":"Positive","text":"Pros: - No additional infrastructure - ACID transactions - Complex queries and joins possible - Integrated with primary database
Cons: - Higher latency for queue operations - Database contention under high load - More complex implementation for reliable polling - Limited scalability for high-frequency operations
"},{"location":"adr/ADR-003-use-redis-for-job-queue/#rabbitmq","title":"RabbitMQ","text":"Pros: - Purpose-built message broker - Advanced routing and filtering - Built-in acknowledgments and retries - Good clustering support
Cons: - More complex setup and configuration - Higher resource requirements - Steeper learning curve - Overkill for simple queue needs
"},{"location":"adr/ADR-003-use-redis-for-job-queue/#apache-kafka","title":"Apache Kafka","text":"Pros: - Extremely high throughput - Built-in partitioning and replication - Good for event streaming
Cons: - Complex setup and operations - Designed for streaming, not job queuing - Higher latency for individual job processing - More resource intensive
"},{"location":"adr/ADR-003-use-redis-for-job-queue/#in-memory-queuing-go-channels","title":"In-memory Queuing (Go channels)","text":"Pros: - Zero external dependencies - Very fast - Simple implementation
Cons: - No persistence (jobs lost on restart) - Limited to single process - No monitoring or observability - Not suitable for distributed systems
"},{"location":"adr/ADR-003-use-redis-for-job-queue/#rationale","title":"Rationale","text":"Redis provides the optimal balance of simplicity, performance, and reliability for our job queue needs. The list-based queue implementation (LPUSH/RPOP) is straightforward and highly performant. Redis's persistence options ensure jobs aren't lost during restarts, and the pub/sub capabilities enable real-time notifications for workers.
The Go client library is excellent and provides connection pooling, automatic reconnection, and good error handling. Redis's low memory footprint and fast operations make it ideal for high-frequency job queuing scenarios common in ML workloads.
While RabbitMQ offers more advanced features, Redis is sufficient for our current needs and much simpler to operate. The simple queue model also makes it easier to understand and debug when issues arise.
"}]}