{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"],"fields":{"title":{"boost":1000.0},"text":{"boost":1.0},"tags":{"boost":1000000.0}}},"docs":[{"location":"","title":"Fetch ML - Secure Machine Learning Platform","text":"
A secure, containerized platform for running machine learning experiments with role-based access control and comprehensive audit trails.
"},{"location":"#quick-start","title":"Quick Start","text":"New to the project? Start here!
# Clone the repository\ngit clone https://github.com/your-username/fetch_ml.git\ncd fetch_ml\n\n# Quick setup (builds everything, creates test user)\nmake quick-start\n\n# Create your API key\n./bin/user_manager --config configs/config_dev.yaml --cmd generate-key --username your_name --role data_scientist\n\n# Run your first experiment\n./bin/worker --config configs/config_dev.yaml --api-key YOUR_GENERATED_KEY\n"},{"location":"#quick-navigation","title":"Quick Navigation","text":""},{"location":"#getting-started","title":"\ud83d\ude80 Getting Started","text":"# Core commands\nmake help # See all available commands\nmake build # Build all binaries\nmake test-unit # Run tests\n\n# User management\n./bin/user_manager --config configs/config_dev.yaml --cmd generate-key --username new_user --role data_scientist\n./bin/user_manager --config configs/config_dev.yaml --cmd list-users\n\n# Run services\n./bin/worker --config configs/config_dev.yaml --api-key YOUR_KEY\n./bin/tui --config configs/config_dev.yaml\n./bin/data_manager --config configs/config_dev.yaml\n"},{"location":"#need-help","title":"Need Help?","text":"make helpmake test-unitHappy ML experimenting!
"},{"location":"api-key-process/","title":"FetchML API Key Process","text":"This document describes how API keys are issued and how team members should configure the ml CLI to use them.
The goal is to keep access easy for your homelab while treating API keys as sensitive secrets.
"},{"location":"api-key-process/#overview","title":"Overview","text":"ml CLI to authenticate to the FetchML API.There are two supported ways to receive your key:
./scripts/create_bitwarden_fetchml_item.sh <username> <api_key> <api_key_hash>\n This script:
FetchML API \u2013 <username>.Stores:
<username><api_key> (the actual API key)api_key_hash: <api_key_hash>Share that item with the user in Bitwarden (for example, via a shared collection like FetchML).
Open Bitwarden and locate the item:
Name: FetchML API \u2013 <your-name>
Copy the password field (this is your FetchML API key).
Configure the CLI, e.g. in ~/.ml/config.toml:
api_key = \"<paste-from-bitwarden>\"\nworker_host = \"localhost\"\nworker_port = 9100\napi_url = \"ws://localhost:9100/ws\"\n ml status\n If the command works, your key and tunnel/config are correct.
"},{"location":"api-key-process/#2-direct-share-no-password-manager-required","title":"2. Direct share (no password manager required)","text":"For users who do not use Bitwarden, a lightweight alternative is a direct one-to-one share.
"},{"location":"api-key-process/#for-the-admin_1","title":"For the admin","text":"Share only the API key with the user via a direct channel you both trust, such as:
Signal / WhatsApp direct message
Short call/meeting where you read it to them
Ask the user to:
Paste the key into their local config.
~/.ml/config.toml:api_key = \"<your-api-key>\"\nworker_host = \"localhost\"\nworker_port = 9100\napi_url = \"ws://localhost:9100/ws\"\n ml status\n ml queue my-training-job\nml cancel my-training-job\n"},{"location":"api-key-process/#3-security-notes","title":"3. Security notes","text":"api_key_hash is as sensitive as the API key itself.Do not commit keys or hashes to Git or share them in screenshots or tickets.
Rotation
The admin will revoke the old key, generate a new one, and update Bitwarden or share a new key.
Transport security
api_url is typically ws://localhost:9100/ws when used through an SSH tunnel to the homelab.Following these steps keeps API access easy for the team while maintaining a reasonable security posture for a personal homelab deployment.
"},{"location":"architecture/","title":"Homelab Architecture","text":"Simple, secure architecture for ML experiments in your homelab.
"},{"location":"architecture/#components-overview","title":"Components Overview","text":"graph TB\n subgraph \"Homelab Stack\"\n CLI[Zig CLI]\n API[HTTPS API]\n REDIS[Redis Cache]\n FS[Local Storage]\n end\n\n CLI --> API\n API --> REDIS\n API --> FS\n"},{"location":"architecture/#core-services","title":"Core Services","text":""},{"location":"architecture/#api-server","title":"API Server","text":"graph LR\n USER[User] --> AUTH[API Key Auth]\n AUTH --> RATE[Rate Limiting]\n RATE --> WHITELIST[IP Whitelist]\n WHITELIST --> API[Secure API]\n API --> AUDIT[Audit Logging]\n"},{"location":"architecture/#security-layers","title":"Security Layers","text":"sequenceDiagram\n participant CLI\n participant API\n participant Redis\n participant Storage\n\n CLI->>API: HTTPS Request\n API->>API: Validate Auth\n API->>Redis: Cache/Queue\n API->>Storage: Experiment Data\n Storage->>API: Results\n API->>CLI: Response\n"},{"location":"architecture/#deployment-options","title":"Deployment Options","text":""},{"location":"architecture/#docker-compose-recommended","title":"Docker Compose (Recommended)","text":"services:\n redis:\n image: redis:7-alpine\n ports: [\"6379:6379\"]\n volumes: [redis_data:/data]\n\n api-server:\n build: .\n ports: [\"9101:9101\"]\n depends_on: [redis]\n"},{"location":"architecture/#local-setup","title":"Local Setup","text":"./setup.sh && ./manage.sh start\n"},{"location":"architecture/#network-architecture","title":"Network Architecture","text":"data/\n\u251c\u2500\u2500 experiments/ # ML experiment results\n\u251c\u2500\u2500 cache/ # Temporary cache files\n\u2514\u2500\u2500 backups/ # Local backups\n\nlogs/\n\u251c\u2500\u2500 app.log # Application logs\n\u251c\u2500\u2500 audit.log # Security events\n\u2514\u2500\u2500 access.log # API access logs\n"},{"location":"architecture/#monitoring-architecture","title":"Monitoring Architecture","text":"Simple, lightweight monitoring: - Health Checks: Service availability - Log Files: Structured logging - Basic Metrics: Request counts, error rates - Security Events: Failed auth, rate limits
"},{"location":"architecture/#homelab-benefits","title":"Homelab Benefits","text":"graph TB\n subgraph \"Client Layer\"\n CLI[CLI Tools]\n TUI[Terminal UI]\n API[REST API]\n end\n\n subgraph \"Authentication Layer\"\n Auth[Authentication Service]\n RBAC[Role-Based Access Control]\n Perm[Permission Manager]\n end\n\n subgraph \"Core Services\"\n Worker[ML Worker Service]\n DataMgr[Data Manager Service]\n Queue[Job Queue]\n end\n\n subgraph \"Storage Layer\"\n Redis[(Redis Cache)]\n DB[(SQLite/PostgreSQL)]\n Files[File Storage]\n end\n\n subgraph \"Container Runtime\"\n Podman[Podman/Docker]\n Containers[ML Containers]\n end\n\n CLI --> Auth\n TUI --> Auth\n API --> Auth\n\n Auth --> RBAC\n RBAC --> Perm\n\n Worker --> Queue\n Worker --> DataMgr\n Worker --> Podman\n\n DataMgr --> DB\n DataMgr --> Files\n\n Queue --> Redis\n\n Podman --> Containers\n"},{"location":"architecture/#zig-cli-architecture","title":"Zig CLI Architecture","text":""},{"location":"architecture/#component-structure","title":"Component Structure","text":"graph TB\n subgraph \"Zig CLI Components\"\n Main[main.zig] --> Commands[commands/]\n Commands --> Config[config.zig]\n Commands --> Utils[utils/]\n Commands --> Net[net/]\n Commands --> Errors[errors.zig]\n\n subgraph \"Commands\"\n Init[init.zig]\n Sync[sync.zig]\n Queue[queue.zig]\n Watch[watch.zig]\n Status[status.zig]\n Monitor[monitor.zig]\n Cancel[cancel.zig]\n Prune[prune.zig]\n end\n\n subgraph \"Utils\"\n Crypto[crypto.zig]\n Storage[storage.zig]\n Rsync[rsync.zig]\n end\n\n subgraph \"Network\"\n WS[ws.zig]\n end\n end\n"},{"location":"architecture/#performance-optimizations","title":"Performance Optimizations","text":""},{"location":"architecture/#content-addressed-storage","title":"Content-Addressed Storage","text":"graph LR\n subgraph \"CLI Security\"\n Config[Config File] --> Hash[SHA256 Hashing]\n Hash --> Auth[API Authentication]\n Auth --> SSH[SSH Transfer]\n SSH --> WS[WebSocket Security]\n end\n"},{"location":"architecture/#core-components","title":"Core Components","text":""},{"location":"architecture/#1-authentication-authorization","title":"1. Authentication & Authorization","text":"graph LR\n subgraph \"Auth Flow\"\n Client[Client] --> APIKey[API Key]\n APIKey --> Hash[Hash Validation]\n Hash --> Roles[Role Resolution]\n Roles --> Perms[Permission Check]\n Perms --> Access[Grant/Deny Access]\n end\n\n subgraph \"Permission Sources\"\n YAML[YAML Config]\n Inline[Inline Fallback]\n Roles --> YAML\n Roles --> Inline\n end\n Features: - API key-based authentication - Role-based access control (RBAC) - YAML-based permission configuration - Fallback to inline permissions - Admin wildcard permissions
"},{"location":"architecture/#2-worker-service","title":"2. Worker Service","text":"graph TB\n subgraph \"Worker Architecture\"\n API[HTTP API] --> Router[Request Router]\n Router --> Auth[Auth Middleware]\n Auth --> Queue[Job Queue]\n Queue --> Processor[Job Processor]\n Processor --> Runtime[Container Runtime]\n Runtime --> Storage[Result Storage]\n\n subgraph \"Job Lifecycle\"\n Submit[Submit Job] --> Queue\n Queue --> Execute[Execute]\n Execute --> Monitor[Monitor]\n Monitor --> Complete[Complete]\n Complete --> Store[Store Results]\n end\n end\n Responsibilities: - HTTP API for job submission - Job queue management - Container orchestration - Result collection and storage - Metrics and monitoring
"},{"location":"architecture/#3-data-manager-service","title":"3. Data Manager Service","text":"graph TB\n subgraph \"Data Management\"\n API[Data API] --> Storage[Storage Layer]\n Storage --> Metadata[Metadata DB]\n Storage --> Files[File System]\n Storage --> Cache[Redis Cache]\n\n subgraph \"Data Operations\"\n Upload[Upload Data] --> Validate[Validate]\n Validate --> Store[Store]\n Store --> Index[Index]\n Index --> Catalog[Catalog]\n end\n end\n Features: - Data upload and validation - Metadata management - File system abstraction - Caching layer - Data catalog
"},{"location":"architecture/#4-terminal-ui-tui","title":"4. Terminal UI (TUI)","text":"graph TB\n subgraph \"TUI Architecture\"\n UI[UI Components] --> Model[Data Model]\n Model --> Update[Update Loop]\n Update --> Render[Render]\n\n subgraph \"UI Panels\"\n Jobs[Job List]\n Details[Job Details]\n Logs[Log Viewer]\n Status[Status Bar]\n end\n\n UI --> Jobs\n UI --> Details\n UI --> Logs\n UI --> Status\n end\n Components: - Bubble Tea framework - Component-based architecture - Real-time updates - Keyboard navigation - Theme support
"},{"location":"architecture/#data-flow_1","title":"Data Flow","text":""},{"location":"architecture/#job-execution-flow","title":"Job Execution Flow","text":"sequenceDiagram\n participant Client\n participant Auth\n participant Worker\n participant Queue\n participant Container\n participant Storage\n\n Client->>Auth: Submit job with API key\n Auth->>Client: Validate and return job ID\n\n Client->>Worker: Execute job request\n Worker->>Queue: Queue job\n Queue->>Worker: Job ready\n Worker->>Container: Start ML container\n Container->>Worker: Execute experiment\n Worker->>Storage: Store results\n Worker->>Client: Return results\n"},{"location":"architecture/#authentication-flow","title":"Authentication Flow","text":"sequenceDiagram\n participant Client\n participant Auth\n participant PermMgr\n participant Config\n\n Client->>Auth: Request with API key\n Auth->>Auth: Validate key hash\n Auth->>PermMgr: Get user permissions\n PermMgr->>Config: Load YAML permissions\n Config->>PermMgr: Return permissions\n PermMgr->>Auth: Return resolved permissions\n Auth->>Client: Grant/deny access\n"},{"location":"architecture/#security-architecture_1","title":"Security Architecture","text":""},{"location":"architecture/#defense-in-depth","title":"Defense in Depth","text":"graph TB\n subgraph \"Security Layers\"\n Network[Network Security]\n Auth[Authentication]\n AuthZ[Authorization]\n Container[Container Security]\n Data[Data Protection]\n Audit[Audit Logging]\n end\n\n Network --> Auth\n Auth --> AuthZ\n AuthZ --> Container\n Container --> Data\n Data --> Audit\n Security Features: - API key authentication - Role-based permissions - Container isolation - File system sandboxing - Comprehensive audit logs - Input validation and sanitization
"},{"location":"architecture/#container-security","title":"Container Security","text":"graph TB\n subgraph \"Container Isolation\"\n Host[Host System]\n Podman[Podman Runtime]\n Network[Network Isolation]\n FS[File System Isolation]\n User[User Namespaces]\n ML[ML Container]\n\n Host --> Podman\n Podman --> Network\n Podman --> FS\n Podman --> User\n User --> ML\n end\n Isolation Features: - Rootless containers - Network isolation - File system sandboxing - User namespace mapping - Resource limits
"},{"location":"architecture/#configuration-architecture","title":"Configuration Architecture","text":""},{"location":"architecture/#configuration-hierarchy","title":"Configuration Hierarchy","text":"graph TB\n subgraph \"Config Sources\"\n Env[Environment Variables]\n File[Config Files]\n CLI[CLI Flags]\n Defaults[Default Values]\n end\n\n subgraph \"Config Processing\"\n Merge[Config Merger]\n Validate[Schema Validator]\n Apply[Config Applier]\n end\n\n Env --> Merge\n File --> Merge\n CLI --> Merge\n Defaults --> Merge\n\n Merge --> Validate\n Validate --> Apply\n Configuration Priority: 1. CLI flags (highest) 2. Environment variables 3. Configuration files 4. Default values (lowest)
"},{"location":"architecture/#scalability-architecture","title":"Scalability Architecture","text":""},{"location":"architecture/#horizontal-scaling","title":"Horizontal Scaling","text":"graph TB\n subgraph \"Scaled Architecture\"\n LB[Load Balancer]\n W1[Worker 1]\n W2[Worker 2]\n W3[Worker N]\n Redis[Redis Cluster]\n Storage[Shared Storage]\n\n LB --> W1\n LB --> W2\n LB --> W3\n\n W1 --> Redis\n W2 --> Redis\n W3 --> Redis\n\n W1 --> Storage\n W2 --> Storage\n W3 --> Storage\n end\n Scaling Features: - Stateless worker services - Shared job queue (Redis) - Distributed storage - Load balancer ready - Health checks and monitoring
"},{"location":"architecture/#technology-stack","title":"Technology Stack","text":""},{"location":"architecture/#backend-technologies","title":"Backend Technologies","text":"Component Technology Purpose Language Go 1.25+ Core application Web Framework Standard library HTTP server Authentication Custom API key + RBAC Database SQLite/PostgreSQL Metadata storage Cache Redis Job queue & caching Containers Podman/Docker Job isolation UI Framework Bubble Tea Terminal UI"},{"location":"architecture/#dependencies","title":"Dependencies","text":"// Core dependencies\nrequire (\n github.com/charmbracelet/bubbletea v1.3.10 // TUI framework\n github.com/go-redis/redis/v8 v8.11.5 // Redis client\n github.com/google/uuid v1.6.0 // UUID generation\n github.com/mattn/go-sqlite3 v1.14.32 // SQLite driver\n golang.org/x/crypto v0.45.0 // Crypto utilities\n gopkg.in/yaml.v3 v3.0.1 // YAML parsing\n)\n"},{"location":"architecture/#development-architecture","title":"Development Architecture","text":""},{"location":"architecture/#project-structure","title":"Project Structure","text":"fetch_ml/\n\u251c\u2500\u2500 cmd/ # CLI applications\n\u2502 \u251c\u2500\u2500 worker/ # ML worker service\n\u2502 \u251c\u2500\u2500 tui/ # Terminal UI\n\u2502 \u251c\u2500\u2500 data_manager/ # Data management\n\u2502 \u2514\u2500\u2500 user_manager/ # User management\n\u251c\u2500\u2500 internal/ # Internal packages\n\u2502 \u251c\u2500\u2500 auth/ # Authentication system\n\u2502 \u251c\u2500\u2500 config/ # Configuration management\n\u2502 \u251c\u2500\u2500 container/ # Container operations\n\u2502 \u251c\u2500\u2500 database/ # Database operations\n\u2502 \u251c\u2500\u2500 logging/ # Logging utilities\n\u2502 \u251c\u2500\u2500 metrics/ # Metrics collection\n\u2502 \u2514\u2500\u2500 network/ # Network utilities\n\u251c\u2500\u2500 configs/ # Configuration files\n\u251c\u2500\u2500 scripts/ # Setup and utility scripts\n\u251c\u2500\u2500 tests/ # Test suites\n\u2514\u2500\u2500 docs/ # Documentation\n"},{"location":"architecture/#package-dependencies","title":"Package Dependencies","text":"graph TB\n subgraph \"Application Layer\"\n Worker[cmd/worker]\n TUI[cmd/tui]\n DataMgr[cmd/data_manager]\n UserMgr[cmd/user_manager]\n end\n\n subgraph \"Service Layer\"\n Auth[internal/auth]\n Config[internal/config]\n Container[internal/container]\n Database[internal/database]\n end\n\n subgraph \"Utility Layer\"\n Logging[internal/logging]\n Metrics[internal/metrics]\n Network[internal/network]\n end\n\n Worker --> Auth\n Worker --> Config\n Worker --> Container\n TUI --> Auth\n DataMgr --> Database\n UserMgr --> Auth\n\n Auth --> Logging\n Container --> Network\n Database --> Metrics\n"},{"location":"architecture/#monitoring-observability","title":"Monitoring & Observability","text":""},{"location":"architecture/#metrics-collection","title":"Metrics Collection","text":"graph TB\n subgraph \"Metrics Pipeline\"\n App[Application] --> Metrics[Metrics Collector]\n Metrics --> Export[Prometheus Exporter]\n Export --> Prometheus[Prometheus Server]\n Prometheus --> Grafana[Grafana Dashboard]\n\n subgraph \"Metric Types\"\n Counter[Counters]\n Gauge[Gauges]\n Histogram[Histograms]\n Timer[Timers]\n end\n\n App --> Counter\n App --> Gauge\n App --> Histogram\n App --> Timer\n end\n"},{"location":"architecture/#logging-architecture","title":"Logging Architecture","text":"graph TB\n subgraph \"Logging Pipeline\"\n App[Application] --> Logger[Structured Logger]\n Logger --> File[File Output]\n Logger --> Console[Console Output]\n Logger --> Syslog[Syslog Forwarder]\n Syslog --> Aggregator[Log Aggregator]\n Aggregator --> Storage[Log Storage]\n Storage --> Viewer[Log Viewer]\n end\n"},{"location":"architecture/#deployment-architecture","title":"Deployment Architecture","text":""},{"location":"architecture/#container-deployment","title":"Container Deployment","text":"graph TB\n subgraph \"Deployment Stack\"\n Image[Container Image]\n Registry[Container Registry]\n Orchestrator[Docker Compose]\n Config[ConfigMaps/Secrets]\n Storage[Persistent Storage]\n\n Image --> Registry\n Registry --> Orchestrator\n Config --> Orchestrator\n Storage --> Orchestrator\n end\n"},{"location":"architecture/#service-discovery","title":"Service Discovery","text":"graph TB\n subgraph \"Service Mesh\"\n Gateway[API Gateway]\n Discovery[Service Discovery]\n Worker[Worker Service]\n Data[Data Service]\n Redis[Redis Cluster]\n\n Gateway --> Discovery\n Discovery --> Worker\n Discovery --> Data\n Discovery --> Redis\n end\n"},{"location":"architecture/#future-architecture-considerations","title":"Future Architecture Considerations","text":""},{"location":"architecture/#microservices-evolution","title":"Microservices Evolution","text":"This architecture provides a solid foundation for secure, scalable machine learning experiments while maintaining simplicity and developer productivity.
"},{"location":"cicd/","title":"CI/CD Pipeline","text":"Automated testing, building, and releasing for fetch_ml.
"},{"location":"cicd/#workflows","title":"Workflows","text":""},{"location":"cicd/#ci-workflow-githubworkflowsciyml","title":"CI Workflow (.github/workflows/ci.yml)","text":"Runs on every push to main/develop and all pull requests.
Jobs: 1. test - Go backend tests with Redis 2. build - Build all binaries (Go + Zig CLI) 3. test-scripts - Validate deployment scripts 4. security-scan - Trivy and Gosec security scans 5. docker-build - Build and push Docker images (main branch only)
Test Coverage: - Go unit tests with race detection - internal/queue package tests - Zig CLI tests - Integration tests - Security audits
.github/workflows/release.yml)","text":"Runs on version tags (e.g., v1.0.0).
Jobs:
Embeds rsync for zero-dependency releases
build-go-backends
api-server, worker, tui, data_manager, user_manager
create-release
# 1. Update version\ngit tag v1.0.0\n\n# 2. Push tag\ngit push origin v1.0.0\n\n# 3. CI automatically builds and releases\n"},{"location":"cicd/#release-artifacts","title":"Release Artifacts","text":"CLI Binaries (with embedded rsync): - ml-linux-x86_64.tar.gz (~450-650KB) - ml-macos-x86_64.tar.gz (~450-650KB) - ml-macos-arm64.tar.gz (~450-650KB)
Go Backends: - fetch_ml_api-server.tar.gz - fetch_ml_worker.tar.gz - fetch_ml_tui.tar.gz - fetch_ml_data_manager.tar.gz - fetch_ml_user_manager.tar.gz
Checksums: - checksums.txt - Combined SHA256 sums - Individual .sha256 files per binary
# Run all tests\nmake test\n\n# Run specific package tests\ngo test ./internal/queue/...\n\n# Build CLI\ncd cli && zig build dev\n\n# Run formatters and linters\nmake lint\n\n# Security scans are handled automatically in CI by the `security-scan` job\n"},{"location":"cicd/#optional-heavy-end-to-end-tests","title":"Optional heavy end-to-end tests","text":"Some e2e tests exercise full Docker deployments and performance scenarios and are skipped by default to keep local/CI runs fast. You can enable them explicitly with environment variables:
# Run Docker deployment e2e tests\nFETCH_ML_E2E_DOCKER=1 go test ./tests/e2e/...\n\n# Run performance-oriented e2e tests\nFETCH_ML_E2E_PERF=1 go test ./tests/e2e/...\n Without these variables, TestDockerDeploymentE2E and TestPerformanceE2E will t.Skip, while all lighter e2e tests still run.
All PRs must pass: - \u2705 Go tests (with Redis) - \u2705 CLI tests - \u2705 Security scans - \u2705 Code linting - \u2705 Build verification
"},{"location":"cicd/#configuration","title":"Configuration","text":""},{"location":"cicd/#environment-variables","title":"Environment Variables","text":"GO_VERSION: '1.25.0'\nZIG_VERSION: '0.15.2'\n"},{"location":"cicd/#secrets","title":"Secrets","text":"Required for releases: - GITHUB_TOKEN - Automatic, provided by GitHub Actions
Check workflow runs at:
https://github.com/jfraeys/fetch_ml/actions\n"},{"location":"cicd/#artifacts","title":"Artifacts","text":"Download build artifacts from: - Successful workflow runs (30-day retention) - GitHub Releases (permanent)
For implementation details: - .github/workflows/ci.yml - .github/workflows/release.yml
"},{"location":"cli-reference/","title":"Fetch ML CLI Reference","text":"Comprehensive command-line tools for managing ML experiments in your homelab with Zig-based high-performance CLI.
"},{"location":"cli-reference/#overview","title":"Overview","text":"Fetch ML provides a comprehensive CLI toolkit built with performance and security in mind:
./cli/zig-out/bin/ml)","text":"High-performance command-line interface for experiment management, written in Zig for speed and efficiency.
"},{"location":"cli-reference/#available-commands","title":"Available Commands","text":"Command Description Exampleinit Interactive configuration setup ml init sync Sync project to worker with deduplication ml sync ./project --name myjob --queue queue Queue job for execution ml queue myjob --commit abc123 --priority 8 status Get system and worker status ml status monitor Launch TUI monitoring via SSH ml monitor cancel Cancel running job ml cancel job123 prune Clean up old experiments ml prune --keep 10 watch Auto-sync directory on changes ml watch ./project --queue"},{"location":"cli-reference/#command-details","title":"Command Details","text":""},{"location":"cli-reference/#init-configuration-setup","title":"init - Configuration Setup","text":"ml init\n Creates a configuration template at ~/.ml/config.toml with: - Worker connection details - API authentication - Base paths and ports"},{"location":"cli-reference/#sync-project-synchronization","title":"sync - Project Synchronization","text":"# Basic sync\nml sync ./my-project\n\n# Sync with custom name and queue\nml sync ./my-project --name \"experiment-1\" --queue\n\n# Sync with priority\nml sync ./my-project --priority 9\n Features: - Content-addressed storage for deduplication - SHA256 commit ID generation - Rsync-based file transfer - Automatic queuing (with --queue flag)
queue - Job Management","text":"# Queue with commit ID\nml queue my-job --commit abc123def456\n\n# Queue with priority (1-10, default 5)\nml queue my-job --commit abc123 --priority 8\n Features: - WebSocket-based communication - Priority queuing system - API key authentication
"},{"location":"cli-reference/#watch-auto-sync-monitoring","title":"watch - Auto-Sync Monitoring","text":"# Watch directory for changes\nml watch ./project\n\n# Watch and auto-queue on changes\nml watch ./project --name \"dev-exp\" --queue\n Features: - Real-time file system monitoring - Automatic re-sync on changes - Configurable polling interval (2 seconds) - Commit ID comparison for efficiency
"},{"location":"cli-reference/#prune-cleanup-management","title":"prune - Cleanup Management","text":"# Keep last N experiments\nml prune --keep 20\n\n# Remove experiments older than N days\nml prune --older-than 30\n"},{"location":"cli-reference/#monitor-remote-monitoring","title":"monitor - Remote Monitoring","text":"ml monitor\n Launches TUI interface via SSH for real-time monitoring."},{"location":"cli-reference/#cancel-job-cancellation","title":"cancel - Job Cancellation","text":"ml cancel running-job-id\n Cancels currently running jobs by ID."},{"location":"cli-reference/#configuration","title":"Configuration","text":"The Zig CLI reads configuration from ~/.ml/config.toml:
worker_host = \"worker.local\"\nworker_user = \"mluser\"\nworker_base = \"/data/ml-experiments\"\nworker_port = 22\napi_key = \"your-api-key\"\n"},{"location":"cli-reference/#performance-features","title":"Performance Features","text":"./cmd/api-server/main.go)","text":"Main HTTPS API server for experiment management.
# Build and run\ngo run ./cmd/api-server/main.go\n\n# With configuration\n./bin/api-server --config configs/config-local.yaml\n Features: - HTTPS-only communication - API key authentication - Rate limiting and IP whitelisting - WebSocket support for real-time updates - Redis integration for caching
"},{"location":"cli-reference/#tui-cmdtuimaingo","title":"TUI (./cmd/tui/main.go)","text":"Terminal User Interface for monitoring experiments.
# Launch TUI\ngo run ./cmd/tui/main.go\n\n# With custom config\n./tui --config configs/config-local.yaml\n Features: - Real-time experiment monitoring - Interactive job management - Status visualization - Log viewing
"},{"location":"cli-reference/#data-manager-cmddata_manager","title":"Data Manager (./cmd/data_manager/)","text":"Utilities for data synchronization and management.
# Sync data\n./data_manager --sync ./data\n\n# Clean old data\n./data_manager --cleanup --older-than 30d\n"},{"location":"cli-reference/#config-lint-cmdconfiglintmaingo","title":"Config Lint (./cmd/configlint/main.go)","text":"Configuration validation and linting tool.
# Validate configuration\n./configlint configs/config-local.yaml\n\n# Check schema compliance\n./configlint --schema configs/schema/config_schema.yaml\n"},{"location":"cli-reference/#management-script-toolsmanagesh","title":"Management Script (./tools/manage.sh)","text":"Simple service management for your homelab.
"},{"location":"cli-reference/#commands","title":"Commands","text":"./tools/manage.sh start # Start all services\n./tools/manage.sh stop # Stop all services\n./tools/manage.sh status # Check service status\n./tools/manage.sh logs # View logs\n./tools/manage.sh monitor # Basic monitoring\n./tools/manage.sh security # Security status\n./tools/manage.sh cleanup # Clean project artifacts\n"},{"location":"cli-reference/#setup-script-setupsh","title":"Setup Script (./setup.sh)","text":"One-command homelab setup.
"},{"location":"cli-reference/#usage","title":"Usage","text":"# Full setup\n./setup.sh\n\n# Setup includes:\n# - SSL certificate generation\n# - Configuration creation\n# - Build all components\n# - Start Redis\n# - Setup Fail2Ban (if available)\n"},{"location":"cli-reference/#api-testing","title":"API Testing","text":"Test the API with curl:
# Health check\ncurl -k -H 'X-API-Key: password' https://localhost:9101/health\n\n# List experiments\ncurl -k -H 'X-API-Key: password' https://localhost:9101/experiments\n\n# Submit experiment\ncurl -k -X POST -H 'X-API-Key: password' \\\n -H 'Content-Type: application/json' \\\n -d '{\"name\":\"test\",\"config\":{\"type\":\"basic\"}}' \\\n https://localhost:9101/experiments\n"},{"location":"cli-reference/#zig-cli-architecture","title":"Zig CLI Architecture","text":"The Zig CLI is designed for performance and reliability:
"},{"location":"cli-reference/#core-components","title":"Core Components","text":"cli/src/commands/): Individual command implementationscli/src/config.zig): Configuration managementcli/src/net/ws.zig): WebSocket client implementationcli/src/utils/): Cryptography, storage, and rsync utilitiescli/src/errors.zig): Centralized error handlingMain configuration file: configs/config-local.yaml
auth:\n enabled: true\n api_keys:\n homelab_user:\n hash: \"5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8\"\n admin: true\n\nserver:\n address: \":9101\"\n tls:\n enabled: true\n cert_file: \"./ssl/cert.pem\"\n key_file: \"./ssl/key.pem\"\n\nsecurity:\n rate_limit:\n enabled: true\n requests_per_minute: 30\n ip_whitelist:\n - \"127.0.0.1\"\n - \"::1\"\n - \"192.168.0.0/16\"\n - \"10.0.0.0/8\"\n"},{"location":"cli-reference/#docker-commands","title":"Docker Commands","text":"If using Docker Compose:
# Start services\ndocker-compose up -d (testing only)\n\n# View logs\ndocker-compose logs -f\n\n# Stop services\ndocker-compose down\n\n# Check status\ndocker-compose ps\n"},{"location":"cli-reference/#troubleshooting","title":"Troubleshooting","text":""},{"location":"cli-reference/#common-issues","title":"Common Issues","text":"Zig CLI not found:
# Build the CLI\ncd cli && make build\n\n# Check binary exists\nls -la ./cli/zig-out/bin/ml\n Configuration not found:
# Create configuration\n./cli/zig-out/bin/ml init\n\n# Check config file\nls -la ~/.ml/config.toml\n Worker connection failed:
# Test SSH connection\nssh -p 22 mluser@worker.local\n\n# Check configuration\ncat ~/.ml/config.toml\n Sync not working:
# Check rsync availability\nrsync --version\n\n# Test manual sync\nrsync -avz ./project/ mluser@worker.local:/tmp/test/\n WebSocket connection failed:
# Check worker WebSocket port\ntelnet worker.local 9100\n\n# Verify API key\n./cli/zig-out/bin/ml status\n API not responding:
./tools/manage.sh status\n./tools/manage.sh logs\n Authentication failed:
# Check API key in config-local.yaml\ngrep -A 5 \"api_keys:\" configs/config-local.yaml\n Redis connection failed:
# Check Redis status\nredis-cli ping\n\n# Start Redis\nredis-server\n"},{"location":"cli-reference/#getting-help","title":"Getting Help","text":"# CLI help\n./cli/zig-out/bin/ml help\n\n# Management script help\n./tools/manage.sh help\n\n# Check all available commands\nmake help\n That's it for the CLI reference! For complete setup instructions, see the main index.
"},{"location":"configuration-schema/","title":"Configuration Schema","text":"Complete reference for Fetch ML configuration options.
"},{"location":"configuration-schema/#configuration-file-structure","title":"Configuration File Structure","text":"Fetch ML uses YAML configuration files. The main configuration file is typically config.yaml.
# Server Configuration\nserver:\n address: \":9101\"\n tls:\n enabled: false\n cert_file: \"\"\n key_file: \"\"\n\n# Database Configuration\ndatabase:\n type: \"sqlite\" # sqlite, postgres, mysql\n connection: \"fetch_ml.db\"\n host: \"localhost\"\n port: 5432\n username: \"postgres\"\n password: \"\"\n database: \"fetch_ml\"\n\n# Redis Configuration\n\n\n## Quick Reference\n\n### Database Types\n- **SQLite**: `type: sqlite, connection: file.db`\n- **PostgreSQL**: `type: postgres, host: localhost, port: 5432`\n\n### Key Settings\n- `server.address: :9101`\n- `database.type: sqlite`\n- `redis.addr: localhost:6379`\n- `auth.enabled: true`\n- `logging.level: info`\n\n### Environment Override\n```bash\nexport FETCHML_SERVER_ADDRESS=:8080\nexport FETCHML_DATABASE_TYPE=postgres\n"},{"location":"configuration-schema/#validation","title":"Validation","text":"make configlint\n"},{"location":"deployment/","title":"ML Experiment Manager - Deployment Guide","text":""},{"location":"deployment/#overview","title":"Overview","text":"The ML Experiment Manager supports multiple deployment methods from local development to homelab Docker setups.
"},{"location":"deployment/#quick-start","title":"Quick Start","text":""},{"location":"deployment/#docker-compose-recommended-for-development","title":"Docker Compose (Recommended for Development)","text":"# Clone repository\ngit clone https://github.com/your-org/fetch_ml.git\ncd fetch_ml\n\n# Start all services\ndocker-compose up -d (testing only)\n\n# Check status\ndocker-compose ps\n\n# View logs\ndocker-compose logs -f api-server\n Access the API at http://localhost:9100
Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution - Go 1.25+ - Zig 0.15.2 - Redis 7+ - Docker & Docker Compose (optional)
"},{"location":"deployment/#manual-setup","title":"Manual Setup","text":"# Start Redis\nredis-server\n\n# Build and run Go server\ngo build -o bin/api-server ./cmd/api-server\n./bin/api-server -config configs/config-local.yaml\n\n# Build Zig CLI\ncd cli\nzig build prod\n./zig-out/bin/ml --help\n"},{"location":"deployment/#2-docker-deployment","title":"2. Docker Deployment","text":""},{"location":"deployment/#build-image","title":"Build Image","text":"docker build -t ml-experiment-manager:latest .\n"},{"location":"deployment/#run-container","title":"Run Container","text":"docker run -d \\\n --name ml-api \\\n -p 9100:9100 \\\n -p 9101:9101 \\\n -v $(pwd)/configs:/app/configs:ro \\\n -v experiment-data:/data/ml-experiments \\\n ml-experiment-manager:latest\n"},{"location":"deployment/#docker-compose","title":"Docker Compose","text":"# Production mode\ndocker-compose -f docker-compose.yml up -d\n\n# Development mode with logs\ndocker-compose -f docker-compose.yml up\n"},{"location":"deployment/#3-homelab-setup","title":"3. Homelab Setup","text":"# Use the simple setup script\n./setup.sh\n\n# Or manually with Docker Compose\ndocker-compose up -d (testing only)\n"},{"location":"deployment/#4-cloud-deployment","title":"4. Cloud Deployment","text":""},{"location":"deployment/#aws-ecs","title":"AWS ECS","text":"# Build and push to ECR\naws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REGISTRY\ndocker build -t $ECR_REGISTRY/ml-experiment-manager:latest .\ndocker push $ECR_REGISTRY/ml-experiment-manager:latest\n\n# Deploy with ECS CLI\necs-cli compose --project-name ml-experiment-manager up\n"},{"location":"deployment/#google-cloud-run","title":"Google Cloud Run","text":"# Build and push\ngcloud builds submit --tag gcr.io/$PROJECT_ID/ml-experiment-manager\n\n# Deploy\ngcloud run deploy ml-experiment-manager \\\n --image gcr.io/$PROJECT_ID/ml-experiment-manager \\\n --platform managed \\\n --region us-central1 \\\n --allow-unauthenticated\n"},{"location":"deployment/#configuration","title":"Configuration","text":""},{"location":"deployment/#environment-variables","title":"Environment Variables","text":"# configs/config-local.yaml\nbase_path: \"/data/ml-experiments\"\nauth:\n enabled: true\n api_keys:\n - \"your-production-api-key\"\nserver:\n address: \":9100\"\n tls:\n enabled: true\n cert_file: \"/app/ssl/cert.pem\"\n key_file: \"/app/ssl/key.pem\"\n"},{"location":"deployment/#docker-compose-environment","title":"Docker Compose Environment","text":"# docker-compose.yml\nversion: '3.8'\nservices:\n api-server:\n environment:\n - REDIS_URL=redis://redis:6379\n - LOG_LEVEL=info\n volumes:\n - ./configs:/configs:ro\n - ./data:/data/experiments\n"},{"location":"deployment/#monitoring-logging","title":"Monitoring & Logging","text":""},{"location":"deployment/#health-checks","title":"Health Checks","text":"GET /health/metrics# Generate self-signed cert (development)\nopenssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365 -nodes\n\n# Production - use Let's Encrypt\ncertbot certonly --standalone -d ml-experiments.example.com\n"},{"location":"deployment/#network-security","title":"Network Security","text":"resources:\n requests:\n memory: \"256Mi\"\n cpu: \"250m\"\n limits:\n memory: \"1Gi\"\n cpu: \"1000m\"\n"},{"location":"deployment/#scaling-strategies","title":"Scaling Strategies","text":"# Backup experiment data\ndocker-compose exec redis redis-cli BGSAVE\ndocker cp $(docker-compose ps -q redis):/data/dump.rdb ./redis-backup.rdb\n\n# Backup data volume\ndocker run --rm -v ml-experiments_redis_data:/data -v $(pwd):/backup alpine tar czf /backup/redis-backup.tar.gz -C /data .\n"},{"location":"deployment/#disaster-recovery","title":"Disaster Recovery","text":"# Check logs\ndocker-compose logs api-server\n\n# Check configuration\ncat configs/config-local.yaml\n\n# Check Redis connection\ndocker-compose exec redis redis-cli ping\n"},{"location":"deployment/#websocket-connection-issues","title":"WebSocket Connection Issues","text":"# Test WebSocket\nwscat -c ws://localhost:9100/ws\n\n# Check TLS\nopenssl s_client -connect localhost:9101 -servername localhost\n"},{"location":"deployment/#performance-issues","title":"Performance Issues","text":"# Check resource usage\ndocker-compose exec api-server ps aux\n\n# Check Redis memory\ndocker-compose exec redis redis-cli info memory\n"},{"location":"deployment/#debug-mode","title":"Debug Mode","text":"# Enable debug logging\nexport LOG_LEVEL=debug\n./bin/api-server -config configs/config-local.yaml\n"},{"location":"deployment/#cicd-integration","title":"CI/CD Integration","text":""},{"location":"deployment/#github-actions","title":"GitHub Actions","text":"For deployment issues: 1. Check this guide 2. Review logs 3. Check GitHub Issues 4. Contact maintainers
"},{"location":"development-setup/","title":"Development Setup","text":"Set up your local development environment for Fetch ML.
"},{"location":"development-setup/#prerequisites","title":"Prerequisites","text":"Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution
# Clone repository\ngit clone https://github.com/jfraeys/fetch_ml.git\ncd fetch_ml\n\n# Start dependencies\nsee [Quick Start](quick-start.md) for Docker setup redis postgres\n\n# Build all components\nmake build\n\n# Run tests\nsee [Testing Guide](testing.md)\n"},{"location":"development-setup/#detailed-setup","title":"Detailed Setup","text":""},{"location":"development-setup/#quick-start","title":"Quick Start","text":"git clone https://github.com/jfraeys/fetch_ml.git\ncd fetch_ml\nsee [Quick Start](quick-start.md) for Docker setup\nmake build\nsee [Testing Guide](testing.md)\n"},{"location":"development-setup/#key-commands","title":"Key Commands","text":"make build - Build all componentssee [Testing Guide](testing.md) - Run testsmake dev - Development buildsee [CLI Reference](cli-reference.md) and [Zig CLI](zig-cli.md) - Build CLIgo mod tidycd cli && rm -rf zig-out zig-cachelsof -i :9101Fetch ML supports environment variables for configuration, allowing you to override config file settings and deploy in different environments.
"},{"location":"environment-variables/#priority-order","title":"Priority Order","text":"FETCH_ML_* - General server and application settingsFETCH_ML_CLI_* - CLI-specific settings (overrides ~/.ml/config.toml)FETCH_ML_TUI_* - TUI-specific settings (overrides TUI config file)FETCH_ML_CLI_HOST worker_host localhost FETCH_ML_CLI_USER worker_user mluser FETCH_ML_CLI_BASE worker_base /opt/ml FETCH_ML_CLI_PORT worker_port 22 FETCH_ML_CLI_API_KEY api_key your-api-key-here"},{"location":"environment-variables/#tui-environment-variables","title":"TUI Environment Variables","text":"Variable Config Field Example FETCH_ML_TUI_HOST host localhost FETCH_ML_TUI_USER user mluser FETCH_ML_TUI_SSH_KEY ssh_key ~/.ssh/id_rsa FETCH_ML_TUI_PORT port 22 FETCH_ML_TUI_BASE_PATH base_path /opt/ml FETCH_ML_TUI_TRAIN_SCRIPT train_script train.py FETCH_ML_TUI_REDIS_ADDR redis_addr localhost:6379 FETCH_ML_TUI_REDIS_PASSWORD redis_password `` FETCH_ML_TUI_REDIS_DB redis_db 0 FETCH_ML_TUI_KNOWN_HOSTS known_hosts ~/.ssh/known_hosts"},{"location":"environment-variables/#server-environment-variables-auth-debug","title":"Server Environment Variables (Auth & Debug)","text":"These variables control server-side authentication behavior and are intended only for local development and debugging.
Variable Purpose Allowed In Production?FETCH_ML_ALLOW_INSECURE_AUTH When set to 1 and FETCH_ML_DEBUG=1, allows the API server to run with auth.enabled: false by injecting a default admin user. No. Must never be set in production. FETCH_ML_DEBUG Enables additional debug behaviors. Required (set to 1) to activate the insecure auth bypass above. No. Must never be set in production. When both variables are set to 1 and auth.enabled is false, the server logs a clear warning and treats all requests as coming from a default admin user. This mode is convenient for local homelab experiments but is insecure by design and must not be used on any shared or internet-facing environment.
export FETCH_ML_CLI_HOST=localhost\nexport FETCH_ML_CLI_USER=devuser\nexport FETCH_ML_CLI_API_KEY=dev-key-123456789012\n./ml status\n"},{"location":"environment-variables/#production-environment","title":"Production Environment","text":"export FETCH_ML_CLI_HOST=prod-server.example.com\nexport FETCH_ML_CLI_USER=mluser\nexport FETCH_ML_CLI_API_KEY=prod-key-abcdef1234567890\n./ml status\n"},{"location":"environment-variables/#dockerkubernetes","title":"Docker/Kubernetes","text":"env:\n - name: FETCH_ML_CLI_HOST\n value: \"ml-server.internal\"\n - name: FETCH_ML_CLI_USER\n value: \"mluser\"\n - name: FETCH_ML_CLI_API_KEY\n valueFrom:\n secretKeyRef:\n name: ml-secrets\n key: api-key\n"},{"location":"environment-variables/#using-env-file","title":"Using .env file","text":"# Copy the example file\ncp .env.example .env\n\n# Edit with your values\nvim .env\n\n# Load in your shell\nexport $(cat .env | xargs)\n"},{"location":"environment-variables/#backward-compatibility","title":"Backward Compatibility","text":"The CLI also supports the legacy ML_* prefix for backward compatibility, but FETCH_ML_CLI_* takes priority if both are set.
ML_HOST FETCH_ML_CLI_HOST ML_USER FETCH_ML_CLI_USER ML_BASE FETCH_ML_CLI_BASE ML_PORT FETCH_ML_CLI_PORT ML_API_KEY FETCH_ML_CLI_API_KEY"},{"location":"first-experiment/","title":"First Experiment","text":"Run your first machine learning experiment with Fetch ML.
"},{"location":"first-experiment/#prerequisites","title":"Prerequisites","text":"Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution
Create a simple Python script:
# experiment.py\nimport argparse\nimport json\nimport sys\nimport time\n\ndef main():\n parser = argparse.ArgumentParser()\n parser.add_argument('--epochs', type=int, default=10)\n parser.add_argument('--lr', type=float, default=0.001)\n parser.add_argument('--output', default='results.json')\n\n args = parser.parse_args()\n\n # Simulate training\n results = {\n 'epochs': args.epochs,\n 'learning_rate': args.lr,\n 'accuracy': 0.85 + (args.lr * 0.1),\n 'loss': 0.5 - (args.epochs * 0.01),\n 'training_time': args.epochs * 0.1\n }\n\n # Save results\n with open(args.output, 'w') as f:\n json.dump(results, f, indent=2)\n\n print(f\"Training completed: {results}\")\n return results\n\nif __name__ == '__main__':\n main()\n"},{"location":"first-experiment/#2-submit-job-via-api","title":"2. Submit Job via API","text":"# Submit experiment\ncurl -X POST http://localhost:9101/api/v1/jobs \\\n -H \"Content-Type: application/json\" \\\n -H \"X-API-Key: your-api-key\" \\\n -d '{\n \"job_name\": \"first-experiment\",\n \"args\": \"--epochs 20 --lr 0.01 --output experiment_results.json\",\n \"priority\": 1,\n \"metadata\": {\n \"experiment_type\": \"training\",\n \"dataset\": \"sample_data\"\n }\n }'\n"},{"location":"first-experiment/#3-monitor-progress","title":"3. Monitor Progress","text":"# Check job status\ncurl -H \"X-API-Key: your-api-key\" \\\n http://localhost:9101/api/v1/jobs/first-experiment\n\n# List all jobs\ncurl -H \"X-API-Key: your-api-key\" \\\n http://localhost:9101/api/v1/jobs\n\n# Get job metrics\ncurl -H \"X-API-Key: your-api-key\" \\\n http://localhost:9101/api/v1/jobs/first-experiment/metrics\n"},{"location":"first-experiment/#4-use-cli","title":"4. Use CLI","text":"# Submit with CLI\ncd cli && zig build dev\n./cli/zig-out/dev/ml submit \\\n --name \"cli-experiment\" \\\n --args \"--epochs 15 --lr 0.005\" \\\n --server http://localhost:9101\n\n# Monitor with CLI\n./cli/zig-out/dev/ml list-jobs --server http://localhost:9101\n./cli/zig-out/dev/ml job-status cli-experiment --server http://localhost:9101\n"},{"location":"first-experiment/#advanced-experiment","title":"Advanced Experiment","text":""},{"location":"first-experiment/#hyperparameter-tuning","title":"Hyperparameter Tuning","text":"# Submit multiple experiments\nfor lr in 0.001 0.01 0.1; do\n curl -X POST http://localhost:9101/api/v1/jobs \\\n -H \"Content-Type: application/json\" \\\n -H \"X-API-Key: your-api-key\" \\\n -d \"{\n \\\"job_name\\\": \\\"tune-lr-$lr\\\",\n \\\"args\\\": \\\"--epochs 10 --lr $lr\\\",\n \\\"metadata\\\": {\\\"learning_rate\\\": $lr}\n }\"\ndone\n"},{"location":"first-experiment/#batch-processing","title":"Batch Processing","text":"# Submit batch job\ncurl -X POST http://localhost:9101/api/v1/jobs \\\n -H \"Content-Type: application/json\" \\\n -H \"X-API-Key: your-api-key\" \\\n -d '{\n \"job_name\": \"batch-processing\",\n \"args\": \"--input data/ --output results/ --batch-size 32\",\n \"priority\": 2,\n \"datasets\": [\"training_data\", \"validation_data\"]\n }'\n"},{"location":"first-experiment/#results-and-output","title":"Results and Output","text":""},{"location":"first-experiment/#access-results","title":"Access Results","text":"# Download results\ncurl -H \"X-API-Key: your-api-key\" \\\n http://localhost:9101/api/v1/jobs/first-experiment/results\n\n# View job details\ncurl -H \"X-API-Key: your-api-key\" \\\n http://localhost:9101/api/v1/jobs/first-experiment | jq .\n"},{"location":"first-experiment/#result-format","title":"Result Format","text":"{\n \"job_id\": \"first-experiment\",\n \"status\": \"completed\",\n \"results\": {\n \"epochs\": 20,\n \"learning_rate\": 0.01,\n \"accuracy\": 0.86,\n \"loss\": 0.3,\n \"training_time\": 2.0\n },\n \"metrics\": {\n \"gpu_utilization\": \"85%\",\n \"memory_usage\": \"2GB\",\n \"execution_time\": \"120s\"\n }\n}\n"},{"location":"first-experiment/#best-practices","title":"Best Practices","text":""},{"location":"first-experiment/#job-naming","title":"Job Naming","text":"model-training-v2, data-preprocessingexperiment-v1, experiment-v2daily-batch-2024-01-15{\n \"metadata\": {\n \"experiment_type\": \"training\",\n \"model_version\": \"v2.1\",\n \"dataset\": \"imagenet-2024\",\n \"environment\": \"gpu\",\n \"team\": \"ml-team\"\n }\n}\n"},{"location":"first-experiment/#error-handling","title":"Error Handling","text":"# Check failed jobs\ncurl -H \"X-API-Key: your-api-key\" \\\n \"http://localhost:9101/api/v1/jobs?status=failed\"\n\n# Retry failed job\ncurl -X POST http://localhost:9101/api/v1/jobs \\\n -H \"Content-Type: application/json\" \\\n -H \"X-API-Key: your-api-key\" \\\n -d '{\n \"job_name\": \"retry-experiment\",\n \"args\": \"--epochs 20 --lr 0.01\",\n \"metadata\": {\"retry_of\": \"first-experiment\"}\n }'\n"},{"location":"first-experiment/#related-documentation","title":"## Related Documentation","text":"Job stuck in pending? - Check worker status: curl /api/v1/workers - Verify resources: docker stats - Check logs: docker-compose logs api-server
Job failed? - Check error message: curl /api/v1/jobs/job-id - Review job arguments - Verify input data
No results? - Check job completion status - Verify output file paths - Check storage permissions
"},{"location":"installation/","title":"Simple Installation Guide","text":""},{"location":"installation/#quick-start-5-minutes","title":"Quick Start (5 minutes)","text":"# 1. Install\ngit clone https://github.com/jfraeys/fetch_ml.git\ncd fetch_ml\nmake install\n\n# 2. Setup (auto-configures)\n./bin/ml setup\n\n# 3. Run experiments\n./bin/ml run my-experiment.py\n That's it. Everything else is optional.
"},{"location":"installation/#what-if-i-want-more-control","title":"What If I Want More Control?","text":""},{"location":"installation/#manual-configuration-optional","title":"Manual Configuration (Optional)","text":"# Edit settings if defaults don't work\nnano ~/.ml/config.toml\n"},{"location":"installation/#monitoring-dashboard-optional","title":"Monitoring Dashboard (Optional)","text":"# Real-time monitoring\n./bin/tui\n"},{"location":"installation/#senior-developer-feedback","title":"Senior Developer Feedback","text":"\"Keep it simple\" - Most data scientists want: 1. One installation command 2. Sensible defaults 3. Works without configuration 4. Advanced features available when needed
Current plan is too complex because it asks users to decide between: - CLI vs TUI vs Both - Zig vs Go build tools - Manual vs auto config - Multiple environment variables
Better approach: Start simple, add complexity gradually.
"},{"location":"installation/#recommended-simplified-workflow","title":"Recommended Simplified Workflow","text":"The goal: \"It just works\" for 80% of use cases.
"},{"location":"operations/","title":"Operations Runbook","text":"Operational guide for troubleshooting and maintaining the ML experiment system.
"},{"location":"operations/#task-queue-operations","title":"Task Queue Operations","text":""},{"location":"operations/#monitoring-queue-health","title":"Monitoring Queue Health","text":"# Check queue depth\nZCARD task:queue\n\n# List pending tasks\nZRANGE task:queue 0 -1 WITHSCORES\n\n# Check dead letter queue\nKEYS task:dlq:*\n"},{"location":"operations/#handling-stuck-tasks","title":"Handling Stuck Tasks","text":"Symptom: Tasks stuck in \"running\" status
Diagnosis:
# Check for expired leases\nredis-cli GET task:{task-id}\n# Look for LeaseExpiry in past\n **Rem
ediation:** Tasks with expired leases are automatically reclaimed every 1 minute. To force immediate reclamation:
# Restart worker to trigger reclaim cycle\nsystemctl restart ml-worker\n"},{"location":"operations/#dead-letter-queue-management","title":"Dead Letter Queue Management","text":"View failed tasks:
KEYS task:dlq:*\n Inspect failed task:
GET task:dlq:{task-id}\n Retry from DLQ:
# Manual retry (requires custom script)\n# 1. Get task from DLQ\n# 2. Reset retry count\n# 3. Re-queue task\n"},{"location":"operations/#worker-crashes","title":"Worker Crashes","text":"Symptom: Worker disappeared mid-task
What Happens: 1. Lease expires after 30 minutes (default) 2. Background reclaim job detects expired lease 3. Task is retried (up to 3 attempts) 4. After max retries \u2192 Dead Letter Queue
Prevention: - Monitor worker heartbeats - Set up alerts for worker down - Use process manager (systemd, supervisor)
"},{"location":"operations/#worker-operations","title":"Worker Operations","text":""},{"location":"operations/#graceful-shutdown","title":"Graceful Shutdown","text":"# Send SIGTERM for graceful shutdown\nkill -TERM $(pgrep ml-worker)\n\n# Worker will:\n# 1. Stop accepting new tasks\n# 2. Finish active tasks (up to 5min timeout)\n# 3. Release all leases\n# 4. Exit cleanly\n"},{"location":"operations/#force-shutdown","title":"Force Shutdown","text":"# Force kill (leases will be reclaimed automatically)\nkill -9 $(pgrep ml-worker)\n"},{"location":"operations/#worker-heartbeat-monitoring","title":"Worker Heartbeat Monitoring","text":"# Check worker heartbeats\nHGETALL worker:heartbeat\n\n# Example output:\n# worker-abc123 1701234567\n# worker-def456 1701234580\n Alert if: Heartbeat timestamp > 5 minutes old
"},{"location":"operations/#redis-operations","title":"Redis Operations","text":""},{"location":"operations/#backup","title":"Backup","text":"# Manual backup\nredis-cli SAVE\ncp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d).rdb\n"},{"location":"operations/#restore","title":"Restore","text":"# Stop Redis\nsystemctl stop redis\n\n# Restore snapshot\ncp /backup/redis-20231201.rdb /var/lib/redis/dump.rdb\n\n# Start Redis\nsystemctl start redis\n"},{"location":"operations/#memory-management","title":"Memory Management","text":"# Check memory usage\nINFO memory\n\n# Evict old data if needed\nFLUSHDB # DANGER: Clears all data!\n"},{"location":"operations/#common-issues","title":"Common Issues","text":""},{"location":"operations/#issue-queue-growing-unbounded","title":"Issue: Queue Growing Unbounded","text":"Symptoms: - ZCARD task:queue keeps increasing - No workers processing tasks
Diagnosis:
# Check worker status\nsystemctl status ml-worker\n\n# Check logs\njournalctl -u ml-worker -n 100\n Resolution: 1. Verify workers are running 2. Check Redis connectivity 3. Verify lease configuration
"},{"location":"operations/#issue-high-retry-rate","title":"Issue: High Retry Rate","text":"Symptoms: - Many tasks in DLQ - retry_count field high on tasks
Diagnosis:
# Check worker logs for errors\njournalctl -u ml-worker | grep \"retry\"\n\n# Look for patterns (network issues, resource limits, etc)\n Resolution: - Fix underlying issue (network, resources, etc) - Adjust retry limits if permanent failures - Increase task timeout if jobs are slow
"},{"location":"operations/#issue-leases-expiring-prematurely","title":"Issue: Leases Expiring Prematurely","text":"Symptoms: - Tasks retried even though worker is healthy - Logs show \"lease expired\" frequently
Diagnosis:
# Check worker config\ncat configs/worker-config.yaml | grep -A3 \"lease\"\n\ntask_lease_duration: 30m # Too short?\nheartbeat_interval: 1m # Too infrequent?\n Resolution:
# Increase lease duration for long-running jobs\ntask_lease_duration: 60m\nheartbeat_interval: 30s # More frequent heartbeats\n"},{"location":"operations/#performance-tuning","title":"Performance Tuning","text":""},{"location":"operations/#worker-concurrency","title":"Worker Concurrency","text":"# worker-config.yaml\nmax_workers: 4 # Number of parallel tasks\n\n# Adjust based on:\n# - CPU cores available\n# - Memory per task\n# - GPU availability\n"},{"location":"operations/#redis-configuration","title":"Redis Configuration","text":"# /etc/redis/redis.conf\n\n# Persistence\nsave 900 1\nsave 300 10\n\n# Memory\nmaxmemory 2gb\nmaxmemory-policy noeviction\n\n# Performance\ntcp-keepalive 300\ntimeout 0\n"},{"location":"operations/#alerting-rules","title":"Alerting Rules","text":""},{"location":"operations/#critical-alerts","title":"Critical Alerts","text":"#!/bin/bash\n# health-check.sh\n\n# Check Redis\nredis-cli PING || echo \"Redis DOWN\"\n\n# Check worker heartbeat\nWORKER_ID=$(cat /var/run/ml-worker.pid)\nLAST_HB=$(redis-cli HGET worker:heartbeat \"$WORKER_ID\")\nNOW=$(date +%s)\nif [ $((NOW - LAST_HB)) -gt 300 ]; then\n echo \"Worker heartbeat stale\"\nfi\n\n# Check queue depth\nDEPTH=$(redis-cli ZCARD task:queue)\nif [ \"$DEPTH\" -gt 1000 ]; then\n echo \"Queue depth critical: $DEPTH\"\nfi\n"},{"location":"operations/#runbook-checklist","title":"Runbook Checklist","text":""},{"location":"operations/#daily-operations","title":"Daily Operations","text":"For homelab setups: Most of these operations can be simplified. Focus on: - Basic monitoring (queue depth, worker status) - Periodic Redis backups - Graceful shutdowns for maintenance
"},{"location":"production-monitoring/","title":"Production Monitoring Deployment Guide (Linux)","text":"This guide covers deploying the monitoring stack (Prometheus, Grafana, Loki, Promtail) on Linux production servers.
"},{"location":"production-monitoring/#architecture","title":"Architecture","text":"Testing: Docker Compose (macOS/Linux) Production: Podman + systemd (Linux)
Important: Docker is for testing only. Podman is used for running actual ML experiments in production.
Dev (Testing): Docker Compose Prod (Experiments): Podman + systemd
Each service runs as a separate Podman container managed by systemd for automatic restarts and proper lifecycle management.
"},{"location":"production-monitoring/#prerequisites","title":"Prerequisites","text":"Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution
scripts/setup-prod.sh)cd /path/to/fetch_ml\nsudo ./scripts/setup-monitoring-prod.sh /data/monitoring ml-user ml-group\n This will: - Create directory structure at /data/monitoring - Copy configuration files to /etc/fetch_ml/monitoring - Create systemd services for each component - Set up firewall rules
# Start all monitoring services\nsudo systemctl start prometheus\nsudo systemctl start loki\nsudo systemctl start promtail\nsudo systemctl start grafana\n\n# Enable on boot\nsudo systemctl enable prometheus loki promtail grafana\n"},{"location":"production-monitoring/#3-access-grafana","title":"3. Access Grafana","text":"http://YOUR_SERVER_IP:3000adminadmin (change on first login)Dashboards will auto-load: - ML Task Queue Monitoring (metrics) - Application Logs (Loki logs)
"},{"location":"production-monitoring/#service-details","title":"Service Details","text":""},{"location":"production-monitoring/#prometheus","title":"Prometheus","text":"/etc/fetch_ml/monitoring/prometheus.yml/data/monitoring/prometheus/etc/fetch_ml/monitoring/loki-config.yml/data/monitoring/loki/etc/fetch_ml/monitoring/promtail-config.yml/var/log/fetch_ml/*.log/etc/fetch_ml/monitoring/grafana/provisioning/data/monitoring/grafana/var/lib/grafana/dashboards# Check status\nsudo systemctl status prometheus grafana loki promtail\n\n# View logs\nsudo journalctl -u prometheus -f\nsudo journalctl -u grafana -f\nsudo journalctl -u loki -f\nsudo journalctl -u promtail -f\n\n# Restart services\nsudo systemctl restart prometheus\nsudo systemctl restart grafana\n\n# Stop all monitoring\nsudo systemctl stop prometheus grafana loki promtail\n"},{"location":"production-monitoring/#data-retention","title":"Data Retention","text":""},{"location":"production-monitoring/#prometheus_1","title":"Prometheus","text":"Default: 15 days. Edit /etc/fetch_ml/monitoring/prometheus.yml:
storage:\n tsdb:\n retention.time: 30d\n"},{"location":"production-monitoring/#loki_1","title":"Loki","text":"Default: 30 days. Edit /etc/fetch_ml/monitoring/loki-config.yml:
limits_config:\n retention_period: 30d\n"},{"location":"production-monitoring/#security","title":"Security","text":""},{"location":"production-monitoring/#firewall","title":"Firewall","text":"The setup script automatically configures firewall rules using the detected firewall manager (firewalld or ufw).
For manual firewall configuration:
RHEL/Rocky/Fedora (firewalld):
# Remove public access\nsudo firewall-cmd --permanent --remove-port=3000/tcp\nsudo firewall-cmd --permanent --remove-port=9090/tcp\n\n# Add specific source\nsudo firewall-cmd --permanent --add-rich-rule='rule family=\"ipv4\" source address=\"10.0.0.0/24\" port port=\"3000\" protocol=\"tcp\" accept'\nsudo firewall-cmd --reload\n Ubuntu/Debian (ufw):
# Remove public access\nsudo ufw delete allow 3000/tcp\nsudo ufw delete allow 9090/tcp\n\n# Add specific source\nsudo ufw allow from 10.0.0.0/24 to any port 3000 proto tcp\n"},{"location":"production-monitoring/#authentication","title":"Authentication","text":"Change Grafana admin password: 1. Login to Grafana 2. User menu \u2192 Profile \u2192 Change Password
"},{"location":"production-monitoring/#tls-optional","title":"TLS (Optional)","text":"For HTTPS, configure reverse proxy (nginx/Apache) in front of Grafana.
"},{"location":"production-monitoring/#troubleshooting","title":"Troubleshooting","text":""},{"location":"production-monitoring/#grafana-shows-no-data","title":"Grafana shows no data","text":"# Check if Prometheus is reachable\ncurl http://localhost:9090/-/healthy\n\n# Check datasource in Grafana\n# Settings \u2192 Data Sources \u2192 Prometheus \u2192 Save & Test\n"},{"location":"production-monitoring/#loki-not-receiving-logs","title":"Loki not receiving logs","text":"# Check Promtail is running\nsudo systemctl status promtail\n\n# Verify log file exists\nls -l /var/log/fetch_ml/\n\n# Check Promtail can reach Loki\ncurl http://localhost:3100/ready\n"},{"location":"production-monitoring/#podman-containers-not-starting","title":"Podman containers not starting","text":"# Check pod status\nsudo -u ml-user podman pod ps\nsudo -u ml-user podman ps -a\n\n# Remove and recreate\nsudo -u ml-user podman pod stop monitoring\nsudo -u ml-user podman pod rm monitoring\nsudo systemctl restart prometheus\n"},{"location":"production-monitoring/#backup","title":"Backup","text":"# Backup Grafana dashboards and data\nsudo tar -czf grafana-backup.tar.gz /data/monitoring/grafana\n\n# Backup Prometheus data\nsudo tar -czf prometheus-backup.tar.gz /data/monitoring/prometheus\n"},{"location":"production-monitoring/#updates","title":"Updates","text":"# Pull latest images\nsudo -u ml-user podman pull docker.io/grafana/grafana:latest\nsudo -u ml-user podman pull docker.io/prom/prometheus:latest\nsudo -u ml-user podman pull docker.io/grafana/loki:latest\nsudo -u ml-user podman pull docker.io/grafana/promtail:latest\n\n# Restart services to use new images\nsudo systemctl restart grafana prometheus loki promtail\n"},{"location":"queue/","title":"Task Queue Architecture","text":"The task queue system enables reliable job processing between the API server and workers using Redis.
"},{"location":"queue/#overview","title":"Overview","text":"graph LR\n CLI[CLI/Client] -->|WebSocket| API[API Server]\n API -->|Enqueue| Redis[(Redis)]\n Redis -->|Dequeue| Worker[Worker]\n Worker -->|Update Status| Redis\n"},{"location":"queue/#components","title":"Components","text":""},{"location":"queue/#taskqueue-internalqueue","title":"TaskQueue (internal/queue)","text":"Shared package used by both API server and worker for job management.
"},{"location":"queue/#task-structure","title":"Task Structure","text":"type Task struct {\n ID string // Unique task ID (UUID)\n JobName string // User-defined job name \n Args string // Job arguments\n Status string // queued, running, completed, failed\n Priority int64 // Higher = executed first\n CreatedAt time.Time \n StartedAt *time.Time \n EndedAt *time.Time \n WorkerID string \n Error string \n Datasets []string \n Metadata map[string]string // commit_id, user, etc\n}\n"},{"location":"queue/#taskqueue-interface","title":"TaskQueue Interface","text":"// Initialize queue\nqueue, err := queue.NewTaskQueue(queue.Config{\n RedisAddr: \"localhost:6379\",\n RedisPassword: \"\",\n RedisDB: 0,\n})\n\n// Add task (API server)\ntask := &queue.Task{\n ID: uuid.New().String(),\n JobName: \"train-model\",\n Status: \"queued\",\n Priority: 5,\n Metadata: map[string]string{\n \"commit_id\": commitID,\n \"user\": username,\n },\n}\nerr = queue.AddTask(task)\n\n// Get next task (Worker)\ntask, err := queue.GetNextTask()\n\n// Update task status\ntask.Status = \"running\"\nerr = queue.UpdateTask(task)\n"},{"location":"queue/#data-flow","title":"Data Flow","text":""},{"location":"queue/#job-submission-flow","title":"Job Submission Flow","text":"sequenceDiagram\n participant CLI\n participant API\n participant Redis\n participant Worker\n\n CLI->>API: Queue Job (WebSocket)\n API->>API: Create Task (UUID)\n API->>Redis: ZADD task:queue\n API->>Redis: SET task:{id}\n API->>CLI: Success Response\n\n Worker->>Redis: ZPOPMAX task:queue\n Redis->>Worker: Task ID\n Worker->>Redis: GET task:{id}\n Redis->>Worker: Task Data\n Worker->>Worker: Execute Job\n Worker->>Redis: Update Status\n"},{"location":"queue/#protocol","title":"Protocol","text":"CLI \u2192 API (Binary WebSocket):
[opcode:1][api_key_hash:64][commit_id:64][priority:1][job_name_len:1][job_name:var]\n API \u2192 Redis: - Priority queue: ZADD task:queue {priority} {task_id} - Task data: SET task:{id} {json} - Status: HSET task:status:{job_name} ...
Worker \u2190 Redis: - Poll: ZPOPMAX task:queue 1 (highest priority first) - Fetch: GET task:{id}
task:queue # ZSET: priority queue\ntask:{uuid} # STRING: task JSON data\ntask:status:{job_name} # HASH: job status\nworker:heartbeat # HASH: worker health\njob:metrics:{job_name} # HASH: job metrics\n"},{"location":"queue/#priority-queue-zset","title":"Priority Queue (ZSET)","text":"ZADD task:queue 10 \"uuid-1\" # Priority 10\nZADD task:queue 5 \"uuid-2\" # Priority 5\nZPOPMAX task:queue 1 # Returns uuid-1 (highest)\n"},{"location":"queue/#api-server-integration","title":"API Server Integration","text":""},{"location":"queue/#initialization","title":"Initialization","text":"// cmd/api-server/main.go\nqueueCfg := queue.Config{\n RedisAddr: cfg.Redis.Addr,\n RedisPassword: cfg.Redis.Password,\n RedisDB: cfg.Redis.DB,\n}\ntaskQueue, err := queue.NewTaskQueue(queueCfg)\n"},{"location":"queue/#websocket-handler","title":"WebSocket Handler","text":"// internal/api/ws.go\nfunc (h *WSHandler) handleQueueJob(conn *websocket.Conn, payload []byte) error {\n // Parse request\n apiKeyHash, commitID, priority, jobName := parsePayload(payload)\n\n // Create task with unique ID\n taskID := uuid.New().String()\n task := &queue.Task{\n ID: taskID,\n JobName: jobName,\n Status: \"queued\",\n Priority: int64(priority),\n Metadata: map[string]string{\n \"commit_id\": commitID,\n \"user\": user,\n },\n }\n\n // Enqueue\n if err := h.queue.AddTask(task); err != nil {\n return h.sendErrorPacket(conn, ErrorCodeDatabaseError, ...)\n }\n\n return h.sendSuccessPacket(conn, \"Job queued\")\n}\n"},{"location":"queue/#worker-integration","title":"Worker Integration","text":""},{"location":"queue/#task-polling","title":"Task Polling","text":"// cmd/worker/worker_server.go\nfunc (w *Worker) Start() error {\n for {\n task, err := w.queue.WaitForNextTask(ctx, 5*time.Second)\n if task != nil {\n go w.executeTask(task)\n }\n }\n}\n"},{"location":"queue/#task-execution","title":"Task Execution","text":"func (w *Worker) executeTask(task *queue.Task) {\n // Update status\n task.Status = \"running\"\n task.StartedAt = &now\n w.queue.UpdateTaskWithMetrics(task, \"start\")\n\n // Execute\n err := w.runJob(task)\n\n // Finalize\n task.Status = \"completed\" // or \"failed\"\n task.EndedAt = &endTime\n task.Error = err.Error() // if err != nil\n w.queue.UpdateTaskWithMetrics(task, \"final\")\n}\n"},{"location":"queue/#configuration","title":"Configuration","text":""},{"location":"queue/#api-server-configsconfigyaml","title":"API Server (configs/config.yaml)","text":"redis:\n addr: \"localhost:6379\"\n password: \"\"\n db: 0\n"},{"location":"queue/#worker-configsworker-configyaml","title":"Worker (configs/worker-config.yaml)","text":"redis:\n addr: \"localhost:6379\"\n password: \"\"\n db: 0\n\nmetrics_flush_interval: 500ms\n"},{"location":"queue/#monitoring","title":"Monitoring","text":""},{"location":"queue/#queue-depth","title":"Queue Depth","text":"depth, err := queue.QueueDepth()\nfmt.Printf(\"Pending tasks: %d\\n\", depth)\n"},{"location":"queue/#worker-heartbeat","title":"Worker Heartbeat","text":"// Worker sends heartbeat every 30s\nerr := queue.Heartbeat(workerID)\n"},{"location":"queue/#metrics","title":"Metrics","text":"HGETALL job:metrics:{job_name}\n# Returns: timestamp, tasks_start, tasks_final, etc\n"},{"location":"queue/#error-handling","title":"Error Handling","text":""},{"location":"queue/#task-failures","title":"Task Failures","text":"if err := w.runJob(task); err != nil {\n task.Status = \"failed\"\n task.Error = err.Error()\n w.queue.UpdateTask(task)\n}\n"},{"location":"queue/#redis-connection-loss","title":"Redis Connection Loss","text":"// TaskQueue automatically reconnects\n// Workers should implement retry logic\nfor retries := 0; retries < 3; retries++ {\n task, err := queue.GetNextTask()\n if err == nil {\n break\n }\n time.Sleep(backoff)\n}\n"},{"location":"queue/#testing","title":"Testing","text":"// tests using miniredis\ns, _ := miniredis.Run()\ndefer s.Close()\n\ntq, _ := queue.NewTaskQueue(queue.Config{\n RedisAddr: s.Addr(),\n})\n\ntask := &queue.Task{ID: \"test-1\", JobName: \"test\"}\ntq.AddTask(task)\n\nfetched, _ := tq.GetNextTask()\n// assert fetched.ID == \"test-1\"\n"},{"location":"queue/#best-practices","title":"Best Practices","text":"For implementation details, see: - internal/queue/task.go - internal/queue/queue.go
"},{"location":"quick-start/","title":"Quick Start","text":"Get Fetch ML running in minutes with Docker Compose.
"},{"location":"quick-start/#prerequisites","title":"Prerequisites","text":"Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution
# Clone and start\ngit clone https://github.com/jfraeys/fetch_ml.git\ncd fetch_ml\ndocker-compose up -d (testing only)\n\n# Wait for services (30 seconds)\nsleep 30\n\n# Verify setup\ncurl http://localhost:9101/health\n"},{"location":"quick-start/#first-experiment","title":"First Experiment","text":"# Submit a simple ML job (see [First Experiment](first-experiment.md) for details)\ncurl -X POST http://localhost:9101/api/v1/jobs \\\n -H \"Content-Type: application/json\" \\\n -H \"X-API-Key: admin\" \\\n -d '{\n \"job_name\": \"hello-world\",\n \"args\": \"--echo Hello World\",\n \"priority\": 1\n }'\n\n# Check job status\ncurl http://localhost:9101/api/v1/jobs \\\n -H \"X-API-Key: admin\"\n"},{"location":"quick-start/#cli-access","title":"CLI Access","text":"# Build CLI\ncd cli && zig build dev\n\n# List jobs\n./cli/zig-out/dev/ml --server http://localhost:9101 list-jobs\n\n# Submit new job\n./cli/zig-out/dev/ml --server http://localhost:9101 submit \\\n --name \"test-job\" --args \"--epochs 10\"\n"},{"location":"quick-start/#related-documentation","title":"Related Documentation","text":"Services not starting?
# Check logs\ndocker-compose logs\n\n# Restart services\ndocker-compose down && docker-compose up -d (testing only)\n API not responding?
# Check health\ncurl http://localhost:9101/health\n\n# Verify ports\ndocker-compose ps\n Permission denied?
# Check API key\ncurl -H \"X-API-Key: admin\" http://localhost:9101/api/v1/jobs\n"},{"location":"redis-ha/","title":"Redis High Availability","text":"Note: This is optional for homelab setups. Single Redis instance is sufficient for most use cases.
"},{"location":"redis-ha/#when-you-need-ha","title":"When You Need HA","text":"Consider Redis HA if: - Running production workloads - Uptime > 99.9% required - Can't afford to lose queued tasks - Multiple workers across machines
"},{"location":"redis-ha/#redis-sentinel-recommended","title":"Redis Sentinel (Recommended)","text":""},{"location":"redis-ha/#setup","title":"Setup","text":"# docker-compose.yml\nversion: '3.8'\nservices:\n redis-master:\n image: redis:7-alpine\n command: redis-server --maxmemory 2gb\n\n redis-replica:\n image: redis:7-alpine\n command: redis-server --slaveof redis-master 6379\n\n redis-sentinel-1:\n image: redis:7-alpine\n command: redis-sentinel /etc/redis/sentinel.conf\n volumes:\n - ./sentinel.conf:/etc/redis/sentinel.conf\n sentinel.conf:
sentinel monitor mymaster redis-master 6379 2\nsentinel down-after-milliseconds mymaster 5000\nsentinel parallel-syncs mymaster 1\nsentinel failover-timeout mymaster 10000\n"},{"location":"redis-ha/#application-configuration","title":"Application Configuration","text":"# worker-config.yaml\nredis_addr: \"redis-sentinel-1:26379,redis-sentinel-2:26379\"\nredis_master_name: \"mymaster\"\n"},{"location":"redis-ha/#redis-cluster-advanced","title":"Redis Cluster (Advanced)","text":"For larger deployments with sharding needs.
# Minimum 3 masters + 3 replicas\nservices:\n redis-1:\n image: redis:7-alpine\n command: redis-server --cluster-enabled yes\n\n redis-2:\n # ... similar config\n"},{"location":"redis-ha/#homelab-alternative-persistence-only","title":"Homelab Alternative: Persistence Only","text":"For most homelabs, just enable persistence:
# docker-compose.yml\nservices:\n redis:\n image: redis:7-alpine\n command: redis-server --appendonly yes\n volumes:\n - redis_data:/data\n\nvolumes:\n redis_data:\n This ensures tasks survive Redis restarts without full HA complexity.
Recommendation: Start simple. Add HA only if you experience actual downtime issues.
"},{"location":"release-checklist/","title":"Release Checklist","text":"This checklist captures the work required before cutting a release that includes the graceful worker shutdown feature.
"},{"location":"release-checklist/#1-code-hygiene-compilation","title":"1. Code Hygiene / Compilation","text":"Worker redeclared errors (see cmd/worker/worker_graceful_shutdown.go and cmd/worker/worker_server.go).logger, queue, cfg, metrics).go build ./cmd/worker succeeds without undefined-field errors.shutdownCh, activeTasks, and gracefulWait during worker start-up.heartbeatLoop, releaseAllLeases).executeTaskWithLease with the real executeTask signature so the \"no value used as value\" compile error disappears.cmd/worker/worker_server.go wires up config, queue, metrics, and logger instances used by the shutdown logic.go test ./cmd/worker/... and make test (or equivalent) pass locally.This document outlines security features, best practices, and hardening procedures for FetchML.
"},{"location":"security/#security-features","title":"Security Features","text":""},{"location":"security/#authentication-authorization","title":"Authentication & Authorization","text":"Generate Strong Passwords
# Grafana admin password\nopenssl rand -base64 32 > .grafana-password\n\n# Redis password\nopenssl rand -base64 32\n Configure Environment Variables
cp .env.example .env\n# Edit .env and set:\n# - GRAFANA_ADMIN_PASSWORD\n Enable TLS (Production only)
# configs/config-prod.yaml\nserver:\n tls:\n enabled: true\n cert_file: \"/secrets/cert.pem\"\n key_file: \"/secrets/key.pem\"\n Configure Firewall
# Allow only necessary ports\nsudo ufw allow 22/tcp # SSH\nsudo ufw allow 443/tcp # HTTPS\nsudo ufw allow 80/tcp # HTTP (redirect to HTTPS)\nsudo ufw enable\n Restrict IP Access
# configs/config-prod.yaml\nauth:\n ip_whitelist:\n - \"10.0.0.0/8\"\n - \"192.168.0.0/16\"\n - \"127.0.0.1\"\n Enable Audit Logging
logging:\n level: \"info\"\n audit: true\n file: \"/var/log/fetch_ml/audit.log\"\n Harden Redis
# Redis security\nredis-cli CONFIG SET requirepass \"your-strong-password\"\nredis-cli CONFIG SET rename-command FLUSHDB \"\"\nredis-cli CONFIG SET rename-command FLUSHALL \"\"\n Secure Grafana
# Change default admin password\ndocker-compose exec grafana grafana-cli admin reset-admin-password new-strong-password\n Regular Updates
# Update system packages\nsudo apt update && sudo apt upgrade -y\n\n# Update containers\ndocker-compose pull\ndocker-compose up -d (testing only)\n # Method 1: OpenSSL\nopenssl rand -base64 32\n\n# Method 2: pwgen (if installed)\npwgen -s 32 1\n\n# Method 3: /dev/urandom\nhead -c 32 /dev/urandom | base64\n"},{"location":"security/#store-passwords-securely","title":"Store Passwords Securely","text":"Development: Use .env file (gitignored)
echo \"REDIS_PASSWORD=$(openssl rand -base64 32)\" >> .env\necho \"GRAFANA_ADMIN_PASSWORD=$(openssl rand -base64 32)\" >> .env\n Production: Use systemd environment files
sudo mkdir -p /etc/fetch_ml/secrets\nsudo chmod 700 /etc/fetch_ml/secrets\necho \"REDIS_PASSWORD=...\" | sudo tee /etc/fetch_ml/secrets/redis.env\nsudo chmod 600 /etc/fetch_ml/secrets/redis.env\n"},{"location":"security/#api-key-management","title":"API Key Management","text":""},{"location":"security/#generate-api-keys","title":"Generate API Keys","text":"# Generate random API key\nopenssl rand -hex 32\n\n# Hash for storage\necho -n \"your-api-key\" | sha256sum\n"},{"location":"security/#rotate-api-keys","title":"Rotate API Keys","text":"config-local.yaml with new hashRemove user entry from config-local.yaml:
auth:\n apikeys:\n # user_to_revoke: # Comment out or delete\n"},{"location":"security/#network-security_1","title":"Network Security","text":""},{"location":"security/#production-network-topology","title":"Production Network Topology","text":"Internet\n \u2193\n[Firewall] (ports 3000, 9102)\n \u2193\n[Reverse Proxy] (nginx/Apache) - TLS termination\n \u2193\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Application Pod \u2502\n\u2502 \u2502\n\u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502\n\u2502 \u2502 API Server \u2502 \u2502 \u2190 Public (via reverse proxy)\n\u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502\n\u2502 \u2502\n\u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502\n\u2502 \u2502 Redis \u2502 \u2502 \u2190 Internal only\n\u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502\n\u2502 \u2502\n\u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502\n\u2502 \u2502 Grafana \u2502 \u2502 \u2190 Public (via reverse proxy)\n\u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502\n\u2502 \u2502\n\u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502\n\u2502 \u2502 Prometheus \u2502 \u2502 \u2190 Internal only\n\u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502\n\u2502 \u2502\n\u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502\n\u2502 \u2502 Loki \u2502 \u2502 \u2190 Internal only\n\u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n"},{"location":"security/#recommended-firewall-rules","title":"Recommended Firewall Rules","text":"# Allow only necessary inbound connections\nsudo firewall-cmd --permanent --zone=public --add-rich-rule='\n rule family=\"ipv4\"\n source address=\"YOUR_NETWORK\"\n port port=\"3000\" protocol=\"tcp\" accept'\n\nsudo firewall-cmd --permanent --zone=public --add-rich-rule='\n rule family=\"ipv4\"\n source address=\"YOUR_NETWORK\"\n port port=\"9102\" protocol=\"tcp\" accept'\n\n# Block all other traffic\nsudo firewall-cmd --permanent --set-default-zone=drop\nsudo firewall-cmd --reload\n"},{"location":"security/#incident-response","title":"Incident Response","text":""},{"location":"security/#suspected-breach","title":"Suspected Breach","text":"Review audit logs
Investigation
# Check recent logins\nsudo journalctl -u fetchml-api --since \"1 hour ago\"\n\n# Review failed auth attempts\ngrep \"authentication failed\" /var/log/fetch_ml/*.log\n\n# Check active connections\nss -tnp | grep :9102\n Recovery
# Monitor failed authentication\ntail -f /var/log/fetch_ml/api.log | grep \"auth.*failed\"\n\n# Monitor unusual activity\njournalctl -u fetchml-api -f | grep -E \"(ERROR|WARN)\"\n\n# Check open ports\nnmap -p- localhost\n"},{"location":"security/#security-best-practices","title":"Security Best Practices","text":"All API access is logged with: - Timestamp - User/API key - Action performed - Source IP - Result (success/failure)
"},{"location":"security/#getting-help","title":"Getting Help","text":"This document describes Fetch ML's smart defaults system, which automatically adapts configuration based on the runtime environment.
"},{"location":"smart-defaults/#overview","title":"Overview","text":"Smart defaults eliminate the need for manual configuration tweaks when running in different environments:
The system automatically detects the environment based on:
CI, GITHUB_ACTIONS, GITLAB_CI environment variables/.dockerenv, KUBERNETES_SERVICE_HOST, or CONTAINER variablesFETCH_ML_ENV=production or ENV=productionlocalhosthost.docker.internal (Docker Desktop/Colima)0.0.0.0~/ml-experiments/workspace/ml-experiments/var/lib/fetch_ml/experiments~/ml-data/workspace/data/var/lib/fetch_ml/datalocalhost:6379redis:6379 (service name)redis:6379~/.ssh/id_rsa and ~/.ssh/known_hosts/workspace/.ssh/id_rsa and /workspace/.ssh/known_hosts/etc/fetch_ml/ssh/id_rsa and /etc/fetch_ml/ssh/known_hostsinfodebug (verbose for debugging)info// Get smart defaults for current environment\nsmart := config.GetSmartDefaults()\n\n// Use smart defaults\nif cfg.Host == \"\" {\n cfg.Host = smart.Host()\n}\nif cfg.BasePath == \"\" {\n cfg.BasePath = smart.BasePath()\n}\n"},{"location":"smart-defaults/#environment-overrides","title":"Environment Overrides","text":"Smart defaults can be overridden with environment variables:
FETCH_ML_HOST - Override hostFETCH_ML_BASE_PATH - Override base pathFETCH_ML_REDIS_ADDR - Override Redis addressFETCH_ML_ENV - Force environment profileYou can force a specific environment:
# Force production mode\nexport FETCH_ML_ENV=production\n\n# Force container mode\nexport CONTAINER=true\n"},{"location":"smart-defaults/#implementation-details","title":"Implementation Details","text":"The smart defaults system is implemented in internal/config/smart_defaults.go:
DetectEnvironment() - Determines current environment profileSmartDefaults struct - Provides environment-aware defaultsNo changes required - existing configurations continue to work. Smart defaults only apply when values are not explicitly set.
"},{"location":"smart-defaults/#for-developers","title":"For Developers","text":"When adding new configuration options:
SmartDefaults structExample:
// Add to SmartDefaults struct\nfunc (s *SmartDefaults) NewFeature() string {\n switch s.Profile {\n case ProfileContainer, ProfileCI:\n return \"/workspace/new-feature\"\n case ProfileProduction:\n return \"/var/lib/fetch_ml/new-feature\"\n default:\n return \"./new-feature\"\n }\n}\n\n// Use in config loader\nif cfg.NewFeature == \"\" {\n cfg.NewFeature = smart.NewFeature()\n}\n"},{"location":"smart-defaults/#testing","title":"Testing","text":"To test different environments:
# Test local defaults (default)\n./bin/worker\n\n# Test container defaults\nexport CONTAINER=true\n./bin/worker\n\n# Test CI defaults\nexport CI=true\n./bin/worker\n\n# Test production defaults\nexport FETCH_ML_ENV=production\n./bin/worker\n"},{"location":"smart-defaults/#troubleshooting","title":"Troubleshooting","text":""},{"location":"smart-defaults/#wrong-environment-detection","title":"Wrong Environment Detection","text":"Check environment variables:
echo \"CI: $CI\"\necho \"CONTAINER: $CONTAINER\"\necho \"FETCH_ML_ENV: $FETCH_ML_ENV\"\n"},{"location":"smart-defaults/#path-issues","title":"Path Issues","text":"Smart defaults expand ~ and environment variables automatically. If paths don't work as expected:
config.GetSmartDefaults().GetEnvironmentDescription()For container environments, ensure: - Redis service is named redis in docker-compose - Host networking is configured properly - host.docker.internal resolves (Docker Desktop/Colima)
How to run and write tests for FetchML.
"},{"location":"testing/#running-tests","title":"Running Tests","text":""},{"location":"testing/#quick-test","title":"Quick Test","text":"# All tests\nmake test\n\n# Unit tests only\nmake test-unit\n\n# Integration tests\nmake test-integration\n\n# With coverage\nmake test-coverage\n\n\n## Quick Test\n```bash\nmake test # All tests\nmake test-unit # Unit only\n.\nmake test.\nmake test$\nmake test; make test # Coverage\n # E2E tests\n"},{"location":"testing/#docker-testing","title":"Docker Testing","text":"docker-compose up -d (testing only)\nmake test\ndocker-compose down\n"},{"location":"testing/#cli-testing","title":"CLI Testing","text":"cd cli && zig build dev\n./cli/zig-out/dev/ml --help\nzig build test\n"},{"location":"troubleshooting/","title":"Troubleshooting","text":"Common issues and solutions for Fetch ML.
"},{"location":"troubleshooting/#quick-fixes","title":"Quick Fixes","text":""},{"location":"troubleshooting/#services-not-starting","title":"Services Not Starting","text":"# Check Docker status\ndocker-compose ps\n\n# Restart services\ndocker-compose down && docker-compose up -d (testing only)\n\n# Check logs\ndocker-compose logs -f\n"},{"location":"troubleshooting/#api-not-responding","title":"API Not Responding","text":"# Check health endpoint\ncurl http://localhost:9101/health\n\n# Check if port is in use\nlsof -i :9101\n\n# Kill process on port\nkill -9 $(lsof -ti :9101)\n"},{"location":"troubleshooting/#database-issues","title":"Database Issues","text":"# Check database connection\ndocker-compose exec postgres psql -U postgres -d fetch_ml\n\n# Reset database\ndocker-compose down postgres\ndocker-compose up -d (testing only) postgres\n\n# Check Redis\ndocker-compose exec redis redis-cli ping\n"},{"location":"troubleshooting/#common-errors","title":"Common Errors","text":""},{"location":"troubleshooting/#authentication-errors","title":"Authentication Errors","text":"jwt_expiry setting--migrate (see Development Setup)runtime: docker (testing only) in configresources.memory_limitgo mod tidy and cd cli && rm -rf zig-out zig-cachedocker-compose -f docker-compose.test.yml up -dcd cli && zig build dev--server and --api-keylsof -i :9101 and kill processespython3 -c \"import yaml; yaml.safe_load(open('config.yaml'))\"see [Configuration Schema](configuration-schema.md)./bin/api-server --version\ndocker-compose ps\ndocker-compose logs api-server | grep ERROR\n"},{"location":"troubleshooting/#emergency-reset","title":"Emergency Reset","text":"docker-compose down -v\nrm -rf data/ results/ *.db\ndocker-compose up -d (testing only)\n"},{"location":"user-permissions/","title":"User Permissions in Fetch ML","text":"Fetch ML now supports user-based permissions to ensure data scientists can only view and manage their own experiments while administrators retain full control.
"},{"location":"user-permissions/#overview","title":"Overview","text":"jobs:create - Create new experimentsjobs:read - View experiment status and resultsjobs:update - Cancel or modify experimentsml status\n Shows only your experiments with user context displayed."},{"location":"user-permissions/#cancel-your-jobs","title":"Cancel Your Jobs","text":"ml cancel <job-name>\n Only allows canceling your own experiments (unless you're an admin)."},{"location":"user-permissions/#authentication","title":"Authentication","text":"The CLI automatically authenticates using your API key from ~/.ml/config.toml.
[worker]\napi_key = \"your-api-key-here\"\n"},{"location":"user-permissions/#user-roles","title":"User Roles","text":"User roles and permissions are configured on the server side by administrators.
"},{"location":"user-permissions/#security-features","title":"Security Features","text":"# Submit your experiment\nml run my-experiment\n\n# Check your experiments (only shows yours)\nml status\n\n# Cancel your own experiment\nml cancel my-experiment\n"},{"location":"user-permissions/#administrator-workflow","title":"Administrator Workflow","text":"# View all experiments (admin sees everything)\nml status\n\n# Cancel any user's experiment\nml cancel user-experiment\n"},{"location":"user-permissions/#error-messages","title":"Error Messages","text":"For more details, see the architecture documentation.
"},{"location":"zig-cli/","title":"Zig CLI Guide","text":"High-performance command-line interface for ML experiment management, written in Zig for maximum speed and efficiency.
"},{"location":"zig-cli/#overview","title":"Overview","text":"The Zig CLI (ml) is the primary interface for managing ML experiments in your homelab. Built with Zig, it provides exceptional performance for file operations, network communication, and experiment management.
Download from GitHub Releases:
# Download for your platform\ncurl -LO https://github.com/jfraeys/fetch_ml/releases/latest/download/ml-<platform>.tar.gz\n\n# Extract\ntar -xzf ml-<platform>.tar.gz\n\n# Install\nchmod +x ml-<platform>\nsudo mv ml-<platform> /usr/local/bin/ml\n\n# Verify\nml --help\n Platforms: - ml-linux-x86_64.tar.gz - Linux (fully static, zero dependencies) - ml-macos-x86_64.tar.gz - macOS Intel - ml-macos-arm64.tar.gz - macOS Apple Silicon
All release binaries include embedded static rsync for complete independence.
"},{"location":"zig-cli/#build-from-source","title":"Build from Source","text":"Development Build (uses system rsync):
cd cli\nzig build dev\n./zig-out/dev/ml-dev --help\n Production Build (embedded rsync):
cd cli\n# For testing: uses rsync wrapper\nzig build prod\n\n# For release with static rsync:\n# 1. Place static rsync binary at src/assets/rsync_release.bin\n# 2. Build\nzig build prod\nstrip zig-out/prod/ml # Optional: reduce size\n\n# Verify\n./zig-out/prod/ml --help\nls -lh zig-out/prod/ml\n See cli/src/assets/README.md for details on obtaining static rsync binaries.
"},{"location":"zig-cli/#verify-installation","title":"Verify Installation","text":"ml --help\nml --version # Shows build config\n"},{"location":"zig-cli/#quick-start","title":"Quick Start","text":"Initialize Configuration
./cli/zig-out/bin/ml init\n Sync Your First Project
./cli/zig-out/bin/ml sync ./my-project --queue\n Monitor Progress
./cli/zig-out/bin/ml status\n init - Configuration Setup","text":"Initialize the CLI configuration file.
ml init\n Creates: ~/.ml/config.toml
Configuration Template:
worker_host = \"worker.local\"\nworker_user = \"mluser\"\nworker_base = \"/data/ml-experiments\"\nworker_port = 22\napi_key = \"your-api-key\"\n"},{"location":"zig-cli/#sync-project-synchronization","title":"sync - Project Synchronization","text":"Sync project files to the worker with intelligent deduplication.
# Basic sync\nml sync ./project\n\n# Sync with custom name and auto-queue\nml sync ./project --name \"experiment-1\" --queue\n\n# Sync with priority\nml sync ./project --priority 8\n Options: - --name <name>: Custom experiment name - --queue: Automatically queue after sync - --priority N: Set priority (1-10, default 5)
Features: - Content-Addressed Storage: Automatic deduplication - SHA256 Commit IDs: Reliable change detection - Incremental Transfer: Only sync changed files - Rsync Backend: Efficient file transfer
"},{"location":"zig-cli/#queue-job-management","title":"queue - Job Management","text":"Queue experiments for execution on the worker.
# Queue with commit ID\nml queue my-job --commit abc123def456\n\n# Queue with priority\nml queue my-job --commit abc123 --priority 8\n Options: - --commit <id>: Commit ID from sync output - --priority N: Execution priority (1-10)
Features: - WebSocket Communication: Real-time job submission - Priority Queuing: Higher priority jobs run first - API Authentication: Secure job submission
"},{"location":"zig-cli/#watch-auto-sync-monitoring","title":"watch - Auto-Sync Monitoring","text":"Monitor directories for changes and auto-sync.
# Watch for changes\nml watch ./project\n\n# Watch and auto-queue on changes\nml watch ./project --name \"dev-exp\" --queue\n Options: - --name <name>: Custom experiment name - --queue: Auto-queue on changes - --priority N: Set priority for queued jobs
Features: - Real-time Monitoring: 2-second polling interval - Change Detection: File modification time tracking - Commit Comparison: Only sync when content changes - Automatic Queuing: Seamless development workflow
"},{"location":"zig-cli/#status-system-status","title":"status - System Status","text":"Check system and worker status.
ml status\n Displays: - Worker connectivity - Queue status - Running jobs - System health
"},{"location":"zig-cli/#monitor-remote-monitoring","title":"monitor - Remote Monitoring","text":"Launch TUI interface via SSH for real-time monitoring.
ml monitor\n Features: - Real-time Updates: Live experiment status - Interactive Interface: Browse and manage experiments - SSH Integration: Secure remote access
"},{"location":"zig-cli/#cancel-job-cancellation","title":"cancel - Job Cancellation","text":"Cancel running or queued jobs.
ml cancel job-id\n Options: - job-id: Job identifier from status output
prune - Cleanup Management","text":"Clean up old experiments to save space.
# Keep last N experiments\nml prune --keep 20\n\n# Remove experiments older than N days\nml prune --older-than 30\n Options: - --keep N: Keep N most recent experiments - --older-than N: Remove experiments older than N days
Testing: Docker Compose (macOS/Linux) Production: Podman + systemd (Linux)
Important: Docker is for testing only. Podman is used for running actual ML experiments in production.
"},{"location":"zig-cli/#core-components","title":"Core Components","text":"cli/src/\n\u251c\u2500\u2500 commands/ # Command implementations\n\u2502 \u251c\u2500\u2500 init.zig # Configuration setup\n\u2502 \u251c\u2500\u2500 sync.zig # Project synchronization\n\u2502 \u251c\u2500\u2500 queue.zig # Job management\n\u2502 \u251c\u2500\u2500 watch.zig # Auto-sync monitoring\n\u2502 \u251c\u2500\u2500 status.zig # System status\n\u2502 \u251c\u2500\u2500 monitor.zig # Remote monitoring\n\u2502 \u251c\u2500\u2500 cancel.zig # Job cancellation\n\u2502 \u2514\u2500\u2500 prune.zig # Cleanup operations\n\u251c\u2500\u2500 config.zig # Configuration management\n\u251c\u2500\u2500 errors.zig # Error handling\n\u251c\u2500\u2500 net/ # Network utilities\n\u2502 \u2514\u2500\u2500 ws.zig # WebSocket client\n\u2514\u2500\u2500 utils/ # Utility functions\n \u251c\u2500\u2500 crypto.zig # Hashing and encryption\n \u251c\u2500\u2500 storage.zig # Content-addressed storage\n \u2514\u2500\u2500 rsync.zig # File synchronization\n"},{"location":"zig-cli/#performance-features","title":"Performance Features","text":""},{"location":"zig-cli/#content-addressed-storage","title":"Content-Addressed Storage","text":"# 1. Initialize project\nml sync ./project --name \"dev\" --queue\n\n# 2. Auto-sync during development\nml watch ./project --name \"dev\" --queue\n\n# 3. Monitor progress\nml status\n"},{"location":"zig-cli/#batch-processing","title":"Batch Processing","text":"# Process multiple experiments\nfor dir in experiments/*/; do\n ml sync \"$dir\" --queue\ndone\n"},{"location":"zig-cli/#priority-management","title":"Priority Management","text":"# High priority experiment\nml sync ./urgent --priority 10 --queue\n\n# Background processing\nml sync ./background --priority 1 --queue\n"},{"location":"zig-cli/#configuration-management","title":"Configuration Management","text":""},{"location":"zig-cli/#multiple-workers","title":"Multiple Workers","text":"# ~/.ml/config.toml\nworker_host = \"worker.local\"\nworker_user = \"mluser\"\nworker_base = \"/data/ml-experiments\"\nworker_port = 22\napi_key = \"your-api-key\"\n"},{"location":"zig-cli/#security-settings","title":"Security Settings","text":"# Set restrictive permissions\nchmod 600 ~/.ml/config.toml\n\n# Verify configuration\nml status\n"},{"location":"zig-cli/#troubleshooting","title":"Troubleshooting","text":""},{"location":"zig-cli/#common-issues","title":"Common Issues","text":""},{"location":"zig-cli/#build-problems","title":"Build Problems","text":"# Check Zig installation\nzig version\n\n# Clean build\ncd cli && make clean && make build\n"},{"location":"zig-cli/#connection-issues","title":"Connection Issues","text":"# Test SSH connectivity\nssh -p $worker_port $worker_user@$worker_host\n\n# Verify configuration\ncat ~/.ml/config.toml\n"},{"location":"zig-cli/#sync-failures","title":"Sync Failures","text":"# Check rsync\nrsync --version\n\n# Manual sync test\nrsync -avz ./test/ $worker_user@$worker_host:/tmp/\n"},{"location":"zig-cli/#performance-issues","title":"Performance Issues","text":"# Monitor resource usage\ntop -p $(pgrep ml)\n\n# Check disk space\ndf -h $worker_base\n"},{"location":"zig-cli/#debug-mode","title":"Debug Mode","text":"Enable verbose logging:
# Environment variable\nexport ML_DEBUG=1\nml sync ./project\n\n# Or use debug build\ncd cli && make debug\n"},{"location":"zig-cli/#performance-benchmarks","title":"Performance Benchmarks","text":""},{"location":"zig-cli/#file-operations","title":"File Operations","text":"cd cli\nzig build-exe src/main.zig\n"},{"location":"zig-cli/#testing","title":"Testing","text":"# Run tests\ncd cli && zig test src/\n\n# Integration tests\nzig test tests/\n"},{"location":"zig-cli/#code-style","title":"Code Style","text":"For more information, see the CLI Reference and Architecture pages.
"}]}