fetch_ml/docs/src/cli-reference.md
2026-02-16 20:39:20 -05:00

13 KiB

title url weight
CLI Reference /cli-reference/ 2

Fetch ML CLI Reference

Comprehensive command-line tools for managing ML experiments in your homelab with Zig-based high-performance CLI.

Overview

Fetch ML provides a comprehensive CLI toolkit built with performance and security in mind:

  • Zig CLI - High-performance experiment management written in Zig
  • Go Commands - API server, TUI, and data management utilities
  • Management Scripts - Service orchestration and deployment
  • Setup Scripts - One-command installation and configuration

Zig CLI (./cli/zig-out/bin/ml)

High-performance command-line interface for experiment management, written in Zig for speed and efficiency.

Available Commands

Command Description Example
init Interactive configuration setup ml init
sync Sync project to worker with deduplication ml sync ./project --name myjob --queue
queue Queue job for execution ml queue myjob --commit abc123 --priority 8
status Get system and worker status ml status
monitor Launch TUI monitoring via SSH ml monitor
cancel Cancel running job ml cancel job123
prune Clean up old experiments ml prune --keep 10
watch Auto-sync directory on changes ml watch ./project --queue
jupyter Manage Jupyter notebook services ml jupyter start --name my-nb
validate Validate provenance/integrity for a commit or task ml validate <commit_id> --verbose
info Show run info from run_manifest.json ml info <run_dir>
requeue Re-submit an existing run/commit with new args/resources `ml requeue <commit_id
logs Fetch and follow job logs ml logs job123 -n 100

Command Details

init - Configuration Setup

ml init

Creates a configuration template at ~/.ml/config.toml with:

  • Worker connection details
  • API authentication
  • Base paths and ports

sync - Project Synchronization

# Basic sync
ml sync ./my-project

# Sync with custom name and queue
ml sync ./my-project --name "experiment-1" --queue

# Sync with priority
ml sync ./my-project --priority 9

Features:

  • Content-addressed storage for deduplication
  • SHA256 commit ID generation
  • Rsync-based file transfer
  • Automatic queuing (with --queue flag)

queue - Job Management

# Queue with commit ID
ml queue my-job --commit abc123def456

# Queue with commit ID prefix (>=7 hex chars; must be unique)
ml queue my-job --commit abc123 --priority 8

# Queue with extra runner args (stored as task.Args)
ml queue my-job --commit abc123 -- --epochs 5 --lr 1e-3

Features:

  • WebSocket-based communication
  • Priority queuing system
  • API key authentication

Notes:

  • --priority is passed to the server as a single byte (0-255).
  • Args are sent via a dedicated queue opcode and become task.Args on the worker.
  • --commit may be a full 40-hex commit id or a unique prefix (>=7 hex chars) resolvable under worker_base.

requeue - Re-submit a Previous Run

# Requeue directly by commit_id
ml requeue <commit_id> -- --epochs 20

# Requeue by commit_id prefix (>=7 hex chars; must be unique)
ml requeue <commit_prefix> -- --epochs 20

# Requeue by run_id/task_id (CLI scans run_manifest.json under worker_base)
ml requeue <run_id> -- --epochs 20

# Requeue by a run directory or run_manifest.json path
ml requeue /data/ml-experiments/finished/<run_id> -- --epochs 20

# Override priority/resources on requeue
ml requeue <task_id> --priority 10 --gpu 1 -- --epochs 20

What it does:

  • Locates run_manifest.json
  • Extracts commit_id
  • Submits a new queue request using that commit_id with optional overridden args/resources

Notes:

  • Tasks support optional snapshot_id and dataset_specs fields server-side (for provenance and dataset resolution).

watch - Auto-Sync Monitoring

# Watch directory for changes
ml watch ./project

# Watch and auto-queue on changes
ml watch ./project --name "dev-exp" --queue

Features:

  • Real-time file system monitoring
  • Automatic re-sync on changes
  • Configurable polling interval (2 seconds)
  • Commit ID comparison for efficiency

prune - Cleanup Management

# Keep last N experiments
ml prune --keep 20

# Remove experiments older than N days
ml prune --older-than 30

monitor - Remote Monitoring

ml monitor

Launches TUI interface via SSH for real-time monitoring.

status - System Status

ml status --json returns a JSON object including an optional prewarm field when worker prewarming is active:

{
  "prewarm": [
    {
      "worker_id": "worker-1",
      "task_id": "<task-id>",
      "started_at": "2025-01-01T00:00:00Z",
      "updated_at": "2025-01-01T00:00:05Z",
      "phase": "datasets",
      "dataset_count": 2
    }
  ]
}

cancel - Job Cancellation

ml cancel running-job-id

Cancels currently running jobs by ID.

logs - Fetch and Follow Job Logs

Retrieve logs from running or completed ML experiments.

# Show full logs for a job
ml logs job123

# Show last 100 lines (tail)
ml logs job123 -n 100
ml logs job123 --tail 100

# Follow logs in real-time (like tail -f)
ml logs job123 -f
ml logs job123 --follow

# Combine tail and follow
ml logs job123 -n 50 -f

Features:

  • WebSocket-based log streaming for real-time updates
  • Works with both running and completed jobs
  • Automatic reconnection on network issues
  • Scrollable output with pagination support

Common Use Cases:

# Check why a job failed
ml logs failed-job-abc123

# Monitor a running training job
ml logs training-job-xyz789 -f

# Get recent errors only
ml logs job123 -n 20 | grep -i error

jupyter - Jupyter Notebook Management

Manage Jupyter notebook services via WebSocket protocol.

# Start a Jupyter service
ml jupyter start --name my-notebook --workspace /path/to/workspace

# Start with password protection
ml jupyter start --name my-notebook --workspace /path/to/workspace --password mypass

# List running services
ml jupyter list

# Stop a service
ml jupyter stop service-id-12345

# Check service status
ml jupyter status

Features:

  • WebSocket-based binary protocol for low latency
  • Secure API key authentication (SHA256 hashed)
  • Real-time service management
  • Workspace isolation

Common Use Cases:

# Development workflow
ml jupyter start --name dev-notebook --workspace ./notebooks
# ... do development work ...
ml jupyter stop dev-service-123

# Team collaboration
ml jupyter start --name team-analysis --workspace /shared/analysis --password teampass

# Multiple services
ml jupyter list  # View all running services

Security:

  • API keys are hashed before transmission
  • Password protection for notebooks
  • Workspace path validation
  • Service ID-based authorization

Configuration

worker_host = "worker.local"
worker_user = "mluser"
worker_base = "/data/ml-experiments"
worker_port = 22
api_key = "your-api-key"

Performance Features

  • Content-Addressed Storage: Automatic deduplication of identical files
  • Incremental Sync: Only transfers changed files
  • SHA256 Hashing: Reliable commit ID generation
  • WebSocket Communication: Efficient real-time messaging
  • Multi-threaded: Concurrent operations where applicable

Go Commands

API Server (./cmd/api-server/main.go)

Main HTTPS API server for experiment management.

# Build and run
go run ./cmd/api-server/main.go

# With configuration
./bin/api-server --config configs/api/dev.yaml

Features:

  • HTTPS-only communication
  • API key authentication
  • Rate limiting and IP whitelisting
  • WebSocket support for real-time updates
  • Redis integration for caching

TUI (./cmd/tui/main.go)

Terminal User Interface for monitoring experiments.

# Launch TUI
go run ./cmd/tui/main.go

Features:

  • Real-time experiment monitoring
  • Interactive job management
  • Status visualization
  • Log viewing

Data Manager (./cmd/data_manager/)

Utilities for data synchronization and management.

# Sync data
./data_manager --sync ./data

# Clean old data
./data_manager --cleanup --older-than 30d

Config Lint (./cmd/configlint/main.go)

Configuration validation and linting tool.

# Validate configuration
./configlint configs/api/dev.yaml

# Check schema compliance
./configlint --schema configs/schema/api_server_config.yaml

Management Script (./tools/manage.sh)

Simple service management for your homelab.

Commands

./tools/manage.sh start          # Start all services
./tools/manage.sh stop           # Stop all services
./tools/manage.sh status         # Check service status
./tools/manage.sh logs           # View logs
./tools/manage.sh monitor        # Basic monitoring
./tools/manage.sh security       # Security status
./tools/manage.sh cleanup        # Clean project artifacts

API Testing

Test the API with curl:

# Health check
curl -f http://localhost:8080/health

# List experiments
curl -H 'X-API-Key: password' http://localhost:8080/experiments

# Submit experiment
curl -X POST -H 'X-API-Key: password' \
     -H 'Content-Type: application/json' \
     -d '{"name":"test","config":{"type":"basic"}}' \
     http://localhost:8080/experiments

Zig CLI Architecture

The Zig CLI is designed for performance and reliability:

Core Components

  • Commands (cli/src/commands/): Individual command implementations
  • Config (cli/src/config.zig): Configuration management
  • Network (cli/src/net/ws.zig): WebSocket client implementation
  • Utils (cli/src/utils/): Cryptography, storage, and rsync utilities
  • Errors (cli/src/errors.zig): Centralized error handling

Performance Optimizations

  • Content-Addressed Storage: Deduplicates identical files across experiments
  • SHA256 Hashing: Fast, reliable commit ID generation
  • Rsync Integration: Efficient incremental file transfers
  • WebSocket Protocol: Low-latency communication with worker
  • Memory Management: Efficient allocation with Zig's allocator system

Security Features

  • API Key Hashing: Secure authentication token handling
  • SSH Integration: Secure file transfers
  • Input Validation: Comprehensive argument checking
  • Error Handling: Secure error reporting without information leakage

Configuration

Main configuration file: configs/api/dev.yaml

Key Settings

auth:
  enabled: true
  api_keys:
    dev_user:
      hash: "CHANGE_ME_SHA256_DEV_USER_KEY"
      admin: true
      roles:
        - admin
      permissions:
        '*': true
    researcher_user:
      hash: "CHANGE_ME_SHA256_RESEARCHER_USER_KEY"
      admin: false
      roles:
        - researcher
      permissions:
        'experiments': true
        'datasets': true

server:
  address: ":9101"
  tls:
    enabled: false  # Set to true for production
    cert_file: "./ssl/cert.pem"
    key_file: "./ssl/key.pem"

security:
  rate_limit:
    enabled: true
    requests_per_minute: 30
  ip_whitelist:
    - "127.0.0.1"
    - "::1"
    - "localhost"
    - "10.0.0.0/8"

Docker Commands

If using Docker Compose:

# Start services
docker-compose up -d (testing only)

# View logs
docker-compose logs -f

# Stop services
docker-compose down

# Check status
docker-compose ps

Troubleshooting

Common Issues

Zig CLI not found:

# Build the CLI
cd cli && make build

# Check binary exists
ls -la ./cli/zig-out/bin/ml

Configuration not found:

# Create configuration
./cli/zig-out/bin/ml init

# Check config file
ls -la ~/.ml/config.toml

Worker connection failed:

# Test SSH connection
ssh -p 22 mluser@worker.local

# Check configuration
cat ~/.ml/config.toml

Sync not working:

# Check rsync availability
rsync --version

# Test manual sync
rsync -avz ./project/ mluser@worker.local:/tmp/test/

WebSocket connection failed:

# Check worker WebSocket port
telnet worker.local 9100

# Verify API key
./cli/zig-out/bin/ml status

API not responding:

./tools/manage.sh status
./tools/manage.sh logs

Authentication failed:

# Check API key in config
grep -A 5 "api_keys:" configs/api/dev.yaml

Redis connection failed:

# Check Redis status
redis-cli ping

# Start Redis
redis-server

Getting Help

# CLI help
./cli/zig-out/bin/ml help

# Management script help
./tools/manage.sh help

# Check all available commands
make help

That's it for the CLI reference! For complete setup instructions, see the main index.