docs: remove debug command from CLI reference

2026-02-16 20:39:20 -05:00

13 KiB

Raw Blame History

title	url	weight
CLI Reference	/cli-reference/	2

Fetch ML CLI Reference

Comprehensive command-line tools for managing ML experiments in your homelab with Zig-based high-performance CLI.

Overview

Fetch ML provides a comprehensive CLI toolkit built with performance and security in mind:

Zig CLI - High-performance experiment management written in Zig
Go Commands - API server, TUI, and data management utilities
Management Scripts - Service orchestration and deployment
Setup Scripts - One-command installation and configuration

Zig CLI (`./cli/zig-out/bin/ml`)

High-performance command-line interface for experiment management, written in Zig for speed and efficiency.

Available Commands

Command	Description	Example
`init`	Interactive configuration setup	`ml init`
`sync`	Sync project to worker with deduplication	`ml sync ./project --name myjob --queue`
`queue`	Queue job for execution	`ml queue myjob --commit abc123 --priority 8`
`status`	Get system and worker status	`ml status`
`monitor`	Launch TUI monitoring via SSH	`ml monitor`
`cancel`	Cancel running job	`ml cancel job123`
`prune`	Clean up old experiments	`ml prune --keep 10`
`watch`	Auto-sync directory on changes	`ml watch ./project --queue`
`jupyter`	Manage Jupyter notebook services	`ml jupyter start --name my-nb`
`validate`	Validate provenance/integrity for a commit or task	`ml validate <commit_id> --verbose`
`info`	Show run info from `run_manifest.json`	`ml info <run_dir>`
`requeue`	Re-submit an existing run/commit with new args/resources	`ml requeue <commit_id
`logs`	Fetch and follow job logs	`ml logs job123 -n 100`

Command Details

`init` - Configuration Setup

ml init

Creates a configuration template at ~/.ml/config.toml with:

Worker connection details
API authentication
Base paths and ports

`sync` - Project Synchronization

# Basic sync
ml sync ./my-project

# Sync with custom name and queue
ml sync ./my-project --name "experiment-1" --queue

# Sync with priority
ml sync ./my-project --priority 9

Features:

Content-addressed storage for deduplication
SHA256 commit ID generation
Rsync-based file transfer
Automatic queuing (with --queue flag)

`queue` - Job Management

# Queue with commit ID
ml queue my-job --commit abc123def456

# Queue with commit ID prefix (>=7 hex chars; must be unique)
ml queue my-job --commit abc123 --priority 8

# Queue with extra runner args (stored as task.Args)
ml queue my-job --commit abc123 -- --epochs 5 --lr 1e-3

Features:

WebSocket-based communication
Priority queuing system
API key authentication

Notes:

--priority is passed to the server as a single byte (0-255).
Args are sent via a dedicated queue opcode and become task.Args on the worker.
--commit may be a full 40-hex commit id or a unique prefix (>=7 hex chars) resolvable under worker_base.

`requeue` - Re-submit a Previous Run

# Requeue directly by commit_id
ml requeue <commit_id> -- --epochs 20

# Requeue by commit_id prefix (>=7 hex chars; must be unique)
ml requeue <commit_prefix> -- --epochs 20

# Requeue by run_id/task_id (CLI scans run_manifest.json under worker_base)
ml requeue <run_id> -- --epochs 20

# Requeue by a run directory or run_manifest.json path
ml requeue /data/ml-experiments/finished/<run_id> -- --epochs 20

# Override priority/resources on requeue
ml requeue <task_id> --priority 10 --gpu 1 -- --epochs 20

What it does:

Locates run_manifest.json
Extracts commit_id
Submits a new queue request using that commit_id with optional overridden args/resources

Notes:

Tasks support optional snapshot_id and dataset_specs fields server-side (for provenance and dataset resolution).

`watch` - Auto-Sync Monitoring

# Watch directory for changes
ml watch ./project

# Watch and auto-queue on changes
ml watch ./project --name "dev-exp" --queue

Features:

Real-time file system monitoring
Automatic re-sync on changes
Configurable polling interval (2 seconds)
Commit ID comparison for efficiency

`prune` - Cleanup Management

# Keep last N experiments
ml prune --keep 20

# Remove experiments older than N days
ml prune --older-than 30

`monitor` - Remote Monitoring

ml monitor

Launches TUI interface via SSH for real-time monitoring.

`status` - System Status

ml status --json returns a JSON object including an optional prewarm field when worker prewarming is active:

{
  "prewarm": [
    {
      "worker_id": "worker-1",
      "task_id": "<task-id>",
      "started_at": "2025-01-01T00:00:00Z",
      "updated_at": "2025-01-01T00:00:05Z",
      "phase": "datasets",
      "dataset_count": 2
    }
  ]
}

`cancel` - Job Cancellation

ml cancel running-job-id

Cancels currently running jobs by ID.

`logs` - Fetch and Follow Job Logs

Retrieve logs from running or completed ML experiments.

# Show full logs for a job
ml logs job123

# Show last 100 lines (tail)
ml logs job123 -n 100
ml logs job123 --tail 100

# Follow logs in real-time (like tail -f)
ml logs job123 -f
ml logs job123 --follow

# Combine tail and follow
ml logs job123 -n 50 -f

Features:

WebSocket-based log streaming for real-time updates
Works with both running and completed jobs
Automatic reconnection on network issues
Scrollable output with pagination support

Common Use Cases:

# Check why a job failed
ml logs failed-job-abc123

# Monitor a running training job
ml logs training-job-xyz789 -f

# Get recent errors only
ml logs job123 -n 20 | grep -i error

`jupyter` - Jupyter Notebook Management

Manage Jupyter notebook services via WebSocket protocol.

# Start a Jupyter service
ml jupyter start --name my-notebook --workspace /path/to/workspace

# Start with password protection
ml jupyter start --name my-notebook --workspace /path/to/workspace --password mypass

# List running services
ml jupyter list

# Stop a service
ml jupyter stop service-id-12345

# Check service status
ml jupyter status

Features:

WebSocket-based binary protocol for low latency
Secure API key authentication (SHA256 hashed)
Real-time service management
Workspace isolation

Common Use Cases:

# Development workflow
ml jupyter start --name dev-notebook --workspace ./notebooks
# ... do development work ...
ml jupyter stop dev-service-123

# Team collaboration
ml jupyter start --name team-analysis --workspace /shared/analysis --password teampass

# Multiple services
ml jupyter list  # View all running services

Security:

API keys are hashed before transmission
Password protection for notebooks
Workspace path validation
Service ID-based authorization

Configuration

worker_host = "worker.local"
worker_user = "mluser"
worker_base = "/data/ml-experiments"
worker_port = 22
api_key = "your-api-key"

Performance Features

Content-Addressed Storage: Automatic deduplication of identical files
Incremental Sync: Only transfers changed files
SHA256 Hashing: Reliable commit ID generation
WebSocket Communication: Efficient real-time messaging
Multi-threaded: Concurrent operations where applicable

Go Commands

API Server (`./cmd/api-server/main.go`)

Main HTTPS API server for experiment management.

# Build and run
go run ./cmd/api-server/main.go

# With configuration
./bin/api-server --config configs/api/dev.yaml

Features:

HTTPS-only communication
API key authentication
Rate limiting and IP whitelisting
WebSocket support for real-time updates
Redis integration for caching

TUI (`./cmd/tui/main.go`)

Terminal User Interface for monitoring experiments.

# Launch TUI
go run ./cmd/tui/main.go

Features:

Real-time experiment monitoring
Interactive job management
Status visualization
Log viewing

Data Manager (`./cmd/data_manager/`)

Utilities for data synchronization and management.

# Sync data
./data_manager --sync ./data

# Clean old data
./data_manager --cleanup --older-than 30d

Config Lint (`./cmd/configlint/main.go`)

Configuration validation and linting tool.

# Validate configuration
./configlint configs/api/dev.yaml

# Check schema compliance
./configlint --schema configs/schema/api_server_config.yaml

Management Script (`./tools/manage.sh`)

Simple service management for your homelab.

Commands

./tools/manage.sh start          # Start all services
./tools/manage.sh stop           # Stop all services
./tools/manage.sh status         # Check service status
./tools/manage.sh logs           # View logs
./tools/manage.sh monitor        # Basic monitoring
./tools/manage.sh security       # Security status
./tools/manage.sh cleanup        # Clean project artifacts

API Testing

Test the API with curl:

# Health check
curl -f http://localhost:8080/health

# List experiments
curl -H 'X-API-Key: password' http://localhost:8080/experiments

# Submit experiment
curl -X POST -H 'X-API-Key: password' \
     -H 'Content-Type: application/json' \
     -d '{"name":"test","config":{"type":"basic"}}' \
     http://localhost:8080/experiments

Zig CLI Architecture

The Zig CLI is designed for performance and reliability:

Core Components

Commands (cli/src/commands/): Individual command implementations
Config (cli/src/config.zig): Configuration management
Network (cli/src/net/ws.zig): WebSocket client implementation
Utils (cli/src/utils/): Cryptography, storage, and rsync utilities
Errors (cli/src/errors.zig): Centralized error handling

Performance Optimizations

Content-Addressed Storage: Deduplicates identical files across experiments
SHA256 Hashing: Fast, reliable commit ID generation
Rsync Integration: Efficient incremental file transfers
WebSocket Protocol: Low-latency communication with worker
Memory Management: Efficient allocation with Zig's allocator system

Security Features

API Key Hashing: Secure authentication token handling
SSH Integration: Secure file transfers
Input Validation: Comprehensive argument checking
Error Handling: Secure error reporting without information leakage

Configuration

Main configuration file: configs/api/dev.yaml

Key Settings

auth:
  enabled: true
  api_keys:
    dev_user:
      hash: "CHANGE_ME_SHA256_DEV_USER_KEY"
      admin: true
      roles:
        - admin
      permissions:
        '*': true
    researcher_user:
      hash: "CHANGE_ME_SHA256_RESEARCHER_USER_KEY"
      admin: false
      roles:
        - researcher
      permissions:
        'experiments': true
        'datasets': true

server:
  address: ":9101"
  tls:
    enabled: false  # Set to true for production
    cert_file: "./ssl/cert.pem"
    key_file: "./ssl/key.pem"

security:
  rate_limit:
    enabled: true
    requests_per_minute: 30
  ip_whitelist:
    - "127.0.0.1"
    - "::1"
    - "localhost"
    - "10.0.0.0/8"

Docker Commands

If using Docker Compose:

# Start services
docker-compose up -d (testing only)

# View logs
docker-compose logs -f

# Stop services
docker-compose down

# Check status
docker-compose ps

Troubleshooting

Common Issues

Zig CLI not found:

# Build the CLI
cd cli && make build

# Check binary exists
ls -la ./cli/zig-out/bin/ml

Configuration not found:

# Create configuration
./cli/zig-out/bin/ml init

# Check config file
ls -la ~/.ml/config.toml

Worker connection failed:

# Test SSH connection
ssh -p 22 mluser@worker.local

# Check configuration
cat ~/.ml/config.toml

Sync not working:

# Check rsync availability
rsync --version

# Test manual sync
rsync -avz ./project/ mluser@worker.local:/tmp/test/

WebSocket connection failed:

# Check worker WebSocket port
telnet worker.local 9100

# Verify API key
./cli/zig-out/bin/ml status

API not responding:

./tools/manage.sh status
./tools/manage.sh logs

Authentication failed:

# Check API key in config
grep -A 5 "api_keys:" configs/api/dev.yaml

Redis connection failed:

# Check Redis status
redis-cli ping

# Start Redis
redis-server

Getting Help

# CLI help
./cli/zig-out/bin/ml help

# Management script help
./tools/manage.sh help

# Check all available commands
make help

That's it for the CLI reference! For complete setup instructions, see the main index.

13 KiB Raw Blame History

Fetch ML CLI Reference

Overview

Zig CLI (./cli/zig-out/bin/ml)

Available Commands

Command Details

init - Configuration Setup

sync - Project Synchronization

queue - Job Management

requeue - Re-submit a Previous Run

watch - Auto-Sync Monitoring

prune - Cleanup Management

monitor - Remote Monitoring

status - System Status

cancel - Job Cancellation

logs - Fetch and Follow Job Logs

jupyter - Jupyter Notebook Management

Configuration

Performance Features

Go Commands

API Server (./cmd/api-server/main.go)

TUI (./cmd/tui/main.go)

Data Manager (./cmd/data_manager/)

Config Lint (./cmd/configlint/main.go)

Management Script (./tools/manage.sh)

Commands

API Testing

Zig CLI Architecture

Core Components

Performance Optimizations

Security Features

Configuration

Key Settings

Docker Commands

Troubleshooting

Common Issues

Getting Help

13 KiB

Raw Blame History

Zig CLI (`./cli/zig-out/bin/ml`)

`init` - Configuration Setup

`sync` - Project Synchronization

`queue` - Job Management

`requeue` - Re-submit a Previous Run

`watch` - Auto-Sync Monitoring

`prune` - Cleanup Management

`monitor` - Remote Monitoring

`status` - System Status

`cancel` - Job Cancellation

`logs` - Fetch and Follow Job Logs

`jupyter` - Jupyter Notebook Management

API Server (`./cmd/api-server/main.go`)

TUI (`./cmd/tui/main.go`)

Data Manager (`./cmd/data_manager/`)

Config Lint (`./cmd/configlint/main.go`)

Management Script (`./tools/manage.sh`)