13 KiB
| title | url | weight |
|---|---|---|
| CLI Reference | /cli-reference/ | 2 |
Fetch ML CLI Reference
Comprehensive command-line tools for managing ML experiments in your homelab with Zig-based high-performance CLI.
Overview
Fetch ML provides a comprehensive CLI toolkit built with performance and security in mind:
- Zig CLI - High-performance experiment management written in Zig
- Go Commands - API server, TUI, and data management utilities
- Management Scripts - Service orchestration and deployment
- Setup Scripts - One-command installation and configuration
Zig CLI (./cli/zig-out/bin/ml)
High-performance command-line interface for experiment management, written in Zig for speed and efficiency.
Available Commands
| Command | Description | Example |
|---|---|---|
init |
Interactive configuration setup | ml init |
sync |
Sync project to worker with deduplication | ml sync ./project --name myjob --queue |
queue |
Queue job for execution | ml queue myjob --commit abc123 --priority 8 |
status |
Get system and worker status | ml status |
monitor |
Launch TUI monitoring via SSH | ml monitor |
cancel |
Cancel running job | ml cancel job123 |
prune |
Clean up old experiments | ml prune --keep 10 |
watch |
Auto-sync directory on changes | ml watch ./project --queue |
jupyter |
Manage Jupyter notebook services | ml jupyter start --name my-nb |
validate |
Validate provenance/integrity for a commit or task | ml validate <commit_id> --verbose |
info |
Show run info from run_manifest.json |
ml info <run_dir> |
requeue |
Re-submit an existing run/commit with new args/resources | `ml requeue <commit_id |
logs |
Fetch and follow job logs | ml logs job123 -n 100 |
Command Details
init - Configuration Setup
ml init
Creates a configuration template at ~/.ml/config.toml with:
- Worker connection details
- API authentication
- Base paths and ports
sync - Project Synchronization
# Basic sync
ml sync ./my-project
# Sync with custom name and queue
ml sync ./my-project --name "experiment-1" --queue
# Sync with priority
ml sync ./my-project --priority 9
Features:
- Content-addressed storage for deduplication
- SHA256 commit ID generation
- Rsync-based file transfer
- Automatic queuing (with
--queueflag)
queue - Job Management
# Queue with commit ID
ml queue my-job --commit abc123def456
# Queue with commit ID prefix (>=7 hex chars; must be unique)
ml queue my-job --commit abc123 --priority 8
# Queue with extra runner args (stored as task.Args)
ml queue my-job --commit abc123 -- --epochs 5 --lr 1e-3
Features:
- WebSocket-based communication
- Priority queuing system
- API key authentication
Notes:
--priorityis passed to the server as a single byte (0-255).- Args are sent via a dedicated queue opcode and become
task.Argson the worker. --commitmay be a full 40-hex commit id or a unique prefix (>=7 hex chars) resolvable underworker_base.
requeue - Re-submit a Previous Run
# Requeue directly by commit_id
ml requeue <commit_id> -- --epochs 20
# Requeue by commit_id prefix (>=7 hex chars; must be unique)
ml requeue <commit_prefix> -- --epochs 20
# Requeue by run_id/task_id (CLI scans run_manifest.json under worker_base)
ml requeue <run_id> -- --epochs 20
# Requeue by a run directory or run_manifest.json path
ml requeue /data/ml-experiments/finished/<run_id> -- --epochs 20
# Override priority/resources on requeue
ml requeue <task_id> --priority 10 --gpu 1 -- --epochs 20
What it does:
- Locates
run_manifest.json - Extracts
commit_id - Submits a new queue request using that
commit_idwith optional overridden args/resources
Notes:
- Tasks support optional
snapshot_idanddataset_specsfields server-side (for provenance and dataset resolution).
watch - Auto-Sync Monitoring
# Watch directory for changes
ml watch ./project
# Watch and auto-queue on changes
ml watch ./project --name "dev-exp" --queue
Features:
- Real-time file system monitoring
- Automatic re-sync on changes
- Configurable polling interval (2 seconds)
- Commit ID comparison for efficiency
prune - Cleanup Management
# Keep last N experiments
ml prune --keep 20
# Remove experiments older than N days
ml prune --older-than 30
monitor - Remote Monitoring
ml monitor
Launches TUI interface via SSH for real-time monitoring.
status - System Status
ml status --json returns a JSON object including an optional prewarm field when worker prewarming is active:
{
"prewarm": [
{
"worker_id": "worker-1",
"task_id": "<task-id>",
"started_at": "2025-01-01T00:00:00Z",
"updated_at": "2025-01-01T00:00:05Z",
"phase": "datasets",
"dataset_count": 2
}
]
}
cancel - Job Cancellation
ml cancel running-job-id
Cancels currently running jobs by ID.
logs - Fetch and Follow Job Logs
Retrieve logs from running or completed ML experiments.
# Show full logs for a job
ml logs job123
# Show last 100 lines (tail)
ml logs job123 -n 100
ml logs job123 --tail 100
# Follow logs in real-time (like tail -f)
ml logs job123 -f
ml logs job123 --follow
# Combine tail and follow
ml logs job123 -n 50 -f
Features:
- WebSocket-based log streaming for real-time updates
- Works with both running and completed jobs
- Automatic reconnection on network issues
- Scrollable output with pagination support
Common Use Cases:
# Check why a job failed
ml logs failed-job-abc123
# Monitor a running training job
ml logs training-job-xyz789 -f
# Get recent errors only
ml logs job123 -n 20 | grep -i error
jupyter - Jupyter Notebook Management
Manage Jupyter notebook services via WebSocket protocol.
# Start a Jupyter service
ml jupyter start --name my-notebook --workspace /path/to/workspace
# Start with password protection
ml jupyter start --name my-notebook --workspace /path/to/workspace --password mypass
# List running services
ml jupyter list
# Stop a service
ml jupyter stop service-id-12345
# Check service status
ml jupyter status
Features:
- WebSocket-based binary protocol for low latency
- Secure API key authentication (SHA256 hashed)
- Real-time service management
- Workspace isolation
Common Use Cases:
# Development workflow
ml jupyter start --name dev-notebook --workspace ./notebooks
# ... do development work ...
ml jupyter stop dev-service-123
# Team collaboration
ml jupyter start --name team-analysis --workspace /shared/analysis --password teampass
# Multiple services
ml jupyter list # View all running services
Security:
- API keys are hashed before transmission
- Password protection for notebooks
- Workspace path validation
- Service ID-based authorization
Configuration
worker_host = "worker.local"
worker_user = "mluser"
worker_base = "/data/ml-experiments"
worker_port = 22
api_key = "your-api-key"
Performance Features
- Content-Addressed Storage: Automatic deduplication of identical files
- Incremental Sync: Only transfers changed files
- SHA256 Hashing: Reliable commit ID generation
- WebSocket Communication: Efficient real-time messaging
- Multi-threaded: Concurrent operations where applicable
Go Commands
API Server (./cmd/api-server/main.go)
Main HTTPS API server for experiment management.
# Build and run
go run ./cmd/api-server/main.go
# With configuration
./bin/api-server --config configs/api/dev.yaml
Features:
- HTTPS-only communication
- API key authentication
- Rate limiting and IP whitelisting
- WebSocket support for real-time updates
- Redis integration for caching
TUI (./cmd/tui/main.go)
Terminal User Interface for monitoring experiments.
# Launch TUI
go run ./cmd/tui/main.go
Features:
- Real-time experiment monitoring
- Interactive job management
- Status visualization
- Log viewing
Data Manager (./cmd/data_manager/)
Utilities for data synchronization and management.
# Sync data
./data_manager --sync ./data
# Clean old data
./data_manager --cleanup --older-than 30d
Config Lint (./cmd/configlint/main.go)
Configuration validation and linting tool.
# Validate configuration
./configlint configs/api/dev.yaml
# Check schema compliance
./configlint --schema configs/schema/api_server_config.yaml
Management Script (./tools/manage.sh)
Simple service management for your homelab.
Commands
./tools/manage.sh start # Start all services
./tools/manage.sh stop # Stop all services
./tools/manage.sh status # Check service status
./tools/manage.sh logs # View logs
./tools/manage.sh monitor # Basic monitoring
./tools/manage.sh security # Security status
./tools/manage.sh cleanup # Clean project artifacts
API Testing
Test the API with curl:
# Health check
curl -f http://localhost:8080/health
# List experiments
curl -H 'X-API-Key: password' http://localhost:8080/experiments
# Submit experiment
curl -X POST -H 'X-API-Key: password' \
-H 'Content-Type: application/json' \
-d '{"name":"test","config":{"type":"basic"}}' \
http://localhost:8080/experiments
Zig CLI Architecture
The Zig CLI is designed for performance and reliability:
Core Components
- Commands (
cli/src/commands/): Individual command implementations - Config (
cli/src/config.zig): Configuration management - Network (
cli/src/net/ws.zig): WebSocket client implementation - Utils (
cli/src/utils/): Cryptography, storage, and rsync utilities - Errors (
cli/src/errors.zig): Centralized error handling
Performance Optimizations
- Content-Addressed Storage: Deduplicates identical files across experiments
- SHA256 Hashing: Fast, reliable commit ID generation
- Rsync Integration: Efficient incremental file transfers
- WebSocket Protocol: Low-latency communication with worker
- Memory Management: Efficient allocation with Zig's allocator system
Security Features
- API Key Hashing: Secure authentication token handling
- SSH Integration: Secure file transfers
- Input Validation: Comprehensive argument checking
- Error Handling: Secure error reporting without information leakage
Configuration
Main configuration file: configs/api/dev.yaml
Key Settings
auth:
enabled: true
api_keys:
dev_user:
hash: "CHANGE_ME_SHA256_DEV_USER_KEY"
admin: true
roles:
- admin
permissions:
'*': true
researcher_user:
hash: "CHANGE_ME_SHA256_RESEARCHER_USER_KEY"
admin: false
roles:
- researcher
permissions:
'experiments': true
'datasets': true
server:
address: ":9101"
tls:
enabled: false # Set to true for production
cert_file: "./ssl/cert.pem"
key_file: "./ssl/key.pem"
security:
rate_limit:
enabled: true
requests_per_minute: 30
ip_whitelist:
- "127.0.0.1"
- "::1"
- "localhost"
- "10.0.0.0/8"
Docker Commands
If using Docker Compose:
# Start services
docker-compose up -d (testing only)
# View logs
docker-compose logs -f
# Stop services
docker-compose down
# Check status
docker-compose ps
Troubleshooting
Common Issues
Zig CLI not found:
# Build the CLI
cd cli && make build
# Check binary exists
ls -la ./cli/zig-out/bin/ml
Configuration not found:
# Create configuration
./cli/zig-out/bin/ml init
# Check config file
ls -la ~/.ml/config.toml
Worker connection failed:
# Test SSH connection
ssh -p 22 mluser@worker.local
# Check configuration
cat ~/.ml/config.toml
Sync not working:
# Check rsync availability
rsync --version
# Test manual sync
rsync -avz ./project/ mluser@worker.local:/tmp/test/
WebSocket connection failed:
# Check worker WebSocket port
telnet worker.local 9100
# Verify API key
./cli/zig-out/bin/ml status
API not responding:
./tools/manage.sh status
./tools/manage.sh logs
Authentication failed:
# Check API key in config
grep -A 5 "api_keys:" configs/api/dev.yaml
Redis connection failed:
# Check Redis status
redis-cli ping
# Start Redis
redis-server
Getting Help
# CLI help
./cli/zig-out/bin/ml help
# Management script help
./tools/manage.sh help
# Check all available commands
make help
That's it for the CLI reference! For complete setup instructions, see the main index.