fetch_ml/docs/src/cli-tui-ux-contract-v1.md
Jeremie Fraeys d673bce216
docs: fix mermaid graphs and update outdated content
- Fix mermaid graph syntax errors (escape parentheses in node labels)
- Move mermaid-init.js to Hugo static directory for correct MIME type
- Update Future Extensions section in cli-tui-ux-contract-v1.md to match current roadmap
- Add ADR-004 through ADR-007 documenting C++ native optimization strategy
2026-02-16 20:37:38 -05:00

8.7 KiB

FetchML CLI/TUI UX Contract v1

This document defines the user experience contract for FetchML v1, focusing on clean, predictable CLI/TUI interactions without mode flags.

Core Principles

  1. Thin CLI: Local CLI does minimal validation; authoritative checks happen server-side
  2. No Mode Flags: Commands do what they say; no --mode or similar flags
  3. Predictable Defaults: Sensible defaults that work for most use cases
  4. Graceful Degradation: JSON output for automation, human-friendly output for interactive use
  5. Explicit Operations: --dry-run, --validate, --explain are explicit, not implied

Commands v1

Core Workflow Commands

ml queue <job-name> [options]

Submit a job for execution.

Basic Usage:

ml queue my-experiment

Options:

  • --commit <sha>: Specify commit ID (default: current git HEAD)
  • --priority <1-10>: Job priority (default: 5)
  • --cpu <cores>: CPU cores requested (default: 2)
  • --memory <gb>: Memory in GB (default: 8)
  • --gpu <count>: GPU count (default: 0)
  • --gpu-memory <gb>: GPU memory budget (default: auto)

Dry Run:

ml queue my-experiment --dry-run
# Output: JSON with what would be submitted, validation results

Validate Only:

ml queue my-experiment --validate
# Output: Validation results without submitting

Explain:

ml queue my-experiment --explain
# Output: Human-readable explanation of what will happen

JSON Output: When using --json, the response may include a prewarm field describing best-effort worker prewarming activity (e.g. dataset prefetch for the next queued task).

ml queue my-experiment --json
# Output: Structured JSON response

ml status [job-name]

Show job status.

Basic Usage:

ml status              # All jobs summary
ml status my-experiment # Specific job details

Options:

  • --json: JSON output
  • --watch: Watch mode (refresh every 2s)
  • --limit <n>: Limit number of jobs shown (default: 20)

ml cancel <job-name>

Cancel a running or queued job.

Basic Usage:

ml cancel my-experiment

Options:

  • --force: Force cancel even if running
  • --json: JSON output

Experiment Management

ml experiment init <name>

Initialize a new experiment directory.

Basic Usage:

ml experiment init my-project

Options:

  • --template <name>: Use experiment template
  • --dry-run: Show what would be created

ml experiment list

List available experiments.

Options:

  • --json: JSON output
  • --limit <n>: Limit results

ml experiment show <commit-id>

Show experiment details.

Options:

  • --json: JSON output
  • --manifest: Show content integrity manifest

Dataset Management

ml dataset list

List available datasets.

Options:

  • --json: JSON output
  • --synced-only: Show only synced datasets

ml dataset sync <dataset-name>

Sync a dataset from NAS to ML server.

Options:

  • --dry-run: Show what would be synced
  • --validate: Validate dataset integrity after sync

Monitoring & TUI

ml monitor

Launch TUI for real-time monitoring (runs over SSH).

Basic Usage:

ml monitor

TUI Controls:

  • Ctrl+C: Exit TUI
  • q: Quit
  • r: Refresh
  • j/k: Navigate jobs
  • Enter: Job details
  • c: Cancel selected job

ml watch <job-name>

Watch a specific job's output.

Options:

  • --follow: Follow log output (default)
  • --tail <n>: Show last n lines

Global Options

These options work with any command:

  • --json: Output structured JSON instead of human-readable format
  • --config <path>: Use custom config file (default: ~/.ml/config.toml)
  • --verbose: Verbose output
  • --quiet: Minimal output
  • --help: Show help for command

Defaults Configuration

Default Job Resources

[defaults]
cpu = 2              # CPU cores
memory = 8           # GB
gpu = 0              # GPU count
gpu_memory = "auto"  # Auto-detect or specify GB
priority = 5         # Job priority (1-10)

Default Behavior

  • Commit ID: Current git HEAD (must be clean working directory)
  • Working Directory: Current directory for experiment files
  • Output: Human-readable format unless --json specified
  • Validation: Server-side authoritative validation

Error Handling

Exit Codes

  • 0: Success
  • 1: General error
  • 2: Invalid arguments
  • 3: Validation failed
  • 4: Network/connection error
  • 5: Server error

Error Output Format

Human-readable:

Error: Experiment validation failed
  - Missing dependency manifest (environment.yml, poetry.lock, pyproject.toml, or requirements.txt)
  - Train script not found: train.py

JSON:

{
  "error": "validation_failed",
  "message": "Experiment validation failed",
  "details": [
    {"field": "dependency_manifest", "error": "missing", "supported": ["environment.yml", "poetry.lock", "pyproject.toml", "requirements.txt"]},
    {"field": "train_script", "error": "not_found", "expected": "train.py"}
  ]
}

Ctrl+C Semantics

Command Cancellation

  • Ctrl+C during ml queue --dry-run: Immediate exit, no side effects
  • Ctrl+C during ml queue: Attempt to cancel submission, show status
  • Ctrl+C during ml status --watch: Exit watch mode
  • Ctrl+C during ml monitor: Gracefully exit TUI
  • Ctrl+C during ml watch: Stop following logs, show final status

Graceful Shutdown

  1. Signal interrupt to server (if applicable)
  2. Clean up local resources
  3. Display current status
  4. Exit with appropriate code

JSON Output Schema

Job Submission Response

{
  "job_id": "uuid-string",
  "job_name": "my-experiment",
  "status": "queued",
  "commit_id": "abc123...",
  "submitted_at": "2025-01-01T12:00:00Z",
  "estimated_start": "2025-01-01T12:05:00Z",
  "resources": {
    "cpu": 2,
    "memory_gb": 8,
    "gpu": 1,
    "gpu_memory_gb": 16
  }
}

Status Response

{
  "jobs": [
    {
      "job_id": "uuid-string",
      "job_name": "my-experiment",
      "status": "running",
      "progress": 0.75,
      "started_at": "2025-01-01T12:05:00Z",
      "estimated_completion": "2025-01-01T12:30:00Z",
      "node": "worker-01"
    }
  ],
  "total": 1,
  "showing": 1
}

Examples

Typical Workflow

# 1. Initialize experiment
ml experiment init my-project
cd my-project

# 2. Validate experiment locally
ml queue . --validate --dry-run

# 3. Submit job
ml queue . --priority 8 --gpu 1

# 4. Monitor progress
ml status .
ml watch .

# 5. Check results
ml status . --json

Automation Script

#!/bin/bash
# Submit job and wait for completion
JOB_ID=$(ml queue my-experiment --json | jq -r '.job_id')

echo "Submitted job: $JOB_ID"

# Wait for completion
while true; do
  STATUS=$(ml status $JOB_ID --json | jq -r '.jobs[0].status')
  echo "Status: $STATUS"
  
  if [[ "$STATUS" == "completed" || "$STATUS" == "failed" ]]; then
    break
  fi
  
  sleep 10
done

ml status $JOB_ID

Implementation Notes

Server-side Validation

  • CLI performs minimal local checks (git status, file existence)
  • All authoritative validation happens on worker
  • Validation failures are propagated back to CLI with clear error messages

Trust Contract Integration

  • Every job submission includes commit ID and content integrity manifest
  • Worker validates both before execution
  • Any mismatch causes hard-fail with detailed error reporting

Resource Management

  • Resource requests are validated against available capacity
  • Jobs are queued based on priority and resource availability
  • Resource usage is tracked and reported in status

Future Extensions

The v1 contract is intentionally minimal but designed for extension:

Phase 0: Trust and usability (highest priority)

  1. Make ml status excellent - Compact summary with queue counts, relevant tasks, prewarm state
  2. Add ml explain - Dry-run preview command showing resolved execution plan
  3. Tighten run manifest completeness - Require timestamps, exit codes, dataset identities
  4. Dataset identity - Structured dataset_specs with checksums (strict-by-default)

Phase 1: Simple performance wins

  • Keep prewarming single-level (next task only)
  • Improve observability first (status output + metrics)

Phase 2+: Research workflows

  • ml compare <runA> <runB>: Manifest-driven diff of provenance
  • ml reproduce <run-id>: Submit task from recorded manifest
  • ml export <run-id>: Package provenance + artifacts

Phase 3: Infrastructure (only if needed)

  • Multi-level prewarming, predictive scheduling
  • Optional S3-compatible storage (MinIO)
  • Optional integrations (MLflow, W&B)
  • Optional Kubernetes deployment

All extensions will maintain backward compatibility with the v1 contract.