docs: fix mermaid graphs and update outdated content

- Fix mermaid graph syntax errors (escape parentheses in node labels)
- Move mermaid-init.js to Hugo static directory for correct MIME type
- Update Future Extensions section in cli-tui-ux-contract-v1.md to match current roadmap
- Add ADR-004 through ADR-007 documenting C++ native optimization strategy

2026-02-16 20:37:38 -05:00

8.7 KiB

Raw Blame History

FetchML CLI/TUI UX Contract v1

This document defines the user experience contract for FetchML v1, focusing on clean, predictable CLI/TUI interactions without mode flags.

Core Principles

Thin CLI: Local CLI does minimal validation; authoritative checks happen server-side
No Mode Flags: Commands do what they say; no --mode or similar flags
Predictable Defaults: Sensible defaults that work for most use cases
Graceful Degradation: JSON output for automation, human-friendly output for interactive use
Explicit Operations: --dry-run, --validate, --explain are explicit, not implied

Commands v1

Core Workflow Commands

`ml queue <job-name> [options]`

Submit a job for execution.

Basic Usage:

ml queue my-experiment

Options:

--commit <sha>: Specify commit ID (default: current git HEAD)
--priority <1-10>: Job priority (default: 5)
--cpu <cores>: CPU cores requested (default: 2)
--memory <gb>: Memory in GB (default: 8)
--gpu <count>: GPU count (default: 0)
--gpu-memory <gb>: GPU memory budget (default: auto)

Dry Run:

ml queue my-experiment --dry-run
# Output: JSON with what would be submitted, validation results

Validate Only:

ml queue my-experiment --validate
# Output: Validation results without submitting

Explain:

ml queue my-experiment --explain
# Output: Human-readable explanation of what will happen

JSON Output: When using --json, the response may include a prewarm field describing best-effort worker prewarming activity (e.g. dataset prefetch for the next queued task).

ml queue my-experiment --json
# Output: Structured JSON response

`ml status [job-name]`

Show job status.

Basic Usage:

ml status              # All jobs summary
ml status my-experiment # Specific job details

Options:

--json: JSON output
--watch: Watch mode (refresh every 2s)
--limit <n>: Limit number of jobs shown (default: 20)

`ml cancel <job-name>`

Cancel a running or queued job.

Basic Usage:

ml cancel my-experiment

Options:

--force: Force cancel even if running
--json: JSON output

Experiment Management

`ml experiment init <name>`

Initialize a new experiment directory.

Basic Usage:

ml experiment init my-project

Options:

--template <name>: Use experiment template
--dry-run: Show what would be created

`ml experiment list`

List available experiments.

Options:

--json: JSON output
--limit <n>: Limit results

`ml experiment show <commit-id>`

Show experiment details.

Options:

--json: JSON output
--manifest: Show content integrity manifest

Dataset Management

`ml dataset list`

List available datasets.

Options:

--json: JSON output
--synced-only: Show only synced datasets

`ml dataset sync <dataset-name>`

Sync a dataset from NAS to ML server.

Options:

--dry-run: Show what would be synced
--validate: Validate dataset integrity after sync

Monitoring & TUI

`ml monitor`

Launch TUI for real-time monitoring (runs over SSH).

Basic Usage:

ml monitor

TUI Controls:

Ctrl+C: Exit TUI
q: Quit
r: Refresh
j/k: Navigate jobs
Enter: Job details
c: Cancel selected job

`ml watch <job-name>`

Watch a specific job's output.

Options:

--follow: Follow log output (default)
--tail <n>: Show last n lines

Global Options

These options work with any command:

--json: Output structured JSON instead of human-readable format
--config <path>: Use custom config file (default: ~/.ml/config.toml)
--verbose: Verbose output
--quiet: Minimal output
--help: Show help for command

Defaults Configuration

Default Job Resources

[defaults]
cpu = 2              # CPU cores
memory = 8           # GB
gpu = 0              # GPU count
gpu_memory = "auto"  # Auto-detect or specify GB
priority = 5         # Job priority (1-10)

Default Behavior

Commit ID: Current git HEAD (must be clean working directory)
Working Directory: Current directory for experiment files
Output: Human-readable format unless --json specified
Validation: Server-side authoritative validation

Error Handling

Exit Codes

0: Success
1: General error
2: Invalid arguments
3: Validation failed
4: Network/connection error
5: Server error

Error Output Format

Human-readable:

Error: Experiment validation failed
  - Missing dependency manifest (environment.yml, poetry.lock, pyproject.toml, or requirements.txt)
  - Train script not found: train.py

JSON:

{
  "error": "validation_failed",
  "message": "Experiment validation failed",
  "details": [
    {"field": "dependency_manifest", "error": "missing", "supported": ["environment.yml", "poetry.lock", "pyproject.toml", "requirements.txt"]},
    {"field": "train_script", "error": "not_found", "expected": "train.py"}
  ]
}

Ctrl+C Semantics

Command Cancellation

Ctrl+C during ml queue --dry-run: Immediate exit, no side effects
Ctrl+C during ml queue: Attempt to cancel submission, show status
Ctrl+C during ml status --watch: Exit watch mode
Ctrl+C during ml monitor: Gracefully exit TUI
Ctrl+C during ml watch: Stop following logs, show final status

Graceful Shutdown

Signal interrupt to server (if applicable)
Clean up local resources
Display current status
Exit with appropriate code

JSON Output Schema

Job Submission Response

{
  "job_id": "uuid-string",
  "job_name": "my-experiment",
  "status": "queued",
  "commit_id": "abc123...",
  "submitted_at": "2025-01-01T12:00:00Z",
  "estimated_start": "2025-01-01T12:05:00Z",
  "resources": {
    "cpu": 2,
    "memory_gb": 8,
    "gpu": 1,
    "gpu_memory_gb": 16
  }
}

Status Response

{
  "jobs": [
    {
      "job_id": "uuid-string",
      "job_name": "my-experiment",
      "status": "running",
      "progress": 0.75,
      "started_at": "2025-01-01T12:05:00Z",
      "estimated_completion": "2025-01-01T12:30:00Z",
      "node": "worker-01"
    }
  ],
  "total": 1,
  "showing": 1
}

Examples

Typical Workflow

# 1. Initialize experiment
ml experiment init my-project
cd my-project

# 2. Validate experiment locally
ml queue . --validate --dry-run

# 3. Submit job
ml queue . --priority 8 --gpu 1

# 4. Monitor progress
ml status .
ml watch .

# 5. Check results
ml status . --json

Automation Script

#!/bin/bash
# Submit job and wait for completion
JOB_ID=$(ml queue my-experiment --json | jq -r '.job_id')

echo "Submitted job: $JOB_ID"

# Wait for completion
while true; do
  STATUS=$(ml status $JOB_ID --json | jq -r '.jobs[0].status')
  echo "Status: $STATUS"
  
  if [[ "$STATUS" == "completed" || "$STATUS" == "failed" ]]; then
    break
  fi
  
  sleep 10
done

ml status $JOB_ID

Implementation Notes

Server-side Validation

CLI performs minimal local checks (git status, file existence)
All authoritative validation happens on worker
Validation failures are propagated back to CLI with clear error messages

Trust Contract Integration

Every job submission includes commit ID and content integrity manifest
Worker validates both before execution
Any mismatch causes hard-fail with detailed error reporting

Resource Management

Resource requests are validated against available capacity
Jobs are queued based on priority and resource availability
Resource usage is tracked and reported in status

Future Extensions

The v1 contract is intentionally minimal but designed for extension:

Phase 0: Trust and usability (highest priority)

Make ml status excellent - Compact summary with queue counts, relevant tasks, prewarm state
Add ml explain - Dry-run preview command showing resolved execution plan
Tighten run manifest completeness - Require timestamps, exit codes, dataset identities
Dataset identity - Structured dataset_specs with checksums (strict-by-default)

Phase 1: Simple performance wins

Keep prewarming single-level (next task only)
Improve observability first (status output + metrics)

Phase 2+: Research workflows

ml compare <runA> <runB>: Manifest-driven diff of provenance
ml reproduce <run-id>: Submit task from recorded manifest
ml export <run-id>: Package provenance + artifacts

Phase 3: Infrastructure (only if needed)

Multi-level prewarming, predictive scheduling
Optional S3-compatible storage (MinIO)
Optional integrations (MLflow, W&B)
Optional Kubernetes deployment

All extensions will maintain backward compatibility with the v1 contract.

8.7 KiB Raw Blame History

FetchML CLI/TUI UX Contract v1

Core Principles

Commands v1

Core Workflow Commands

ml queue <job-name> [options]

ml status [job-name]

ml cancel <job-name>

Experiment Management

ml experiment init <name>

ml experiment list

ml experiment show <commit-id>

Dataset Management

ml dataset list

ml dataset sync <dataset-name>

Monitoring & TUI

ml monitor

ml watch <job-name>

Global Options

Defaults Configuration

Default Job Resources

Default Behavior

Error Handling

Exit Codes

Error Output Format

Ctrl+C Semantics

Command Cancellation

Graceful Shutdown

JSON Output Schema

Job Submission Response

Status Response

Examples

Typical Workflow

Automation Script

Implementation Notes

Server-side Validation

Trust Contract Integration

Resource Management

Future Extensions

Phase 0: Trust and usability (highest priority)

Phase 1: Simple performance wins

Phase 2+: Research workflows

Phase 3: Infrastructure (only if needed)

8.7 KiB

Raw Blame History

`ml queue <job-name> [options]`

`ml status [job-name]`

`ml cancel <job-name>`

`ml experiment init <name>`

`ml experiment list`

`ml experiment show <commit-id>`

`ml dataset list`

`ml dataset sync <dataset-name>`

`ml monitor`

`ml watch <job-name>`