- Fix mermaid graph syntax errors (escape parentheses in node labels) - Move mermaid-init.js to Hugo static directory for correct MIME type - Update Future Extensions section in cli-tui-ux-contract-v1.md to match current roadmap - Add ADR-004 through ADR-007 documenting C++ native optimization strategy
8.7 KiB
FetchML CLI/TUI UX Contract v1
This document defines the user experience contract for FetchML v1, focusing on clean, predictable CLI/TUI interactions without mode flags.
Core Principles
- Thin CLI: Local CLI does minimal validation; authoritative checks happen server-side
- No Mode Flags: Commands do what they say; no
--modeor similar flags - Predictable Defaults: Sensible defaults that work for most use cases
- Graceful Degradation: JSON output for automation, human-friendly output for interactive use
- Explicit Operations:
--dry-run,--validate,--explainare explicit, not implied
Commands v1
Core Workflow Commands
ml queue <job-name> [options]
Submit a job for execution.
Basic Usage:
ml queue my-experiment
Options:
--commit <sha>: Specify commit ID (default: current git HEAD)--priority <1-10>: Job priority (default: 5)--cpu <cores>: CPU cores requested (default: 2)--memory <gb>: Memory in GB (default: 8)--gpu <count>: GPU count (default: 0)--gpu-memory <gb>: GPU memory budget (default: auto)
Dry Run:
ml queue my-experiment --dry-run
# Output: JSON with what would be submitted, validation results
Validate Only:
ml queue my-experiment --validate
# Output: Validation results without submitting
Explain:
ml queue my-experiment --explain
# Output: Human-readable explanation of what will happen
JSON Output:
When using --json, the response may include a prewarm field describing best-effort worker prewarming activity (e.g. dataset prefetch for the next queued task).
ml queue my-experiment --json
# Output: Structured JSON response
ml status [job-name]
Show job status.
Basic Usage:
ml status # All jobs summary
ml status my-experiment # Specific job details
Options:
--json: JSON output--watch: Watch mode (refresh every 2s)--limit <n>: Limit number of jobs shown (default: 20)
ml cancel <job-name>
Cancel a running or queued job.
Basic Usage:
ml cancel my-experiment
Options:
--force: Force cancel even if running--json: JSON output
Experiment Management
ml experiment init <name>
Initialize a new experiment directory.
Basic Usage:
ml experiment init my-project
Options:
--template <name>: Use experiment template--dry-run: Show what would be created
ml experiment list
List available experiments.
Options:
--json: JSON output--limit <n>: Limit results
ml experiment show <commit-id>
Show experiment details.
Options:
--json: JSON output--manifest: Show content integrity manifest
Dataset Management
ml dataset list
List available datasets.
Options:
--json: JSON output--synced-only: Show only synced datasets
ml dataset sync <dataset-name>
Sync a dataset from NAS to ML server.
Options:
--dry-run: Show what would be synced--validate: Validate dataset integrity after sync
Monitoring & TUI
ml monitor
Launch TUI for real-time monitoring (runs over SSH).
Basic Usage:
ml monitor
TUI Controls:
Ctrl+C: Exit TUIq: Quitr: Refreshj/k: Navigate jobsEnter: Job detailsc: Cancel selected job
ml watch <job-name>
Watch a specific job's output.
Options:
--follow: Follow log output (default)--tail <n>: Show last n lines
Global Options
These options work with any command:
--json: Output structured JSON instead of human-readable format--config <path>: Use custom config file (default: ~/.ml/config.toml)--verbose: Verbose output--quiet: Minimal output--help: Show help for command
Defaults Configuration
Default Job Resources
[defaults]
cpu = 2 # CPU cores
memory = 8 # GB
gpu = 0 # GPU count
gpu_memory = "auto" # Auto-detect or specify GB
priority = 5 # Job priority (1-10)
Default Behavior
- Commit ID: Current git HEAD (must be clean working directory)
- Working Directory: Current directory for experiment files
- Output: Human-readable format unless
--jsonspecified - Validation: Server-side authoritative validation
Error Handling
Exit Codes
0: Success1: General error2: Invalid arguments3: Validation failed4: Network/connection error5: Server error
Error Output Format
Human-readable:
Error: Experiment validation failed
- Missing dependency manifest (environment.yml, poetry.lock, pyproject.toml, or requirements.txt)
- Train script not found: train.py
JSON:
{
"error": "validation_failed",
"message": "Experiment validation failed",
"details": [
{"field": "dependency_manifest", "error": "missing", "supported": ["environment.yml", "poetry.lock", "pyproject.toml", "requirements.txt"]},
{"field": "train_script", "error": "not_found", "expected": "train.py"}
]
}
Ctrl+C Semantics
Command Cancellation
- Ctrl+C during
ml queue --dry-run: Immediate exit, no side effects - Ctrl+C during
ml queue: Attempt to cancel submission, show status - Ctrl+C during
ml status --watch: Exit watch mode - Ctrl+C during
ml monitor: Gracefully exit TUI - Ctrl+C during
ml watch: Stop following logs, show final status
Graceful Shutdown
- Signal interrupt to server (if applicable)
- Clean up local resources
- Display current status
- Exit with appropriate code
JSON Output Schema
Job Submission Response
{
"job_id": "uuid-string",
"job_name": "my-experiment",
"status": "queued",
"commit_id": "abc123...",
"submitted_at": "2025-01-01T12:00:00Z",
"estimated_start": "2025-01-01T12:05:00Z",
"resources": {
"cpu": 2,
"memory_gb": 8,
"gpu": 1,
"gpu_memory_gb": 16
}
}
Status Response
{
"jobs": [
{
"job_id": "uuid-string",
"job_name": "my-experiment",
"status": "running",
"progress": 0.75,
"started_at": "2025-01-01T12:05:00Z",
"estimated_completion": "2025-01-01T12:30:00Z",
"node": "worker-01"
}
],
"total": 1,
"showing": 1
}
Examples
Typical Workflow
# 1. Initialize experiment
ml experiment init my-project
cd my-project
# 2. Validate experiment locally
ml queue . --validate --dry-run
# 3. Submit job
ml queue . --priority 8 --gpu 1
# 4. Monitor progress
ml status .
ml watch .
# 5. Check results
ml status . --json
Automation Script
#!/bin/bash
# Submit job and wait for completion
JOB_ID=$(ml queue my-experiment --json | jq -r '.job_id')
echo "Submitted job: $JOB_ID"
# Wait for completion
while true; do
STATUS=$(ml status $JOB_ID --json | jq -r '.jobs[0].status')
echo "Status: $STATUS"
if [[ "$STATUS" == "completed" || "$STATUS" == "failed" ]]; then
break
fi
sleep 10
done
ml status $JOB_ID
Implementation Notes
Server-side Validation
- CLI performs minimal local checks (git status, file existence)
- All authoritative validation happens on worker
- Validation failures are propagated back to CLI with clear error messages
Trust Contract Integration
- Every job submission includes commit ID and content integrity manifest
- Worker validates both before execution
- Any mismatch causes hard-fail with detailed error reporting
Resource Management
- Resource requests are validated against available capacity
- Jobs are queued based on priority and resource availability
- Resource usage is tracked and reported in status
Future Extensions
The v1 contract is intentionally minimal but designed for extension:
Phase 0: Trust and usability (highest priority)
- Make
ml statusexcellent - Compact summary with queue counts, relevant tasks, prewarm state - Add
ml explain- Dry-run preview command showing resolved execution plan - Tighten run manifest completeness - Require timestamps, exit codes, dataset identities
- Dataset identity - Structured
dataset_specswith checksums (strict-by-default)
Phase 1: Simple performance wins
- Keep prewarming single-level (next task only)
- Improve observability first (status output + metrics)
Phase 2+: Research workflows
ml compare <runA> <runB>: Manifest-driven diff of provenanceml reproduce <run-id>: Submit task from recorded manifestml export <run-id>: Package provenance + artifacts
Phase 3: Infrastructure (only if needed)
- Multi-level prewarming, predictive scheduling
- Optional S3-compatible storage (MinIO)
- Optional integrations (MLflow, W&B)
- Optional Kubernetes deployment
All extensions will maintain backward compatibility with the v1 contract.