# FetchML CLI/TUI UX Contract v1 This document defines the user experience contract for FetchML v1, focusing on clean, predictable CLI/TUI interactions without mode flags. ## Core Principles 1. **Thin CLI**: Local CLI does minimal validation; authoritative checks happen server-side 2. **No Mode Flags**: Commands do what they say; no `--mode` or similar flags 3. **Predictable Defaults**: Sensible defaults that work for most use cases 4. **Graceful Degradation**: JSON output for automation, human-friendly output for interactive use 5. **Explicit Operations**: `--dry-run`, `--validate`, `--explain` are explicit, not implied ## Commands v1 ### Core Workflow Commands #### `ml queue [options]` Submit a job for execution. **Basic Usage:** ```bash ml queue my-experiment ``` **Options:** - `--commit `: Specify commit ID (default: current git HEAD) - `--priority <1-10>`: Job priority (default: 5) - `--cpu `: CPU cores requested (default: 2) - `--memory `: Memory in GB (default: 8) - `--gpu `: GPU count (default: 0) - `--gpu-memory `: GPU memory budget (default: auto) **Dry Run:** ```bash ml queue my-experiment --dry-run # Output: JSON with what would be submitted, validation results ``` **Validate Only:** ```bash ml queue my-experiment --validate # Output: Validation results without submitting ``` **Explain:** ```bash ml queue my-experiment --explain # Output: Human-readable explanation of what will happen ``` **JSON Output:** When using `--json`, the response may include a `prewarm` field describing best-effort worker prewarming activity (e.g. dataset prefetch for the next queued task). ```bash ml queue my-experiment --json # Output: Structured JSON response ``` #### `ml status [job-name]` Show job status. **Basic Usage:** ```bash ml status # All jobs summary ml status my-experiment # Specific job details ``` **Options:** - `--json`: JSON output - `--watch`: Watch mode (refresh every 2s) - `--limit `: Limit number of jobs shown (default: 20) #### `ml cancel ` Cancel a running or queued job. **Basic Usage:** ```bash ml cancel my-experiment ``` **Options:** - `--force`: Force cancel even if running - `--json`: JSON output ### Experiment Management #### `ml experiment init ` Initialize a new experiment directory. **Basic Usage:** ```bash ml experiment init my-project ``` **Options:** - `--template `: Use experiment template - `--dry-run`: Show what would be created #### `ml experiment list` List available experiments. **Options:** - `--json`: JSON output - `--limit `: Limit results #### `ml experiment show ` Show experiment details. **Options:** - `--json`: JSON output - `--manifest`: Show content integrity manifest ### Dataset Management #### `ml dataset list` List available datasets. **Options:** - `--json`: JSON output - `--synced-only`: Show only synced datasets #### `ml dataset sync ` Sync a dataset from NAS to ML server. **Options:** - `--dry-run`: Show what would be synced - `--validate`: Validate dataset integrity after sync ### Monitoring & TUI #### `ml monitor` Launch TUI for real-time monitoring (runs over SSH). **Basic Usage:** ```bash ml monitor ``` **TUI Controls:** - `Ctrl+C`: Exit TUI - `q`: Quit - `r`: Refresh - `j/k`: Navigate jobs - `Enter`: Job details - `c`: Cancel selected job #### `ml watch ` Watch a specific job's output. **Options:** - `--follow`: Follow log output (default) - `--tail `: Show last n lines ## Global Options These options work with any command: - `--json`: Output structured JSON instead of human-readable format - `--config `: Use custom config file (default: ~/.ml/config.toml) - `--verbose`: Verbose output - `--quiet`: Minimal output - `--help`: Show help for command ## Defaults Configuration ### Default Job Resources ```toml [defaults] cpu = 2 # CPU cores memory = 8 # GB gpu = 0 # GPU count gpu_memory = "auto" # Auto-detect or specify GB priority = 5 # Job priority (1-10) ``` ### Default Behavior - **Commit ID**: Current git HEAD (must be clean working directory) - **Working Directory**: Current directory for experiment files - **Output**: Human-readable format unless `--json` specified - **Validation**: Server-side authoritative validation ## Error Handling ### Exit Codes - `0`: Success - `1`: General error - `2`: Invalid arguments - `3`: Validation failed - `4`: Network/connection error - `5`: Server error ### Error Output Format **Human-readable:** ``` Error: Experiment validation failed - Missing dependency manifest (environment.yml, poetry.lock, pyproject.toml, or requirements.txt) - Train script not found: train.py ``` **JSON:** ```json { "error": "validation_failed", "message": "Experiment validation failed", "details": [ {"field": "dependency_manifest", "error": "missing", "supported": ["environment.yml", "poetry.lock", "pyproject.toml", "requirements.txt"]}, {"field": "train_script", "error": "not_found", "expected": "train.py"} ] } ``` ## Ctrl+C Semantics ### Command Cancellation - **Ctrl+C during `ml queue --dry-run`**: Immediate exit, no side effects - **Ctrl+C during `ml queue`**: Attempt to cancel submission, show status - **Ctrl+C during `ml status --watch`**: Exit watch mode - **Ctrl+C during `ml monitor`**: Gracefully exit TUI - **Ctrl+C during `ml watch`**: Stop following logs, show final status ### Graceful Shutdown 1. Signal interrupt to server (if applicable) 2. Clean up local resources 3. Display current status 4. Exit with appropriate code ## JSON Output Schema ### Job Submission Response ```json { "job_id": "uuid-string", "job_name": "my-experiment", "status": "queued", "commit_id": "abc123...", "submitted_at": "2025-01-01T12:00:00Z", "estimated_start": "2025-01-01T12:05:00Z", "resources": { "cpu": 2, "memory_gb": 8, "gpu": 1, "gpu_memory_gb": 16 } } ``` ### Status Response ```json { "jobs": [ { "job_id": "uuid-string", "job_name": "my-experiment", "status": "running", "progress": 0.75, "started_at": "2025-01-01T12:05:00Z", "estimated_completion": "2025-01-01T12:30:00Z", "node": "worker-01" } ], "total": 1, "showing": 1 } ``` ## Examples ### Typical Workflow ```bash # 1. Initialize experiment ml experiment init my-project cd my-project # 2. Validate experiment locally ml queue . --validate --dry-run # 3. Submit job ml queue . --priority 8 --gpu 1 # 4. Monitor progress ml status . ml watch . # 5. Check results ml status . --json ``` ### Automation Script ```bash #!/bin/bash # Submit job and wait for completion JOB_ID=$(ml queue my-experiment --json | jq -r '.job_id') echo "Submitted job: $JOB_ID" # Wait for completion while true; do STATUS=$(ml status $JOB_ID --json | jq -r '.jobs[0].status') echo "Status: $STATUS" if [[ "$STATUS" == "completed" || "$STATUS" == "failed" ]]; then break fi sleep 10 done ml status $JOB_ID ``` ## Implementation Notes ### Server-side Validation - CLI performs minimal local checks (git status, file existence) - All authoritative validation happens on worker - Validation failures are propagated back to CLI with clear error messages ### Trust Contract Integration - Every job submission includes commit ID and content integrity manifest - Worker validates both before execution - Any mismatch causes hard-fail with detailed error reporting ### Resource Management - Resource requests are validated against available capacity - Jobs are queued based on priority and resource availability - Resource usage is tracked and reported in status ## Future Extensions The v1 contract is intentionally minimal but designed for extension: ### Phase 0: Trust and usability (highest priority) 1. **Make `ml status` excellent** - Compact summary with queue counts, relevant tasks, prewarm state 2. **Add `ml explain`** - Dry-run preview command showing resolved execution plan 3. **Tighten run manifest completeness** - Require timestamps, exit codes, dataset identities 4. **Dataset identity** - Structured `dataset_specs` with checksums (strict-by-default) ### Phase 1: Simple performance wins - Keep prewarming single-level (next task only) - Improve observability first (status output + metrics) ### Phase 2+: Research workflows - `ml compare `: Manifest-driven diff of provenance - `ml reproduce `: Submit task from recorded manifest - `ml export `: Package provenance + artifacts ### Phase 3: Infrastructure (only if needed) - Multi-level prewarming, predictive scheduling - Optional S3-compatible storage (MinIO) - Optional integrations (MLflow, W&B) - Optional Kubernetes deployment All extensions will maintain backward compatibility with the v1 contract.