# FetchML CLI/TUI UX Contract v1

This document defines the user experience contract for FetchML v1, focusing on clean, predictable CLI/TUI interactions without mode flags.

## Core Principles

1. **Thin CLI**: Local CLI does minimal validation; authoritative checks happen server-side
2. **No Mode Flags**: Commands do what they say; no `--mode` or similar flags
3. **Predictable Defaults**: Sensible defaults that work for most use cases
4. **Graceful Degradation**: JSON output for automation, human-friendly output for interactive use
5. **Explicit Operations**: `--dry-run`, `--validate`, `--explain` are explicit, not implied

## Commands v1

### Core Workflow Commands

#### `ml queue <job-name> [options]`
Submit a job for execution.

**Basic Usage:**
```bash
ml queue my-experiment
```

**Options:**
- `--commit <sha>`: Specify commit ID (default: current git HEAD)
- `--priority <1-10>`: Job priority (default: 5)
- `--cpu <cores>`: CPU cores requested (default: 2)
- `--memory <gb>`: Memory in GB (default: 8)
- `--gpu <count>`: GPU count (default: 0)
- `--gpu-memory <gb>`: GPU memory budget (default: auto)

**Dry Run:**
```bash
ml queue my-experiment --dry-run
# Output: JSON with what would be submitted, validation results
```

**Validate Only:**
```bash
ml queue my-experiment --validate
# Output: Validation results without submitting
```

**Explain:**
```bash
ml queue my-experiment --explain
# Output: Human-readable explanation of what will happen
```

**JSON Output:**
When using `--json`, the response may include a `prewarm` field describing best-effort worker prewarming activity (e.g. dataset prefetch for the next queued task).

```bash
ml queue my-experiment --json
# Output: Structured JSON response
```

#### `ml status [job-name]`
Show job status.

**Basic Usage:**
```bash
ml status              # All jobs summary
ml status my-experiment # Specific job details
```

**Options:**
- `--json`: JSON output
- `--watch`: Watch mode (refresh every 2s)
- `--limit <n>`: Limit number of jobs shown (default: 20)

#### `ml cancel <job-name>`
Cancel a running or queued job.

**Basic Usage:**
```bash
ml cancel my-experiment
```

**Options:**
- `--force`: Force cancel even if running
- `--json`: JSON output

### Experiment Management

#### `ml experiment init <name>`
Initialize a new experiment directory.

**Basic Usage:**
```bash
ml experiment init my-project
```

**Options:**
- `--template <name>`: Use experiment template
- `--dry-run`: Show what would be created

#### `ml experiment list`
List available experiments.

**Options:**
- `--json`: JSON output
- `--limit <n>`: Limit results

#### `ml experiment show <commit-id>`
Show experiment details.

**Options:**
- `--json`: JSON output
- `--manifest`: Show content integrity manifest

### Dataset Management

#### `ml dataset list`
List available datasets.

**Options:**
- `--json`: JSON output
- `--synced-only`: Show only synced datasets

#### `ml dataset sync <dataset-name>`
Sync a dataset from NAS to ML server.

**Options:**
- `--dry-run`: Show what would be synced
- `--validate`: Validate dataset integrity after sync

### Monitoring & TUI

#### `ml monitor`
Launch TUI for real-time monitoring (runs over SSH).

**Basic Usage:**
```bash
ml monitor
```

**TUI Controls:**
- `Ctrl+C`: Exit TUI
- `q`: Quit
- `r`: Refresh
- `j/k`: Navigate jobs
- `Enter`: Job details
- `c`: Cancel selected job

#### `ml watch <job-name>`
Watch a specific job's output.

**Options:**
- `--follow`: Follow log output (default)
- `--tail <n>`: Show last n lines

## Global Options

These options work with any command:

- `--json`: Output structured JSON instead of human-readable format
- `--config <path>`: Use custom config file (default: ~/.ml/config.toml)
- `--verbose`: Verbose output
- `--quiet`: Minimal output
- `--help`: Show help for command

## Defaults Configuration

### Default Job Resources
```toml
[defaults]
cpu = 2              # CPU cores
memory = 8           # GB
gpu = 0              # GPU count
gpu_memory = "auto"  # Auto-detect or specify GB
priority = 5         # Job priority (1-10)
```

### Default Behavior
- **Commit ID**: Current git HEAD (must be clean working directory)
- **Working Directory**: Current directory for experiment files
- **Output**: Human-readable format unless `--json` specified
- **Validation**: Server-side authoritative validation

## Error Handling

### Exit Codes
- `0`: Success
- `1`: General error
- `2`: Invalid arguments
- `3`: Validation failed
- `4`: Network/connection error
- `5`: Server error

### Error Output Format
**Human-readable:**
```
Error: Experiment validation failed
  - Missing dependency manifest (environment.yml, poetry.lock, pyproject.toml, or requirements.txt)
  - Train script not found: train.py
```

**JSON:**
```json
{
  "error": "validation_failed",
  "message": "Experiment validation failed",
  "details": [
    {"field": "dependency_manifest", "error": "missing", "supported": ["environment.yml", "poetry.lock", "pyproject.toml", "requirements.txt"]},
    {"field": "train_script", "error": "not_found", "expected": "train.py"}
  ]
}
```

## Ctrl+C Semantics

### Command Cancellation
- **Ctrl+C during `ml queue --dry-run`**: Immediate exit, no side effects
- **Ctrl+C during `ml queue`**: Attempt to cancel submission, show status
- **Ctrl+C during `ml status --watch`**: Exit watch mode
- **Ctrl+C during `ml monitor`**: Gracefully exit TUI
- **Ctrl+C during `ml watch`**: Stop following logs, show final status

### Graceful Shutdown
1. Signal interrupt to server (if applicable)
2. Clean up local resources
3. Display current status
4. Exit with appropriate code

## JSON Output Schema

### Job Submission Response
```json
{
  "job_id": "uuid-string",
  "job_name": "my-experiment",
  "status": "queued",
  "commit_id": "abc123...",
  "submitted_at": "2025-01-01T12:00:00Z",
  "estimated_start": "2025-01-01T12:05:00Z",
  "resources": {
    "cpu": 2,
    "memory_gb": 8,
    "gpu": 1,
    "gpu_memory_gb": 16
  }
}
```

### Status Response
```json
{
  "jobs": [
    {
      "job_id": "uuid-string",
      "job_name": "my-experiment",
      "status": "running",
      "progress": 0.75,
      "started_at": "2025-01-01T12:05:00Z",
      "estimated_completion": "2025-01-01T12:30:00Z",
      "node": "worker-01"
    }
  ],
  "total": 1,
  "showing": 1
}
```

## Examples

### Typical Workflow
```bash
# 1. Initialize experiment
ml experiment init my-project
cd my-project

# 2. Validate experiment locally
ml queue . --validate --dry-run

# 3. Submit job
ml queue . --priority 8 --gpu 1

# 4. Monitor progress
ml status .
ml watch .

# 5. Check results
ml status . --json
```

### Automation Script
```bash
#!/bin/bash
# Submit job and wait for completion
JOB_ID=$(ml queue my-experiment --json | jq -r '.job_id')

echo "Submitted job: $JOB_ID"

# Wait for completion
while true; do
  STATUS=$(ml status $JOB_ID --json | jq -r '.jobs[0].status')
  echo "Status: $STATUS"
  
  if [[ "$STATUS" == "completed" || "$STATUS" == "failed" ]]; then
    break
  fi
  
  sleep 10
done

ml status $JOB_ID
```

## Implementation Notes

### Server-side Validation
- CLI performs minimal local checks (git status, file existence)
- All authoritative validation happens on worker
- Validation failures are propagated back to CLI with clear error messages

### Trust Contract Integration
- Every job submission includes commit ID and content integrity manifest
- Worker validates both before execution
- Any mismatch causes hard-fail with detailed error reporting

### Resource Management
- Resource requests are validated against available capacity
- Jobs are queued based on priority and resource availability
- Resource usage is tracked and reported in status

## Future Extensions

The v1 contract is intentionally minimal but designed for extension:

### Phase 0: Trust and usability (highest priority)

1. **Make `ml status` excellent** - Compact summary with queue counts, relevant tasks, prewarm state
2. **Add `ml explain`** - Dry-run preview command showing resolved execution plan
3. **Tighten run manifest completeness** - Require timestamps, exit codes, dataset identities
4. **Dataset identity** - Structured `dataset_specs` with checksums (strict-by-default)

### Phase 1: Simple performance wins

- Keep prewarming single-level (next task only)
- Improve observability first (status output + metrics)

### Phase 2+: Research workflows

- `ml compare <runA> <runB>`: Manifest-driven diff of provenance
- `ml reproduce <run-id>`: Submit task from recorded manifest
- `ml export <run-id>`: Package provenance + artifacts

### Phase 3: Infrastructure (only if needed)

- Multi-level prewarming, predictive scheduling
- Optional S3-compatible storage (MinIO)
- Optional integrations (MLflow, W&B)
- Optional Kubernetes deployment

All extensions will maintain backward compatibility with the v1 contract.