# Research-First Runner: Missing Themes Plan

This file captures additional themes that are commonly missing in existing ML runners/experiment tools, translated into actionable design targets for a lightweight, research-first runner.

## Quick Overview

**What makes this different:**
- **Your server, not their cloud**: Everything runs on your homelab/workstation/uni server
- **Dual interfaces**: Zig CLI for scripting + SSH-accessible TUI for interactive work
- **Fair queueing**: `ml queue` (not `run`) makes resource sharing explicit
- **Research narrative**: Capture why you ran experiments, not just what ran
- **Zero SaaS**: No accounts, web dashboards, or external services
- **Plain text everything**: Human-readable manifests, long-term reproducibility

**Perfect for:** Researchers in uni labs, homelab enthusiasts, small research groups who want control over their infrastructure without cloud vendor lock-in.

## Architecture Context

**Server-Centric Model for Homelab/Workstation/Uni Lab:**
- **Two client interfaces**:
  - **Zig CLI**: Thin WebSocket client for scripting, automation, remote access
  - **SSH-accessible TUI**: Interactive Bubble Tea UI for monitoring when SSH'd into server
- Go API server with embedded rsync (reduces dependencies)
- Worker pulls from flexible queue backend (Redis/SQLite/filesystem)
- Priority-based scheduling with prewarm mechanism
- NAS integration for data prefetching
- Target: single server, workstation, or small uni lab cluster (not cloud/SaaS)

**Client Access Patterns:**
```bash
# CLI (from anywhere via WebSocket)
ml queue train.py --epochs 100
ml status --watch
ml info <path|id>

# TUI (when SSH'd into server or jump box)
ssh mluser@worker.local
ml-tui  # Interactive terminal UI
# Navigate with keyboard, see live updates
```

**Configuration:**
```toml
# ~/.ml/config.toml (shared by both CLI and TUI)
worker_host = "worker.local"
worker_user = "mluser" 
worker_base = "/data/ml-experiments"
worker_port = 22
api_key = "your-api-key"
```

## Plan (Missing Themes)

## Implemented Today (in this repo)

- Runs are queued via `ml queue` and processed by workers.
- Run provenance is written to `run_manifest.json`.
- You can attach queue-time notes with `ml queue --note "..."` (persisted under `run_manifest.json` → `metadata.note`).
- Queue backends support Redis / SQLite / filesystem (and optional filesystem fallback).
- CLI + SSH-launched TUI are both available (`ml monitor` launches the TUI).

## Future Ideas (this document)

### 1. Own-infrastructure-first, research-centric by default

### 2. Minimal server dependencies (simple operations)

### 3. Text-first tracking (logs > dashboards)

- **Research narrative completion**: post-run outcome/learnings/next steps captured in the manifest
- **Auto-captured context**:
  - Command + args (as sent from CLI)
  - Timestamps (queue time, start time, end time)
  - Git commit hash (and optionally diff)
  - Environment snapshot (pip freeze, conda export, container image digest)
  - Hardware context (GPU model, driver version, CUDA version)
- **Plain text manifests**: JSON or YAML, never binary blobs
- **Stable formats**: Can read experiments from 5 years ago without the runner

**Implementation note**: Server writes `run_manifest.json` to experiment directory. CLI can display it via `ml info`.

### 4. CLI and TUI as complementary interfaces

- **Consistent CLI scripting UX**: Future idea (uniform `--json`, quiet modes, and stable exit codes across commands)
- **TUI feature parity**: Future idea (surface the same key details in TUI + CLI: queue position/ETA, narrative, validation results)

### 5. Failure-tolerant, messy-research friendly

- **Failure is first-class**: Failed runs stay visible and queryable
- **Partial artifacts preserved**: Keep artifacts/logs up to failure point (including checkpoints, if the script produces them)
- **No punishment for refactors**: Script renames don't break history
- **Grouping/tagging**: Label attempts (baseline/ablation/debug/exploration)

**Server implementation**: Worker should catch exceptions, record failure reason, preserve state. Queue should track failure modes (OOM, timeout, code error, data error).

### 6. Minimal abstraction over Python (transparent execution)

- **Run scripts as-is**: No decorators, no framework rewrites
- **Preserve debuggability**: Clean stack traces, pdb works
- **Optional instrumentation**: Explicit metric logging via simple API
  ```python
  # Optional, not required
  from ml_runner import log_metric
  log_metric("loss", 0.5, step=100)
  ```
- **Standard I/O works**: `print()` goes to logs, arguments via `sys.argv`

**Server implementation**: Worker spawns process, captures stdout/stderr, parses optional structured logs. No magic wrappers that hide what's happening.

### 7. Reproducibility that survives time

- **Immutable run folders**: Server never modifies completed runs
- **Environment capture** (best-effort, pluggable):
  - Container image digest (primary method)
  - `pip freeze` / `uv pip freeze` / `poetry.lock`
  - `conda env export`
  - `nix flake.lock` (if available)
- **Hardware fingerprint**: GPU model, driver, CUDA, CPU, RAM
- **Data provenance**: Dataset checksums, NAS paths, version identifiers
- **Commit everything**: Store full environment, even if verbose

**Server implementation**: Pre-run hook captures environment. Store in `run_manifest.json`. Validate on `ml validate <run-id>`.

### 8. Small compute and shared machine friendliness

### 9. Server-side storage with client-side visibility
- **Energy awareness**: Respect that homelabs pay electricity bills
- **Laptop-friendly**: Support thermal/power throttling
- **Single-GPU to 4-GPU range**: Optimize for typical research setups
- **No cluster assumptions**: Don't require Kubernetes/SLURM/etc.

**Why this matters**: Researchers want to `ls` experiment directories but don't want to manually sync. Server handles storage, CLI provides views.

### 11. Research narrative (lab notebook, not job IDs)

- **Queue-time narrative capture**: Future idea (add `--hypothesis`, `--context`, `--intent`, etc. to `ml queue`)
- **Post-run learning capture**: Future idea (explicit `outcome`, `learnings[]`, `next_steps[]`, and validation status)
- **Narrative UX**: Future idea (view/edit narrative from TUI/CLI without hand-editing JSON)

**CLI commands**:
```bash
ml queue train.py --note "Testing warmup hypothesis from paper X"
```

  - CLI: WebSocket streaming for `--watch` and `--follow`
  - TUI: Live refresh (500ms tick), immediate queue updates
- **No magic**: Minimize implicit behavior
  - Explicit is better than clever
  - Defaults should be obvious and documented
  - Side effects should be visible (both in CLI and TUI)
  - Configuration hierarchy clear: CLI flags > env > config file > defaults

**TUI advantages for observability:**
- See everything at once: jobs, queue, GPUs, containers, logs
- Keyboard shortcuts for common operations
- Instant feedback on actions (queue, cancel, delete)
- Prewarm state visible in GPU panel
- No need to run multiple `ml status` commands

### 13. Support clear thinking during experimentation

- **Optimize for cognitive throughput**:
  - Make it easy to remember what you were thinking
  - Surface patterns across experiments
  - Warn about near-duplicates before running
- **Built-in comparison**:
  ```bash
  # Future ideas:
  # ml diff <run-a> <run-b>
  # ml similar <run-id>
  ```
- **Learning from history**:
  ```bash
  # Future ideas:
  # ml lessons --tag ablation
  # ml dead-ends
  ```
- **Hypothesis tracking**:
  - Link hypothesis → experiment → outcome → next hypothesis
  - Mark outcomes: validates/refutes/inconclusive
- **Reduce cognitive load**:
  - Natural queries: Future idea (search over manifests/notes)
  - Show relevant history when queueing
  - Don't make researchers remember IDs

**Server implementation**: Maintain index (rebuildable from filesystem). Support semantic queries over manifests, notes, tags.

### 14. Fast iteration velocity

- **Easy modification**:
  ```bash
  # Future ideas:
  # ml clone <run-id>
  # ml fork <run-id>
  ```
- **Batch operations**:
  ```bash
  # Future idea: ml sweep
  ```

**Why prewarm matters**: Your NAS prefetch in prewarm means jobs start training immediately instead of waiting for data. This dramatically improves iteration velocity.

### 15. Full research lifecycle support

- **Exploration phase**: Minimal metadata, quick runs
- **Development phase**: Group attempts, compare variations
- **Validation phase**: Strict reproducibility, complete capture
- **Publication phase**: Export bundles, generate reproduction instructions
- **Maintenance phase**: Long-term readable, re-executable years later

**Reproducibility levels** (your strict/best-effort model):
```bash
# Future idea: --repro-level
ml validate <commit_id>                     # Future idea: expand validation coverage + outputs
```

### 16. Collaboration without platforms

- **Async collaboration** (no shared server required):
  ```bash
  # Future ideas:
  # ml export <run-id> --bundle run_42.tar.gz
  # ml import run_42.tar.gz
  ```
- **Selective sharing**:
  ```bash
  # Future ideas:
  # ml export <run-id> --metadata-only
  # ml export <run-id> --include-artifacts
  ```
- **Review-friendly**:
  - Self-contained bundles
  - All provenance included
  - Reproducibility instructions
  - No "install our platform" friction

**Server implementation**: Export packages `run_manifest.json` + artifacts into tarball. Import validates and unpacks into experiments directory.

### 17. Graceful degradation

- **Core works with minimal setup**:
  - Filesystem-only queue (no Redis required)
  - SQLite for metadata (no Postgres)
  - Local execution (no remote targets needed)
- **Optional enhancements**:
  - Redis for better multi-worker queueing
  - Git integration (works without git)
  - NAS prewarm (falls back to on-demand fetch)
  - WebSocket updates (falls back to polling)
- **Progressive disclosure**:
  - Simple commands for simple cases
  - Advanced flags for power users
  - Features activate when available

**Implementation note**:

### 18. Concrete features (derived from above)

#### Findability
```bash
# Future ideas:
# ml find "failed runs on GPU2 last week"
# ml find --note "warmup"
```
Server maintains rebuildable index over manifests, logs, tags.

#### Dataset provenance
```json
{
  "datasets": [
    {
      "name": "imagenet-train",
      "nas_path": "/nas/datasets/imagenet/train",
      "checksum": "sha256:abc123...",
      "fetched_at": "2024-01-15T10:30:00Z",
      "fetch_method": "prewarm"
    }
  ]
}
```
Server validates checksums, warns on drift.

#### Prewarm observability
```bash
ml status
# Shows:
#   Next in queue: run_xyz (priority 5)
#   Prewarming: dataset imagenet-train (2/5 complete)
#   GPU 0: running run_abc (50% complete, ETA 2h)
#   GPU 1: idle
```

#### CLI queue/requeue workflows

**Core principle**: the runner does not introduce checkpoint conventions. The script should run identically when executed directly vs via `ml`.

**Passive artifact tracking** (future idea): worker records what files exist in the run directory after completion (or via configured glob patterns). Checkpoints are just artifacts.

**Requeue = replay command with modifications** (future idea):
```bash
# Original run
ml queue train.py --epochs 100 --save-dir ./checkpoints

# Requeue (continue)
ml requeue run_abc -- --resume ./checkpoints/best.pt --epochs 200
```

**Arg merge strategies** (future idea):
```bash
# Append new args (default)
ml requeue run_abc --append -- --resume ./checkpoints/best.pt

# Replace (rerun with only new args)
ml requeue run_abc --replace -- --epochs 200 --lr 3e-4

# Merge (override matching flags, keep the rest)
ml requeue run_abc --merge -- --epochs 200
```

**Optional staging** (future idea): copy an artifact from the source run into the new run directory, then reference it with a placeholder.
```bash
ml requeue run_abc --stage checkpoints/best.pt -- \
  --resume {staged}/best.pt --epochs 200
```

#### Hardware/resource management
```json
{
  "resources": {
    "gpus": 2,
    "gpu_memory_gb": 40,
    "cpu_cores": 16,
    "ram_gb": 64,
    "disk_gb": 100,
    "max_runtime_hours": 24
  }
}
```
Worker validates resources before pulling from queue. Server tracks utilization.

---

## Design Philosophy Summary (Server-Centric)

The goal is to build a **research assistant that runs on YOUR server**, not a platform that runs on someone else's cloud.

### Every feature should answer:

1. Does this help researchers **understand** what happened on the server?
2. Does this make the server **transparent** instead of a black box?
3. Does this work on a **single workstation** or small lab server?
4. Does this respect that researchers **SSH into the server**?
5. Does this make **local data** (NAS, scratch drives) first-class?

### Architecture principles:

- **Server is the control plane**: All logic, storage, scheduling on server
- **CLI is a thin client**: Just communicates via WebSocket, no local state
- **Filesystem is still king**: Server writes plain text, CLI reads via API
- **Queue-first for fairness**: `ml queue` not `ml run` - explicit resource requests
- **Priority without hogging**: Higher priority = earlier in queue, not exclusive access
- **Prewarm is a performance optimization**: Best-effort, never required for correctness
- **NAS integration is native**: Server understands mounted storage

### When in doubt:

- **Server-side is better** than client-side (for logic)
- **WebSocket is better** than REST (for interactivity)
- **Embedded is better** than external deps (rsync in server)
- **Flexible backend is better** than required service (Redis OR SQLite OR filesystem)
- **Plain text is better** than binary
- **Your hardware is better** than their cloud

The runner should feel like **SSH into your well-organized research server with powerful tools**, not like operating a cloud platform. Whether you're using the CLI for automation or the TUI for interactive work, the experience should be transparent, fair, and research-focused.

---

## Typical Research Workflows (CLI + TUI)

### Morning Routine: Check What Happened Overnight
```bash
# From your laptop (via WebSocket)
ml status
# Shows: 2 finished, 1 running, 3 in queue

ml info run_abc --show-metrics
# Quick check: did the overnight run validate the hypothesis?

# If you need deep investigation, SSH in
ssh mluser@worker.local
ml-tui
# Visual inspection of logs, GPU usage, etc.
```

### Starting a New Experiment Series
```bash
# Script a parameter sweep (CLI automation)
for lr in 1e-3 3e-4 1e-4; do
  ml queue train.py --lr $lr \
    # Future idea: --hypothesis / --experiment-group
    --priority 5
done

# Monitor in TUI (interactive)
ssh mluser@worker.local
ml-tui
# Watch queue, see ETA, check prewarm status
```

### Debugging a Failed Run
```bash
# Notice failure via CLI
ml status
# run_xyz: failed (exit code 137) - OOM?

# Jump into TUI for investigation
ssh mluser@worker.local
ml-tui
# Navigate to run_xyz, press 'l' for logs
# See OOM error at batch 128
# Future idea: narrative/annotation UX in the TUI
```

### End-of-Day Review
```bash
# TUI for visual summary
ssh mluser@worker.local
ml-tui
# Scroll through today's runs
# Future ideas: compare views, export bundles
```

### Paper Writing Time (6 months later)
```bash
# Today: use the filesystem + run manifests
ml info <path|id>

# Future ideas: searching/filtering + comparison reports

# TUI for visual exploration
ssh mluser@worker.local
ml-tui
# Navigate through old experiments
# Press 'n' to read narratives
# Reconstruct your thought process
```

### Collaborative Debugging with Advisor
```bash
# Both SSH into server simultaneously
ssh mluser@worker.local

# You run TUI to show current state
ml-tui
# Navigate to problem run, show logs live

# Advisor suggests fix
# You queue new run with their suggestion
ml queue train.py --lr 1e-4 \
  --note "Per advisor: try smaller LR with warmup" \
  # Future idea: --parent-run
  --priority 7

# Watch it start in TUI immediately
# Queue position visible, prewarm status shown
```

This dual-interface approach gives researchers the best of both worlds: **scriptability when they need it, visibility when they want it**.

---

## How This Maps to Your Current Architecture

✅ **Already correct**:
- Server-centric with dual client interfaces (CLI + TUI)
- WebSocket communication (CLI)
- SSH-based TUI with Bubble Tea (interactive monitoring)
- Embedded rsync in server
- Flexible queue backend (Redis/SQLite/filesystem)
- Priority scheduling
- Prewarm mechanism for NAS prefetch
- **Fair queueing philosophy** - `queue` not `run`
- TUI shows live updates: jobs, queue, GPU status, logs

🎯 **Natural extensions**:
- Queue-time narrative flags for `ml queue` (hypothesis/context/intent/etc.)
- CLI commands for diffing and finding (and higher-level comparison workflows)
- TUI panels for hypothesis/learnings (in job details)
- Reproducibility validation improvements (extend `ml validate`)
- Export/import for collaboration
- Graceful degradation (filesystem-only mode)
- Visible queue position and fairness metrics

📝 **Design considerations**:
- Show prewarm state/progress in `ml status`
- Show queue position and ETA in both CLI and TUI
- Add research context fields to manifests
- Build comparison workflows (diff, similar, why-different)
- Support hypothesis tracking in both interfaces
- Create export bundles for sharing
- Expose fairness metrics (wait time distribution, resource utilization)
- TUI could show narrative snippets in job list (hypothesis as subtitle?)

**TUI Research Narrative Integration Ideas:**
```
┌─ ML Jobs & Queue ─────────────────────────────────────┐
│ > imagenet_baseline                                   │
│   ✓ finished | Priority: 5                            │
│   "Testing baseline performance before ablations"     │
│                                                        │
│   batch_size_64                                       │
│   ▶ running (epoch 45/100) | Priority: 5             │
│   "Validating linear LR scaling hypothesis"           │
│                                                        │
│   warmup_test                                         │
│   ⏳ queued (position 2) | Priority: 3               │
│   "Following up on advisor suggestion about warmup"   │
└───────────────────────────────────────────────────────┘

Press 'n' to view narrative, 'a' to annotate
```

**Implementation status (today):**
- **Annotations are implemented** and stored at the **root** of `run_manifest.json` as `annotations[]`.
- **Narrative fields are implemented** and stored under `run_manifest.json` as `narrative` (set/update via CLI).
- Use `ml annotate <path|run_id|task_id> --note "..." [--author "..."]` to append an entry.
- Remaining gaps are around **queue-time capture**, **post-run learnings/outcomes**, and **TUI-first narrative UX**.

Example manifest.json
```json
{
  // === Standard Execution Metadata ===
  "run_id": "2024-01-15_abc123",
  "status": "completed",
  "command": "train.py --lr 0.001 --epochs 100 --batch-size 64",
  "queued_at": "2024-01-15T10:25:00Z",
  "started_at": "2024-01-15T10:30:00Z",
  "ended_at": "2024-01-15T14:45:00Z",
  "exit_code": 0,
  "priority": 5,
  
  // === Research Narrative (The Important Part) ===
  "narrative": {
    // WHY did you run this?
    "hypothesis": "Larger batch size with linear LR scaling should improve convergence speed without hurting final accuracy",
    
    // WHAT were you thinking at the time?
    "context": "Previous run (run_789) with batch=32 took 8 hours and plateaued at 0.85. Paper XYZ suggests linear scaling rule should work.",
    
    // WHAT were you trying to accomplish?
    "intent": "Test if doubling batch size (32→64) with 2x learning rate maintains accuracy while reducing training time",
    
    // WHAT did you expect to happen?
    "expected_outcome": "Similar final accuracy (~0.85) but ~4 hour training time instead of 8",
    
    // HOW is this related to other experiments?
    "parent_run": "2024-01-14_run789",
    "experiment_group": "batch-size-scaling-ablation",
    "tags": ["ablation", "batch-size", "convergence-speed", "paper-xyz-reproduction"],
    
    // WHAT did you learn? (filled in post-run or during)
    "outcome": "Success: accuracy=0.87 (+0.02), time=3.5h (-56%). Linear scaling rule validated.",
    "learnings": [
      "Linear LR scaling worked as expected from paper XYZ",
      "GPU memory utilization went from 60% to 95% - near limit",
      "Convergence was actually smoother (fewer spikes in loss curve)",
      "Could probably push to batch=96 before OOM"
    ],
    "next_steps": [
      "Try batch=96 to maximize GPU utilization",
      "Test if this scales to batch=128 with gradient accumulation",
      "Validate on other datasets (currently only tested on ImageNet)"
    ],
    "validation_status": "validates",  // or "refutes", "inconclusive", "partial"
  },
  
  // Human annotations added later
  "annotations": [
    {
      "timestamp": "2024-01-15T15:00:00Z",
      "author": "user@lab.edu",
      "note": "This result is strong enough for the paper. Use these hyperparams for final training."
    },
    {
      "timestamp": "2024-01-16T09:00:00Z",
      "author": "advisor@lab.edu", 
      "note": "Good work. Also compare with warmup schedule before finalizing."
    }
  ],
  
  // === Reproducibility Metadata ===
  "environment": {
    "git_commit": "a1b2c3d4",
    "git_dirty": false,
    "git_branch": "experiment/batch-scaling",
    "container_image": "pytorch/pytorch:2.0.1-cuda11.8-cudnn8-runtime",
    "container_digest": "sha256:abc123...",
    "pip_freeze": "torch==2.0.1\ntorchvision==0.15.2\n...",
    "cuda_version": "11.8",
    "gpu_driver": "525.105.17",
    "python_version": "3.10.12"
  },
  
  // === Data Provenance ===
  "datasets": [
    {
      "name": "imagenet-train",
      "nas_path": "/nas/datasets/imagenet/ILSVRC2012/train",
      "checksum": "sha256:def456...",
      "size_gb": 144.2,
      "num_samples": 1281167,
      "version": "ILSVRC2012",
      "fetched_via": "prewarm",
      "fetch_time_seconds": 180
    }
  ],
  
  // === Resource Usage ===
  "resources": {
    "requested": {
      "gpus": 1,
      "gpu_memory_gb": 24,
      "cpu_cores": 8,
      "ram_gb": 32
    },
    "actual": {
      "gpu_utilization_avg": 95,
      "gpu_memory_peak_gb": 22.8,
      "cpu_utilization_avg": 45,
      "ram_peak_gb": 28.5,
      "disk_read_gb": 145,
      "disk_write_gb": 12
    },
    "gpu_model": "NVIDIA RTX 3090",
    "host": "ml-server-01"
  },
  
  // === Results ===
  "metrics": {
    "final_train_accuracy": 0.891,
    "final_val_accuracy": 0.873,
    "final_train_loss": 0.234,
    "final_val_loss": 0.287,
    "best_val_accuracy": 0.876,
    "best_epoch": 87,
    "total_epochs": 100,
    "training_time_hours": 3.52
  },
  
  // === Artifacts ===
  "artifacts": {
    "discovery_time": "2024-01-15T14:45:00Z",
    "files": [
      {
        "path": "checkpoints/epoch_010.pth",
        "size_bytes": 450000000,
        "modified": "2024-01-15T11:30:00Z"
      },
      {
        "path": "checkpoints/best.pth",
        "size_bytes": 450000000,
        "modified": "2024-01-15T13:45:00Z"
      }
    ],
    "total_size_bytes": 900000000
  }
}