fetch_ml/docs/src/research-runner-plan.md
Jeremie Fraeys 5144d291cb
docs: comprehensive documentation updates
- Add architecture, CI/CD, CLI reference documentation
- Update installation, operations, and quick-start guides
- Add Jupyter workflow and queue documentation
- New landing page and research runner plan
2026-02-12 12:05:27 -05:00

23 KiB

Research-First Runner: Missing Themes Plan

This file captures additional themes that are commonly missing in existing ML runners/experiment tools, translated into actionable design targets for a lightweight, research-first runner.

Quick Overview

What makes this different:

  • Your server, not their cloud: Everything runs on your homelab/workstation/uni server
  • Dual interfaces: Zig CLI for scripting + SSH-accessible TUI for interactive work
  • Fair queueing: ml queue (not run) makes resource sharing explicit
  • Research narrative: Capture why you ran experiments, not just what ran
  • Zero SaaS: No accounts, web dashboards, or external services
  • Plain text everything: Human-readable manifests, long-term reproducibility

Perfect for: Researchers in uni labs, homelab enthusiasts, small research groups who want control over their infrastructure without cloud vendor lock-in.

Architecture Context

Server-Centric Model for Homelab/Workstation/Uni Lab:

  • Two client interfaces:
    • Zig CLI: Thin WebSocket client for scripting, automation, remote access
    • SSH-accessible TUI: Interactive Bubble Tea UI for monitoring when SSH'd into server
  • Go API server with embedded rsync (reduces dependencies)
  • Worker pulls from flexible queue backend (Redis/SQLite/filesystem)
  • Priority-based scheduling with prewarm mechanism
  • NAS integration for data prefetching
  • Target: single server, workstation, or small uni lab cluster (not cloud/SaaS)

Client Access Patterns:

# CLI (from anywhere via WebSocket)
ml queue train.py --epochs 100
ml status --watch
ml info <path|id>

# TUI (when SSH'd into server or jump box)
ssh mluser@worker.local
ml-tui  # Interactive terminal UI
# Navigate with keyboard, see live updates

Configuration:

# ~/.ml/config.toml (shared by both CLI and TUI)
worker_host = "worker.local"
worker_user = "mluser" 
worker_base = "/data/ml-experiments"
worker_port = 22
api_key = "your-api-key"

Plan (Missing Themes)

Implemented Today (in this repo)

  • Runs are queued via ml queue and processed by workers.
  • Run provenance is written to run_manifest.json.
  • You can attach queue-time notes with ml queue --note "..." (persisted under run_manifest.jsonmetadata.note).
  • Queue backends support Redis / SQLite / filesystem (and optional filesystem fallback).
  • CLI + SSH-launched TUI are both available (ml monitor launches the TUI).

Future Ideas (this document)

1. Own-infrastructure-first, research-centric by default

2. Minimal server dependencies (simple operations)

3. Text-first tracking (logs > dashboards)

  • Research narrative completion: post-run outcome/learnings/next steps captured in the manifest
  • Auto-captured context:
    • Command + args (as sent from CLI)
    • Timestamps (queue time, start time, end time)
    • Git commit hash (and optionally diff)
    • Environment snapshot (pip freeze, conda export, container image digest)
    • Hardware context (GPU model, driver version, CUDA version)
  • Plain text manifests: JSON or YAML, never binary blobs
  • Stable formats: Can read experiments from 5 years ago without the runner

Implementation note: Server writes run_manifest.json to experiment directory. CLI can display it via ml info.

4. CLI and TUI as complementary interfaces

  • Consistent CLI scripting UX: Future idea (uniform --json, quiet modes, and stable exit codes across commands)
  • TUI feature parity: Future idea (surface the same key details in TUI + CLI: queue position/ETA, narrative, validation results)

5. Failure-tolerant, messy-research friendly

  • Failure is first-class: Failed runs stay visible and queryable
  • Partial artifacts preserved: Keep artifacts/logs up to failure point (including checkpoints, if the script produces them)
  • No punishment for refactors: Script renames don't break history
  • Grouping/tagging: Label attempts (baseline/ablation/debug/exploration)

Server implementation: Worker should catch exceptions, record failure reason, preserve state. Queue should track failure modes (OOM, timeout, code error, data error).

6. Minimal abstraction over Python (transparent execution)

  • Run scripts as-is: No decorators, no framework rewrites
  • Preserve debuggability: Clean stack traces, pdb works
  • Optional instrumentation: Explicit metric logging via simple API
    # Optional, not required
    from ml_runner import log_metric
    log_metric("loss", 0.5, step=100)
    
  • Standard I/O works: print() goes to logs, arguments via sys.argv

Server implementation: Worker spawns process, captures stdout/stderr, parses optional structured logs. No magic wrappers that hide what's happening.

7. Reproducibility that survives time

  • Immutable run folders: Server never modifies completed runs
  • Environment capture (best-effort, pluggable):
    • Container image digest (primary method)
    • pip freeze / uv pip freeze / poetry.lock
    • conda env export
    • nix flake.lock (if available)
  • Hardware fingerprint: GPU model, driver, CUDA, CPU, RAM
  • Data provenance: Dataset checksums, NAS paths, version identifiers
  • Commit everything: Store full environment, even if verbose

Server implementation: Pre-run hook captures environment. Store in run_manifest.json. Validate on ml validate <run-id>.

8. Small compute and shared machine friendliness

9. Server-side storage with client-side visibility

  • Energy awareness: Respect that homelabs pay electricity bills
  • Laptop-friendly: Support thermal/power throttling
  • Single-GPU to 4-GPU range: Optimize for typical research setups
  • No cluster assumptions: Don't require Kubernetes/SLURM/etc.

Why this matters: Researchers want to ls experiment directories but don't want to manually sync. Server handles storage, CLI provides views.

11. Research narrative (lab notebook, not job IDs)

  • Queue-time narrative capture: Future idea (add --hypothesis, --context, --intent, etc. to ml queue)
  • Post-run learning capture: Future idea (explicit outcome, learnings[], next_steps[], and validation status)
  • Narrative UX: Future idea (view/edit narrative from TUI/CLI without hand-editing JSON)

CLI commands:

ml queue train.py --note "Testing warmup hypothesis from paper X"
  • CLI: WebSocket streaming for --watch and --follow
  • TUI: Live refresh (500ms tick), immediate queue updates
  • No magic: Minimize implicit behavior
    • Explicit is better than clever
    • Defaults should be obvious and documented
    • Side effects should be visible (both in CLI and TUI)
    • Configuration hierarchy clear: CLI flags > env > config file > defaults

TUI advantages for observability:

  • See everything at once: jobs, queue, GPUs, containers, logs
  • Keyboard shortcuts for common operations
  • Instant feedback on actions (queue, cancel, delete)
  • Prewarm state visible in GPU panel
  • No need to run multiple ml status commands

13. Support clear thinking during experimentation

  • Optimize for cognitive throughput:
    • Make it easy to remember what you were thinking
    • Surface patterns across experiments
    • Warn about near-duplicates before running
  • Built-in comparison:
    # Future ideas:
    # ml diff <run-a> <run-b>
    # ml similar <run-id>
    
  • Learning from history:
    # Future ideas:
    # ml lessons --tag ablation
    # ml dead-ends
    
  • Hypothesis tracking:
    • Link hypothesis → experiment → outcome → next hypothesis
    • Mark outcomes: validates/refutes/inconclusive
  • Reduce cognitive load:
    • Natural queries: Future idea (search over manifests/notes)
    • Show relevant history when queueing
    • Don't make researchers remember IDs

Server implementation: Maintain index (rebuildable from filesystem). Support semantic queries over manifests, notes, tags.

14. Fast iteration velocity

  • Easy modification:
    # Future ideas:
    # ml clone <run-id>
    # ml fork <run-id>
    
  • Batch operations:
    # Future idea: ml sweep
    

Why prewarm matters: Your NAS prefetch in prewarm means jobs start training immediately instead of waiting for data. This dramatically improves iteration velocity.

15. Full research lifecycle support

  • Exploration phase: Minimal metadata, quick runs
  • Development phase: Group attempts, compare variations
  • Validation phase: Strict reproducibility, complete capture
  • Publication phase: Export bundles, generate reproduction instructions
  • Maintenance phase: Long-term readable, re-executable years later

Reproducibility levels (your strict/best-effort model):

# Future idea: --repro-level
ml validate <commit_id>                     # Future idea: expand validation coverage + outputs

16. Collaboration without platforms

  • Async collaboration (no shared server required):
    # Future ideas:
    # ml export <run-id> --bundle run_42.tar.gz
    # ml import run_42.tar.gz
    
  • Selective sharing:
    # Future ideas:
    # ml export <run-id> --metadata-only
    # ml export <run-id> --include-artifacts
    
  • Review-friendly:
    • Self-contained bundles
    • All provenance included
    • Reproducibility instructions
    • No "install our platform" friction

Server implementation: Export packages run_manifest.json + artifacts into tarball. Import validates and unpacks into experiments directory.

17. Graceful degradation

  • Core works with minimal setup:
    • Filesystem-only queue (no Redis required)
    • SQLite for metadata (no Postgres)
    • Local execution (no remote targets needed)
  • Optional enhancements:
    • Redis for better multi-worker queueing
    • Git integration (works without git)
    • NAS prewarm (falls back to on-demand fetch)
    • WebSocket updates (falls back to polling)
  • Progressive disclosure:
    • Simple commands for simple cases
    • Advanced flags for power users
    • Features activate when available

Implementation note:

18. Concrete features (derived from above)

Findability

# Future ideas:
# ml find "failed runs on GPU2 last week"
# ml find --note "warmup"

Server maintains rebuildable index over manifests, logs, tags.

Dataset provenance

{
  "datasets": [
    {
      "name": "imagenet-train",
      "nas_path": "/nas/datasets/imagenet/train",
      "checksum": "sha256:abc123...",
      "fetched_at": "2024-01-15T10:30:00Z",
      "fetch_method": "prewarm"
    }
  ]
}

Server validates checksums, warns on drift.

Prewarm observability

ml status
# Shows:
#   Next in queue: run_xyz (priority 5)
#   Prewarming: dataset imagenet-train (2/5 complete)
#   GPU 0: running run_abc (50% complete, ETA 2h)
#   GPU 1: idle

CLI queue/requeue workflows

Core principle: the runner does not introduce checkpoint conventions. The script should run identically when executed directly vs via ml.

Passive artifact tracking (future idea): worker records what files exist in the run directory after completion (or via configured glob patterns). Checkpoints are just artifacts.

Requeue = replay command with modifications (future idea):

# Original run
ml queue train.py --epochs 100 --save-dir ./checkpoints

# Requeue (continue)
ml requeue run_abc -- --resume ./checkpoints/best.pt --epochs 200

Arg merge strategies (future idea):

# Append new args (default)
ml requeue run_abc --append -- --resume ./checkpoints/best.pt

# Replace (rerun with only new args)
ml requeue run_abc --replace -- --epochs 200 --lr 3e-4

# Merge (override matching flags, keep the rest)
ml requeue run_abc --merge -- --epochs 200

Optional staging (future idea): copy an artifact from the source run into the new run directory, then reference it with a placeholder.

ml requeue run_abc --stage checkpoints/best.pt -- \
  --resume {staged}/best.pt --epochs 200

Hardware/resource management

{
  "resources": {
    "gpus": 2,
    "gpu_memory_gb": 40,
    "cpu_cores": 16,
    "ram_gb": 64,
    "disk_gb": 100,
    "max_runtime_hours": 24
  }
}

Worker validates resources before pulling from queue. Server tracks utilization.


Design Philosophy Summary (Server-Centric)

The goal is to build a research assistant that runs on YOUR server, not a platform that runs on someone else's cloud.

Every feature should answer:

  1. Does this help researchers understand what happened on the server?
  2. Does this make the server transparent instead of a black box?
  3. Does this work on a single workstation or small lab server?
  4. Does this respect that researchers SSH into the server?
  5. Does this make local data (NAS, scratch drives) first-class?

Architecture principles:

  • Server is the control plane: All logic, storage, scheduling on server
  • CLI is a thin client: Just communicates via WebSocket, no local state
  • Filesystem is still king: Server writes plain text, CLI reads via API
  • Queue-first for fairness: ml queue not ml run - explicit resource requests
  • Priority without hogging: Higher priority = earlier in queue, not exclusive access
  • Prewarm is a performance optimization: Best-effort, never required for correctness
  • NAS integration is native: Server understands mounted storage

When in doubt:

  • Server-side is better than client-side (for logic)
  • WebSocket is better than REST (for interactivity)
  • Embedded is better than external deps (rsync in server)
  • Flexible backend is better than required service (Redis OR SQLite OR filesystem)
  • Plain text is better than binary
  • Your hardware is better than their cloud

The runner should feel like SSH into your well-organized research server with powerful tools, not like operating a cloud platform. Whether you're using the CLI for automation or the TUI for interactive work, the experience should be transparent, fair, and research-focused.


Typical Research Workflows (CLI + TUI)

Morning Routine: Check What Happened Overnight

# From your laptop (via WebSocket)
ml status
# Shows: 2 finished, 1 running, 3 in queue

ml info run_abc --show-metrics
# Quick check: did the overnight run validate the hypothesis?

# If you need deep investigation, SSH in
ssh mluser@worker.local
ml-tui
# Visual inspection of logs, GPU usage, etc.

Starting a New Experiment Series

# Script a parameter sweep (CLI automation)
for lr in 1e-3 3e-4 1e-4; do
  ml queue train.py --lr $lr \
    # Future idea: --hypothesis / --experiment-group
    --priority 5
done

# Monitor in TUI (interactive)
ssh mluser@worker.local
ml-tui
# Watch queue, see ETA, check prewarm status

Debugging a Failed Run

# Notice failure via CLI
ml status
# run_xyz: failed (exit code 137) - OOM?

# Jump into TUI for investigation
ssh mluser@worker.local
ml-tui
# Navigate to run_xyz, press 'l' for logs
# See OOM error at batch 128
# Future idea: narrative/annotation UX in the TUI

End-of-Day Review

# TUI for visual summary
ssh mluser@worker.local
ml-tui
# Scroll through today's runs
# Future ideas: compare views, export bundles

Paper Writing Time (6 months later)

# Today: use the filesystem + run manifests
ml info <path|id>

# Future ideas: searching/filtering + comparison reports

# TUI for visual exploration
ssh mluser@worker.local
ml-tui
# Navigate through old experiments
# Press 'n' to read narratives
# Reconstruct your thought process

Collaborative Debugging with Advisor

# Both SSH into server simultaneously
ssh mluser@worker.local

# You run TUI to show current state
ml-tui
# Navigate to problem run, show logs live

# Advisor suggests fix
# You queue new run with their suggestion
ml queue train.py --lr 1e-4 \
  --note "Per advisor: try smaller LR with warmup" \
  # Future idea: --parent-run
  --priority 7

# Watch it start in TUI immediately
# Queue position visible, prewarm status shown

This dual-interface approach gives researchers the best of both worlds: scriptability when they need it, visibility when they want it.


How This Maps to Your Current Architecture

Already correct:

  • Server-centric with dual client interfaces (CLI + TUI)
  • WebSocket communication (CLI)
  • SSH-based TUI with Bubble Tea (interactive monitoring)
  • Embedded rsync in server
  • Flexible queue backend (Redis/SQLite/filesystem)
  • Priority scheduling
  • Prewarm mechanism for NAS prefetch
  • Fair queueing philosophy - queue not run
  • TUI shows live updates: jobs, queue, GPU status, logs

🎯 Natural extensions:

  • Queue-time narrative flags for ml queue (hypothesis/context/intent/etc.)
  • CLI commands for diffing and finding (and higher-level comparison workflows)
  • TUI panels for hypothesis/learnings (in job details)
  • Reproducibility validation improvements (extend ml validate)
  • Export/import for collaboration
  • Graceful degradation (filesystem-only mode)
  • Visible queue position and fairness metrics

📝 Design considerations:

  • Show prewarm state/progress in ml status
  • Show queue position and ETA in both CLI and TUI
  • Add research context fields to manifests
  • Build comparison workflows (diff, similar, why-different)
  • Support hypothesis tracking in both interfaces
  • Create export bundles for sharing
  • Expose fairness metrics (wait time distribution, resource utilization)
  • TUI could show narrative snippets in job list (hypothesis as subtitle?)

TUI Research Narrative Integration Ideas:

┌─ ML Jobs & Queue ─────────────────────────────────────┐
│ > imagenet_baseline                                   │
│   ✓ finished | Priority: 5                            │
│   "Testing baseline performance before ablations"     │
│                                                        │
│   batch_size_64                                       │
│   ▶ running (epoch 45/100) | Priority: 5             │
│   "Validating linear LR scaling hypothesis"           │
│                                                        │
│   warmup_test                                         │
│   ⏳ queued (position 2) | Priority: 3               │
│   "Following up on advisor suggestion about warmup"   │
└───────────────────────────────────────────────────────┘

Press 'n' to view narrative, 'a' to annotate

Implementation status (today):

  • Annotations are implemented and stored at the root of run_manifest.json as annotations[].
  • Narrative fields are implemented and stored under run_manifest.json as narrative (set/update via CLI).
  • Use ml annotate <path|run_id|task_id> --note "..." [--author "..."] to append an entry.
  • Remaining gaps are around queue-time capture, post-run learnings/outcomes, and TUI-first narrative UX.

Example manifest.json

{
  // === Standard Execution Metadata ===
  "run_id": "2024-01-15_abc123",
  "status": "completed",
  "command": "train.py --lr 0.001 --epochs 100 --batch-size 64",
  "queued_at": "2024-01-15T10:25:00Z",
  "started_at": "2024-01-15T10:30:00Z",
  "ended_at": "2024-01-15T14:45:00Z",
  "exit_code": 0,
  "priority": 5,
  
  // === Research Narrative (The Important Part) ===
  "narrative": {
    // WHY did you run this?
    "hypothesis": "Larger batch size with linear LR scaling should improve convergence speed without hurting final accuracy",
    
    // WHAT were you thinking at the time?
    "context": "Previous run (run_789) with batch=32 took 8 hours and plateaued at 0.85. Paper XYZ suggests linear scaling rule should work.",
    
    // WHAT were you trying to accomplish?
    "intent": "Test if doubling batch size (32→64) with 2x learning rate maintains accuracy while reducing training time",
    
    // WHAT did you expect to happen?
    "expected_outcome": "Similar final accuracy (~0.85) but ~4 hour training time instead of 8",
    
    // HOW is this related to other experiments?
    "parent_run": "2024-01-14_run789",
    "experiment_group": "batch-size-scaling-ablation",
    "tags": ["ablation", "batch-size", "convergence-speed", "paper-xyz-reproduction"],
    
    // WHAT did you learn? (filled in post-run or during)
    "outcome": "Success: accuracy=0.87 (+0.02), time=3.5h (-56%). Linear scaling rule validated.",
    "learnings": [
      "Linear LR scaling worked as expected from paper XYZ",
      "GPU memory utilization went from 60% to 95% - near limit",
      "Convergence was actually smoother (fewer spikes in loss curve)",
      "Could probably push to batch=96 before OOM"
    ],
    "next_steps": [
      "Try batch=96 to maximize GPU utilization",
      "Test if this scales to batch=128 with gradient accumulation",
      "Validate on other datasets (currently only tested on ImageNet)"
    ],
    "validation_status": "validates",  // or "refutes", "inconclusive", "partial"
  },
  
  // Human annotations added later
  "annotations": [
    {
      "timestamp": "2024-01-15T15:00:00Z",
      "author": "user@lab.edu",
      "note": "This result is strong enough for the paper. Use these hyperparams for final training."
    },
    {
      "timestamp": "2024-01-16T09:00:00Z",
      "author": "advisor@lab.edu", 
      "note": "Good work. Also compare with warmup schedule before finalizing."
    }
  ],
  
  // === Reproducibility Metadata ===
  "environment": {
    "git_commit": "a1b2c3d4",
    "git_dirty": false,
    "git_branch": "experiment/batch-scaling",
    "container_image": "pytorch/pytorch:2.0.1-cuda11.8-cudnn8-runtime",
    "container_digest": "sha256:abc123...",
    "pip_freeze": "torch==2.0.1\ntorchvision==0.15.2\n...",
    "cuda_version": "11.8",
    "gpu_driver": "525.105.17",
    "python_version": "3.10.12"
  },
  
  // === Data Provenance ===
  "datasets": [
    {
      "name": "imagenet-train",
      "nas_path": "/nas/datasets/imagenet/ILSVRC2012/train",
      "checksum": "sha256:def456...",
      "size_gb": 144.2,
      "num_samples": 1281167,
      "version": "ILSVRC2012",
      "fetched_via": "prewarm",
      "fetch_time_seconds": 180
    }
  ],
  
  // === Resource Usage ===
  "resources": {
    "requested": {
      "gpus": 1,
      "gpu_memory_gb": 24,
      "cpu_cores": 8,
      "ram_gb": 32
    },
    "actual": {
      "gpu_utilization_avg": 95,
      "gpu_memory_peak_gb": 22.8,
      "cpu_utilization_avg": 45,
      "ram_peak_gb": 28.5,
      "disk_read_gb": 145,
      "disk_write_gb": 12
    },
    "gpu_model": "NVIDIA RTX 3090",
    "host": "ml-server-01"
  },
  
  // === Results ===
  "metrics": {
    "final_train_accuracy": 0.891,
    "final_val_accuracy": 0.873,
    "final_train_loss": 0.234,
    "final_val_loss": 0.287,
    "best_val_accuracy": 0.876,
    "best_epoch": 87,
    "total_epochs": 100,
    "training_time_hours": 3.52
  },
  
  // === Artifacts ===
  "artifacts": {
    "discovery_time": "2024-01-15T14:45:00Z",
    "files": [
      {
        "path": "checkpoints/epoch_010.pth",
        "size_bytes": 450000000,
        "modified": "2024-01-15T11:30:00Z"
      },
      {
        "path": "checkpoints/best.pth",
        "size_bytes": 450000000,
        "modified": "2024-01-15T13:45:00Z"
      }
    ],
    "total_size_bytes": 900000000
  }
}