# Research-First Runner: Missing Themes Plan This file captures additional themes that are commonly missing in existing ML runners/experiment tools, translated into actionable design targets for a lightweight, research-first runner. ## Quick Overview **What makes this different:** - **Your server, not their cloud**: Everything runs on your homelab/workstation/uni server - **Dual interfaces**: Zig CLI for scripting + SSH-accessible TUI for interactive work - **Fair queueing**: `ml queue` (not `run`) makes resource sharing explicit - **Research narrative**: Capture why you ran experiments, not just what ran - **Zero SaaS**: No accounts, web dashboards, or external services - **Plain text everything**: Human-readable manifests, long-term reproducibility **Perfect for:** Researchers in uni labs, homelab enthusiasts, small research groups who want control over their infrastructure without cloud vendor lock-in. ## Architecture Context **Server-Centric Model for Homelab/Workstation/Uni Lab:** - **Two client interfaces**: - **Zig CLI**: Thin WebSocket client for scripting, automation, remote access - **SSH-accessible TUI**: Interactive Bubble Tea UI for monitoring when SSH'd into server - Go API server with embedded rsync (reduces dependencies) - Worker pulls from flexible queue backend (Redis/SQLite/filesystem) - Priority-based scheduling with prewarm mechanism - NAS integration for data prefetching - Target: single server, workstation, or small uni lab cluster (not cloud/SaaS) **Client Access Patterns:** ```bash # CLI (from anywhere via WebSocket) ml queue train.py --epochs 100 ml status --watch ml info # TUI (when SSH'd into server or jump box) ssh mluser@worker.local ml-tui # Interactive terminal UI # Navigate with keyboard, see live updates ``` **Configuration:** ```toml # ~/.ml/config.toml (shared by both CLI and TUI) worker_host = "worker.local" worker_user = "mluser" worker_base = "/data/ml-experiments" worker_port = 22 api_key = "your-api-key" ``` ## Plan (Missing Themes) ## Implemented Today (in this repo) - Runs are queued via `ml queue` and processed by workers. - Run provenance is written to `run_manifest.json`. - You can attach queue-time notes with `ml queue --note "..."` (persisted under `run_manifest.json` → `metadata.note`). - Queue backends support Redis / SQLite / filesystem (and optional filesystem fallback). - CLI + SSH-launched TUI are both available (`ml monitor` launches the TUI). ## Future Ideas (this document) ### 1. Own-infrastructure-first, research-centric by default ### 2. Minimal server dependencies (simple operations) ### 3. Text-first tracking (logs > dashboards) - **Research narrative completion**: post-run outcome/learnings/next steps captured in the manifest - **Auto-captured context**: - Command + args (as sent from CLI) - Timestamps (queue time, start time, end time) - Git commit hash (and optionally diff) - Environment snapshot (pip freeze, conda export, container image digest) - Hardware context (GPU model, driver version, CUDA version) - **Plain text manifests**: JSON or YAML, never binary blobs - **Stable formats**: Can read experiments from 5 years ago without the runner **Implementation note**: Server writes `run_manifest.json` to experiment directory. CLI can display it via `ml info`. ### 4. CLI and TUI as complementary interfaces - **Consistent CLI scripting UX**: Future idea (uniform `--json`, quiet modes, and stable exit codes across commands) - **TUI feature parity**: Future idea (surface the same key details in TUI + CLI: queue position/ETA, narrative, validation results) ### 5. Failure-tolerant, messy-research friendly - **Failure is first-class**: Failed runs stay visible and queryable - **Partial artifacts preserved**: Keep artifacts/logs up to failure point (including checkpoints, if the script produces them) - **No punishment for refactors**: Script renames don't break history - **Grouping/tagging**: Label attempts (baseline/ablation/debug/exploration) **Server implementation**: Worker should catch exceptions, record failure reason, preserve state. Queue should track failure modes (OOM, timeout, code error, data error). ### 6. Minimal abstraction over Python (transparent execution) - **Run scripts as-is**: No decorators, no framework rewrites - **Preserve debuggability**: Clean stack traces, pdb works - **Optional instrumentation**: Explicit metric logging via simple API ```python # Optional, not required from ml_runner import log_metric log_metric("loss", 0.5, step=100) ``` - **Standard I/O works**: `print()` goes to logs, arguments via `sys.argv` **Server implementation**: Worker spawns process, captures stdout/stderr, parses optional structured logs. No magic wrappers that hide what's happening. ### 7. Reproducibility that survives time - **Immutable run folders**: Server never modifies completed runs - **Environment capture** (best-effort, pluggable): - Container image digest (primary method) - `pip freeze` / `uv pip freeze` / `poetry.lock` - `conda env export` - `nix flake.lock` (if available) - **Hardware fingerprint**: GPU model, driver, CUDA, CPU, RAM - **Data provenance**: Dataset checksums, NAS paths, version identifiers - **Commit everything**: Store full environment, even if verbose **Server implementation**: Pre-run hook captures environment. Store in `run_manifest.json`. Validate on `ml validate `. ### 8. Small compute and shared machine friendliness ### 9. Server-side storage with client-side visibility - **Energy awareness**: Respect that homelabs pay electricity bills - **Laptop-friendly**: Support thermal/power throttling - **Single-GPU to 4-GPU range**: Optimize for typical research setups - **No cluster assumptions**: Don't require Kubernetes/SLURM/etc. **Why this matters**: Researchers want to `ls` experiment directories but don't want to manually sync. Server handles storage, CLI provides views. ### 11. Research narrative (lab notebook, not job IDs) - **Queue-time narrative capture**: Future idea (add `--hypothesis`, `--context`, `--intent`, etc. to `ml queue`) - **Post-run learning capture**: Future idea (explicit `outcome`, `learnings[]`, `next_steps[]`, and validation status) - **Narrative UX**: Future idea (view/edit narrative from TUI/CLI without hand-editing JSON) **CLI commands**: ```bash ml queue train.py --note "Testing warmup hypothesis from paper X" ``` - CLI: WebSocket streaming for `--watch` and `--follow` - TUI: Live refresh (500ms tick), immediate queue updates - **No magic**: Minimize implicit behavior - Explicit is better than clever - Defaults should be obvious and documented - Side effects should be visible (both in CLI and TUI) - Configuration hierarchy clear: CLI flags > env > config file > defaults **TUI advantages for observability:** - See everything at once: jobs, queue, GPUs, containers, logs - Keyboard shortcuts for common operations - Instant feedback on actions (queue, cancel, delete) - Prewarm state visible in GPU panel - No need to run multiple `ml status` commands ### 13. Support clear thinking during experimentation - **Optimize for cognitive throughput**: - Make it easy to remember what you were thinking - Surface patterns across experiments - Warn about near-duplicates before running - **Built-in comparison**: ```bash # Future ideas: # ml diff # ml similar ``` - **Learning from history**: ```bash # Future ideas: # ml lessons --tag ablation # ml dead-ends ``` - **Hypothesis tracking**: - Link hypothesis → experiment → outcome → next hypothesis - Mark outcomes: validates/refutes/inconclusive - **Reduce cognitive load**: - Natural queries: Future idea (search over manifests/notes) - Show relevant history when queueing - Don't make researchers remember IDs **Server implementation**: Maintain index (rebuildable from filesystem). Support semantic queries over manifests, notes, tags. ### 14. Fast iteration velocity - **Easy modification**: ```bash # Future ideas: # ml clone # ml fork ``` - **Batch operations**: ```bash # Future idea: ml sweep ``` **Why prewarm matters**: Your NAS prefetch in prewarm means jobs start training immediately instead of waiting for data. This dramatically improves iteration velocity. ### 15. Full research lifecycle support - **Exploration phase**: Minimal metadata, quick runs - **Development phase**: Group attempts, compare variations - **Validation phase**: Strict reproducibility, complete capture - **Publication phase**: Export bundles, generate reproduction instructions - **Maintenance phase**: Long-term readable, re-executable years later **Reproducibility levels** (your strict/best-effort model): ```bash # Future idea: --repro-level ml validate # Future idea: expand validation coverage + outputs ``` ### 16. Collaboration without platforms - **Async collaboration** (no shared server required): ```bash # Future ideas: # ml export --bundle run_42.tar.gz # ml import run_42.tar.gz ``` - **Selective sharing**: ```bash # Future ideas: # ml export --metadata-only # ml export --include-artifacts ``` - **Review-friendly**: - Self-contained bundles - All provenance included - Reproducibility instructions - No "install our platform" friction **Server implementation**: Export packages `run_manifest.json` + artifacts into tarball. Import validates and unpacks into experiments directory. ### 17. Graceful degradation - **Core works with minimal setup**: - Filesystem-only queue (no Redis required) - SQLite for metadata (no Postgres) - Local execution (no remote targets needed) - **Optional enhancements**: - Redis for better multi-worker queueing - Git integration (works without git) - NAS prewarm (falls back to on-demand fetch) - WebSocket updates (falls back to polling) - **Progressive disclosure**: - Simple commands for simple cases - Advanced flags for power users - Features activate when available **Implementation note**: ### 18. Concrete features (derived from above) #### Findability ```bash # Future ideas: # ml find "failed runs on GPU2 last week" # ml find --note "warmup" ``` Server maintains rebuildable index over manifests, logs, tags. #### Dataset provenance ```json { "datasets": [ { "name": "imagenet-train", "nas_path": "/nas/datasets/imagenet/train", "checksum": "sha256:abc123...", "fetched_at": "2024-01-15T10:30:00Z", "fetch_method": "prewarm" } ] } ``` Server validates checksums, warns on drift. #### Prewarm observability ```bash ml status # Shows: # Next in queue: run_xyz (priority 5) # Prewarming: dataset imagenet-train (2/5 complete) # GPU 0: running run_abc (50% complete, ETA 2h) # GPU 1: idle ``` #### CLI queue/requeue workflows **Core principle**: the runner does not introduce checkpoint conventions. The script should run identically when executed directly vs via `ml`. **Passive artifact tracking** (future idea): worker records what files exist in the run directory after completion (or via configured glob patterns). Checkpoints are just artifacts. **Requeue = replay command with modifications** (future idea): ```bash # Original run ml queue train.py --epochs 100 --save-dir ./checkpoints # Requeue (continue) ml requeue run_abc -- --resume ./checkpoints/best.pt --epochs 200 ``` **Arg merge strategies** (future idea): ```bash # Append new args (default) ml requeue run_abc --append -- --resume ./checkpoints/best.pt # Replace (rerun with only new args) ml requeue run_abc --replace -- --epochs 200 --lr 3e-4 # Merge (override matching flags, keep the rest) ml requeue run_abc --merge -- --epochs 200 ``` **Optional staging** (future idea): copy an artifact from the source run into the new run directory, then reference it with a placeholder. ```bash ml requeue run_abc --stage checkpoints/best.pt -- \ --resume {staged}/best.pt --epochs 200 ``` #### Hardware/resource management ```json { "resources": { "gpus": 2, "gpu_memory_gb": 40, "cpu_cores": 16, "ram_gb": 64, "disk_gb": 100, "max_runtime_hours": 24 } } ``` Worker validates resources before pulling from queue. Server tracks utilization. --- ## Design Philosophy Summary (Server-Centric) The goal is to build a **research assistant that runs on YOUR server**, not a platform that runs on someone else's cloud. ### Every feature should answer: 1. Does this help researchers **understand** what happened on the server? 2. Does this make the server **transparent** instead of a black box? 3. Does this work on a **single workstation** or small lab server? 4. Does this respect that researchers **SSH into the server**? 5. Does this make **local data** (NAS, scratch drives) first-class? ### Architecture principles: - **Server is the control plane**: All logic, storage, scheduling on server - **CLI is a thin client**: Just communicates via WebSocket, no local state - **Filesystem is still king**: Server writes plain text, CLI reads via API - **Queue-first for fairness**: `ml queue` not `ml run` - explicit resource requests - **Priority without hogging**: Higher priority = earlier in queue, not exclusive access - **Prewarm is a performance optimization**: Best-effort, never required for correctness - **NAS integration is native**: Server understands mounted storage ### When in doubt: - **Server-side is better** than client-side (for logic) - **WebSocket is better** than REST (for interactivity) - **Embedded is better** than external deps (rsync in server) - **Flexible backend is better** than required service (Redis OR SQLite OR filesystem) - **Plain text is better** than binary - **Your hardware is better** than their cloud The runner should feel like **SSH into your well-organized research server with powerful tools**, not like operating a cloud platform. Whether you're using the CLI for automation or the TUI for interactive work, the experience should be transparent, fair, and research-focused. --- ## Typical Research Workflows (CLI + TUI) ### Morning Routine: Check What Happened Overnight ```bash # From your laptop (via WebSocket) ml status # Shows: 2 finished, 1 running, 3 in queue ml info run_abc --show-metrics # Quick check: did the overnight run validate the hypothesis? # If you need deep investigation, SSH in ssh mluser@worker.local ml-tui # Visual inspection of logs, GPU usage, etc. ``` ### Starting a New Experiment Series ```bash # Script a parameter sweep (CLI automation) for lr in 1e-3 3e-4 1e-4; do ml queue train.py --lr $lr \ # Future idea: --hypothesis / --experiment-group --priority 5 done # Monitor in TUI (interactive) ssh mluser@worker.local ml-tui # Watch queue, see ETA, check prewarm status ``` ### Debugging a Failed Run ```bash # Notice failure via CLI ml status # run_xyz: failed (exit code 137) - OOM? # Jump into TUI for investigation ssh mluser@worker.local ml-tui # Navigate to run_xyz, press 'l' for logs # See OOM error at batch 128 # Future idea: narrative/annotation UX in the TUI ``` ### End-of-Day Review ```bash # TUI for visual summary ssh mluser@worker.local ml-tui # Scroll through today's runs # Future ideas: compare views, export bundles ``` ### Paper Writing Time (6 months later) ```bash # Today: use the filesystem + run manifests ml info # Future ideas: searching/filtering + comparison reports # TUI for visual exploration ssh mluser@worker.local ml-tui # Navigate through old experiments # Press 'n' to read narratives # Reconstruct your thought process ``` ### Collaborative Debugging with Advisor ```bash # Both SSH into server simultaneously ssh mluser@worker.local # You run TUI to show current state ml-tui # Navigate to problem run, show logs live # Advisor suggests fix # You queue new run with their suggestion ml queue train.py --lr 1e-4 \ --note "Per advisor: try smaller LR with warmup" \ # Future idea: --parent-run --priority 7 # Watch it start in TUI immediately # Queue position visible, prewarm status shown ``` This dual-interface approach gives researchers the best of both worlds: **scriptability when they need it, visibility when they want it**. --- ## How This Maps to Your Current Architecture ✅ **Already correct**: - Server-centric with dual client interfaces (CLI + TUI) - WebSocket communication (CLI) - SSH-based TUI with Bubble Tea (interactive monitoring) - Embedded rsync in server - Flexible queue backend (Redis/SQLite/filesystem) - Priority scheduling - Prewarm mechanism for NAS prefetch - **Fair queueing philosophy** - `queue` not `run` - TUI shows live updates: jobs, queue, GPU status, logs 🎯 **Natural extensions**: - Queue-time narrative flags for `ml queue` (hypothesis/context/intent/etc.) - CLI commands for diffing and finding (and higher-level comparison workflows) - TUI panels for hypothesis/learnings (in job details) - Reproducibility validation improvements (extend `ml validate`) - Export/import for collaboration - Graceful degradation (filesystem-only mode) - Visible queue position and fairness metrics 📝 **Design considerations**: - Show prewarm state/progress in `ml status` - Show queue position and ETA in both CLI and TUI - Add research context fields to manifests - Build comparison workflows (diff, similar, why-different) - Support hypothesis tracking in both interfaces - Create export bundles for sharing - Expose fairness metrics (wait time distribution, resource utilization) - TUI could show narrative snippets in job list (hypothesis as subtitle?) **TUI Research Narrative Integration Ideas:** ``` ┌─ ML Jobs & Queue ─────────────────────────────────────┐ │ > imagenet_baseline │ │ ✓ finished | Priority: 5 │ │ "Testing baseline performance before ablations" │ │ │ │ batch_size_64 │ │ ▶ running (epoch 45/100) | Priority: 5 │ │ "Validating linear LR scaling hypothesis" │ │ │ │ warmup_test │ │ ⏳ queued (position 2) | Priority: 3 │ │ "Following up on advisor suggestion about warmup" │ └───────────────────────────────────────────────────────┘ Press 'n' to view narrative, 'a' to annotate ``` **Implementation status (today):** - **Annotations are implemented** and stored at the **root** of `run_manifest.json` as `annotations[]`. - **Narrative fields are implemented** and stored under `run_manifest.json` as `narrative` (set/update via CLI). - Use `ml annotate --note "..." [--author "..."]` to append an entry. - Remaining gaps are around **queue-time capture**, **post-run learnings/outcomes**, and **TUI-first narrative UX**. Example manifest.json ```json { // === Standard Execution Metadata === "run_id": "2024-01-15_abc123", "status": "completed", "command": "train.py --lr 0.001 --epochs 100 --batch-size 64", "queued_at": "2024-01-15T10:25:00Z", "started_at": "2024-01-15T10:30:00Z", "ended_at": "2024-01-15T14:45:00Z", "exit_code": 0, "priority": 5, // === Research Narrative (The Important Part) === "narrative": { // WHY did you run this? "hypothesis": "Larger batch size with linear LR scaling should improve convergence speed without hurting final accuracy", // WHAT were you thinking at the time? "context": "Previous run (run_789) with batch=32 took 8 hours and plateaued at 0.85. Paper XYZ suggests linear scaling rule should work.", // WHAT were you trying to accomplish? "intent": "Test if doubling batch size (32→64) with 2x learning rate maintains accuracy while reducing training time", // WHAT did you expect to happen? "expected_outcome": "Similar final accuracy (~0.85) but ~4 hour training time instead of 8", // HOW is this related to other experiments? "parent_run": "2024-01-14_run789", "experiment_group": "batch-size-scaling-ablation", "tags": ["ablation", "batch-size", "convergence-speed", "paper-xyz-reproduction"], // WHAT did you learn? (filled in post-run or during) "outcome": "Success: accuracy=0.87 (+0.02), time=3.5h (-56%). Linear scaling rule validated.", "learnings": [ "Linear LR scaling worked as expected from paper XYZ", "GPU memory utilization went from 60% to 95% - near limit", "Convergence was actually smoother (fewer spikes in loss curve)", "Could probably push to batch=96 before OOM" ], "next_steps": [ "Try batch=96 to maximize GPU utilization", "Test if this scales to batch=128 with gradient accumulation", "Validate on other datasets (currently only tested on ImageNet)" ], "validation_status": "validates", // or "refutes", "inconclusive", "partial" }, // Human annotations added later "annotations": [ { "timestamp": "2024-01-15T15:00:00Z", "author": "user@lab.edu", "note": "This result is strong enough for the paper. Use these hyperparams for final training." }, { "timestamp": "2024-01-16T09:00:00Z", "author": "advisor@lab.edu", "note": "Good work. Also compare with warmup schedule before finalizing." } ], // === Reproducibility Metadata === "environment": { "git_commit": "a1b2c3d4", "git_dirty": false, "git_branch": "experiment/batch-scaling", "container_image": "pytorch/pytorch:2.0.1-cuda11.8-cudnn8-runtime", "container_digest": "sha256:abc123...", "pip_freeze": "torch==2.0.1\ntorchvision==0.15.2\n...", "cuda_version": "11.8", "gpu_driver": "525.105.17", "python_version": "3.10.12" }, // === Data Provenance === "datasets": [ { "name": "imagenet-train", "nas_path": "/nas/datasets/imagenet/ILSVRC2012/train", "checksum": "sha256:def456...", "size_gb": 144.2, "num_samples": 1281167, "version": "ILSVRC2012", "fetched_via": "prewarm", "fetch_time_seconds": 180 } ], // === Resource Usage === "resources": { "requested": { "gpus": 1, "gpu_memory_gb": 24, "cpu_cores": 8, "ram_gb": 32 }, "actual": { "gpu_utilization_avg": 95, "gpu_memory_peak_gb": 22.8, "cpu_utilization_avg": 45, "ram_peak_gb": 28.5, "disk_read_gb": 145, "disk_write_gb": 12 }, "gpu_model": "NVIDIA RTX 3090", "host": "ml-server-01" }, // === Results === "metrics": { "final_train_accuracy": 0.891, "final_val_accuracy": 0.873, "final_train_loss": 0.234, "final_val_loss": 0.287, "best_val_accuracy": 0.876, "best_epoch": 87, "total_epochs": 100, "training_time_hours": 3.52 }, // === Artifacts === "artifacts": { "discovery_time": "2024-01-15T14:45:00Z", "files": [ { "path": "checkpoints/epoch_010.pth", "size_bytes": 450000000, "modified": "2024-01-15T11:30:00Z" }, { "path": "checkpoints/best.pth", "size_bytes": 450000000, "modified": "2024-01-15T13:45:00Z" } ], "total_size_bytes": 900000000 } }