- Add architecture, CI/CD, CLI reference documentation - Update installation, operations, and quick-start guides - Add Jupyter workflow and queue documentation - New landing page and research runner plan
23 KiB
Research-First Runner: Missing Themes Plan
This file captures additional themes that are commonly missing in existing ML runners/experiment tools, translated into actionable design targets for a lightweight, research-first runner.
Quick Overview
What makes this different:
- Your server, not their cloud: Everything runs on your homelab/workstation/uni server
- Dual interfaces: Zig CLI for scripting + SSH-accessible TUI for interactive work
- Fair queueing:
ml queue(notrun) makes resource sharing explicit - Research narrative: Capture why you ran experiments, not just what ran
- Zero SaaS: No accounts, web dashboards, or external services
- Plain text everything: Human-readable manifests, long-term reproducibility
Perfect for: Researchers in uni labs, homelab enthusiasts, small research groups who want control over their infrastructure without cloud vendor lock-in.
Architecture Context
Server-Centric Model for Homelab/Workstation/Uni Lab:
- Two client interfaces:
- Zig CLI: Thin WebSocket client for scripting, automation, remote access
- SSH-accessible TUI: Interactive Bubble Tea UI for monitoring when SSH'd into server
- Go API server with embedded rsync (reduces dependencies)
- Worker pulls from flexible queue backend (Redis/SQLite/filesystem)
- Priority-based scheduling with prewarm mechanism
- NAS integration for data prefetching
- Target: single server, workstation, or small uni lab cluster (not cloud/SaaS)
Client Access Patterns:
# CLI (from anywhere via WebSocket)
ml queue train.py --epochs 100
ml status --watch
ml info <path|id>
# TUI (when SSH'd into server or jump box)
ssh mluser@worker.local
ml-tui # Interactive terminal UI
# Navigate with keyboard, see live updates
Configuration:
# ~/.ml/config.toml (shared by both CLI and TUI)
worker_host = "worker.local"
worker_user = "mluser"
worker_base = "/data/ml-experiments"
worker_port = 22
api_key = "your-api-key"
Plan (Missing Themes)
Implemented Today (in this repo)
- Runs are queued via
ml queueand processed by workers. - Run provenance is written to
run_manifest.json. - You can attach queue-time notes with
ml queue --note "..."(persisted underrun_manifest.json→metadata.note). - Queue backends support Redis / SQLite / filesystem (and optional filesystem fallback).
- CLI + SSH-launched TUI are both available (
ml monitorlaunches the TUI).
Future Ideas (this document)
1. Own-infrastructure-first, research-centric by default
2. Minimal server dependencies (simple operations)
3. Text-first tracking (logs > dashboards)
- Research narrative completion: post-run outcome/learnings/next steps captured in the manifest
- Auto-captured context:
- Command + args (as sent from CLI)
- Timestamps (queue time, start time, end time)
- Git commit hash (and optionally diff)
- Environment snapshot (pip freeze, conda export, container image digest)
- Hardware context (GPU model, driver version, CUDA version)
- Plain text manifests: JSON or YAML, never binary blobs
- Stable formats: Can read experiments from 5 years ago without the runner
Implementation note: Server writes run_manifest.json to experiment directory. CLI can display it via ml info.
4. CLI and TUI as complementary interfaces
- Consistent CLI scripting UX: Future idea (uniform
--json, quiet modes, and stable exit codes across commands) - TUI feature parity: Future idea (surface the same key details in TUI + CLI: queue position/ETA, narrative, validation results)
5. Failure-tolerant, messy-research friendly
- Failure is first-class: Failed runs stay visible and queryable
- Partial artifacts preserved: Keep artifacts/logs up to failure point (including checkpoints, if the script produces them)
- No punishment for refactors: Script renames don't break history
- Grouping/tagging: Label attempts (baseline/ablation/debug/exploration)
Server implementation: Worker should catch exceptions, record failure reason, preserve state. Queue should track failure modes (OOM, timeout, code error, data error).
6. Minimal abstraction over Python (transparent execution)
- Run scripts as-is: No decorators, no framework rewrites
- Preserve debuggability: Clean stack traces, pdb works
- Optional instrumentation: Explicit metric logging via simple API
# Optional, not required from ml_runner import log_metric log_metric("loss", 0.5, step=100) - Standard I/O works:
print()goes to logs, arguments viasys.argv
Server implementation: Worker spawns process, captures stdout/stderr, parses optional structured logs. No magic wrappers that hide what's happening.
7. Reproducibility that survives time
- Immutable run folders: Server never modifies completed runs
- Environment capture (best-effort, pluggable):
- Container image digest (primary method)
pip freeze/uv pip freeze/poetry.lockconda env exportnix flake.lock(if available)
- Hardware fingerprint: GPU model, driver, CUDA, CPU, RAM
- Data provenance: Dataset checksums, NAS paths, version identifiers
- Commit everything: Store full environment, even if verbose
Server implementation: Pre-run hook captures environment. Store in run_manifest.json. Validate on ml validate <run-id>.
8. Small compute and shared machine friendliness
9. Server-side storage with client-side visibility
- Energy awareness: Respect that homelabs pay electricity bills
- Laptop-friendly: Support thermal/power throttling
- Single-GPU to 4-GPU range: Optimize for typical research setups
- No cluster assumptions: Don't require Kubernetes/SLURM/etc.
Why this matters: Researchers want to ls experiment directories but don't want to manually sync. Server handles storage, CLI provides views.
11. Research narrative (lab notebook, not job IDs)
- Queue-time narrative capture: Future idea (add
--hypothesis,--context,--intent, etc. toml queue) - Post-run learning capture: Future idea (explicit
outcome,learnings[],next_steps[], and validation status) - Narrative UX: Future idea (view/edit narrative from TUI/CLI without hand-editing JSON)
CLI commands:
ml queue train.py --note "Testing warmup hypothesis from paper X"
- CLI: WebSocket streaming for
--watchand--follow - TUI: Live refresh (500ms tick), immediate queue updates
- No magic: Minimize implicit behavior
- Explicit is better than clever
- Defaults should be obvious and documented
- Side effects should be visible (both in CLI and TUI)
- Configuration hierarchy clear: CLI flags > env > config file > defaults
TUI advantages for observability:
- See everything at once: jobs, queue, GPUs, containers, logs
- Keyboard shortcuts for common operations
- Instant feedback on actions (queue, cancel, delete)
- Prewarm state visible in GPU panel
- No need to run multiple
ml statuscommands
13. Support clear thinking during experimentation
- Optimize for cognitive throughput:
- Make it easy to remember what you were thinking
- Surface patterns across experiments
- Warn about near-duplicates before running
- Built-in comparison:
# Future ideas: # ml diff <run-a> <run-b> # ml similar <run-id> - Learning from history:
# Future ideas: # ml lessons --tag ablation # ml dead-ends - Hypothesis tracking:
- Link hypothesis → experiment → outcome → next hypothesis
- Mark outcomes: validates/refutes/inconclusive
- Reduce cognitive load:
- Natural queries: Future idea (search over manifests/notes)
- Show relevant history when queueing
- Don't make researchers remember IDs
Server implementation: Maintain index (rebuildable from filesystem). Support semantic queries over manifests, notes, tags.
14. Fast iteration velocity
- Easy modification:
# Future ideas: # ml clone <run-id> # ml fork <run-id> - Batch operations:
# Future idea: ml sweep
Why prewarm matters: Your NAS prefetch in prewarm means jobs start training immediately instead of waiting for data. This dramatically improves iteration velocity.
15. Full research lifecycle support
- Exploration phase: Minimal metadata, quick runs
- Development phase: Group attempts, compare variations
- Validation phase: Strict reproducibility, complete capture
- Publication phase: Export bundles, generate reproduction instructions
- Maintenance phase: Long-term readable, re-executable years later
Reproducibility levels (your strict/best-effort model):
# Future idea: --repro-level
ml validate <commit_id> # Future idea: expand validation coverage + outputs
16. Collaboration without platforms
- Async collaboration (no shared server required):
# Future ideas: # ml export <run-id> --bundle run_42.tar.gz # ml import run_42.tar.gz - Selective sharing:
# Future ideas: # ml export <run-id> --metadata-only # ml export <run-id> --include-artifacts - Review-friendly:
- Self-contained bundles
- All provenance included
- Reproducibility instructions
- No "install our platform" friction
Server implementation: Export packages run_manifest.json + artifacts into tarball. Import validates and unpacks into experiments directory.
17. Graceful degradation
- Core works with minimal setup:
- Filesystem-only queue (no Redis required)
- SQLite for metadata (no Postgres)
- Local execution (no remote targets needed)
- Optional enhancements:
- Redis for better multi-worker queueing
- Git integration (works without git)
- NAS prewarm (falls back to on-demand fetch)
- WebSocket updates (falls back to polling)
- Progressive disclosure:
- Simple commands for simple cases
- Advanced flags for power users
- Features activate when available
Implementation note:
18. Concrete features (derived from above)
Findability
# Future ideas:
# ml find "failed runs on GPU2 last week"
# ml find --note "warmup"
Server maintains rebuildable index over manifests, logs, tags.
Dataset provenance
{
"datasets": [
{
"name": "imagenet-train",
"nas_path": "/nas/datasets/imagenet/train",
"checksum": "sha256:abc123...",
"fetched_at": "2024-01-15T10:30:00Z",
"fetch_method": "prewarm"
}
]
}
Server validates checksums, warns on drift.
Prewarm observability
ml status
# Shows:
# Next in queue: run_xyz (priority 5)
# Prewarming: dataset imagenet-train (2/5 complete)
# GPU 0: running run_abc (50% complete, ETA 2h)
# GPU 1: idle
CLI queue/requeue workflows
Core principle: the runner does not introduce checkpoint conventions. The script should run identically when executed directly vs via ml.
Passive artifact tracking (future idea): worker records what files exist in the run directory after completion (or via configured glob patterns). Checkpoints are just artifacts.
Requeue = replay command with modifications (future idea):
# Original run
ml queue train.py --epochs 100 --save-dir ./checkpoints
# Requeue (continue)
ml requeue run_abc -- --resume ./checkpoints/best.pt --epochs 200
Arg merge strategies (future idea):
# Append new args (default)
ml requeue run_abc --append -- --resume ./checkpoints/best.pt
# Replace (rerun with only new args)
ml requeue run_abc --replace -- --epochs 200 --lr 3e-4
# Merge (override matching flags, keep the rest)
ml requeue run_abc --merge -- --epochs 200
Optional staging (future idea): copy an artifact from the source run into the new run directory, then reference it with a placeholder.
ml requeue run_abc --stage checkpoints/best.pt -- \
--resume {staged}/best.pt --epochs 200
Hardware/resource management
{
"resources": {
"gpus": 2,
"gpu_memory_gb": 40,
"cpu_cores": 16,
"ram_gb": 64,
"disk_gb": 100,
"max_runtime_hours": 24
}
}
Worker validates resources before pulling from queue. Server tracks utilization.
Design Philosophy Summary (Server-Centric)
The goal is to build a research assistant that runs on YOUR server, not a platform that runs on someone else's cloud.
Every feature should answer:
- Does this help researchers understand what happened on the server?
- Does this make the server transparent instead of a black box?
- Does this work on a single workstation or small lab server?
- Does this respect that researchers SSH into the server?
- Does this make local data (NAS, scratch drives) first-class?
Architecture principles:
- Server is the control plane: All logic, storage, scheduling on server
- CLI is a thin client: Just communicates via WebSocket, no local state
- Filesystem is still king: Server writes plain text, CLI reads via API
- Queue-first for fairness:
ml queuenotml run- explicit resource requests - Priority without hogging: Higher priority = earlier in queue, not exclusive access
- Prewarm is a performance optimization: Best-effort, never required for correctness
- NAS integration is native: Server understands mounted storage
When in doubt:
- Server-side is better than client-side (for logic)
- WebSocket is better than REST (for interactivity)
- Embedded is better than external deps (rsync in server)
- Flexible backend is better than required service (Redis OR SQLite OR filesystem)
- Plain text is better than binary
- Your hardware is better than their cloud
The runner should feel like SSH into your well-organized research server with powerful tools, not like operating a cloud platform. Whether you're using the CLI for automation or the TUI for interactive work, the experience should be transparent, fair, and research-focused.
Typical Research Workflows (CLI + TUI)
Morning Routine: Check What Happened Overnight
# From your laptop (via WebSocket)
ml status
# Shows: 2 finished, 1 running, 3 in queue
ml info run_abc --show-metrics
# Quick check: did the overnight run validate the hypothesis?
# If you need deep investigation, SSH in
ssh mluser@worker.local
ml-tui
# Visual inspection of logs, GPU usage, etc.
Starting a New Experiment Series
# Script a parameter sweep (CLI automation)
for lr in 1e-3 3e-4 1e-4; do
ml queue train.py --lr $lr \
# Future idea: --hypothesis / --experiment-group
--priority 5
done
# Monitor in TUI (interactive)
ssh mluser@worker.local
ml-tui
# Watch queue, see ETA, check prewarm status
Debugging a Failed Run
# Notice failure via CLI
ml status
# run_xyz: failed (exit code 137) - OOM?
# Jump into TUI for investigation
ssh mluser@worker.local
ml-tui
# Navigate to run_xyz, press 'l' for logs
# See OOM error at batch 128
# Future idea: narrative/annotation UX in the TUI
End-of-Day Review
# TUI for visual summary
ssh mluser@worker.local
ml-tui
# Scroll through today's runs
# Future ideas: compare views, export bundles
Paper Writing Time (6 months later)
# Today: use the filesystem + run manifests
ml info <path|id>
# Future ideas: searching/filtering + comparison reports
# TUI for visual exploration
ssh mluser@worker.local
ml-tui
# Navigate through old experiments
# Press 'n' to read narratives
# Reconstruct your thought process
Collaborative Debugging with Advisor
# Both SSH into server simultaneously
ssh mluser@worker.local
# You run TUI to show current state
ml-tui
# Navigate to problem run, show logs live
# Advisor suggests fix
# You queue new run with their suggestion
ml queue train.py --lr 1e-4 \
--note "Per advisor: try smaller LR with warmup" \
# Future idea: --parent-run
--priority 7
# Watch it start in TUI immediately
# Queue position visible, prewarm status shown
This dual-interface approach gives researchers the best of both worlds: scriptability when they need it, visibility when they want it.
How This Maps to Your Current Architecture
✅ Already correct:
- Server-centric with dual client interfaces (CLI + TUI)
- WebSocket communication (CLI)
- SSH-based TUI with Bubble Tea (interactive monitoring)
- Embedded rsync in server
- Flexible queue backend (Redis/SQLite/filesystem)
- Priority scheduling
- Prewarm mechanism for NAS prefetch
- Fair queueing philosophy -
queuenotrun - TUI shows live updates: jobs, queue, GPU status, logs
🎯 Natural extensions:
- Queue-time narrative flags for
ml queue(hypothesis/context/intent/etc.) - CLI commands for diffing and finding (and higher-level comparison workflows)
- TUI panels for hypothesis/learnings (in job details)
- Reproducibility validation improvements (extend
ml validate) - Export/import for collaboration
- Graceful degradation (filesystem-only mode)
- Visible queue position and fairness metrics
📝 Design considerations:
- Show prewarm state/progress in
ml status - Show queue position and ETA in both CLI and TUI
- Add research context fields to manifests
- Build comparison workflows (diff, similar, why-different)
- Support hypothesis tracking in both interfaces
- Create export bundles for sharing
- Expose fairness metrics (wait time distribution, resource utilization)
- TUI could show narrative snippets in job list (hypothesis as subtitle?)
TUI Research Narrative Integration Ideas:
┌─ ML Jobs & Queue ─────────────────────────────────────┐
│ > imagenet_baseline │
│ ✓ finished | Priority: 5 │
│ "Testing baseline performance before ablations" │
│ │
│ batch_size_64 │
│ ▶ running (epoch 45/100) | Priority: 5 │
│ "Validating linear LR scaling hypothesis" │
│ │
│ warmup_test │
│ ⏳ queued (position 2) | Priority: 3 │
│ "Following up on advisor suggestion about warmup" │
└───────────────────────────────────────────────────────┘
Press 'n' to view narrative, 'a' to annotate
Implementation status (today):
- Annotations are implemented and stored at the root of
run_manifest.jsonasannotations[]. - Narrative fields are implemented and stored under
run_manifest.jsonasnarrative(set/update via CLI). - Use
ml annotate <path|run_id|task_id> --note "..." [--author "..."]to append an entry. - Remaining gaps are around queue-time capture, post-run learnings/outcomes, and TUI-first narrative UX.
Example manifest.json
{
// === Standard Execution Metadata ===
"run_id": "2024-01-15_abc123",
"status": "completed",
"command": "train.py --lr 0.001 --epochs 100 --batch-size 64",
"queued_at": "2024-01-15T10:25:00Z",
"started_at": "2024-01-15T10:30:00Z",
"ended_at": "2024-01-15T14:45:00Z",
"exit_code": 0,
"priority": 5,
// === Research Narrative (The Important Part) ===
"narrative": {
// WHY did you run this?
"hypothesis": "Larger batch size with linear LR scaling should improve convergence speed without hurting final accuracy",
// WHAT were you thinking at the time?
"context": "Previous run (run_789) with batch=32 took 8 hours and plateaued at 0.85. Paper XYZ suggests linear scaling rule should work.",
// WHAT were you trying to accomplish?
"intent": "Test if doubling batch size (32→64) with 2x learning rate maintains accuracy while reducing training time",
// WHAT did you expect to happen?
"expected_outcome": "Similar final accuracy (~0.85) but ~4 hour training time instead of 8",
// HOW is this related to other experiments?
"parent_run": "2024-01-14_run789",
"experiment_group": "batch-size-scaling-ablation",
"tags": ["ablation", "batch-size", "convergence-speed", "paper-xyz-reproduction"],
// WHAT did you learn? (filled in post-run or during)
"outcome": "Success: accuracy=0.87 (+0.02), time=3.5h (-56%). Linear scaling rule validated.",
"learnings": [
"Linear LR scaling worked as expected from paper XYZ",
"GPU memory utilization went from 60% to 95% - near limit",
"Convergence was actually smoother (fewer spikes in loss curve)",
"Could probably push to batch=96 before OOM"
],
"next_steps": [
"Try batch=96 to maximize GPU utilization",
"Test if this scales to batch=128 with gradient accumulation",
"Validate on other datasets (currently only tested on ImageNet)"
],
"validation_status": "validates", // or "refutes", "inconclusive", "partial"
},
// Human annotations added later
"annotations": [
{
"timestamp": "2024-01-15T15:00:00Z",
"author": "user@lab.edu",
"note": "This result is strong enough for the paper. Use these hyperparams for final training."
},
{
"timestamp": "2024-01-16T09:00:00Z",
"author": "advisor@lab.edu",
"note": "Good work. Also compare with warmup schedule before finalizing."
}
],
// === Reproducibility Metadata ===
"environment": {
"git_commit": "a1b2c3d4",
"git_dirty": false,
"git_branch": "experiment/batch-scaling",
"container_image": "pytorch/pytorch:2.0.1-cuda11.8-cudnn8-runtime",
"container_digest": "sha256:abc123...",
"pip_freeze": "torch==2.0.1\ntorchvision==0.15.2\n...",
"cuda_version": "11.8",
"gpu_driver": "525.105.17",
"python_version": "3.10.12"
},
// === Data Provenance ===
"datasets": [
{
"name": "imagenet-train",
"nas_path": "/nas/datasets/imagenet/ILSVRC2012/train",
"checksum": "sha256:def456...",
"size_gb": 144.2,
"num_samples": 1281167,
"version": "ILSVRC2012",
"fetched_via": "prewarm",
"fetch_time_seconds": 180
}
],
// === Resource Usage ===
"resources": {
"requested": {
"gpus": 1,
"gpu_memory_gb": 24,
"cpu_cores": 8,
"ram_gb": 32
},
"actual": {
"gpu_utilization_avg": 95,
"gpu_memory_peak_gb": 22.8,
"cpu_utilization_avg": 45,
"ram_peak_gb": 28.5,
"disk_read_gb": 145,
"disk_write_gb": 12
},
"gpu_model": "NVIDIA RTX 3090",
"host": "ml-server-01"
},
// === Results ===
"metrics": {
"final_train_accuracy": 0.891,
"final_val_accuracy": 0.873,
"final_train_loss": 0.234,
"final_val_loss": 0.287,
"best_val_accuracy": 0.876,
"best_epoch": 87,
"total_epochs": 100,
"training_time_hours": 3.52
},
// === Artifacts ===
"artifacts": {
"discovery_time": "2024-01-15T14:45:00Z",
"files": [
{
"path": "checkpoints/epoch_010.pth",
"size_bytes": 450000000,
"modified": "2024-01-15T11:30:00Z"
},
{
"path": "checkpoints/best.pth",
"size_bytes": 450000000,
"modified": "2024-01-15T13:45:00Z"
}
],
"total_size_bytes": 900000000
}
}