docs: comprehensive documentation updates

- Add architecture, CI/CD, CLI reference documentation
- Update installation, operations, and quick-start guides
- Add Jupyter workflow and queue documentation
- New landing page and research runner plan

2026-02-12 12:05:27 -05:00

23 KiB

Raw Blame History

Research-First Runner: Missing Themes Plan

This file captures additional themes that are commonly missing in existing ML runners/experiment tools, translated into actionable design targets for a lightweight, research-first runner.

Quick Overview

What makes this different:

Your server, not their cloud: Everything runs on your homelab/workstation/uni server
Dual interfaces: Zig CLI for scripting + SSH-accessible TUI for interactive work
Fair queueing: ml queue (not run) makes resource sharing explicit
Research narrative: Capture why you ran experiments, not just what ran
Zero SaaS: No accounts, web dashboards, or external services
Plain text everything: Human-readable manifests, long-term reproducibility

Perfect for: Researchers in uni labs, homelab enthusiasts, small research groups who want control over their infrastructure without cloud vendor lock-in.

Architecture Context

Server-Centric Model for Homelab/Workstation/Uni Lab:

Two client interfaces:
- Zig CLI: Thin WebSocket client for scripting, automation, remote access
- SSH-accessible TUI: Interactive Bubble Tea UI for monitoring when SSH'd into server
Go API server with embedded rsync (reduces dependencies)
Worker pulls from flexible queue backend (Redis/SQLite/filesystem)
Priority-based scheduling with prewarm mechanism
NAS integration for data prefetching
Target: single server, workstation, or small uni lab cluster (not cloud/SaaS)

Client Access Patterns:

# CLI (from anywhere via WebSocket)
ml queue train.py --epochs 100
ml status --watch
ml info <path|id>

# TUI (when SSH'd into server or jump box)
ssh mluser@worker.local
ml-tui  # Interactive terminal UI
# Navigate with keyboard, see live updates

Configuration:

# ~/.ml/config.toml (shared by both CLI and TUI)
worker_host = "worker.local"
worker_user = "mluser" 
worker_base = "/data/ml-experiments"
worker_port = 22
api_key = "your-api-key"

Plan (Missing Themes)

Implemented Today (in this repo)

Runs are queued via ml queue and processed by workers.
Run provenance is written to run_manifest.json.
You can attach queue-time notes with ml queue --note "..." (persisted under run_manifest.json → metadata.note).
Queue backends support Redis / SQLite / filesystem (and optional filesystem fallback).
CLI + SSH-launched TUI are both available (ml monitor launches the TUI).

Future Ideas (this document)

1. Own-infrastructure-first, research-centric by default

2. Minimal server dependencies (simple operations)

3. Text-first tracking (logs > dashboards)

Research narrative completion: post-run outcome/learnings/next steps captured in the manifest
Auto-captured context:
- Command + args (as sent from CLI)
- Timestamps (queue time, start time, end time)
- Git commit hash (and optionally diff)
- Environment snapshot (pip freeze, conda export, container image digest)
- Hardware context (GPU model, driver version, CUDA version)
Plain text manifests: JSON or YAML, never binary blobs
Stable formats: Can read experiments from 5 years ago without the runner

Implementation note: Server writes run_manifest.json to experiment directory. CLI can display it via ml info.

4. CLI and TUI as complementary interfaces

Consistent CLI scripting UX: Future idea (uniform --json, quiet modes, and stable exit codes across commands)
TUI feature parity: Future idea (surface the same key details in TUI + CLI: queue position/ETA, narrative, validation results)

5. Failure-tolerant, messy-research friendly

Failure is first-class: Failed runs stay visible and queryable
Partial artifacts preserved: Keep artifacts/logs up to failure point (including checkpoints, if the script produces them)
No punishment for refactors: Script renames don't break history
Grouping/tagging: Label attempts (baseline/ablation/debug/exploration)

Server implementation: Worker should catch exceptions, record failure reason, preserve state. Queue should track failure modes (OOM, timeout, code error, data error).

6. Minimal abstraction over Python (transparent execution)

Run scripts as-is: No decorators, no framework rewrites
Preserve debuggability: Clean stack traces, pdb works

Optional instrumentation: Explicit metric logging via simple API

# Optional, not required
from ml_runner import log_metric
log_metric("loss", 0.5, step=100)

Standard I/O works: print() goes to logs, arguments via sys.argv

Server implementation: Worker spawns process, captures stdout/stderr, parses optional structured logs. No magic wrappers that hide what's happening.

7. Reproducibility that survives time

Immutable run folders: Server never modifies completed runs
Environment capture (best-effort, pluggable):
- Container image digest (primary method)
- pip freeze / uv pip freeze / poetry.lock
- conda env export
- nix flake.lock (if available)
Hardware fingerprint: GPU model, driver, CUDA, CPU, RAM
Data provenance: Dataset checksums, NAS paths, version identifiers
Commit everything: Store full environment, even if verbose

Server implementation: Pre-run hook captures environment. Store in run_manifest.json. Validate on ml validate <run-id>.

8. Small compute and shared machine friendliness

9. Server-side storage with client-side visibility

Energy awareness: Respect that homelabs pay electricity bills
Laptop-friendly: Support thermal/power throttling
Single-GPU to 4-GPU range: Optimize for typical research setups
No cluster assumptions: Don't require Kubernetes/SLURM/etc.

Why this matters: Researchers want to ls experiment directories but don't want to manually sync. Server handles storage, CLI provides views.

11. Research narrative (lab notebook, not job IDs)

Queue-time narrative capture: Future idea (add --hypothesis, --context, --intent, etc. to ml queue)
Post-run learning capture: Future idea (explicit outcome, learnings[], next_steps[], and validation status)
Narrative UX: Future idea (view/edit narrative from TUI/CLI without hand-editing JSON)

CLI commands:

ml queue train.py --note "Testing warmup hypothesis from paper X"

CLI: WebSocket streaming for --watch and --follow
TUI: Live refresh (500ms tick), immediate queue updates
No magic: Minimize implicit behavior
- Explicit is better than clever
- Defaults should be obvious and documented
- Side effects should be visible (both in CLI and TUI)
- Configuration hierarchy clear: CLI flags > env > config file > defaults

TUI advantages for observability:

See everything at once: jobs, queue, GPUs, containers, logs
Keyboard shortcuts for common operations
Instant feedback on actions (queue, cancel, delete)
Prewarm state visible in GPU panel
No need to run multiple ml status commands

13. Support clear thinking during experimentation

Optimize for cognitive throughput:
- Make it easy to remember what you were thinking
- Surface patterns across experiments
- Warn about near-duplicates before running

Built-in comparison:

# Future ideas:
# ml diff <run-a> <run-b>
# ml similar <run-id>

Learning from history:

# Future ideas:
# ml lessons --tag ablation
# ml dead-ends

Hypothesis tracking:
- Link hypothesis → experiment → outcome → next hypothesis
- Mark outcomes: validates/refutes/inconclusive
Reduce cognitive load:
- Natural queries: Future idea (search over manifests/notes)
- Show relevant history when queueing
- Don't make researchers remember IDs

Server implementation: Maintain index (rebuildable from filesystem). Support semantic queries over manifests, notes, tags.

14. Fast iteration velocity

Easy modification:

# Future ideas:
# ml clone <run-id>
# ml fork <run-id>

Batch operations:
```
# Future idea: ml sweep
```

Why prewarm matters: Your NAS prefetch in prewarm means jobs start training immediately instead of waiting for data. This dramatically improves iteration velocity.

15. Full research lifecycle support

Exploration phase: Minimal metadata, quick runs
Development phase: Group attempts, compare variations
Validation phase: Strict reproducibility, complete capture
Publication phase: Export bundles, generate reproduction instructions
Maintenance phase: Long-term readable, re-executable years later

Reproducibility levels (your strict/best-effort model):

# Future idea: --repro-level
ml validate <commit_id>                     # Future idea: expand validation coverage + outputs

16. Collaboration without platforms

Async collaboration (no shared server required):

# Future ideas:
# ml export <run-id> --bundle run_42.tar.gz
# ml import run_42.tar.gz

Selective sharing:

# Future ideas:
# ml export <run-id> --metadata-only
# ml export <run-id> --include-artifacts

Review-friendly:
- Self-contained bundles
- All provenance included
- Reproducibility instructions
- No "install our platform" friction

Server implementation: Export packages run_manifest.json + artifacts into tarball. Import validates and unpacks into experiments directory.

17. Graceful degradation

Core works with minimal setup:
- Filesystem-only queue (no Redis required)
- SQLite for metadata (no Postgres)
- Local execution (no remote targets needed)
Optional enhancements:
- Redis for better multi-worker queueing
- Git integration (works without git)
- NAS prewarm (falls back to on-demand fetch)
- WebSocket updates (falls back to polling)
Progressive disclosure:
- Simple commands for simple cases
- Advanced flags for power users
- Features activate when available

Implementation note:

18. Concrete features (derived from above)

Findability

# Future ideas:
# ml find "failed runs on GPU2 last week"
# ml find --note "warmup"

Server maintains rebuildable index over manifests, logs, tags.

Dataset provenance

{
  "datasets": [
    {
      "name": "imagenet-train",
      "nas_path": "/nas/datasets/imagenet/train",
      "checksum": "sha256:abc123...",
      "fetched_at": "2024-01-15T10:30:00Z",
      "fetch_method": "prewarm"
    }
  ]
}

Server validates checksums, warns on drift.

Prewarm observability

ml status
# Shows:
#   Next in queue: run_xyz (priority 5)
#   Prewarming: dataset imagenet-train (2/5 complete)
#   GPU 0: running run_abc (50% complete, ETA 2h)
#   GPU 1: idle

CLI queue/requeue workflows

Core principle: the runner does not introduce checkpoint conventions. The script should run identically when executed directly vs via ml.

Passive artifact tracking (future idea): worker records what files exist in the run directory after completion (or via configured glob patterns). Checkpoints are just artifacts.

Requeue = replay command with modifications (future idea):

# Original run
ml queue train.py --epochs 100 --save-dir ./checkpoints

# Requeue (continue)
ml requeue run_abc -- --resume ./checkpoints/best.pt --epochs 200

Arg merge strategies (future idea):

# Append new args (default)
ml requeue run_abc --append -- --resume ./checkpoints/best.pt

# Replace (rerun with only new args)
ml requeue run_abc --replace -- --epochs 200 --lr 3e-4

# Merge (override matching flags, keep the rest)
ml requeue run_abc --merge -- --epochs 200

Optional staging (future idea): copy an artifact from the source run into the new run directory, then reference it with a placeholder.

ml requeue run_abc --stage checkpoints/best.pt -- \
  --resume {staged}/best.pt --epochs 200

Hardware/resource management

{
  "resources": {
    "gpus": 2,
    "gpu_memory_gb": 40,
    "cpu_cores": 16,
    "ram_gb": 64,
    "disk_gb": 100,
    "max_runtime_hours": 24
  }
}

Worker validates resources before pulling from queue. Server tracks utilization.

Design Philosophy Summary (Server-Centric)

The goal is to build a research assistant that runs on YOUR server, not a platform that runs on someone else's cloud.

Every feature should answer:

Does this help researchers understand what happened on the server?
Does this make the server transparent instead of a black box?
Does this work on a single workstation or small lab server?
Does this respect that researchers SSH into the server?
Does this make local data (NAS, scratch drives) first-class?

Architecture principles:

Server is the control plane: All logic, storage, scheduling on server
CLI is a thin client: Just communicates via WebSocket, no local state
Filesystem is still king: Server writes plain text, CLI reads via API
Queue-first for fairness: ml queue not ml run - explicit resource requests
Priority without hogging: Higher priority = earlier in queue, not exclusive access
Prewarm is a performance optimization: Best-effort, never required for correctness
NAS integration is native: Server understands mounted storage

When in doubt:

Server-side is better than client-side (for logic)
WebSocket is better than REST (for interactivity)
Embedded is better than external deps (rsync in server)
Flexible backend is better than required service (Redis OR SQLite OR filesystem)
Plain text is better than binary
Your hardware is better than their cloud

The runner should feel like SSH into your well-organized research server with powerful tools, not like operating a cloud platform. Whether you're using the CLI for automation or the TUI for interactive work, the experience should be transparent, fair, and research-focused.

Typical Research Workflows (CLI + TUI)

Morning Routine: Check What Happened Overnight

# From your laptop (via WebSocket)
ml status
# Shows: 2 finished, 1 running, 3 in queue

ml info run_abc --show-metrics
# Quick check: did the overnight run validate the hypothesis?

# If you need deep investigation, SSH in
ssh mluser@worker.local
ml-tui
# Visual inspection of logs, GPU usage, etc.

Starting a New Experiment Series

# Script a parameter sweep (CLI automation)
for lr in 1e-3 3e-4 1e-4; do
  ml queue train.py --lr $lr \
    # Future idea: --hypothesis / --experiment-group
    --priority 5
done

# Monitor in TUI (interactive)
ssh mluser@worker.local
ml-tui
# Watch queue, see ETA, check prewarm status

Debugging a Failed Run

# Notice failure via CLI
ml status
# run_xyz: failed (exit code 137) - OOM?

# Jump into TUI for investigation
ssh mluser@worker.local
ml-tui
# Navigate to run_xyz, press 'l' for logs
# See OOM error at batch 128
# Future idea: narrative/annotation UX in the TUI

End-of-Day Review

# TUI for visual summary
ssh mluser@worker.local
ml-tui
# Scroll through today's runs
# Future ideas: compare views, export bundles

Paper Writing Time (6 months later)

# Today: use the filesystem + run manifests
ml info <path|id>

# Future ideas: searching/filtering + comparison reports

# TUI for visual exploration
ssh mluser@worker.local
ml-tui
# Navigate through old experiments
# Press 'n' to read narratives
# Reconstruct your thought process

Collaborative Debugging with Advisor

# Both SSH into server simultaneously
ssh mluser@worker.local

# You run TUI to show current state
ml-tui
# Navigate to problem run, show logs live

# Advisor suggests fix
# You queue new run with their suggestion
ml queue train.py --lr 1e-4 \
  --note "Per advisor: try smaller LR with warmup" \
  # Future idea: --parent-run
  --priority 7

# Watch it start in TUI immediately
# Queue position visible, prewarm status shown

This dual-interface approach gives researchers the best of both worlds: scriptability when they need it, visibility when they want it.

How This Maps to Your Current Architecture

✅ Already correct:

Server-centric with dual client interfaces (CLI + TUI)
WebSocket communication (CLI)
SSH-based TUI with Bubble Tea (interactive monitoring)
Embedded rsync in server
Flexible queue backend (Redis/SQLite/filesystem)
Priority scheduling
Prewarm mechanism for NAS prefetch
Fair queueing philosophy - queue not run
TUI shows live updates: jobs, queue, GPU status, logs

🎯 Natural extensions:

Queue-time narrative flags for ml queue (hypothesis/context/intent/etc.)
CLI commands for diffing and finding (and higher-level comparison workflows)
TUI panels for hypothesis/learnings (in job details)
Reproducibility validation improvements (extend ml validate)
Export/import for collaboration
Graceful degradation (filesystem-only mode)
Visible queue position and fairness metrics

📝 Design considerations:

Show prewarm state/progress in ml status
Show queue position and ETA in both CLI and TUI
Add research context fields to manifests
Build comparison workflows (diff, similar, why-different)
Support hypothesis tracking in both interfaces
Create export bundles for sharing
Expose fairness metrics (wait time distribution, resource utilization)
TUI could show narrative snippets in job list (hypothesis as subtitle?)

TUI Research Narrative Integration Ideas:

┌─ ML Jobs & Queue ─────────────────────────────────────┐
│ > imagenet_baseline                                   │
│   ✓ finished | Priority: 5                            │
│   "Testing baseline performance before ablations"     │
│                                                        │
│   batch_size_64                                       │
│   ▶ running (epoch 45/100) | Priority: 5             │
│   "Validating linear LR scaling hypothesis"           │
│                                                        │
│   warmup_test                                         │
│   ⏳ queued (position 2) | Priority: 3               │
│   "Following up on advisor suggestion about warmup"   │
└───────────────────────────────────────────────────────┘

Press 'n' to view narrative, 'a' to annotate

Implementation status (today):

Annotations are implemented and stored at the root of run_manifest.json as annotations[].
Narrative fields are implemented and stored under run_manifest.json as narrative (set/update via CLI).
Use ml annotate <path|run_id|task_id> --note "..." [--author "..."] to append an entry.
Remaining gaps are around queue-time capture, post-run learnings/outcomes, and TUI-first narrative UX.

Example manifest.json

{
  // === Standard Execution Metadata ===
  "run_id": "2024-01-15_abc123",
  "status": "completed",
  "command": "train.py --lr 0.001 --epochs 100 --batch-size 64",
  "queued_at": "2024-01-15T10:25:00Z",
  "started_at": "2024-01-15T10:30:00Z",
  "ended_at": "2024-01-15T14:45:00Z",
  "exit_code": 0,
  "priority": 5,
  
  // === Research Narrative (The Important Part) ===
  "narrative": {
    // WHY did you run this?
    "hypothesis": "Larger batch size with linear LR scaling should improve convergence speed without hurting final accuracy",
    
    // WHAT were you thinking at the time?
    "context": "Previous run (run_789) with batch=32 took 8 hours and plateaued at 0.85. Paper XYZ suggests linear scaling rule should work.",
    
    // WHAT were you trying to accomplish?
    "intent": "Test if doubling batch size (32→64) with 2x learning rate maintains accuracy while reducing training time",
    
    // WHAT did you expect to happen?
    "expected_outcome": "Similar final accuracy (~0.85) but ~4 hour training time instead of 8",
    
    // HOW is this related to other experiments?
    "parent_run": "2024-01-14_run789",
    "experiment_group": "batch-size-scaling-ablation",
    "tags": ["ablation", "batch-size", "convergence-speed", "paper-xyz-reproduction"],
    
    // WHAT did you learn? (filled in post-run or during)
    "outcome": "Success: accuracy=0.87 (+0.02), time=3.5h (-56%). Linear scaling rule validated.",
    "learnings": [
      "Linear LR scaling worked as expected from paper XYZ",
      "GPU memory utilization went from 60% to 95% - near limit",
      "Convergence was actually smoother (fewer spikes in loss curve)",
      "Could probably push to batch=96 before OOM"
    ],
    "next_steps": [
      "Try batch=96 to maximize GPU utilization",
      "Test if this scales to batch=128 with gradient accumulation",
      "Validate on other datasets (currently only tested on ImageNet)"
    ],
    "validation_status": "validates",  // or "refutes", "inconclusive", "partial"
  },
  
  // Human annotations added later
  "annotations": [
    {
      "timestamp": "2024-01-15T15:00:00Z",
      "author": "user@lab.edu",
      "note": "This result is strong enough for the paper. Use these hyperparams for final training."
    },
    {
      "timestamp": "2024-01-16T09:00:00Z",
      "author": "advisor@lab.edu", 
      "note": "Good work. Also compare with warmup schedule before finalizing."
    }
  ],
  
  // === Reproducibility Metadata ===
  "environment": {
    "git_commit": "a1b2c3d4",
    "git_dirty": false,
    "git_branch": "experiment/batch-scaling",
    "container_image": "pytorch/pytorch:2.0.1-cuda11.8-cudnn8-runtime",
    "container_digest": "sha256:abc123...",
    "pip_freeze": "torch==2.0.1\ntorchvision==0.15.2\n...",
    "cuda_version": "11.8",
    "gpu_driver": "525.105.17",
    "python_version": "3.10.12"
  },
  
  // === Data Provenance ===
  "datasets": [
    {
      "name": "imagenet-train",
      "nas_path": "/nas/datasets/imagenet/ILSVRC2012/train",
      "checksum": "sha256:def456...",
      "size_gb": 144.2,
      "num_samples": 1281167,
      "version": "ILSVRC2012",
      "fetched_via": "prewarm",
      "fetch_time_seconds": 180
    }
  ],
  
  // === Resource Usage ===
  "resources": {
    "requested": {
      "gpus": 1,
      "gpu_memory_gb": 24,
      "cpu_cores": 8,
      "ram_gb": 32
    },
    "actual": {
      "gpu_utilization_avg": 95,
      "gpu_memory_peak_gb": 22.8,
      "cpu_utilization_avg": 45,
      "ram_peak_gb": 28.5,
      "disk_read_gb": 145,
      "disk_write_gb": 12
    },
    "gpu_model": "NVIDIA RTX 3090",
    "host": "ml-server-01"
  },
  
  // === Results ===
  "metrics": {
    "final_train_accuracy": 0.891,
    "final_val_accuracy": 0.873,
    "final_train_loss": 0.234,
    "final_val_loss": 0.287,
    "best_val_accuracy": 0.876,
    "best_epoch": 87,
    "total_epochs": 100,
    "training_time_hours": 3.52
  },
  
  // === Artifacts ===
  "artifacts": {
    "discovery_time": "2024-01-15T14:45:00Z",
    "files": [
      {
        "path": "checkpoints/epoch_010.pth",
        "size_bytes": 450000000,
        "modified": "2024-01-15T11:30:00Z"
      },
      {
        "path": "checkpoints/best.pth",
        "size_bytes": 450000000,
        "modified": "2024-01-15T13:45:00Z"
      }
    ],
    "total_size_bytes": 900000000
  }
}

23 KiB Raw Blame History