docs: comprehensive documentation updates

- Add architecture, CI/CD, CLI reference documentation
- Update installation, operations, and quick-start guides
- Add Jupyter workflow and queue documentation
- New landing page and research runner plan
This commit is contained in:
Jeremie Fraeys 2026-02-12 12:05:27 -05:00
parent 2e701340e5
commit 5144d291cb
No known key found for this signature in database
23 changed files with 790 additions and 110 deletions

View file

@ -7,6 +7,7 @@
- Worker: stage verified `snapshot_id` into each task workspace and expose it to training code via `FETCH_ML_SNAPSHOT_DIR`.
- Worker: provenance enforcement is trustworthiness-by-default (fail-closed) with `provenance_best_effort` opt-in.
- CLI/API: add `ml validate` to fetch a validation report (commit/task) for provenance + integrity checks.
- Worker: persist discovered artifacts into `run_manifest.json` (`artifacts.discovery_time`, `artifacts.files[]`, `artifacts.total_size_bytes`) at task completion.
- Worker: best-effort environment prewarm can build a warmed Podman image keyed by `deps_manifest_sha256` and reuse it for subsequent tasks.
- Worker: export env prewarm hit/miss/built counters and total build time via the worker Prometheus metrics endpoint.
- API/Worker: `ml prune` also triggers best-effort garbage collection of warmed env images.

View file

@ -9,11 +9,8 @@ This guide helps developers set up their environment and contribute effectively
git clone <your-repo>
cd fetch_ml
# Install dependencies
make setup-dev
# Start development environment
make dev-start
make dev-up
# Run tests
make test
@ -24,11 +21,10 @@ make test
### Prerequisites
- Go 1.25+
- Zig 0.11+
- Zig 0.15+
- Python 3.11+
- Docker & Docker Compose
- Redis
- Node.js (for some tools)
### Local Development Setup
@ -40,15 +36,15 @@ make test
2. **Install Zig tools**
```bash
# Install Zig language server
zig build --install zls
# Zig is required for building the CLI and running CLI tests
zig version
```
3. **Setup Python environment**
```bash
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
pip install -r requirements-dev.txt
# Python is optional (used for a few helper scripts)
```
4. **Optional: Install pre-commit hooks**
@ -69,11 +65,8 @@ make test
2. Make your changes with live feedback:
```bash
# Go development with hot reload
make dev-go
# Zig development with build on save
make dev-zig
# Build Go services + Zig CLI
make dev
# Run specific tests
make test-unit
@ -84,10 +77,9 @@ make test
```bash
# Lint and format (if you have tools configured)
make lint
make format
# Full test suite
make test-all
make test-full
# Optional: Pre-commit checks
pre-commit run --all-files
@ -105,13 +97,14 @@ make test
make test-unit # Unit tests only
make test-integration # Integration tests only
make test-e2e # End-to-end tests only
make test-performance # Performance tests only
make benchmark # Benchmarks
make load-test # Load tests
# Run with coverage
make test-coverage
# Watch mode for development
make test-watch
# (no watch mode target; run specific package tests with go test -run)
```
## Code Quality
@ -145,49 +138,13 @@ test: add or update tests
chore: maintenance tasks
```
## Debugging
### Go Debugging
```bash
# Debug with delve
dlv debug cmd/api-server/main.go
# Debug tests
dlv test ./internal/...
# Profile with pprof
go tool pprof http://localhost:6060/debug/pprof/profile
```
### Zig Debugging
```bash
# Debug build
zig build-exe -O Debug -fstrip=false your_file.zig
# Test with debugging
zig test --gdb your_file.zig
```
### Container Debugging
```bash
# Debug containers
docker-compose exec api-server bash
docker-compose logs -f api-server
# Inspect running processes
docker-compose exec api-server ps aux
```
## Performance Monitoring
### Local Monitoring
```bash
# Start monitoring stack
make monitoring-start
make dev-up
# View metrics
open http://localhost:3000 # Grafana
@ -198,7 +155,7 @@ open http://localhost:9090 # Prometheus
```bash
# Load test API
make load-test-api
make load-test
# Performance benchmarks
make benchmark

View file

@ -27,7 +27,7 @@ Verify the signature (keyless Sigstore) using cosign:
cosign verify-blob \
--certificate checksums.txt.cert \
--signature checksums.txt.sig \
--certificate-identity-regexp "^https://github.com/<org>/<repo>/.github/workflows/release.yml@refs/tags/v.*$" \
--certificate-identity-regexp "^https://github.com/jfraeysd/fetch_ml/.forgejo/workflows/release-mirror.yml@refs/tags/v.*$" \
--certificate-oidc-issuer https://token.actions.githubusercontent.com \
checksums.txt
```
@ -40,16 +40,16 @@ Example (CLI on Linux x86_64):
```bash
# Download
curl -fsSLO https://github.com/<org>/<repo>/releases/download/<tag>/ml-linux-x86_64.tar.gz
curl -fsSLO https://github.com/<org>/<repo>/releases/download/<tag>/checksums.txt
curl -fsSLO https://github.com/<org>/<repo>/releases/download/<tag>/checksums.txt.sig
curl -fsSLO https://github.com/<org>/<repo>/releases/download/<tag>/checksums.txt.cert
curl -fsSLO https://github.com/jfraeysd/fetch_ml/releases/download/<tag>/ml-linux-x86_64.tar.gz
curl -fsSLO https://github.com/jfraeysd/fetch_ml/releases/download/<tag>/checksums.txt
curl -fsSLO https://github.com/jfraeysd/fetch_ml/releases/download/<tag>/checksums.txt.sig
curl -fsSLO https://github.com/jfraeysd/fetch_ml/releases/download/<tag>/checksums.txt.cert
# Verify
cosign verify-blob \
--certificate checksums.txt.cert \
--signature checksums.txt.sig \
--certificate-identity-regexp "^https://github.com/<org>/<repo>/.github/workflows/release.yml@refs/tags/v.*$" \
--certificate-identity-regexp "^https://github.com/jfraeysd/fetch_ml/.forgejo/workflows/release-mirror.yml@refs/tags/v.*$" \
--certificate-oidc-issuer https://token.actions.githubusercontent.com \
checksums.txt
sha256sum -c --ignore-missing checksums.txt

0
docs/.hugo_build.lock Normal file
View file

View file

@ -1,3 +1,5 @@
module github.com/jfraeys/fetch_ml/docs
go 1.21
require github.com/alex-shpak/hugo-book v0.0.0-20251118074854-b7f9c8cb0f51 // indirect

2
docs/go.sum Normal file
View file

@ -0,0 +1,2 @@
github.com/alex-shpak/hugo-book v0.0.0-20251118074854-b7f9c8cb0f51 h1:HHxBwO6r6h3AUflUc/X/Gf5UrfTY5rZEbD7QoGzbVvU=
github.com/alex-shpak/hugo-book v0.0.0-20251118074854-b7f9c8cb0f51/go.mod h1:L4NMyzbn15fpLIpmmtDg9ZFFyTZzw87/lk7M2bMQ7ds=

View file

@ -9,7 +9,7 @@ publishDir = "_site"
enableGitInfo = true
disableKinds = ["taxonomy", "taxonomyTerm"]
disableKinds = ["taxonomy"]
[module]
[[module.imports]]

File diff suppressed because one or more lines are too long

View file

@ -0,0 +1 @@
{"Target":"book.min.6970156cec683193d93c9c4edaf0d56574e4361df2e0c1be4f697ae81c3ba55f.css","MediaType":"text/css","Data":{"Integrity":"sha256-aXAVbOxoMZPZPJxO2vDVZXTkNh3y4MG+T2l66Bw7pV8="}}

View file

@ -1,8 +1,7 @@
---
layout: page
title: "Homelab Architecture"
permalink: /architecture/
nav_order: 1
url: "/architecture/"
weight: 1
---
# Homelab Architecture

View file

@ -1,8 +1,7 @@
---
layout: page
title: "CI/CD Pipeline"
permalink: /cicd/
nav_order: 5
url: "/cicd/"
weight: 5
---
# CI/CD Pipeline
@ -11,7 +10,7 @@ Automated testing, building, and releasing for fetch_ml.
## Workflows
### CI Workflow (`.github/workflows/ci.yml`)
### CI Workflow (`.forgejo/workflows/ci.yml`)
Runs on every push to `main`/`develop` and all pull requests.
@ -29,7 +28,7 @@ Runs on every push to `main`/`develop` and all pull requests.
- Integration tests
- Security audits
### Release Workflow (`.github/workflows/release.yml`)
### Release Workflow (`.forgejo/workflows/release-mirror.yml`)
Runs on version tags (e.g., `v1.0.0`).
@ -49,7 +48,7 @@ Runs on version tags (e.g., `v1.0.0`).
3. **create-release**
- Collects all artifacts
- Generates SHA256 checksums
- Creates GitHub release with notes
- Mirrors release artifacts to GitHub Releases
## Release Process
@ -141,7 +140,8 @@ ZIG_VERSION: '0.15.2'
### Secrets
Required for releases:
- `GITHUB_TOKEN` - Automatic, provided by GitHub Actions
- `GH_MIRROR_TOKEN` - GitHub token for publishing mirrored releases
- `GH_MIRROR_REPO` (variable) - GitHub repo slug, e.g. `jfraeysd/fetch_ml`
## Monitoring
@ -149,7 +149,7 @@ Required for releases:
Check workflow runs at:
```
https://github.com/jfraeys/fetch_ml/actions
https://git.jfraeys.com/jfraeysd/fetch_ml/actions
```
### Artifacts
@ -161,5 +161,5 @@ Download build artifacts from:
---
For implementation details:
- [.github/workflows/ci.yml](https://github.com/jfraeys/fetch_ml/blob/main/.github/workflows/ci.yml)
- [.github/workflows/release.yml](https://github.com/jfraeys/fetch_ml/blob/main/.github/workflows/release.yml)
- `.forgejo/workflows/ci.yml`
- `.forgejo/workflows/release-mirror.yml`

View file

@ -1,8 +1,7 @@
---
layout: page
title: "CLI Reference"
permalink: /cli-reference/
nav_order: 2
url: "/cli-reference/"
weight: 2
---
# Fetch ML CLI Reference
@ -37,6 +36,7 @@ High-performance command-line interface for experiment management, written in Zi
| `jupyter` | Manage Jupyter notebook services | `ml jupyter start --name my-nb` |
| `validate` | Validate provenance/integrity for a commit or task | `ml validate <commit_id> --verbose` |
| `info` | Show run info from `run_manifest.json` | `ml info <run_dir>` |
| `requeue` | Re-submit an existing run/commit with new args/resources | `ml requeue <commit_id|run_id|task_id|path> -- --epochs 20` |
### Command Details
@ -72,8 +72,11 @@ ml sync ./my-project --priority 9
# Queue with commit ID
ml queue my-job --commit abc123def456
# Queue with priority (1-10, default 5)
# Queue with commit ID prefix (>=7 hex chars; must be unique)
ml queue my-job --commit abc123 --priority 8
# Queue with extra runner args (stored as task.Args)
ml queue my-job --commit abc123 -- --epochs 5 --lr 1e-3
```
**Features:**
@ -81,6 +84,34 @@ ml queue my-job --commit abc123 --priority 8
- Priority queuing system
- API key authentication
**Notes:**
- `--priority` is passed to the server as a single byte (0-255).
- Args are sent via a dedicated queue opcode and become `task.Args` on the worker.
- `--commit` may be a full 40-hex commit id or a unique prefix (>=7 hex chars) resolvable under `worker_base`.
#### `requeue` - Re-submit a Previous Run
```bash
# Requeue directly by commit_id
ml requeue <commit_id> -- --epochs 20
# Requeue by commit_id prefix (>=7 hex chars; must be unique)
ml requeue <commit_prefix> -- --epochs 20
# Requeue by run_id/task_id (CLI scans run_manifest.json under worker_base)
ml requeue <run_id> -- --epochs 20
# Requeue by a run directory or run_manifest.json path
ml requeue /data/ml-experiments/finished/<run_id> -- --epochs 20
# Override priority/resources on requeue
ml requeue <task_id> --priority 10 --gpu 1 -- --epochs 20
```
**What it does:**
- Locates `run_manifest.json`
- Extracts `commit_id`
- Submits a new queue request using that `commit_id` with optional overridden args/resources
**Notes:**
- Tasks support optional `snapshot_id` and `dataset_specs` fields server-side (for provenance and dataset resolution).

View file

@ -12,7 +12,7 @@ make install
./bin/ml setup
# 3. Run experiments
./bin/ml run my-experiment.py
./cli/zig-out/bin/ml queue my-job
```
That's it. Everything else is optional.

View file

@ -75,9 +75,32 @@ environment:
security:
trusted_channels: ["conda-forge", "defaults", "pytorch"]
blocked_packages: ["requests", "urllib3"]
blocked_packages: ["aiohttp", "telnetlib"]
```
You can also override the blocked package list at runtime using an environment variable on the worker:
```bash
export FETCHML_JUPYTER_BLOCKED_PACKAGES="aiohttp,telnetlib"
```
Some base images (including the default `quay.io/jupyter/base-notebook`) ship with common HTTP client libraries
like `requests`, `urllib3`, and `httpx` preinstalled.
If you want to **block installing** packages like `requests`, `urllib3`, and `httpx` for security reasons but still
use a base image that already includes them, you can disable the **startup image scan** separately:
```bash
# Block installs (user requests)
export FETCHML_JUPYTER_BLOCKED_PACKAGES="requests,urllib3,httpx"
# Allow base images that already contain these packages to start
export FETCHML_JUPYTER_STARTUP_BLOCKED_PACKAGES="off"
```
If you want startup scanning enabled, set `FETCHML_JUPYTER_STARTUP_BLOCKED_PACKAGES` to a comma-separated list.
### Access Control
```bash

View file

@ -1,6 +1,5 @@
---
layout: default
title: Fetch ML Documentation
title: "Fetch ML Documentation"
bookHidden: true
---

View file

@ -1,8 +1,7 @@
---
layout: page
title: "Operations Runbook"
permalink: /operations/
nav_order: 6
url: "/operations/"
weight: 6
---
# Operations Runbook

View file

@ -1,8 +1,7 @@
---
layout: page
title: "Task Queue Architecture"
permalink: /queue/
nav_order: 3
url: "/queue/"
weight: 3
---
# Task Queue Architecture

View file

@ -9,8 +9,8 @@ Get Fetch ML running in minutes with Docker Compose and integrated monitoring.
- **Podman**: For production experiment execution
**Requirements:**
- Go 1.21+
- Zig 0.11+
- Go 1.25+
- Zig 0.15+
- Docker Compose (testing only)
- 4GB+ RAM
- 2GB+ disk space
@ -137,8 +137,7 @@ cd cli && zig build --release=fast
# Common operations
./cli/zig-out/bin/ml status # Check system status
./cli/zig-out/bin/ml queue job-name # Queue job
./cli/zig-out/bin/ml list # List jobs
./cli/zig-out/bin/ml help # Show help
./cli/zig-out/bin/ml --help # Show help
```
### Monitoring Commands

View file

@ -1,8 +1,7 @@
---
layout: page
title: "Redis High Availability (Optional)"
permalink: /redis-ha/
nav_order: 7
url: "/redis-ha/"
weight: 7
---
# Redis High Availability

View file

@ -0,0 +1,667 @@
# Research-First Runner: Missing Themes Plan
This file captures additional themes that are commonly missing in existing ML runners/experiment tools, translated into actionable design targets for a lightweight, research-first runner.
## Quick Overview
**What makes this different:**
- **Your server, not their cloud**: Everything runs on your homelab/workstation/uni server
- **Dual interfaces**: Zig CLI for scripting + SSH-accessible TUI for interactive work
- **Fair queueing**: `ml queue` (not `run`) makes resource sharing explicit
- **Research narrative**: Capture why you ran experiments, not just what ran
- **Zero SaaS**: No accounts, web dashboards, or external services
- **Plain text everything**: Human-readable manifests, long-term reproducibility
**Perfect for:** Researchers in uni labs, homelab enthusiasts, small research groups who want control over their infrastructure without cloud vendor lock-in.
## Architecture Context
**Server-Centric Model for Homelab/Workstation/Uni Lab:**
- **Two client interfaces**:
- **Zig CLI**: Thin WebSocket client for scripting, automation, remote access
- **SSH-accessible TUI**: Interactive Bubble Tea UI for monitoring when SSH'd into server
- Go API server with embedded rsync (reduces dependencies)
- Worker pulls from flexible queue backend (Redis/SQLite/filesystem)
- Priority-based scheduling with prewarm mechanism
- NAS integration for data prefetching
- Target: single server, workstation, or small uni lab cluster (not cloud/SaaS)
**Client Access Patterns:**
```bash
# CLI (from anywhere via WebSocket)
ml queue train.py --epochs 100
ml status --watch
ml info <path|id>
# TUI (when SSH'd into server or jump box)
ssh mluser@worker.local
ml-tui # Interactive terminal UI
# Navigate with keyboard, see live updates
```
**Configuration:**
```toml
# ~/.ml/config.toml (shared by both CLI and TUI)
worker_host = "worker.local"
worker_user = "mluser"
worker_base = "/data/ml-experiments"
worker_port = 22
api_key = "your-api-key"
```
## Plan (Missing Themes)
## Implemented Today (in this repo)
- Runs are queued via `ml queue` and processed by workers.
- Run provenance is written to `run_manifest.json`.
- You can attach queue-time notes with `ml queue --note "..."` (persisted under `run_manifest.json``metadata.note`).
- Queue backends support Redis / SQLite / filesystem (and optional filesystem fallback).
- CLI + SSH-launched TUI are both available (`ml monitor` launches the TUI).
## Future Ideas (this document)
### 1. Own-infrastructure-first, research-centric by default
### 2. Minimal server dependencies (simple operations)
### 3. Text-first tracking (logs > dashboards)
- **Research narrative completion**: post-run outcome/learnings/next steps captured in the manifest
- **Auto-captured context**:
- Command + args (as sent from CLI)
- Timestamps (queue time, start time, end time)
- Git commit hash (and optionally diff)
- Environment snapshot (pip freeze, conda export, container image digest)
- Hardware context (GPU model, driver version, CUDA version)
- **Plain text manifests**: JSON or YAML, never binary blobs
- **Stable formats**: Can read experiments from 5 years ago without the runner
**Implementation note**: Server writes `run_manifest.json` to experiment directory. CLI can display it via `ml info`.
### 4. CLI and TUI as complementary interfaces
- **Consistent CLI scripting UX**: Future idea (uniform `--json`, quiet modes, and stable exit codes across commands)
- **TUI feature parity**: Future idea (surface the same key details in TUI + CLI: queue position/ETA, narrative, validation results)
### 5. Failure-tolerant, messy-research friendly
- **Failure is first-class**: Failed runs stay visible and queryable
- **Partial artifacts preserved**: Keep artifacts/logs up to failure point (including checkpoints, if the script produces them)
- **No punishment for refactors**: Script renames don't break history
- **Grouping/tagging**: Label attempts (baseline/ablation/debug/exploration)
**Server implementation**: Worker should catch exceptions, record failure reason, preserve state. Queue should track failure modes (OOM, timeout, code error, data error).
### 6. Minimal abstraction over Python (transparent execution)
- **Run scripts as-is**: No decorators, no framework rewrites
- **Preserve debuggability**: Clean stack traces, pdb works
- **Optional instrumentation**: Explicit metric logging via simple API
```python
# Optional, not required
from ml_runner import log_metric
log_metric("loss", 0.5, step=100)
```
- **Standard I/O works**: `print()` goes to logs, arguments via `sys.argv`
**Server implementation**: Worker spawns process, captures stdout/stderr, parses optional structured logs. No magic wrappers that hide what's happening.
### 7. Reproducibility that survives time
- **Immutable run folders**: Server never modifies completed runs
- **Environment capture** (best-effort, pluggable):
- Container image digest (primary method)
- `pip freeze` / `uv pip freeze` / `poetry.lock`
- `conda env export`
- `nix flake.lock` (if available)
- **Hardware fingerprint**: GPU model, driver, CUDA, CPU, RAM
- **Data provenance**: Dataset checksums, NAS paths, version identifiers
- **Commit everything**: Store full environment, even if verbose
**Server implementation**: Pre-run hook captures environment. Store in `run_manifest.json`. Validate on `ml validate <run-id>`.
### 8. Small compute and shared machine friendliness
### 9. Server-side storage with client-side visibility
- **Energy awareness**: Respect that homelabs pay electricity bills
- **Laptop-friendly**: Support thermal/power throttling
- **Single-GPU to 4-GPU range**: Optimize for typical research setups
- **No cluster assumptions**: Don't require Kubernetes/SLURM/etc.
**Why this matters**: Researchers want to `ls` experiment directories but don't want to manually sync. Server handles storage, CLI provides views.
### 11. Research narrative (lab notebook, not job IDs)
- **Queue-time narrative capture**: Future idea (add `--hypothesis`, `--context`, `--intent`, etc. to `ml queue`)
- **Post-run learning capture**: Future idea (explicit `outcome`, `learnings[]`, `next_steps[]`, and validation status)
- **Narrative UX**: Future idea (view/edit narrative from TUI/CLI without hand-editing JSON)
**CLI commands**:
```bash
ml queue train.py --note "Testing warmup hypothesis from paper X"
```
- CLI: WebSocket streaming for `--watch` and `--follow`
- TUI: Live refresh (500ms tick), immediate queue updates
- **No magic**: Minimize implicit behavior
- Explicit is better than clever
- Defaults should be obvious and documented
- Side effects should be visible (both in CLI and TUI)
- Configuration hierarchy clear: CLI flags > env > config file > defaults
**TUI advantages for observability:**
- See everything at once: jobs, queue, GPUs, containers, logs
- Keyboard shortcuts for common operations
- Instant feedback on actions (queue, cancel, delete)
- Prewarm state visible in GPU panel
- No need to run multiple `ml status` commands
### 13. Support clear thinking during experimentation
- **Optimize for cognitive throughput**:
- Make it easy to remember what you were thinking
- Surface patterns across experiments
- Warn about near-duplicates before running
- **Built-in comparison**:
```bash
# Future ideas:
# ml diff <run-a> <run-b>
# ml similar <run-id>
```
- **Learning from history**:
```bash
# Future ideas:
# ml lessons --tag ablation
# ml dead-ends
```
- **Hypothesis tracking**:
- Link hypothesis → experiment → outcome → next hypothesis
- Mark outcomes: validates/refutes/inconclusive
- **Reduce cognitive load**:
- Natural queries: Future idea (search over manifests/notes)
- Show relevant history when queueing
- Don't make researchers remember IDs
**Server implementation**: Maintain index (rebuildable from filesystem). Support semantic queries over manifests, notes, tags.
### 14. Fast iteration velocity
- **Easy modification**:
```bash
# Future ideas:
# ml clone <run-id>
# ml fork <run-id>
```
- **Batch operations**:
```bash
# Future idea: ml sweep
```
**Why prewarm matters**: Your NAS prefetch in prewarm means jobs start training immediately instead of waiting for data. This dramatically improves iteration velocity.
### 15. Full research lifecycle support
- **Exploration phase**: Minimal metadata, quick runs
- **Development phase**: Group attempts, compare variations
- **Validation phase**: Strict reproducibility, complete capture
- **Publication phase**: Export bundles, generate reproduction instructions
- **Maintenance phase**: Long-term readable, re-executable years later
**Reproducibility levels** (your strict/best-effort model):
```bash
# Future idea: --repro-level
ml validate <commit_id> # Future idea: expand validation coverage + outputs
```
### 16. Collaboration without platforms
- **Async collaboration** (no shared server required):
```bash
# Future ideas:
# ml export <run-id> --bundle run_42.tar.gz
# ml import run_42.tar.gz
```
- **Selective sharing**:
```bash
# Future ideas:
# ml export <run-id> --metadata-only
# ml export <run-id> --include-artifacts
```
- **Review-friendly**:
- Self-contained bundles
- All provenance included
- Reproducibility instructions
- No "install our platform" friction
**Server implementation**: Export packages `run_manifest.json` + artifacts into tarball. Import validates and unpacks into experiments directory.
### 17. Graceful degradation
- **Core works with minimal setup**:
- Filesystem-only queue (no Redis required)
- SQLite for metadata (no Postgres)
- Local execution (no remote targets needed)
- **Optional enhancements**:
- Redis for better multi-worker queueing
- Git integration (works without git)
- NAS prewarm (falls back to on-demand fetch)
- WebSocket updates (falls back to polling)
- **Progressive disclosure**:
- Simple commands for simple cases
- Advanced flags for power users
- Features activate when available
**Implementation note**:
### 18. Concrete features (derived from above)
#### Findability
```bash
# Future ideas:
# ml find "failed runs on GPU2 last week"
# ml find --note "warmup"
```
Server maintains rebuildable index over manifests, logs, tags.
#### Dataset provenance
```json
{
"datasets": [
{
"name": "imagenet-train",
"nas_path": "/nas/datasets/imagenet/train",
"checksum": "sha256:abc123...",
"fetched_at": "2024-01-15T10:30:00Z",
"fetch_method": "prewarm"
}
]
}
```
Server validates checksums, warns on drift.
#### Prewarm observability
```bash
ml status
# Shows:
# Next in queue: run_xyz (priority 5)
# Prewarming: dataset imagenet-train (2/5 complete)
# GPU 0: running run_abc (50% complete, ETA 2h)
# GPU 1: idle
```
#### CLI queue/requeue workflows
**Core principle**: the runner does not introduce checkpoint conventions. The script should run identically when executed directly vs via `ml`.
**Passive artifact tracking** (future idea): worker records what files exist in the run directory after completion (or via configured glob patterns). Checkpoints are just artifacts.
**Requeue = replay command with modifications** (future idea):
```bash
# Original run
ml queue train.py --epochs 100 --save-dir ./checkpoints
# Requeue (continue)
ml requeue run_abc -- --resume ./checkpoints/best.pt --epochs 200
```
**Arg merge strategies** (future idea):
```bash
# Append new args (default)
ml requeue run_abc --append -- --resume ./checkpoints/best.pt
# Replace (rerun with only new args)
ml requeue run_abc --replace -- --epochs 200 --lr 3e-4
# Merge (override matching flags, keep the rest)
ml requeue run_abc --merge -- --epochs 200
```
**Optional staging** (future idea): copy an artifact from the source run into the new run directory, then reference it with a placeholder.
```bash
ml requeue run_abc --stage checkpoints/best.pt -- \
--resume {staged}/best.pt --epochs 200
```
#### Hardware/resource management
```json
{
"resources": {
"gpus": 2,
"gpu_memory_gb": 40,
"cpu_cores": 16,
"ram_gb": 64,
"disk_gb": 100,
"max_runtime_hours": 24
}
}
```
Worker validates resources before pulling from queue. Server tracks utilization.
---
## Design Philosophy Summary (Server-Centric)
The goal is to build a **research assistant that runs on YOUR server**, not a platform that runs on someone else's cloud.
### Every feature should answer:
1. Does this help researchers **understand** what happened on the server?
2. Does this make the server **transparent** instead of a black box?
3. Does this work on a **single workstation** or small lab server?
4. Does this respect that researchers **SSH into the server**?
5. Does this make **local data** (NAS, scratch drives) first-class?
### Architecture principles:
- **Server is the control plane**: All logic, storage, scheduling on server
- **CLI is a thin client**: Just communicates via WebSocket, no local state
- **Filesystem is still king**: Server writes plain text, CLI reads via API
- **Queue-first for fairness**: `ml queue` not `ml run` - explicit resource requests
- **Priority without hogging**: Higher priority = earlier in queue, not exclusive access
- **Prewarm is a performance optimization**: Best-effort, never required for correctness
- **NAS integration is native**: Server understands mounted storage
### When in doubt:
- **Server-side is better** than client-side (for logic)
- **WebSocket is better** than REST (for interactivity)
- **Embedded is better** than external deps (rsync in server)
- **Flexible backend is better** than required service (Redis OR SQLite OR filesystem)
- **Plain text is better** than binary
- **Your hardware is better** than their cloud
The runner should feel like **SSH into your well-organized research server with powerful tools**, not like operating a cloud platform. Whether you're using the CLI for automation or the TUI for interactive work, the experience should be transparent, fair, and research-focused.
---
## Typical Research Workflows (CLI + TUI)
### Morning Routine: Check What Happened Overnight
```bash
# From your laptop (via WebSocket)
ml status
# Shows: 2 finished, 1 running, 3 in queue
ml info run_abc --show-metrics
# Quick check: did the overnight run validate the hypothesis?
# If you need deep investigation, SSH in
ssh mluser@worker.local
ml-tui
# Visual inspection of logs, GPU usage, etc.
```
### Starting a New Experiment Series
```bash
# Script a parameter sweep (CLI automation)
for lr in 1e-3 3e-4 1e-4; do
ml queue train.py --lr $lr \
# Future idea: --hypothesis / --experiment-group
--priority 5
done
# Monitor in TUI (interactive)
ssh mluser@worker.local
ml-tui
# Watch queue, see ETA, check prewarm status
```
### Debugging a Failed Run
```bash
# Notice failure via CLI
ml status
# run_xyz: failed (exit code 137) - OOM?
# Jump into TUI for investigation
ssh mluser@worker.local
ml-tui
# Navigate to run_xyz, press 'l' for logs
# See OOM error at batch 128
# Future idea: narrative/annotation UX in the TUI
```
### End-of-Day Review
```bash
# TUI for visual summary
ssh mluser@worker.local
ml-tui
# Scroll through today's runs
# Future ideas: compare views, export bundles
```
### Paper Writing Time (6 months later)
```bash
# Today: use the filesystem + run manifests
ml info <path|id>
# Future ideas: searching/filtering + comparison reports
# TUI for visual exploration
ssh mluser@worker.local
ml-tui
# Navigate through old experiments
# Press 'n' to read narratives
# Reconstruct your thought process
```
### Collaborative Debugging with Advisor
```bash
# Both SSH into server simultaneously
ssh mluser@worker.local
# You run TUI to show current state
ml-tui
# Navigate to problem run, show logs live
# Advisor suggests fix
# You queue new run with their suggestion
ml queue train.py --lr 1e-4 \
--note "Per advisor: try smaller LR with warmup" \
# Future idea: --parent-run
--priority 7
# Watch it start in TUI immediately
# Queue position visible, prewarm status shown
```
This dual-interface approach gives researchers the best of both worlds: **scriptability when they need it, visibility when they want it**.
---
## How This Maps to Your Current Architecture
**Already correct**:
- Server-centric with dual client interfaces (CLI + TUI)
- WebSocket communication (CLI)
- SSH-based TUI with Bubble Tea (interactive monitoring)
- Embedded rsync in server
- Flexible queue backend (Redis/SQLite/filesystem)
- Priority scheduling
- Prewarm mechanism for NAS prefetch
- **Fair queueing philosophy** - `queue` not `run`
- TUI shows live updates: jobs, queue, GPU status, logs
🎯 **Natural extensions**:
- Queue-time narrative flags for `ml queue` (hypothesis/context/intent/etc.)
- CLI commands for diffing and finding (and higher-level comparison workflows)
- TUI panels for hypothesis/learnings (in job details)
- Reproducibility validation improvements (extend `ml validate`)
- Export/import for collaboration
- Graceful degradation (filesystem-only mode)
- Visible queue position and fairness metrics
📝 **Design considerations**:
- Show prewarm state/progress in `ml status`
- Show queue position and ETA in both CLI and TUI
- Add research context fields to manifests
- Build comparison workflows (diff, similar, why-different)
- Support hypothesis tracking in both interfaces
- Create export bundles for sharing
- Expose fairness metrics (wait time distribution, resource utilization)
- TUI could show narrative snippets in job list (hypothesis as subtitle?)
**TUI Research Narrative Integration Ideas:**
```
┌─ ML Jobs & Queue ─────────────────────────────────────┐
│ > imagenet_baseline │
│ ✓ finished | Priority: 5 │
│ "Testing baseline performance before ablations" │
│ │
│ batch_size_64 │
│ ▶ running (epoch 45/100) | Priority: 5 │
│ "Validating linear LR scaling hypothesis" │
│ │
│ warmup_test │
│ ⏳ queued (position 2) | Priority: 3 │
│ "Following up on advisor suggestion about warmup" │
└───────────────────────────────────────────────────────┘
Press 'n' to view narrative, 'a' to annotate
```
**Implementation status (today):**
- **Annotations are implemented** and stored at the **root** of `run_manifest.json` as `annotations[]`.
- **Narrative fields are implemented** and stored under `run_manifest.json` as `narrative` (set/update via CLI).
- Use `ml annotate <path|run_id|task_id> --note "..." [--author "..."]` to append an entry.
- Remaining gaps are around **queue-time capture**, **post-run learnings/outcomes**, and **TUI-first narrative UX**.
Example manifest.json
```json
{
// === Standard Execution Metadata ===
"run_id": "2024-01-15_abc123",
"status": "completed",
"command": "train.py --lr 0.001 --epochs 100 --batch-size 64",
"queued_at": "2024-01-15T10:25:00Z",
"started_at": "2024-01-15T10:30:00Z",
"ended_at": "2024-01-15T14:45:00Z",
"exit_code": 0,
"priority": 5,
// === Research Narrative (The Important Part) ===
"narrative": {
// WHY did you run this?
"hypothesis": "Larger batch size with linear LR scaling should improve convergence speed without hurting final accuracy",
// WHAT were you thinking at the time?
"context": "Previous run (run_789) with batch=32 took 8 hours and plateaued at 0.85. Paper XYZ suggests linear scaling rule should work.",
// WHAT were you trying to accomplish?
"intent": "Test if doubling batch size (32→64) with 2x learning rate maintains accuracy while reducing training time",
// WHAT did you expect to happen?
"expected_outcome": "Similar final accuracy (~0.85) but ~4 hour training time instead of 8",
// HOW is this related to other experiments?
"parent_run": "2024-01-14_run789",
"experiment_group": "batch-size-scaling-ablation",
"tags": ["ablation", "batch-size", "convergence-speed", "paper-xyz-reproduction"],
// WHAT did you learn? (filled in post-run or during)
"outcome": "Success: accuracy=0.87 (+0.02), time=3.5h (-56%). Linear scaling rule validated.",
"learnings": [
"Linear LR scaling worked as expected from paper XYZ",
"GPU memory utilization went from 60% to 95% - near limit",
"Convergence was actually smoother (fewer spikes in loss curve)",
"Could probably push to batch=96 before OOM"
],
"next_steps": [
"Try batch=96 to maximize GPU utilization",
"Test if this scales to batch=128 with gradient accumulation",
"Validate on other datasets (currently only tested on ImageNet)"
],
"validation_status": "validates", // or "refutes", "inconclusive", "partial"
},
// Human annotations added later
"annotations": [
{
"timestamp": "2024-01-15T15:00:00Z",
"author": "user@lab.edu",
"note": "This result is strong enough for the paper. Use these hyperparams for final training."
},
{
"timestamp": "2024-01-16T09:00:00Z",
"author": "advisor@lab.edu",
"note": "Good work. Also compare with warmup schedule before finalizing."
}
],
// === Reproducibility Metadata ===
"environment": {
"git_commit": "a1b2c3d4",
"git_dirty": false,
"git_branch": "experiment/batch-scaling",
"container_image": "pytorch/pytorch:2.0.1-cuda11.8-cudnn8-runtime",
"container_digest": "sha256:abc123...",
"pip_freeze": "torch==2.0.1\ntorchvision==0.15.2\n...",
"cuda_version": "11.8",
"gpu_driver": "525.105.17",
"python_version": "3.10.12"
},
// === Data Provenance ===
"datasets": [
{
"name": "imagenet-train",
"nas_path": "/nas/datasets/imagenet/ILSVRC2012/train",
"checksum": "sha256:def456...",
"size_gb": 144.2,
"num_samples": 1281167,
"version": "ILSVRC2012",
"fetched_via": "prewarm",
"fetch_time_seconds": 180
}
],
// === Resource Usage ===
"resources": {
"requested": {
"gpus": 1,
"gpu_memory_gb": 24,
"cpu_cores": 8,
"ram_gb": 32
},
"actual": {
"gpu_utilization_avg": 95,
"gpu_memory_peak_gb": 22.8,
"cpu_utilization_avg": 45,
"ram_peak_gb": 28.5,
"disk_read_gb": 145,
"disk_write_gb": 12
},
"gpu_model": "NVIDIA RTX 3090",
"host": "ml-server-01"
},
// === Results ===
"metrics": {
"final_train_accuracy": 0.891,
"final_val_accuracy": 0.873,
"final_train_loss": 0.234,
"final_val_loss": 0.287,
"best_val_accuracy": 0.876,
"best_epoch": 87,
"total_epochs": 100,
"training_time_hours": 3.52
},
// === Artifacts ===
"artifacts": {
"discovery_time": "2024-01-15T14:45:00Z",
"files": [
{
"path": "checkpoints/epoch_010.pth",
"size_bytes": 450000000,
"modified": "2024-01-15T11:30:00Z"
},
{
"path": "checkpoints/best.pth",
"size_bytes": 450000000,
"modified": "2024-01-15T13:45:00Z"
}
],
"total_size_bytes": 900000000
}
}

View file

@ -61,13 +61,16 @@ User roles and permissions are configured on the server side by administrators.
### Data Scientist Workflow
```bash
# Submit your experiment
ml run my-experiment
ml queue my-experiment
# Check your experiments (only shows yours)
ml status
# Cancel your own experiment
ml cancel my-experiment
# Requeue a previous run with different args
ml requeue <run_id|task_id|path> -- --epochs 20
```
### Administrator Workflow

View file

@ -1,7 +1,6 @@
---
layout: page
title: "Validation (ml validate)"
permalink: /validate/
url: "/validate/"
---
# Validation (`ml validate`)

View file

@ -1,8 +1,7 @@
---
layout: page
title: "Zig CLI Guide"
permalink: /zig-cli/
nav_order: 3
url: "/zig-cli/"
weight: 3
---
# Zig CLI Guide
@ -28,7 +27,7 @@ The CLI reads `~/.ml/config.toml` and respects `FETCH_ML_CLI_*` env vars:
worker_host = "127.0.0.1"
worker_user = "dev_user"
worker_base = "/tmp/ml-experiments"
worker_port = 9101
worker_port = 22
api_key = "your-api-key"
```
@ -59,4 +58,4 @@ All use `zig build-exe` with `-OReleaseSmall -fstrip` and are compatible with Li
## CI/CD
The release workflow builds crossplatform binaries and packages them with checksums. See `.github/workflows/release.yml`.
The release workflow builds crossplatform binaries and packages them with checksums. See `.forgejo/workflows/release-mirror.yml`.