docs: comprehensive documentation updates
- Add architecture, CI/CD, CLI reference documentation - Update installation, operations, and quick-start guides - Add Jupyter workflow and queue documentation - New landing page and research runner plan
This commit is contained in:
parent
2e701340e5
commit
5144d291cb
23 changed files with 790 additions and 110 deletions
|
|
@ -7,6 +7,7 @@
|
|||
- Worker: stage verified `snapshot_id` into each task workspace and expose it to training code via `FETCH_ML_SNAPSHOT_DIR`.
|
||||
- Worker: provenance enforcement is trustworthiness-by-default (fail-closed) with `provenance_best_effort` opt-in.
|
||||
- CLI/API: add `ml validate` to fetch a validation report (commit/task) for provenance + integrity checks.
|
||||
- Worker: persist discovered artifacts into `run_manifest.json` (`artifacts.discovery_time`, `artifacts.files[]`, `artifacts.total_size_bytes`) at task completion.
|
||||
- Worker: best-effort environment prewarm can build a warmed Podman image keyed by `deps_manifest_sha256` and reuse it for subsequent tasks.
|
||||
- Worker: export env prewarm hit/miss/built counters and total build time via the worker Prometheus metrics endpoint.
|
||||
- API/Worker: `ml prune` also triggers best-effort garbage collection of warmed env images.
|
||||
|
|
|
|||
|
|
@ -9,11 +9,8 @@ This guide helps developers set up their environment and contribute effectively
|
|||
git clone <your-repo>
|
||||
cd fetch_ml
|
||||
|
||||
# Install dependencies
|
||||
make setup-dev
|
||||
|
||||
# Start development environment
|
||||
make dev-start
|
||||
make dev-up
|
||||
|
||||
# Run tests
|
||||
make test
|
||||
|
|
@ -24,11 +21,10 @@ make test
|
|||
### Prerequisites
|
||||
|
||||
- Go 1.25+
|
||||
- Zig 0.11+
|
||||
- Zig 0.15+
|
||||
- Python 3.11+
|
||||
- Docker & Docker Compose
|
||||
- Redis
|
||||
- Node.js (for some tools)
|
||||
|
||||
### Local Development Setup
|
||||
|
||||
|
|
@ -40,15 +36,15 @@ make test
|
|||
|
||||
2. **Install Zig tools**
|
||||
```bash
|
||||
# Install Zig language server
|
||||
zig build --install zls
|
||||
# Zig is required for building the CLI and running CLI tests
|
||||
zig version
|
||||
```
|
||||
|
||||
3. **Setup Python environment**
|
||||
```bash
|
||||
python -m venv venv
|
||||
source venv/bin/activate # or venv\Scripts\activate on Windows
|
||||
pip install -r requirements-dev.txt
|
||||
# Python is optional (used for a few helper scripts)
|
||||
```
|
||||
|
||||
4. **Optional: Install pre-commit hooks**
|
||||
|
|
@ -69,11 +65,8 @@ make test
|
|||
|
||||
2. Make your changes with live feedback:
|
||||
```bash
|
||||
# Go development with hot reload
|
||||
make dev-go
|
||||
|
||||
# Zig development with build on save
|
||||
make dev-zig
|
||||
# Build Go services + Zig CLI
|
||||
make dev
|
||||
|
||||
# Run specific tests
|
||||
make test-unit
|
||||
|
|
@ -84,11 +77,10 @@ make test
|
|||
```bash
|
||||
# Lint and format (if you have tools configured)
|
||||
make lint
|
||||
make format
|
||||
|
||||
# Full test suite
|
||||
make test-all
|
||||
|
||||
make test-full
|
||||
|
||||
# Optional: Pre-commit checks
|
||||
pre-commit run --all-files
|
||||
```
|
||||
|
|
@ -105,13 +97,14 @@ make test
|
|||
make test-unit # Unit tests only
|
||||
make test-integration # Integration tests only
|
||||
make test-e2e # End-to-end tests only
|
||||
make test-performance # Performance tests only
|
||||
|
||||
make benchmark # Benchmarks
|
||||
make load-test # Load tests
|
||||
|
||||
# Run with coverage
|
||||
make test-coverage
|
||||
|
||||
|
||||
# Watch mode for development
|
||||
make test-watch
|
||||
# (no watch mode target; run specific package tests with go test -run)
|
||||
```
|
||||
|
||||
## Code Quality
|
||||
|
|
@ -145,50 +138,14 @@ test: add or update tests
|
|||
chore: maintenance tasks
|
||||
```
|
||||
|
||||
## Debugging
|
||||
|
||||
### Go Debugging
|
||||
|
||||
```bash
|
||||
# Debug with delve
|
||||
dlv debug cmd/api-server/main.go
|
||||
|
||||
# Debug tests
|
||||
dlv test ./internal/...
|
||||
|
||||
# Profile with pprof
|
||||
go tool pprof http://localhost:6060/debug/pprof/profile
|
||||
```
|
||||
|
||||
### Zig Debugging
|
||||
|
||||
```bash
|
||||
# Debug build
|
||||
zig build-exe -O Debug -fstrip=false your_file.zig
|
||||
|
||||
# Test with debugging
|
||||
zig test --gdb your_file.zig
|
||||
```
|
||||
|
||||
### Container Debugging
|
||||
|
||||
```bash
|
||||
# Debug containers
|
||||
docker-compose exec api-server bash
|
||||
docker-compose logs -f api-server
|
||||
|
||||
# Inspect running processes
|
||||
docker-compose exec api-server ps aux
|
||||
```
|
||||
|
||||
## Performance Monitoring
|
||||
|
||||
### Local Monitoring
|
||||
|
||||
```bash
|
||||
# Start monitoring stack
|
||||
make monitoring-start
|
||||
|
||||
make dev-up
|
||||
|
||||
# View metrics
|
||||
open http://localhost:3000 # Grafana
|
||||
open http://localhost:9090 # Prometheus
|
||||
|
|
@ -198,8 +155,8 @@ open http://localhost:9090 # Prometheus
|
|||
|
||||
```bash
|
||||
# Load test API
|
||||
make load-test-api
|
||||
|
||||
make load-test
|
||||
|
||||
# Performance benchmarks
|
||||
make benchmark
|
||||
|
||||
|
|
|
|||
12
README.md
12
README.md
|
|
@ -27,7 +27,7 @@ Verify the signature (keyless Sigstore) using cosign:
|
|||
cosign verify-blob \
|
||||
--certificate checksums.txt.cert \
|
||||
--signature checksums.txt.sig \
|
||||
--certificate-identity-regexp "^https://github.com/<org>/<repo>/.github/workflows/release.yml@refs/tags/v.*$" \
|
||||
--certificate-identity-regexp "^https://github.com/jfraeysd/fetch_ml/.forgejo/workflows/release-mirror.yml@refs/tags/v.*$" \
|
||||
--certificate-oidc-issuer https://token.actions.githubusercontent.com \
|
||||
checksums.txt
|
||||
```
|
||||
|
|
@ -40,16 +40,16 @@ Example (CLI on Linux x86_64):
|
|||
|
||||
```bash
|
||||
# Download
|
||||
curl -fsSLO https://github.com/<org>/<repo>/releases/download/<tag>/ml-linux-x86_64.tar.gz
|
||||
curl -fsSLO https://github.com/<org>/<repo>/releases/download/<tag>/checksums.txt
|
||||
curl -fsSLO https://github.com/<org>/<repo>/releases/download/<tag>/checksums.txt.sig
|
||||
curl -fsSLO https://github.com/<org>/<repo>/releases/download/<tag>/checksums.txt.cert
|
||||
curl -fsSLO https://github.com/jfraeysd/fetch_ml/releases/download/<tag>/ml-linux-x86_64.tar.gz
|
||||
curl -fsSLO https://github.com/jfraeysd/fetch_ml/releases/download/<tag>/checksums.txt
|
||||
curl -fsSLO https://github.com/jfraeysd/fetch_ml/releases/download/<tag>/checksums.txt.sig
|
||||
curl -fsSLO https://github.com/jfraeysd/fetch_ml/releases/download/<tag>/checksums.txt.cert
|
||||
|
||||
# Verify
|
||||
cosign verify-blob \
|
||||
--certificate checksums.txt.cert \
|
||||
--signature checksums.txt.sig \
|
||||
--certificate-identity-regexp "^https://github.com/<org>/<repo>/.github/workflows/release.yml@refs/tags/v.*$" \
|
||||
--certificate-identity-regexp "^https://github.com/jfraeysd/fetch_ml/.forgejo/workflows/release-mirror.yml@refs/tags/v.*$" \
|
||||
--certificate-oidc-issuer https://token.actions.githubusercontent.com \
|
||||
checksums.txt
|
||||
sha256sum -c --ignore-missing checksums.txt
|
||||
|
|
|
|||
0
docs/.hugo_build.lock
Normal file
0
docs/.hugo_build.lock
Normal file
|
|
@ -1,3 +1,5 @@
|
|||
module github.com/jfraeys/fetch_ml/docs
|
||||
|
||||
go 1.21
|
||||
|
||||
require github.com/alex-shpak/hugo-book v0.0.0-20251118074854-b7f9c8cb0f51 // indirect
|
||||
|
|
|
|||
2
docs/go.sum
Normal file
2
docs/go.sum
Normal file
|
|
@ -0,0 +1,2 @@
|
|||
github.com/alex-shpak/hugo-book v0.0.0-20251118074854-b7f9c8cb0f51 h1:HHxBwO6r6h3AUflUc/X/Gf5UrfTY5rZEbD7QoGzbVvU=
|
||||
github.com/alex-shpak/hugo-book v0.0.0-20251118074854-b7f9c8cb0f51/go.mod h1:L4NMyzbn15fpLIpmmtDg9ZFFyTZzw87/lk7M2bMQ7ds=
|
||||
|
|
@ -9,7 +9,7 @@ publishDir = "_site"
|
|||
|
||||
enableGitInfo = true
|
||||
|
||||
disableKinds = ["taxonomy", "taxonomyTerm"]
|
||||
disableKinds = ["taxonomy"]
|
||||
|
||||
[module]
|
||||
[[module.imports]]
|
||||
|
|
|
|||
File diff suppressed because one or more lines are too long
|
|
@ -0,0 +1 @@
|
|||
{"Target":"book.min.6970156cec683193d93c9c4edaf0d56574e4361df2e0c1be4f697ae81c3ba55f.css","MediaType":"text/css","Data":{"Integrity":"sha256-aXAVbOxoMZPZPJxO2vDVZXTkNh3y4MG+T2l66Bw7pV8="}}
|
||||
|
|
@ -1,8 +1,7 @@
|
|||
---
|
||||
layout: page
|
||||
title: "Homelab Architecture"
|
||||
permalink: /architecture/
|
||||
nav_order: 1
|
||||
url: "/architecture/"
|
||||
weight: 1
|
||||
---
|
||||
|
||||
# Homelab Architecture
|
||||
|
|
|
|||
|
|
@ -1,8 +1,7 @@
|
|||
---
|
||||
layout: page
|
||||
title: "CI/CD Pipeline"
|
||||
permalink: /cicd/
|
||||
nav_order: 5
|
||||
url: "/cicd/"
|
||||
weight: 5
|
||||
---
|
||||
|
||||
# CI/CD Pipeline
|
||||
|
|
@ -11,7 +10,7 @@ Automated testing, building, and releasing for fetch_ml.
|
|||
|
||||
## Workflows
|
||||
|
||||
### CI Workflow (`.github/workflows/ci.yml`)
|
||||
### CI Workflow (`.forgejo/workflows/ci.yml`)
|
||||
|
||||
Runs on every push to `main`/`develop` and all pull requests.
|
||||
|
||||
|
|
@ -29,7 +28,7 @@ Runs on every push to `main`/`develop` and all pull requests.
|
|||
- Integration tests
|
||||
- Security audits
|
||||
|
||||
### Release Workflow (`.github/workflows/release.yml`)
|
||||
### Release Workflow (`.forgejo/workflows/release-mirror.yml`)
|
||||
|
||||
Runs on version tags (e.g., `v1.0.0`).
|
||||
|
||||
|
|
@ -49,7 +48,7 @@ Runs on version tags (e.g., `v1.0.0`).
|
|||
3. **create-release**
|
||||
- Collects all artifacts
|
||||
- Generates SHA256 checksums
|
||||
- Creates GitHub release with notes
|
||||
- Mirrors release artifacts to GitHub Releases
|
||||
|
||||
## Release Process
|
||||
|
||||
|
|
@ -141,7 +140,8 @@ ZIG_VERSION: '0.15.2'
|
|||
### Secrets
|
||||
|
||||
Required for releases:
|
||||
- `GITHUB_TOKEN` - Automatic, provided by GitHub Actions
|
||||
- `GH_MIRROR_TOKEN` - GitHub token for publishing mirrored releases
|
||||
- `GH_MIRROR_REPO` (variable) - GitHub repo slug, e.g. `jfraeysd/fetch_ml`
|
||||
|
||||
## Monitoring
|
||||
|
||||
|
|
@ -149,7 +149,7 @@ Required for releases:
|
|||
|
||||
Check workflow runs at:
|
||||
```
|
||||
https://github.com/jfraeys/fetch_ml/actions
|
||||
https://git.jfraeys.com/jfraeysd/fetch_ml/actions
|
||||
```
|
||||
|
||||
### Artifacts
|
||||
|
|
@ -161,5 +161,5 @@ Download build artifacts from:
|
|||
---
|
||||
|
||||
For implementation details:
|
||||
- [.github/workflows/ci.yml](https://github.com/jfraeys/fetch_ml/blob/main/.github/workflows/ci.yml)
|
||||
- [.github/workflows/release.yml](https://github.com/jfraeys/fetch_ml/blob/main/.github/workflows/release.yml)
|
||||
- `.forgejo/workflows/ci.yml`
|
||||
- `.forgejo/workflows/release-mirror.yml`
|
||||
|
|
|
|||
|
|
@ -1,8 +1,7 @@
|
|||
---
|
||||
layout: page
|
||||
title: "CLI Reference"
|
||||
permalink: /cli-reference/
|
||||
nav_order: 2
|
||||
url: "/cli-reference/"
|
||||
weight: 2
|
||||
---
|
||||
|
||||
# Fetch ML CLI Reference
|
||||
|
|
@ -37,6 +36,7 @@ High-performance command-line interface for experiment management, written in Zi
|
|||
| `jupyter` | Manage Jupyter notebook services | `ml jupyter start --name my-nb` |
|
||||
| `validate` | Validate provenance/integrity for a commit or task | `ml validate <commit_id> --verbose` |
|
||||
| `info` | Show run info from `run_manifest.json` | `ml info <run_dir>` |
|
||||
| `requeue` | Re-submit an existing run/commit with new args/resources | `ml requeue <commit_id|run_id|task_id|path> -- --epochs 20` |
|
||||
|
||||
### Command Details
|
||||
|
||||
|
|
@ -72,8 +72,11 @@ ml sync ./my-project --priority 9
|
|||
# Queue with commit ID
|
||||
ml queue my-job --commit abc123def456
|
||||
|
||||
# Queue with priority (1-10, default 5)
|
||||
# Queue with commit ID prefix (>=7 hex chars; must be unique)
|
||||
ml queue my-job --commit abc123 --priority 8
|
||||
|
||||
# Queue with extra runner args (stored as task.Args)
|
||||
ml queue my-job --commit abc123 -- --epochs 5 --lr 1e-3
|
||||
```
|
||||
|
||||
**Features:**
|
||||
|
|
@ -81,6 +84,34 @@ ml queue my-job --commit abc123 --priority 8
|
|||
- Priority queuing system
|
||||
- API key authentication
|
||||
|
||||
**Notes:**
|
||||
- `--priority` is passed to the server as a single byte (0-255).
|
||||
- Args are sent via a dedicated queue opcode and become `task.Args` on the worker.
|
||||
- `--commit` may be a full 40-hex commit id or a unique prefix (>=7 hex chars) resolvable under `worker_base`.
|
||||
|
||||
#### `requeue` - Re-submit a Previous Run
|
||||
```bash
|
||||
# Requeue directly by commit_id
|
||||
ml requeue <commit_id> -- --epochs 20
|
||||
|
||||
# Requeue by commit_id prefix (>=7 hex chars; must be unique)
|
||||
ml requeue <commit_prefix> -- --epochs 20
|
||||
|
||||
# Requeue by run_id/task_id (CLI scans run_manifest.json under worker_base)
|
||||
ml requeue <run_id> -- --epochs 20
|
||||
|
||||
# Requeue by a run directory or run_manifest.json path
|
||||
ml requeue /data/ml-experiments/finished/<run_id> -- --epochs 20
|
||||
|
||||
# Override priority/resources on requeue
|
||||
ml requeue <task_id> --priority 10 --gpu 1 -- --epochs 20
|
||||
```
|
||||
|
||||
**What it does:**
|
||||
- Locates `run_manifest.json`
|
||||
- Extracts `commit_id`
|
||||
- Submits a new queue request using that `commit_id` with optional overridden args/resources
|
||||
|
||||
**Notes:**
|
||||
- Tasks support optional `snapshot_id` and `dataset_specs` fields server-side (for provenance and dataset resolution).
|
||||
|
||||
|
|
|
|||
|
|
@ -12,7 +12,7 @@ make install
|
|||
./bin/ml setup
|
||||
|
||||
# 3. Run experiments
|
||||
./bin/ml run my-experiment.py
|
||||
./cli/zig-out/bin/ml queue my-job
|
||||
```
|
||||
|
||||
That's it. Everything else is optional.
|
||||
|
|
|
|||
|
|
@ -75,9 +75,32 @@ environment:
|
|||
|
||||
security:
|
||||
trusted_channels: ["conda-forge", "defaults", "pytorch"]
|
||||
blocked_packages: ["requests", "urllib3"]
|
||||
blocked_packages: ["aiohttp", "telnetlib"]
|
||||
```
|
||||
|
||||
You can also override the blocked package list at runtime using an environment variable on the worker:
|
||||
|
||||
```bash
|
||||
export FETCHML_JUPYTER_BLOCKED_PACKAGES="aiohttp,telnetlib"
|
||||
```
|
||||
|
||||
Some base images (including the default `quay.io/jupyter/base-notebook`) ship with common HTTP client libraries
|
||||
like `requests`, `urllib3`, and `httpx` preinstalled.
|
||||
|
||||
If you want to **block installing** packages like `requests`, `urllib3`, and `httpx` for security reasons but still
|
||||
use a base image that already includes them, you can disable the **startup image scan** separately:
|
||||
|
||||
```bash
|
||||
# Block installs (user requests)
|
||||
export FETCHML_JUPYTER_BLOCKED_PACKAGES="requests,urllib3,httpx"
|
||||
|
||||
# Allow base images that already contain these packages to start
|
||||
export FETCHML_JUPYTER_STARTUP_BLOCKED_PACKAGES="off"
|
||||
```
|
||||
|
||||
If you want startup scanning enabled, set `FETCHML_JUPYTER_STARTUP_BLOCKED_PACKAGES` to a comma-separated list.
|
||||
|
||||
|
||||
### Access Control
|
||||
|
||||
```bash
|
||||
|
|
|
|||
|
|
@ -1,6 +1,5 @@
|
|||
---
|
||||
layout: default
|
||||
title: Fetch ML Documentation
|
||||
title: "Fetch ML Documentation"
|
||||
bookHidden: true
|
||||
---
|
||||
|
||||
|
|
@ -1,8 +1,7 @@
|
|||
---
|
||||
layout: page
|
||||
title: "Operations Runbook"
|
||||
permalink: /operations/
|
||||
nav_order: 6
|
||||
url: "/operations/"
|
||||
weight: 6
|
||||
---
|
||||
|
||||
# Operations Runbook
|
||||
|
|
|
|||
|
|
@ -1,8 +1,7 @@
|
|||
---
|
||||
layout: page
|
||||
title: "Task Queue Architecture"
|
||||
permalink: /queue/
|
||||
nav_order: 3
|
||||
url: "/queue/"
|
||||
weight: 3
|
||||
---
|
||||
|
||||
# Task Queue Architecture
|
||||
|
|
|
|||
|
|
@ -9,8 +9,8 @@ Get Fetch ML running in minutes with Docker Compose and integrated monitoring.
|
|||
- **Podman**: For production experiment execution
|
||||
|
||||
**Requirements:**
|
||||
- Go 1.21+
|
||||
- Zig 0.11+
|
||||
- Go 1.25+
|
||||
- Zig 0.15+
|
||||
- Docker Compose (testing only)
|
||||
- 4GB+ RAM
|
||||
- 2GB+ disk space
|
||||
|
|
@ -137,8 +137,7 @@ cd cli && zig build --release=fast
|
|||
# Common operations
|
||||
./cli/zig-out/bin/ml status # Check system status
|
||||
./cli/zig-out/bin/ml queue job-name # Queue job
|
||||
./cli/zig-out/bin/ml list # List jobs
|
||||
./cli/zig-out/bin/ml help # Show help
|
||||
./cli/zig-out/bin/ml --help # Show help
|
||||
```
|
||||
|
||||
### Monitoring Commands
|
||||
|
|
|
|||
|
|
@ -1,8 +1,7 @@
|
|||
---
|
||||
layout: page
|
||||
title: "Redis High Availability (Optional)"
|
||||
permalink: /redis-ha/
|
||||
nav_order: 7
|
||||
url: "/redis-ha/"
|
||||
weight: 7
|
||||
---
|
||||
|
||||
# Redis High Availability
|
||||
|
|
|
|||
667
docs/src/research-runner-plan.md
Normal file
667
docs/src/research-runner-plan.md
Normal file
|
|
@ -0,0 +1,667 @@
|
|||
# Research-First Runner: Missing Themes Plan
|
||||
|
||||
This file captures additional themes that are commonly missing in existing ML runners/experiment tools, translated into actionable design targets for a lightweight, research-first runner.
|
||||
|
||||
## Quick Overview
|
||||
|
||||
**What makes this different:**
|
||||
- **Your server, not their cloud**: Everything runs on your homelab/workstation/uni server
|
||||
- **Dual interfaces**: Zig CLI for scripting + SSH-accessible TUI for interactive work
|
||||
- **Fair queueing**: `ml queue` (not `run`) makes resource sharing explicit
|
||||
- **Research narrative**: Capture why you ran experiments, not just what ran
|
||||
- **Zero SaaS**: No accounts, web dashboards, or external services
|
||||
- **Plain text everything**: Human-readable manifests, long-term reproducibility
|
||||
|
||||
**Perfect for:** Researchers in uni labs, homelab enthusiasts, small research groups who want control over their infrastructure without cloud vendor lock-in.
|
||||
|
||||
## Architecture Context
|
||||
|
||||
**Server-Centric Model for Homelab/Workstation/Uni Lab:**
|
||||
- **Two client interfaces**:
|
||||
- **Zig CLI**: Thin WebSocket client for scripting, automation, remote access
|
||||
- **SSH-accessible TUI**: Interactive Bubble Tea UI for monitoring when SSH'd into server
|
||||
- Go API server with embedded rsync (reduces dependencies)
|
||||
- Worker pulls from flexible queue backend (Redis/SQLite/filesystem)
|
||||
- Priority-based scheduling with prewarm mechanism
|
||||
- NAS integration for data prefetching
|
||||
- Target: single server, workstation, or small uni lab cluster (not cloud/SaaS)
|
||||
|
||||
**Client Access Patterns:**
|
||||
```bash
|
||||
# CLI (from anywhere via WebSocket)
|
||||
ml queue train.py --epochs 100
|
||||
ml status --watch
|
||||
ml info <path|id>
|
||||
|
||||
# TUI (when SSH'd into server or jump box)
|
||||
ssh mluser@worker.local
|
||||
ml-tui # Interactive terminal UI
|
||||
# Navigate with keyboard, see live updates
|
||||
```
|
||||
|
||||
**Configuration:**
|
||||
```toml
|
||||
# ~/.ml/config.toml (shared by both CLI and TUI)
|
||||
worker_host = "worker.local"
|
||||
worker_user = "mluser"
|
||||
worker_base = "/data/ml-experiments"
|
||||
worker_port = 22
|
||||
api_key = "your-api-key"
|
||||
```
|
||||
|
||||
## Plan (Missing Themes)
|
||||
|
||||
## Implemented Today (in this repo)
|
||||
|
||||
- Runs are queued via `ml queue` and processed by workers.
|
||||
- Run provenance is written to `run_manifest.json`.
|
||||
- You can attach queue-time notes with `ml queue --note "..."` (persisted under `run_manifest.json` → `metadata.note`).
|
||||
- Queue backends support Redis / SQLite / filesystem (and optional filesystem fallback).
|
||||
- CLI + SSH-launched TUI are both available (`ml monitor` launches the TUI).
|
||||
|
||||
## Future Ideas (this document)
|
||||
|
||||
### 1. Own-infrastructure-first, research-centric by default
|
||||
|
||||
### 2. Minimal server dependencies (simple operations)
|
||||
|
||||
### 3. Text-first tracking (logs > dashboards)
|
||||
|
||||
- **Research narrative completion**: post-run outcome/learnings/next steps captured in the manifest
|
||||
- **Auto-captured context**:
|
||||
- Command + args (as sent from CLI)
|
||||
- Timestamps (queue time, start time, end time)
|
||||
- Git commit hash (and optionally diff)
|
||||
- Environment snapshot (pip freeze, conda export, container image digest)
|
||||
- Hardware context (GPU model, driver version, CUDA version)
|
||||
- **Plain text manifests**: JSON or YAML, never binary blobs
|
||||
- **Stable formats**: Can read experiments from 5 years ago without the runner
|
||||
|
||||
**Implementation note**: Server writes `run_manifest.json` to experiment directory. CLI can display it via `ml info`.
|
||||
|
||||
### 4. CLI and TUI as complementary interfaces
|
||||
|
||||
- **Consistent CLI scripting UX**: Future idea (uniform `--json`, quiet modes, and stable exit codes across commands)
|
||||
- **TUI feature parity**: Future idea (surface the same key details in TUI + CLI: queue position/ETA, narrative, validation results)
|
||||
|
||||
### 5. Failure-tolerant, messy-research friendly
|
||||
|
||||
- **Failure is first-class**: Failed runs stay visible and queryable
|
||||
- **Partial artifacts preserved**: Keep artifacts/logs up to failure point (including checkpoints, if the script produces them)
|
||||
- **No punishment for refactors**: Script renames don't break history
|
||||
- **Grouping/tagging**: Label attempts (baseline/ablation/debug/exploration)
|
||||
|
||||
**Server implementation**: Worker should catch exceptions, record failure reason, preserve state. Queue should track failure modes (OOM, timeout, code error, data error).
|
||||
|
||||
### 6. Minimal abstraction over Python (transparent execution)
|
||||
|
||||
- **Run scripts as-is**: No decorators, no framework rewrites
|
||||
- **Preserve debuggability**: Clean stack traces, pdb works
|
||||
- **Optional instrumentation**: Explicit metric logging via simple API
|
||||
```python
|
||||
# Optional, not required
|
||||
from ml_runner import log_metric
|
||||
log_metric("loss", 0.5, step=100)
|
||||
```
|
||||
- **Standard I/O works**: `print()` goes to logs, arguments via `sys.argv`
|
||||
|
||||
**Server implementation**: Worker spawns process, captures stdout/stderr, parses optional structured logs. No magic wrappers that hide what's happening.
|
||||
|
||||
### 7. Reproducibility that survives time
|
||||
|
||||
- **Immutable run folders**: Server never modifies completed runs
|
||||
- **Environment capture** (best-effort, pluggable):
|
||||
- Container image digest (primary method)
|
||||
- `pip freeze` / `uv pip freeze` / `poetry.lock`
|
||||
- `conda env export`
|
||||
- `nix flake.lock` (if available)
|
||||
- **Hardware fingerprint**: GPU model, driver, CUDA, CPU, RAM
|
||||
- **Data provenance**: Dataset checksums, NAS paths, version identifiers
|
||||
- **Commit everything**: Store full environment, even if verbose
|
||||
|
||||
**Server implementation**: Pre-run hook captures environment. Store in `run_manifest.json`. Validate on `ml validate <run-id>`.
|
||||
|
||||
### 8. Small compute and shared machine friendliness
|
||||
|
||||
### 9. Server-side storage with client-side visibility
|
||||
- **Energy awareness**: Respect that homelabs pay electricity bills
|
||||
- **Laptop-friendly**: Support thermal/power throttling
|
||||
- **Single-GPU to 4-GPU range**: Optimize for typical research setups
|
||||
- **No cluster assumptions**: Don't require Kubernetes/SLURM/etc.
|
||||
|
||||
**Why this matters**: Researchers want to `ls` experiment directories but don't want to manually sync. Server handles storage, CLI provides views.
|
||||
|
||||
### 11. Research narrative (lab notebook, not job IDs)
|
||||
|
||||
- **Queue-time narrative capture**: Future idea (add `--hypothesis`, `--context`, `--intent`, etc. to `ml queue`)
|
||||
- **Post-run learning capture**: Future idea (explicit `outcome`, `learnings[]`, `next_steps[]`, and validation status)
|
||||
- **Narrative UX**: Future idea (view/edit narrative from TUI/CLI without hand-editing JSON)
|
||||
|
||||
**CLI commands**:
|
||||
```bash
|
||||
ml queue train.py --note "Testing warmup hypothesis from paper X"
|
||||
```
|
||||
|
||||
- CLI: WebSocket streaming for `--watch` and `--follow`
|
||||
- TUI: Live refresh (500ms tick), immediate queue updates
|
||||
- **No magic**: Minimize implicit behavior
|
||||
- Explicit is better than clever
|
||||
- Defaults should be obvious and documented
|
||||
- Side effects should be visible (both in CLI and TUI)
|
||||
- Configuration hierarchy clear: CLI flags > env > config file > defaults
|
||||
|
||||
**TUI advantages for observability:**
|
||||
- See everything at once: jobs, queue, GPUs, containers, logs
|
||||
- Keyboard shortcuts for common operations
|
||||
- Instant feedback on actions (queue, cancel, delete)
|
||||
- Prewarm state visible in GPU panel
|
||||
- No need to run multiple `ml status` commands
|
||||
|
||||
### 13. Support clear thinking during experimentation
|
||||
|
||||
- **Optimize for cognitive throughput**:
|
||||
- Make it easy to remember what you were thinking
|
||||
- Surface patterns across experiments
|
||||
- Warn about near-duplicates before running
|
||||
- **Built-in comparison**:
|
||||
```bash
|
||||
# Future ideas:
|
||||
# ml diff <run-a> <run-b>
|
||||
# ml similar <run-id>
|
||||
```
|
||||
- **Learning from history**:
|
||||
```bash
|
||||
# Future ideas:
|
||||
# ml lessons --tag ablation
|
||||
# ml dead-ends
|
||||
```
|
||||
- **Hypothesis tracking**:
|
||||
- Link hypothesis → experiment → outcome → next hypothesis
|
||||
- Mark outcomes: validates/refutes/inconclusive
|
||||
- **Reduce cognitive load**:
|
||||
- Natural queries: Future idea (search over manifests/notes)
|
||||
- Show relevant history when queueing
|
||||
- Don't make researchers remember IDs
|
||||
|
||||
**Server implementation**: Maintain index (rebuildable from filesystem). Support semantic queries over manifests, notes, tags.
|
||||
|
||||
### 14. Fast iteration velocity
|
||||
|
||||
- **Easy modification**:
|
||||
```bash
|
||||
# Future ideas:
|
||||
# ml clone <run-id>
|
||||
# ml fork <run-id>
|
||||
```
|
||||
- **Batch operations**:
|
||||
```bash
|
||||
# Future idea: ml sweep
|
||||
```
|
||||
|
||||
**Why prewarm matters**: Your NAS prefetch in prewarm means jobs start training immediately instead of waiting for data. This dramatically improves iteration velocity.
|
||||
|
||||
### 15. Full research lifecycle support
|
||||
|
||||
- **Exploration phase**: Minimal metadata, quick runs
|
||||
- **Development phase**: Group attempts, compare variations
|
||||
- **Validation phase**: Strict reproducibility, complete capture
|
||||
- **Publication phase**: Export bundles, generate reproduction instructions
|
||||
- **Maintenance phase**: Long-term readable, re-executable years later
|
||||
|
||||
**Reproducibility levels** (your strict/best-effort model):
|
||||
```bash
|
||||
# Future idea: --repro-level
|
||||
ml validate <commit_id> # Future idea: expand validation coverage + outputs
|
||||
```
|
||||
|
||||
### 16. Collaboration without platforms
|
||||
|
||||
- **Async collaboration** (no shared server required):
|
||||
```bash
|
||||
# Future ideas:
|
||||
# ml export <run-id> --bundle run_42.tar.gz
|
||||
# ml import run_42.tar.gz
|
||||
```
|
||||
- **Selective sharing**:
|
||||
```bash
|
||||
# Future ideas:
|
||||
# ml export <run-id> --metadata-only
|
||||
# ml export <run-id> --include-artifacts
|
||||
```
|
||||
- **Review-friendly**:
|
||||
- Self-contained bundles
|
||||
- All provenance included
|
||||
- Reproducibility instructions
|
||||
- No "install our platform" friction
|
||||
|
||||
**Server implementation**: Export packages `run_manifest.json` + artifacts into tarball. Import validates and unpacks into experiments directory.
|
||||
|
||||
### 17. Graceful degradation
|
||||
|
||||
- **Core works with minimal setup**:
|
||||
- Filesystem-only queue (no Redis required)
|
||||
- SQLite for metadata (no Postgres)
|
||||
- Local execution (no remote targets needed)
|
||||
- **Optional enhancements**:
|
||||
- Redis for better multi-worker queueing
|
||||
- Git integration (works without git)
|
||||
- NAS prewarm (falls back to on-demand fetch)
|
||||
- WebSocket updates (falls back to polling)
|
||||
- **Progressive disclosure**:
|
||||
- Simple commands for simple cases
|
||||
- Advanced flags for power users
|
||||
- Features activate when available
|
||||
|
||||
**Implementation note**:
|
||||
|
||||
### 18. Concrete features (derived from above)
|
||||
|
||||
#### Findability
|
||||
```bash
|
||||
# Future ideas:
|
||||
# ml find "failed runs on GPU2 last week"
|
||||
# ml find --note "warmup"
|
||||
```
|
||||
Server maintains rebuildable index over manifests, logs, tags.
|
||||
|
||||
#### Dataset provenance
|
||||
```json
|
||||
{
|
||||
"datasets": [
|
||||
{
|
||||
"name": "imagenet-train",
|
||||
"nas_path": "/nas/datasets/imagenet/train",
|
||||
"checksum": "sha256:abc123...",
|
||||
"fetched_at": "2024-01-15T10:30:00Z",
|
||||
"fetch_method": "prewarm"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
Server validates checksums, warns on drift.
|
||||
|
||||
#### Prewarm observability
|
||||
```bash
|
||||
ml status
|
||||
# Shows:
|
||||
# Next in queue: run_xyz (priority 5)
|
||||
# Prewarming: dataset imagenet-train (2/5 complete)
|
||||
# GPU 0: running run_abc (50% complete, ETA 2h)
|
||||
# GPU 1: idle
|
||||
```
|
||||
|
||||
#### CLI queue/requeue workflows
|
||||
|
||||
**Core principle**: the runner does not introduce checkpoint conventions. The script should run identically when executed directly vs via `ml`.
|
||||
|
||||
**Passive artifact tracking** (future idea): worker records what files exist in the run directory after completion (or via configured glob patterns). Checkpoints are just artifacts.
|
||||
|
||||
**Requeue = replay command with modifications** (future idea):
|
||||
```bash
|
||||
# Original run
|
||||
ml queue train.py --epochs 100 --save-dir ./checkpoints
|
||||
|
||||
# Requeue (continue)
|
||||
ml requeue run_abc -- --resume ./checkpoints/best.pt --epochs 200
|
||||
```
|
||||
|
||||
**Arg merge strategies** (future idea):
|
||||
```bash
|
||||
# Append new args (default)
|
||||
ml requeue run_abc --append -- --resume ./checkpoints/best.pt
|
||||
|
||||
# Replace (rerun with only new args)
|
||||
ml requeue run_abc --replace -- --epochs 200 --lr 3e-4
|
||||
|
||||
# Merge (override matching flags, keep the rest)
|
||||
ml requeue run_abc --merge -- --epochs 200
|
||||
```
|
||||
|
||||
**Optional staging** (future idea): copy an artifact from the source run into the new run directory, then reference it with a placeholder.
|
||||
```bash
|
||||
ml requeue run_abc --stage checkpoints/best.pt -- \
|
||||
--resume {staged}/best.pt --epochs 200
|
||||
```
|
||||
|
||||
#### Hardware/resource management
|
||||
```json
|
||||
{
|
||||
"resources": {
|
||||
"gpus": 2,
|
||||
"gpu_memory_gb": 40,
|
||||
"cpu_cores": 16,
|
||||
"ram_gb": 64,
|
||||
"disk_gb": 100,
|
||||
"max_runtime_hours": 24
|
||||
}
|
||||
}
|
||||
```
|
||||
Worker validates resources before pulling from queue. Server tracks utilization.
|
||||
|
||||
---
|
||||
|
||||
## Design Philosophy Summary (Server-Centric)
|
||||
|
||||
The goal is to build a **research assistant that runs on YOUR server**, not a platform that runs on someone else's cloud.
|
||||
|
||||
### Every feature should answer:
|
||||
|
||||
1. Does this help researchers **understand** what happened on the server?
|
||||
2. Does this make the server **transparent** instead of a black box?
|
||||
3. Does this work on a **single workstation** or small lab server?
|
||||
4. Does this respect that researchers **SSH into the server**?
|
||||
5. Does this make **local data** (NAS, scratch drives) first-class?
|
||||
|
||||
### Architecture principles:
|
||||
|
||||
- **Server is the control plane**: All logic, storage, scheduling on server
|
||||
- **CLI is a thin client**: Just communicates via WebSocket, no local state
|
||||
- **Filesystem is still king**: Server writes plain text, CLI reads via API
|
||||
- **Queue-first for fairness**: `ml queue` not `ml run` - explicit resource requests
|
||||
- **Priority without hogging**: Higher priority = earlier in queue, not exclusive access
|
||||
- **Prewarm is a performance optimization**: Best-effort, never required for correctness
|
||||
- **NAS integration is native**: Server understands mounted storage
|
||||
|
||||
### When in doubt:
|
||||
|
||||
- **Server-side is better** than client-side (for logic)
|
||||
- **WebSocket is better** than REST (for interactivity)
|
||||
- **Embedded is better** than external deps (rsync in server)
|
||||
- **Flexible backend is better** than required service (Redis OR SQLite OR filesystem)
|
||||
- **Plain text is better** than binary
|
||||
- **Your hardware is better** than their cloud
|
||||
|
||||
The runner should feel like **SSH into your well-organized research server with powerful tools**, not like operating a cloud platform. Whether you're using the CLI for automation or the TUI for interactive work, the experience should be transparent, fair, and research-focused.
|
||||
|
||||
---
|
||||
|
||||
## Typical Research Workflows (CLI + TUI)
|
||||
|
||||
### Morning Routine: Check What Happened Overnight
|
||||
```bash
|
||||
# From your laptop (via WebSocket)
|
||||
ml status
|
||||
# Shows: 2 finished, 1 running, 3 in queue
|
||||
|
||||
ml info run_abc --show-metrics
|
||||
# Quick check: did the overnight run validate the hypothesis?
|
||||
|
||||
# If you need deep investigation, SSH in
|
||||
ssh mluser@worker.local
|
||||
ml-tui
|
||||
# Visual inspection of logs, GPU usage, etc.
|
||||
```
|
||||
|
||||
### Starting a New Experiment Series
|
||||
```bash
|
||||
# Script a parameter sweep (CLI automation)
|
||||
for lr in 1e-3 3e-4 1e-4; do
|
||||
ml queue train.py --lr $lr \
|
||||
# Future idea: --hypothesis / --experiment-group
|
||||
--priority 5
|
||||
done
|
||||
|
||||
# Monitor in TUI (interactive)
|
||||
ssh mluser@worker.local
|
||||
ml-tui
|
||||
# Watch queue, see ETA, check prewarm status
|
||||
```
|
||||
|
||||
### Debugging a Failed Run
|
||||
```bash
|
||||
# Notice failure via CLI
|
||||
ml status
|
||||
# run_xyz: failed (exit code 137) - OOM?
|
||||
|
||||
# Jump into TUI for investigation
|
||||
ssh mluser@worker.local
|
||||
ml-tui
|
||||
# Navigate to run_xyz, press 'l' for logs
|
||||
# See OOM error at batch 128
|
||||
# Future idea: narrative/annotation UX in the TUI
|
||||
```
|
||||
|
||||
### End-of-Day Review
|
||||
```bash
|
||||
# TUI for visual summary
|
||||
ssh mluser@worker.local
|
||||
ml-tui
|
||||
# Scroll through today's runs
|
||||
# Future ideas: compare views, export bundles
|
||||
```
|
||||
|
||||
### Paper Writing Time (6 months later)
|
||||
```bash
|
||||
# Today: use the filesystem + run manifests
|
||||
ml info <path|id>
|
||||
|
||||
# Future ideas: searching/filtering + comparison reports
|
||||
|
||||
# TUI for visual exploration
|
||||
ssh mluser@worker.local
|
||||
ml-tui
|
||||
# Navigate through old experiments
|
||||
# Press 'n' to read narratives
|
||||
# Reconstruct your thought process
|
||||
```
|
||||
|
||||
### Collaborative Debugging with Advisor
|
||||
```bash
|
||||
# Both SSH into server simultaneously
|
||||
ssh mluser@worker.local
|
||||
|
||||
# You run TUI to show current state
|
||||
ml-tui
|
||||
# Navigate to problem run, show logs live
|
||||
|
||||
# Advisor suggests fix
|
||||
# You queue new run with their suggestion
|
||||
ml queue train.py --lr 1e-4 \
|
||||
--note "Per advisor: try smaller LR with warmup" \
|
||||
# Future idea: --parent-run
|
||||
--priority 7
|
||||
|
||||
# Watch it start in TUI immediately
|
||||
# Queue position visible, prewarm status shown
|
||||
```
|
||||
|
||||
This dual-interface approach gives researchers the best of both worlds: **scriptability when they need it, visibility when they want it**.
|
||||
|
||||
---
|
||||
|
||||
## How This Maps to Your Current Architecture
|
||||
|
||||
✅ **Already correct**:
|
||||
- Server-centric with dual client interfaces (CLI + TUI)
|
||||
- WebSocket communication (CLI)
|
||||
- SSH-based TUI with Bubble Tea (interactive monitoring)
|
||||
- Embedded rsync in server
|
||||
- Flexible queue backend (Redis/SQLite/filesystem)
|
||||
- Priority scheduling
|
||||
- Prewarm mechanism for NAS prefetch
|
||||
- **Fair queueing philosophy** - `queue` not `run`
|
||||
- TUI shows live updates: jobs, queue, GPU status, logs
|
||||
|
||||
🎯 **Natural extensions**:
|
||||
- Queue-time narrative flags for `ml queue` (hypothesis/context/intent/etc.)
|
||||
- CLI commands for diffing and finding (and higher-level comparison workflows)
|
||||
- TUI panels for hypothesis/learnings (in job details)
|
||||
- Reproducibility validation improvements (extend `ml validate`)
|
||||
- Export/import for collaboration
|
||||
- Graceful degradation (filesystem-only mode)
|
||||
- Visible queue position and fairness metrics
|
||||
|
||||
📝 **Design considerations**:
|
||||
- Show prewarm state/progress in `ml status`
|
||||
- Show queue position and ETA in both CLI and TUI
|
||||
- Add research context fields to manifests
|
||||
- Build comparison workflows (diff, similar, why-different)
|
||||
- Support hypothesis tracking in both interfaces
|
||||
- Create export bundles for sharing
|
||||
- Expose fairness metrics (wait time distribution, resource utilization)
|
||||
- TUI could show narrative snippets in job list (hypothesis as subtitle?)
|
||||
|
||||
**TUI Research Narrative Integration Ideas:**
|
||||
```
|
||||
┌─ ML Jobs & Queue ─────────────────────────────────────┐
|
||||
│ > imagenet_baseline │
|
||||
│ ✓ finished | Priority: 5 │
|
||||
│ "Testing baseline performance before ablations" │
|
||||
│ │
|
||||
│ batch_size_64 │
|
||||
│ ▶ running (epoch 45/100) | Priority: 5 │
|
||||
│ "Validating linear LR scaling hypothesis" │
|
||||
│ │
|
||||
│ warmup_test │
|
||||
│ ⏳ queued (position 2) | Priority: 3 │
|
||||
│ "Following up on advisor suggestion about warmup" │
|
||||
└───────────────────────────────────────────────────────┘
|
||||
|
||||
Press 'n' to view narrative, 'a' to annotate
|
||||
```
|
||||
|
||||
**Implementation status (today):**
|
||||
- **Annotations are implemented** and stored at the **root** of `run_manifest.json` as `annotations[]`.
|
||||
- **Narrative fields are implemented** and stored under `run_manifest.json` as `narrative` (set/update via CLI).
|
||||
- Use `ml annotate <path|run_id|task_id> --note "..." [--author "..."]` to append an entry.
|
||||
- Remaining gaps are around **queue-time capture**, **post-run learnings/outcomes**, and **TUI-first narrative UX**.
|
||||
|
||||
Example manifest.json
|
||||
```json
|
||||
{
|
||||
// === Standard Execution Metadata ===
|
||||
"run_id": "2024-01-15_abc123",
|
||||
"status": "completed",
|
||||
"command": "train.py --lr 0.001 --epochs 100 --batch-size 64",
|
||||
"queued_at": "2024-01-15T10:25:00Z",
|
||||
"started_at": "2024-01-15T10:30:00Z",
|
||||
"ended_at": "2024-01-15T14:45:00Z",
|
||||
"exit_code": 0,
|
||||
"priority": 5,
|
||||
|
||||
// === Research Narrative (The Important Part) ===
|
||||
"narrative": {
|
||||
// WHY did you run this?
|
||||
"hypothesis": "Larger batch size with linear LR scaling should improve convergence speed without hurting final accuracy",
|
||||
|
||||
// WHAT were you thinking at the time?
|
||||
"context": "Previous run (run_789) with batch=32 took 8 hours and plateaued at 0.85. Paper XYZ suggests linear scaling rule should work.",
|
||||
|
||||
// WHAT were you trying to accomplish?
|
||||
"intent": "Test if doubling batch size (32→64) with 2x learning rate maintains accuracy while reducing training time",
|
||||
|
||||
// WHAT did you expect to happen?
|
||||
"expected_outcome": "Similar final accuracy (~0.85) but ~4 hour training time instead of 8",
|
||||
|
||||
// HOW is this related to other experiments?
|
||||
"parent_run": "2024-01-14_run789",
|
||||
"experiment_group": "batch-size-scaling-ablation",
|
||||
"tags": ["ablation", "batch-size", "convergence-speed", "paper-xyz-reproduction"],
|
||||
|
||||
// WHAT did you learn? (filled in post-run or during)
|
||||
"outcome": "Success: accuracy=0.87 (+0.02), time=3.5h (-56%). Linear scaling rule validated.",
|
||||
"learnings": [
|
||||
"Linear LR scaling worked as expected from paper XYZ",
|
||||
"GPU memory utilization went from 60% to 95% - near limit",
|
||||
"Convergence was actually smoother (fewer spikes in loss curve)",
|
||||
"Could probably push to batch=96 before OOM"
|
||||
],
|
||||
"next_steps": [
|
||||
"Try batch=96 to maximize GPU utilization",
|
||||
"Test if this scales to batch=128 with gradient accumulation",
|
||||
"Validate on other datasets (currently only tested on ImageNet)"
|
||||
],
|
||||
"validation_status": "validates", // or "refutes", "inconclusive", "partial"
|
||||
},
|
||||
|
||||
// Human annotations added later
|
||||
"annotations": [
|
||||
{
|
||||
"timestamp": "2024-01-15T15:00:00Z",
|
||||
"author": "user@lab.edu",
|
||||
"note": "This result is strong enough for the paper. Use these hyperparams for final training."
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-01-16T09:00:00Z",
|
||||
"author": "advisor@lab.edu",
|
||||
"note": "Good work. Also compare with warmup schedule before finalizing."
|
||||
}
|
||||
],
|
||||
|
||||
// === Reproducibility Metadata ===
|
||||
"environment": {
|
||||
"git_commit": "a1b2c3d4",
|
||||
"git_dirty": false,
|
||||
"git_branch": "experiment/batch-scaling",
|
||||
"container_image": "pytorch/pytorch:2.0.1-cuda11.8-cudnn8-runtime",
|
||||
"container_digest": "sha256:abc123...",
|
||||
"pip_freeze": "torch==2.0.1\ntorchvision==0.15.2\n...",
|
||||
"cuda_version": "11.8",
|
||||
"gpu_driver": "525.105.17",
|
||||
"python_version": "3.10.12"
|
||||
},
|
||||
|
||||
// === Data Provenance ===
|
||||
"datasets": [
|
||||
{
|
||||
"name": "imagenet-train",
|
||||
"nas_path": "/nas/datasets/imagenet/ILSVRC2012/train",
|
||||
"checksum": "sha256:def456...",
|
||||
"size_gb": 144.2,
|
||||
"num_samples": 1281167,
|
||||
"version": "ILSVRC2012",
|
||||
"fetched_via": "prewarm",
|
||||
"fetch_time_seconds": 180
|
||||
}
|
||||
],
|
||||
|
||||
// === Resource Usage ===
|
||||
"resources": {
|
||||
"requested": {
|
||||
"gpus": 1,
|
||||
"gpu_memory_gb": 24,
|
||||
"cpu_cores": 8,
|
||||
"ram_gb": 32
|
||||
},
|
||||
"actual": {
|
||||
"gpu_utilization_avg": 95,
|
||||
"gpu_memory_peak_gb": 22.8,
|
||||
"cpu_utilization_avg": 45,
|
||||
"ram_peak_gb": 28.5,
|
||||
"disk_read_gb": 145,
|
||||
"disk_write_gb": 12
|
||||
},
|
||||
"gpu_model": "NVIDIA RTX 3090",
|
||||
"host": "ml-server-01"
|
||||
},
|
||||
|
||||
// === Results ===
|
||||
"metrics": {
|
||||
"final_train_accuracy": 0.891,
|
||||
"final_val_accuracy": 0.873,
|
||||
"final_train_loss": 0.234,
|
||||
"final_val_loss": 0.287,
|
||||
"best_val_accuracy": 0.876,
|
||||
"best_epoch": 87,
|
||||
"total_epochs": 100,
|
||||
"training_time_hours": 3.52
|
||||
},
|
||||
|
||||
// === Artifacts ===
|
||||
"artifacts": {
|
||||
"discovery_time": "2024-01-15T14:45:00Z",
|
||||
"files": [
|
||||
{
|
||||
"path": "checkpoints/epoch_010.pth",
|
||||
"size_bytes": 450000000,
|
||||
"modified": "2024-01-15T11:30:00Z"
|
||||
},
|
||||
{
|
||||
"path": "checkpoints/best.pth",
|
||||
"size_bytes": 450000000,
|
||||
"modified": "2024-01-15T13:45:00Z"
|
||||
}
|
||||
],
|
||||
"total_size_bytes": 900000000
|
||||
}
|
||||
}
|
||||
|
|
@ -61,13 +61,16 @@ User roles and permissions are configured on the server side by administrators.
|
|||
### Data Scientist Workflow
|
||||
```bash
|
||||
# Submit your experiment
|
||||
ml run my-experiment
|
||||
ml queue my-experiment
|
||||
|
||||
# Check your experiments (only shows yours)
|
||||
ml status
|
||||
|
||||
# Cancel your own experiment
|
||||
ml cancel my-experiment
|
||||
|
||||
# Requeue a previous run with different args
|
||||
ml requeue <run_id|task_id|path> -- --epochs 20
|
||||
```
|
||||
|
||||
### Administrator Workflow
|
||||
|
|
|
|||
|
|
@ -1,7 +1,6 @@
|
|||
---
|
||||
layout: page
|
||||
title: "Validation (ml validate)"
|
||||
permalink: /validate/
|
||||
url: "/validate/"
|
||||
---
|
||||
|
||||
# Validation (`ml validate`)
|
||||
|
|
|
|||
|
|
@ -1,8 +1,7 @@
|
|||
---
|
||||
layout: page
|
||||
title: "Zig CLI Guide"
|
||||
permalink: /zig-cli/
|
||||
nav_order: 3
|
||||
url: "/zig-cli/"
|
||||
weight: 3
|
||||
---
|
||||
|
||||
# Zig CLI Guide
|
||||
|
|
@ -28,7 +27,7 @@ The CLI reads `~/.ml/config.toml` and respects `FETCH_ML_CLI_*` env vars:
|
|||
worker_host = "127.0.0.1"
|
||||
worker_user = "dev_user"
|
||||
worker_base = "/tmp/ml-experiments"
|
||||
worker_port = 9101
|
||||
worker_port = 22
|
||||
api_key = "your-api-key"
|
||||
```
|
||||
|
||||
|
|
@ -59,4 +58,4 @@ All use `zig build-exe` with `-OReleaseSmall -fstrip` and are compatible with Li
|
|||
|
||||
## CI/CD
|
||||
|
||||
The release workflow builds cross‑platform binaries and packages them with checksums. See `.github/workflows/release.yml`.
|
||||
The release workflow builds cross‑platform binaries and packages them with checksums. See `.forgejo/workflows/release-mirror.yml`.
|
||||
|
|
|
|||
Loading…
Reference in a new issue