From 3fa4f6ae51ecff10778a578b5502b9bad23f5031 Mon Sep 17 00:00:00 2001
From: Jeremie Fraeys <jfaeys@gmail.com>
Date: Mon, 5 Jan 2026 12:41:22 -0500
Subject: [PATCH] docs: update architecture/queue pages and remove legacy
 development setup

---
 .windsurf/rules/test-new-features.md |  90 -----------------
 docs/src/architecture.md             | 146 +++++++++++++++++++++++++--
 docs/src/development-setup.md        |  55 ----------
 docs/src/queue.md                    |  15 ++-
 4 files changed, 145 insertions(+), 161 deletions(-)
 delete mode 100644 .windsurf/rules/test-new-features.md
 delete mode 100644 docs/src/development-setup.md

diff --git a/.windsurf/rules/test-new-features.md b/.windsurf/rules/test-new-features.md
deleted file mode 100644
index cac25e9..0000000
--- a/.windsurf/rules/test-new-features.md
+++ /dev/null
@@ -1,90 +0,0 @@
----
-trigger: model_decision
-description: When a new feature is added, this prompt needs to be run
----
-
-# Development Guidelines
-
-## Code Quality Standards
-
-### Testing Requirements
-- MANDATORY: Every new feature MUST include corresponding tests
-- Write tests BEFORE implementing complex features (TDD approach)
-- Test coverage for new code should be >80%
-- Include both unit tests and integration tests where applicable
-- Test edge cases, error paths, and boundary conditions
-
-### Documentation Standards
-- Update relevant documentation IN THE SAME COMMIT as code changes
-- Documentation locations:
-  - README.md: User-facing features, installation, quick start
-  - CHANGELOG.md: All changes, following Keep a Changelog format
-  - Code comments: Complex logic, non-obvious decisions, API contracts
-  - Function/struct docs: Public APIs must have doc comments
-- Use concrete examples in documentation
-- Keep docs concise but complete
-
-### Code Organization
-- CRITICAL: Clean up as you go - no orphaned files or dead code
-- Remove commented-out code blocks (use git history instead)
-- Delete unused imports, functions, and variables immediately
-- Consolidate duplicate code into reusable functions
-- Move TODO items from loose files into:
-  - Code comments with `// TODO(context):` for implementation tasks
-  - GitHub Issues for larger features
-  - NEVER create standalone .md files for tracking
-
-### When Making Changes
-For EVERY significant change, complete ALL of these:
-
-1. Write/update tests
-2. Update documentation (README, CHANGELOG, code comments)
-3. Update build scripts if dependencies/build process changed
-4. Remove any temporary/debug code added during development
-5. Delete unused files created during exploration
-6. Verify no dead code remains (unused functions, imports, variables)
-
-### Cleanup Checklist (Run BEFORE committing)
-- [ ] Removed all debug print statements
-- [ ] Deleted temporary test files
-- [ ] Removed commented-out code
-- [ ] Cleaned up unused imports
-- [ ] Deleted exploratory/spike code
-- [ ] Consolidated duplicate logic
-- [ ] Removed obsolete scripts/configs
-
-### Communication Style
-- Report what you've done: "Added feature X with tests in test/x_test.go"
-- Highlight what needs attention: "WARNING: Manual testing needed for edge case Y"
-- Ask questions directly: "Should we support Z? Trade-offs are..."
-- NEVER say "I'll track this in a markdown file" - use code comments or tell me directly
-
-### Script/Build System Updates
-- Update Makefile/build.zig when adding new targets or commands
-- Modify CI/CD configs (.github/workflows) if build/test process changes
-- Update package.json/Cargo.toml/go.mod when dependencies change
-- Document new scripts in README under "Development" section
-
-## Anti-Patterns to AVOID
-- Creating notes.md, todo.md, tasks.md, ideas.md files
-- Leaving commented-out code "for reference"
-- Keeping old implementation files with .old or .backup suffixes
-- Adding features without tests
-- Updating code without updating docs
-- Leaving TODO comments without context or assignee
-
-## Preferred Patterns
-- Inline TODO comments: `// TODO(user): Add caching layer for better performance`
-- Self-documenting code with clear names
-- Tests that serve as usage examples
-- Incremental, complete commits (code + tests + docs)
-- Direct communication about tasks and priorities
-
-## Definition of Done
-A task is complete ONLY when:
-1. Code is written and working
-2. Tests are written and passing
-3. Documentation is updated
-4. All temporary/dead code is removed
-5. Build scripts are updated if needed
-6. Changes are committed with clear message
\ No newline at end of file
diff --git a/docs/src/architecture.md b/docs/src/architecture.md
index 89b2437..88edd90 100644
--- a/docs/src/architecture.md
+++ b/docs/src/architecture.md
@@ -15,14 +15,22 @@ Simple, secure architecture for ML experiments in your homelab.
 graph TB
     subgraph "Homelab Stack"
         CLI[Zig CLI]
-        API[HTTPS API]
+        API[API Server (HTTPS + WebSocket)]
         REDIS[Redis Cache]
+        DB[(SQLite/PostgreSQL)]
         FS[Local Storage]
+        WORKER[Worker Service]
+        PODMAN[Podman/Docker]
     end
     
     CLI --> API
     API --> REDIS
+    API --> DB
     API --> FS
+    WORKER --> API
+    WORKER --> REDIS
+    WORKER --> FS
+    WORKER --> PODMAN
 ```
 
 ## Core Services
@@ -81,7 +89,7 @@ sequenceDiagram
     participant Redis
     participant Storage
     
-    CLI->>API: HTTPS Request
+    CLI->>API: HTTPS + WebSocket request
     API->>API: Validate Auth
     API->>Redis: Cache/Queue
     API->>Storage: Experiment Data
@@ -107,7 +115,7 @@ services:
 
 ### Local Setup
 ```bash
-./setup.sh && ./manage.sh start
+docker-compose -f deployments/docker-compose.dev.yml up -d
 ```
 
 ## Network Architecture
@@ -121,9 +129,11 @@ services:
 
 ```
 data/
-├── experiments/     # ML experiment results
-├── cache/          # Temporary cache files
-└── backups/        # Local backups
+├── experiments/     # Experiment definitions, run manifests, and artifacts
+├── tracking/        # Tracking tool state (e.g., MLflow/TensorBoard), when enabled
+├── .prewarm/        # Best-effort prewarm staging (snapshots/env/datasets), when enabled
+├── cache/           # Temporary caches (best-effort)
+└── backups/         # Local backups
 
 logs/
 ├── app.log         # Application logs
@@ -136,7 +146,7 @@ logs/
 Simple, lightweight monitoring:
 - **Health Checks**: Service availability
 - **Log Files**: Structured logging
-- **Basic Metrics**: Request counts, error rates
+- **Prometheus Metrics**: Worker and API metrics (including prewarm hit/miss/timing)
 - **Security Events**: Failed auth, rate limits
 
 ## Homelab Benefits
@@ -155,7 +165,7 @@ graph TB
     subgraph "Client Layer"
         CLI[CLI Tools]
         TUI[Terminal UI]
-        API[REST API]
+        API[WebSocket API]
     end
     
     subgraph "Authentication Layer"
@@ -200,6 +210,41 @@ graph TB
     Podman --> Containers
 ```
 
+## Tracking & Plugin System
+
+fetch_ml includes an optional tracking plugin system that can provision sidecar tools and/or pass through environment variables for common research tracking stacks.
+
+### Tracking modes
+
+Tracking tools support the following modes:
+
+- `sidecar`: provision a local sidecar container per task (best-effort).
+- `remote`: point to an externally managed instance (no local provisioning).
+- `disabled`: disable the tool entirely.
+
+### How it works
+
+- The worker maintains a tracking registry and provisions tools during task startup.
+- Provisioned plugins return environment variables that are injected into the task container.
+- Some plugins also require host paths (e.g., TensorBoard log directory); these are mounted into the task container and sanitized to avoid leaking host paths.
+
+### Built-in plugins
+
+The worker ships with built-in plugins:
+
+- `mlflow`: can run an MLflow server as a sidecar or use a remote `MLFLOW_TRACKING_URI`.
+- `tensorboard`: runs a TensorBoard sidecar and mounts a per-job log directory.
+- `wandb`: does not provision a sidecar; it forwards configuration via environment variables.
+
+### Configuration
+
+Plugins can be configured via worker configuration under `plugins`, including:
+
+- `enabled`
+- `image`
+- `mode`
+- per-plugin paths/settings (e.g., artifact base path, log base path)
+
 ## Zig CLI Architecture
 
 ### Component Structure
@@ -733,6 +778,91 @@ graph TB
 - **Security**: Built-in authentication and encryption
 - **Monitoring**: Basic health checks and logging
 
+ ## Roadmap (Research-First, Workstation-First)
+
+ fetchml is a research-first ML experiment runner with production-grade discipline.
+
+ ### Guiding principles
+
+ - **Reproducibility over speed**: optimizations must never change experimental semantics.
+ - **Explicit over magic**: every run should be explainable from manifests, configs, and logs.
+ - **Best-effort optimizations**: prewarming/caching must be optional and must not be required for correctness.
+ - **Workstation-first**: prioritize single-node reliability, observability, and fast iteration; avoid HPC-specific complexity.
+
+ ### Where we are now
+
+ - **Run provenance**: `run_manifest.json` exists and is readable via `ml info <path|id>`.
+ - **Validation**: `ml validate <commit_id>` and `ml validate --task <task_id>` exist; task validation includes run-manifest lifecycle/provenance checks.
+ - **Prewarming (Phase 1, best-effort)**:
+   - Next-task prewarm loop stages snapshots under `base/.prewarm/snapshots/<task_id>`.
+   - Best-effort dataset prefetch with a TTL cache.
+   - Warmed container image infrastructure exists (images keyed by `deps_manifest_sha256`).
+   - Prewarm status is surfaced in `ml status --json` under the `prewarm` field.
+
+ ### Phase 0: Trust and usability (highest priority)
+
+ #### 1) Make `ml status` excellent (human output)
+
+ - Show a compact summary of:
+   - queued/running/completed/failed counts
+   - a short list of most relevant tasks
+   - **prewarm state** (worker id, target task id, phase, dataset count, age)
+ - Preserve `--json` output as stable API for scripting.
+
+ #### 2) Add a dry-run preview command (`ml explain`)
+
+ - Print the resolved execution plan before running:
+   - commit id, experiment manifest overall sha
+   - dependency manifest name + sha
+   - snapshot id + expected sha (when applicable)
+   - dataset identities + checksums (when applicable)
+   - requested resources (cpu/mem/gpu)
+   - candidate runtime image (base vs warmed tag)
+
+ - Enforce a strict preflight by default:
+   - Queue-time blocking (do not enqueue tasks that fail reproducibility requirements).
+   - The strict preflight should be shared by `ml queue` and `ml explain`.
+   - Record the resolved plan into task metadata for traceability:
+     - `repro_policy: strict`
+     - `trust_level: <L0..L4>` (simple trust ladder)
+     - `plan_sha256: <sha256>` (digest of the resolved execution plan)
+
+ #### 3) Tighten run manifest completeness
+
+ - For `running`: require `started_at`.
+ - For `completed/failed`: require `started_at`, `ended_at`, and `exit_code`.
+ - When snapshots/datasets are used: ensure manifest records the relevant identifiers and digests.
+
+ #### 4) Dataset identity (minimal but research-grade)
+
+ - Prefer structured `dataset_specs` (name + checksum) as the authoritative input.
+ - Treat missing checksum as an error by default (strict-by-default).
+
+ ### Phase 1: Simple performance wins (only after Phase 0 feels solid)
+
+ - Keep prewarming single-level (next task only).
+ - Improve observability first (status output + metrics), then expand capabilities.
+
+ ### Phase 2+: Research workflows
+ 
+ - `ml compare <runA> <runB>`: manifest-driven diff of provenance and key parameters.
+ - `ml reproduce <run-id>`: submit a new task derived from the recorded manifest inputs.
+ - `ml export <run-id>`: package provenance + artifacts for collaborators/reviewers.
+ 
+ ### Phase 3: Infrastructure (only if needed)
+ 
+  - Multi-level prewarming, predictive scheduling, tmpfs caching, dashboards.
+  - Optional scalable storage backend for team deployments:
+    - Store run manifests + artifacts in S3-compatible object storage (e.g., MinIO) for durability and multi-worker/Kubernetes setups.
+    - Keep workstation-first defaults (local filesystem) for simplicity.
+  - Optional integrations via plugins/exporters (keep core strict and offline-capable):
+    - Server-side exporters that mirror run metadata, metrics, and artifacts to external systems (e.g., MLflow Tracking, Weights & Biases).
+    - Prefer lifecycle hooks that consume `run_manifest.json` / artifact manifests over plugins that influence execution semantics.
+  - Optional Kubernetes deployment path (for teams on scalable infra):
+    - Publish versioned container images for the backend (API server; optionally worker) and provide reference manifests (Helm/Kustomize).
+    - Keep the CLI as the primary UX; Kubernetes is an execution/deployment backend, not a UI.
+  - These are optional and should be driven by measured bottlenecks.
+
 ---
 
 This architecture provides a solid foundation for secure, scalable machine learning experiments while maintaining simplicity and developer productivity.
diff --git a/docs/src/development-setup.md b/docs/src/development-setup.md
deleted file mode 100644
index 55bf47f..0000000
--- a/docs/src/development-setup.md
+++ /dev/null
@@ -1,55 +0,0 @@
-# Development Setup
-
-Set up your local development environment for Fetch ML.
-
-## Prerequisites
-
-**Container Runtimes:**
-- **Docker Compose**: For testing and development only
-- **Podman**: For production experiment execution
-
-- Go 1.21+
-- Zig 0.11+
-- Docker Compose (testing only)
-- Redis (or use Docker)
-- Git
-
-## Quick Setup
-
-```bash
-# Clone repository
-git clone https://github.com/jfraeys/fetch_ml.git
-cd fetch_ml
-
-# Start dependencies
-see [Quick Start](quick-start.md) for Docker setup redis postgres
-
-# Build all components
-make build
-
-# Run tests
-see [Testing Guide](testing.md)
-```
-
-## Detailed Setup
-
-
-## Quick Start
-```bash
-git clone https://github.com/jfraeys/fetch_ml.git
-cd fetch_ml
-see [Quick Start](quick-start.md) for Docker setup
-make build
-see [Testing Guide](testing.md)
-```
-
-## Key Commands
-- `make build` - Build all components
-- `see [Testing Guide](testing.md)` - Run tests
-- `make dev` - Development build
-- `see [CLI Reference](cli-reference.md) and [Zig CLI](zig-cli.md)` - Build CLI
-
-## Common Issues
-- Build fails: `go mod tidy`
-- Zig errors: `cd cli && rm -rf zig-out zig-cache`
-- Port conflicts: `lsof -i :9101`
diff --git a/docs/src/queue.md b/docs/src/queue.md
index 1697467..08e9f37 100644
--- a/docs/src/queue.md
+++ b/docs/src/queue.md
@@ -218,7 +218,7 @@ func (w *Worker) executeTask(task *queue.Task) {
 
 ## Configuration
 
-### API Server (`configs/config.yaml`)
+### API Server (`configs/api/dev.yaml`)
 
 ```yaml
 redis:
@@ -227,15 +227,14 @@ redis:
   db: 0
 ```
 
-### Worker (`configs/worker-config.yaml`)
+### Worker (`configs/workers/docker.yaml`)
 
 ```yaml
-redis:
-  addr: "localhost:6379"
-  password: ""
-  db: 0
-  
-metrics_flush_interval: 500ms
+redis_addr: "localhost:6379"
+redis_password: ""
+redis_db: 0
+
+metrics_flush_interval: "500ms"
 ```
 
 ## Monitoring