docs: update architecture/queue pages and remove legacy development setup

2026-01-05 12:41:22 -05:00 · 2026-01-05 12:41:22 -05:00 · 3fa4f6ae51
commit 3fa4f6ae51
parent f6e506a632
4 changed files with 145 additions and 161 deletions
--- a/.windsurf/rules/test-new-features.md
+++ b/.windsurf/rules/test-new-features.md
@ -1,90 +0,0 @@
---
-trigger: model_decision
-description: When a new feature is added, this prompt needs to be run
---
-
-# Development Guidelines
-
-## Code Quality Standards
-
-### Testing Requirements
- MANDATORY: Every new feature MUST include corresponding tests
- Write tests BEFORE implementing complex features (TDD approach)
- Test coverage for new code should be >80%
- Include both unit tests and integration tests where applicable
- Test edge cases, error paths, and boundary conditions
-
-### Documentation Standards
- Update relevant documentation IN THE SAME COMMIT as code changes
- Documentation locations:
-  - README.md: User-facing features, installation, quick start
-  - CHANGELOG.md: All changes, following Keep a Changelog format
-  - Code comments: Complex logic, non-obvious decisions, API contracts
-  - Function/struct docs: Public APIs must have doc comments
- Use concrete examples in documentation
- Keep docs concise but complete
-
-### Code Organization
- CRITICAL: Clean up as you go - no orphaned files or dead code
- Remove commented-out code blocks (use git history instead)
- Delete unused imports, functions, and variables immediately
- Consolidate duplicate code into reusable functions
- Move TODO items from loose files into:
-  - Code comments with `// TODO(context):` for implementation tasks
-  - GitHub Issues for larger features
-  - NEVER create standalone .md files for tracking
-
-### When Making Changes
-For EVERY significant change, complete ALL of these:
-
-1. Write/update tests
-2. Update documentation (README, CHANGELOG, code comments)
-3. Update build scripts if dependencies/build process changed
-4. Remove any temporary/debug code added during development
-5. Delete unused files created during exploration
-6. Verify no dead code remains (unused functions, imports, variables)
-
-### Cleanup Checklist (Run BEFORE committing)
- [ ] Removed all debug print statements
- [ ] Deleted temporary test files
- [ ] Removed commented-out code
- [ ] Cleaned up unused imports
- [ ] Deleted exploratory/spike code
- [ ] Consolidated duplicate logic
- [ ] Removed obsolete scripts/configs
-
-### Communication Style
- Report what you've done: "Added feature X with tests in test/x_test.go"
- Highlight what needs attention: "WARNING: Manual testing needed for edge case Y"
- Ask questions directly: "Should we support Z? Trade-offs are..."
- NEVER say "I'll track this in a markdown file" - use code comments or tell me directly
-
-### Script/Build System Updates
- Update Makefile/build.zig when adding new targets or commands
- Modify CI/CD configs (.github/workflows) if build/test process changes
- Update package.json/Cargo.toml/go.mod when dependencies change
- Document new scripts in README under "Development" section
-
-## Anti-Patterns to AVOID
- Creating notes.md, todo.md, tasks.md, ideas.md files
- Leaving commented-out code "for reference"
- Keeping old implementation files with .old or .backup suffixes
- Adding features without tests
- Updating code without updating docs
- Leaving TODO comments without context or assignee
-
-## Preferred Patterns
- Inline TODO comments: `// TODO(user): Add caching layer for better performance`
- Self-documenting code with clear names
- Tests that serve as usage examples
- Incremental, complete commits (code + tests + docs)
- Direct communication about tasks and priorities
-
-## Definition of Done
-A task is complete ONLY when:
-1. Code is written and working
-2. Tests are written and passing
-3. Documentation is updated
-4. All temporary/dead code is removed
-5. Build scripts are updated if needed
-6. Changes are committed with clear message
--- a/docs/src/architecture.md
+++ b/docs/src/architecture.md
@ -15,14 +15,22 @@ Simple, secure architecture for ML experiments in your homelab.
 graph TB
    subgraph "Homelab Stack"
        CLI[Zig CLI]
-        API[HTTPS API]
+        API[API Server (HTTPS + WebSocket)]
        REDIS[Redis Cache]
+        DB[(SQLite/PostgreSQL)]
        FS[Local Storage]
+        WORKER[Worker Service]
+        PODMAN[Podman/Docker]
    end
    
    CLI --> API
    API --> REDIS
+    API --> DB
    API --> FS
+    WORKER --> API
+    WORKER --> REDIS
+    WORKER --> FS
+    WORKER --> PODMAN
 ```

 ## Core Services
@ -81,7 +89,7 @@ sequenceDiagram
    participant Redis
    participant Storage
    
-    CLI->>API: HTTPS Request
+    CLI->>API: HTTPS + WebSocket request
    API->>API: Validate Auth
    API->>Redis: Cache/Queue
    API->>Storage: Experiment Data
@ -107,7 +115,7 @@ services:

 ### Local Setup
 ```bash
-./setup.sh && ./manage.sh start
+docker-compose -f deployments/docker-compose.dev.yml up -d
 ```

 ## Network Architecture
@ -121,9 +129,11 @@ services:

 ```
 data/
-├── experiments/     # ML experiment results
-├── cache/          # Temporary cache files
-└── backups/        # Local backups
+├── experiments/     # Experiment definitions, run manifests, and artifacts
+├── tracking/        # Tracking tool state (e.g., MLflow/TensorBoard), when enabled
+├── .prewarm/        # Best-effort prewarm staging (snapshots/env/datasets), when enabled
+├── cache/           # Temporary caches (best-effort)
+└── backups/         # Local backups

 logs/
 ├── app.log         # Application logs
@ -136,7 +146,7 @@ logs/
 Simple, lightweight monitoring:
 - **Health Checks**: Service availability
 - **Log Files**: Structured logging
- **Basic Metrics**: Request counts, error rates
+- **Prometheus Metrics**: Worker and API metrics (including prewarm hit/miss/timing)
 - **Security Events**: Failed auth, rate limits

 ## Homelab Benefits
@ -155,7 +165,7 @@ graph TB
    subgraph "Client Layer"
        CLI[CLI Tools]
        TUI[Terminal UI]
-        API[REST API]
+        API[WebSocket API]
    end
    
    subgraph "Authentication Layer"
@ -200,6 +210,41 @@ graph TB
    Podman --> Containers
 ```

+## Tracking & Plugin System
+
+fetch_ml includes an optional tracking plugin system that can provision sidecar tools and/or pass through environment variables for common research tracking stacks.
+
+### Tracking modes
+
+Tracking tools support the following modes:
+
+- `sidecar`: provision a local sidecar container per task (best-effort).
+- `remote`: point to an externally managed instance (no local provisioning).
+- `disabled`: disable the tool entirely.
+
+### How it works
+
+- The worker maintains a tracking registry and provisions tools during task startup.
+- Provisioned plugins return environment variables that are injected into the task container.
+- Some plugins also require host paths (e.g., TensorBoard log directory); these are mounted into the task container and sanitized to avoid leaking host paths.
+
+### Built-in plugins
+
+The worker ships with built-in plugins:
+
+- `mlflow`: can run an MLflow server as a sidecar or use a remote `MLFLOW_TRACKING_URI`.
+- `tensorboard`: runs a TensorBoard sidecar and mounts a per-job log directory.
+- `wandb`: does not provision a sidecar; it forwards configuration via environment variables.
+
+### Configuration
+
+Plugins can be configured via worker configuration under `plugins`, including:
+
+- `enabled`
+- `image`
+- `mode`
+- per-plugin paths/settings (e.g., artifact base path, log base path)
+
 ## Zig CLI Architecture

 ### Component Structure
@ -733,6 +778,91 @@ graph TB
 - **Security**: Built-in authentication and encryption
 - **Monitoring**: Basic health checks and logging

+ ## Roadmap (Research-First, Workstation-First)
+
+ fetchml is a research-first ML experiment runner with production-grade discipline.
+
+ ### Guiding principles
+
+ - **Reproducibility over speed**: optimizations must never change experimental semantics.
+ - **Explicit over magic**: every run should be explainable from manifests, configs, and logs.
+ - **Best-effort optimizations**: prewarming/caching must be optional and must not be required for correctness.
+ - **Workstation-first**: prioritize single-node reliability, observability, and fast iteration; avoid HPC-specific complexity.
+
+ ### Where we are now
+
+ - **Run provenance**: `run_manifest.json` exists and is readable via `ml info <path|id>`.
+ - **Validation**: `ml validate <commit_id>` and `ml validate --task <task_id>` exist; task validation includes run-manifest lifecycle/provenance checks.
+ - **Prewarming (Phase 1, best-effort)**:
+   - Next-task prewarm loop stages snapshots under `base/.prewarm/snapshots/<task_id>`.
+   - Best-effort dataset prefetch with a TTL cache.
+   - Warmed container image infrastructure exists (images keyed by `deps_manifest_sha256`).
+   - Prewarm status is surfaced in `ml status --json` under the `prewarm` field.
+
+ ### Phase 0: Trust and usability (highest priority)
+
+ #### 1) Make `ml status` excellent (human output)
+
+ - Show a compact summary of:
+   - queued/running/completed/failed counts
+   - a short list of most relevant tasks
+   - **prewarm state** (worker id, target task id, phase, dataset count, age)
+ - Preserve `--json` output as stable API for scripting.
+
+ #### 2) Add a dry-run preview command (`ml explain`)
+
+ - Print the resolved execution plan before running:
+   - commit id, experiment manifest overall sha
+   - dependency manifest name + sha
+   - snapshot id + expected sha (when applicable)
+   - dataset identities + checksums (when applicable)
+   - requested resources (cpu/mem/gpu)
+   - candidate runtime image (base vs warmed tag)
+
+ - Enforce a strict preflight by default:
+   - Queue-time blocking (do not enqueue tasks that fail reproducibility requirements).
+   - The strict preflight should be shared by `ml queue` and `ml explain`.
+   - Record the resolved plan into task metadata for traceability:
+     - `repro_policy: strict`
+     - `trust_level: <L0..L4>` (simple trust ladder)
+     - `plan_sha256: <sha256>` (digest of the resolved execution plan)
+
+ #### 3) Tighten run manifest completeness
+
+ - For `running`: require `started_at`.
+ - For `completed/failed`: require `started_at`, `ended_at`, and `exit_code`.
+ - When snapshots/datasets are used: ensure manifest records the relevant identifiers and digests.
+
+ #### 4) Dataset identity (minimal but research-grade)
+
+ - Prefer structured `dataset_specs` (name + checksum) as the authoritative input.
+ - Treat missing checksum as an error by default (strict-by-default).
+
+ ### Phase 1: Simple performance wins (only after Phase 0 feels solid)
+
+ - Keep prewarming single-level (next task only).
+ - Improve observability first (status output + metrics), then expand capabilities.
+
+ ### Phase 2+: Research workflows
+ 
+ - `ml compare <runA> <runB>`: manifest-driven diff of provenance and key parameters.
+ - `ml reproduce <run-id>`: submit a new task derived from the recorded manifest inputs.
+ - `ml export <run-id>`: package provenance + artifacts for collaborators/reviewers.
+ 
+ ### Phase 3: Infrastructure (only if needed)
+ 
+  - Multi-level prewarming, predictive scheduling, tmpfs caching, dashboards.
+  - Optional scalable storage backend for team deployments:
+    - Store run manifests + artifacts in S3-compatible object storage (e.g., MinIO) for durability and multi-worker/Kubernetes setups.
+    - Keep workstation-first defaults (local filesystem) for simplicity.
+  - Optional integrations via plugins/exporters (keep core strict and offline-capable):
+    - Server-side exporters that mirror run metadata, metrics, and artifacts to external systems (e.g., MLflow Tracking, Weights & Biases).
+    - Prefer lifecycle hooks that consume `run_manifest.json` / artifact manifests over plugins that influence execution semantics.
+  - Optional Kubernetes deployment path (for teams on scalable infra):
+    - Publish versioned container images for the backend (API server; optionally worker) and provide reference manifests (Helm/Kustomize).
+    - Keep the CLI as the primary UX; Kubernetes is an execution/deployment backend, not a UI.
+  - These are optional and should be driven by measured bottlenecks.
+
 ---

 This architecture provides a solid foundation for secure, scalable machine learning experiments while maintaining simplicity and developer productivity.
--- a/docs/src/development-setup.md
+++ b/docs/src/development-setup.md
@ -1,55 +0,0 @@
-# Development Setup
-
-Set up your local development environment for Fetch ML.
-
-## Prerequisites
-
-**Container Runtimes:**
- **Docker Compose**: For testing and development only
- **Podman**: For production experiment execution
-
- Go 1.21+
- Zig 0.11+
- Docker Compose (testing only)
- Redis (or use Docker)
- Git
-
-## Quick Setup
-
-```bash
-# Clone repository
-git clone https://github.com/jfraeys/fetch_ml.git
-cd fetch_ml
-
-# Start dependencies
-see [Quick Start](quick-start.md) for Docker setup redis postgres
-
-# Build all components
-make build
-
-# Run tests
-see [Testing Guide](testing.md)
-```
-
-## Detailed Setup
-
-
-## Quick Start
-```bash
-git clone https://github.com/jfraeys/fetch_ml.git
-cd fetch_ml
-see [Quick Start](quick-start.md) for Docker setup
-make build
-see [Testing Guide](testing.md)
-```
-
-## Key Commands
- `make build` - Build all components
- `see [Testing Guide](testing.md)` - Run tests
- `make dev` - Development build
- `see [CLI Reference](cli-reference.md) and [Zig CLI](zig-cli.md)` - Build CLI
-
-## Common Issues
- Build fails: `go mod tidy`
- Zig errors: `cd cli && rm -rf zig-out zig-cache`
- Port conflicts: `lsof -i :9101`
--- a/docs/src/queue.md
+++ b/docs/src/queue.md
@ -218,7 +218,7 @@ func (w *Worker) executeTask(task *queue.Task) {

 ## Configuration

-### API Server (`configs/config.yaml`)
+### API Server (`configs/api/dev.yaml`)

 ```yaml
 redis:
@ -227,15 +227,14 @@ redis:
  db: 0
 ```

-### Worker (`configs/worker-config.yaml`)
+### Worker (`configs/workers/docker.yaml`)

 ```yaml
-redis:
-  addr: "localhost:6379"
-  password: ""
-  db: 0
-  
-metrics_flush_interval: 500ms
+redis_addr: "localhost:6379"
+redis_password: ""
+redis_db: 0
+
+metrics_flush_interval: "500ms"
 ```

 ## Monitoring