From 3fa4f6ae51ecff10778a578b5502b9bad23f5031 Mon Sep 17 00:00:00 2001 From: Jeremie Fraeys Date: Mon, 5 Jan 2026 12:41:22 -0500 Subject: [PATCH] docs: update architecture/queue pages and remove legacy development setup --- .windsurf/rules/test-new-features.md | 90 ----------------- docs/src/architecture.md | 146 +++++++++++++++++++++++++-- docs/src/development-setup.md | 55 ---------- docs/src/queue.md | 15 ++- 4 files changed, 145 insertions(+), 161 deletions(-) delete mode 100644 .windsurf/rules/test-new-features.md delete mode 100644 docs/src/development-setup.md diff --git a/.windsurf/rules/test-new-features.md b/.windsurf/rules/test-new-features.md deleted file mode 100644 index cac25e9..0000000 --- a/.windsurf/rules/test-new-features.md +++ /dev/null @@ -1,90 +0,0 @@ ---- -trigger: model_decision -description: When a new feature is added, this prompt needs to be run ---- - -# Development Guidelines - -## Code Quality Standards - -### Testing Requirements -- MANDATORY: Every new feature MUST include corresponding tests -- Write tests BEFORE implementing complex features (TDD approach) -- Test coverage for new code should be >80% -- Include both unit tests and integration tests where applicable -- Test edge cases, error paths, and boundary conditions - -### Documentation Standards -- Update relevant documentation IN THE SAME COMMIT as code changes -- Documentation locations: - - README.md: User-facing features, installation, quick start - - CHANGELOG.md: All changes, following Keep a Changelog format - - Code comments: Complex logic, non-obvious decisions, API contracts - - Function/struct docs: Public APIs must have doc comments -- Use concrete examples in documentation -- Keep docs concise but complete - -### Code Organization -- CRITICAL: Clean up as you go - no orphaned files or dead code -- Remove commented-out code blocks (use git history instead) -- Delete unused imports, functions, and variables immediately -- Consolidate duplicate code into reusable functions -- Move TODO items from loose files into: - - Code comments with `// TODO(context):` for implementation tasks - - GitHub Issues for larger features - - NEVER create standalone .md files for tracking - -### When Making Changes -For EVERY significant change, complete ALL of these: - -1. Write/update tests -2. Update documentation (README, CHANGELOG, code comments) -3. Update build scripts if dependencies/build process changed -4. Remove any temporary/debug code added during development -5. Delete unused files created during exploration -6. Verify no dead code remains (unused functions, imports, variables) - -### Cleanup Checklist (Run BEFORE committing) -- [ ] Removed all debug print statements -- [ ] Deleted temporary test files -- [ ] Removed commented-out code -- [ ] Cleaned up unused imports -- [ ] Deleted exploratory/spike code -- [ ] Consolidated duplicate logic -- [ ] Removed obsolete scripts/configs - -### Communication Style -- Report what you've done: "Added feature X with tests in test/x_test.go" -- Highlight what needs attention: "WARNING: Manual testing needed for edge case Y" -- Ask questions directly: "Should we support Z? Trade-offs are..." -- NEVER say "I'll track this in a markdown file" - use code comments or tell me directly - -### Script/Build System Updates -- Update Makefile/build.zig when adding new targets or commands -- Modify CI/CD configs (.github/workflows) if build/test process changes -- Update package.json/Cargo.toml/go.mod when dependencies change -- Document new scripts in README under "Development" section - -## Anti-Patterns to AVOID -- Creating notes.md, todo.md, tasks.md, ideas.md files -- Leaving commented-out code "for reference" -- Keeping old implementation files with .old or .backup suffixes -- Adding features without tests -- Updating code without updating docs -- Leaving TODO comments without context or assignee - -## Preferred Patterns -- Inline TODO comments: `// TODO(user): Add caching layer for better performance` -- Self-documenting code with clear names -- Tests that serve as usage examples -- Incremental, complete commits (code + tests + docs) -- Direct communication about tasks and priorities - -## Definition of Done -A task is complete ONLY when: -1. Code is written and working -2. Tests are written and passing -3. Documentation is updated -4. All temporary/dead code is removed -5. Build scripts are updated if needed -6. Changes are committed with clear message \ No newline at end of file diff --git a/docs/src/architecture.md b/docs/src/architecture.md index 89b2437..88edd90 100644 --- a/docs/src/architecture.md +++ b/docs/src/architecture.md @@ -15,14 +15,22 @@ Simple, secure architecture for ML experiments in your homelab. graph TB subgraph "Homelab Stack" CLI[Zig CLI] - API[HTTPS API] + API[API Server (HTTPS + WebSocket)] REDIS[Redis Cache] + DB[(SQLite/PostgreSQL)] FS[Local Storage] + WORKER[Worker Service] + PODMAN[Podman/Docker] end CLI --> API API --> REDIS + API --> DB API --> FS + WORKER --> API + WORKER --> REDIS + WORKER --> FS + WORKER --> PODMAN ``` ## Core Services @@ -81,7 +89,7 @@ sequenceDiagram participant Redis participant Storage - CLI->>API: HTTPS Request + CLI->>API: HTTPS + WebSocket request API->>API: Validate Auth API->>Redis: Cache/Queue API->>Storage: Experiment Data @@ -107,7 +115,7 @@ services: ### Local Setup ```bash -./setup.sh && ./manage.sh start +docker-compose -f deployments/docker-compose.dev.yml up -d ``` ## Network Architecture @@ -121,9 +129,11 @@ services: ``` data/ -├── experiments/ # ML experiment results -├── cache/ # Temporary cache files -└── backups/ # Local backups +├── experiments/ # Experiment definitions, run manifests, and artifacts +├── tracking/ # Tracking tool state (e.g., MLflow/TensorBoard), when enabled +├── .prewarm/ # Best-effort prewarm staging (snapshots/env/datasets), when enabled +├── cache/ # Temporary caches (best-effort) +└── backups/ # Local backups logs/ ├── app.log # Application logs @@ -136,7 +146,7 @@ logs/ Simple, lightweight monitoring: - **Health Checks**: Service availability - **Log Files**: Structured logging -- **Basic Metrics**: Request counts, error rates +- **Prometheus Metrics**: Worker and API metrics (including prewarm hit/miss/timing) - **Security Events**: Failed auth, rate limits ## Homelab Benefits @@ -155,7 +165,7 @@ graph TB subgraph "Client Layer" CLI[CLI Tools] TUI[Terminal UI] - API[REST API] + API[WebSocket API] end subgraph "Authentication Layer" @@ -200,6 +210,41 @@ graph TB Podman --> Containers ``` +## Tracking & Plugin System + +fetch_ml includes an optional tracking plugin system that can provision sidecar tools and/or pass through environment variables for common research tracking stacks. + +### Tracking modes + +Tracking tools support the following modes: + +- `sidecar`: provision a local sidecar container per task (best-effort). +- `remote`: point to an externally managed instance (no local provisioning). +- `disabled`: disable the tool entirely. + +### How it works + +- The worker maintains a tracking registry and provisions tools during task startup. +- Provisioned plugins return environment variables that are injected into the task container. +- Some plugins also require host paths (e.g., TensorBoard log directory); these are mounted into the task container and sanitized to avoid leaking host paths. + +### Built-in plugins + +The worker ships with built-in plugins: + +- `mlflow`: can run an MLflow server as a sidecar or use a remote `MLFLOW_TRACKING_URI`. +- `tensorboard`: runs a TensorBoard sidecar and mounts a per-job log directory. +- `wandb`: does not provision a sidecar; it forwards configuration via environment variables. + +### Configuration + +Plugins can be configured via worker configuration under `plugins`, including: + +- `enabled` +- `image` +- `mode` +- per-plugin paths/settings (e.g., artifact base path, log base path) + ## Zig CLI Architecture ### Component Structure @@ -733,6 +778,91 @@ graph TB - **Security**: Built-in authentication and encryption - **Monitoring**: Basic health checks and logging + ## Roadmap (Research-First, Workstation-First) + + fetchml is a research-first ML experiment runner with production-grade discipline. + + ### Guiding principles + + - **Reproducibility over speed**: optimizations must never change experimental semantics. + - **Explicit over magic**: every run should be explainable from manifests, configs, and logs. + - **Best-effort optimizations**: prewarming/caching must be optional and must not be required for correctness. + - **Workstation-first**: prioritize single-node reliability, observability, and fast iteration; avoid HPC-specific complexity. + + ### Where we are now + + - **Run provenance**: `run_manifest.json` exists and is readable via `ml info `. + - **Validation**: `ml validate ` and `ml validate --task ` exist; task validation includes run-manifest lifecycle/provenance checks. + - **Prewarming (Phase 1, best-effort)**: + - Next-task prewarm loop stages snapshots under `base/.prewarm/snapshots/`. + - Best-effort dataset prefetch with a TTL cache. + - Warmed container image infrastructure exists (images keyed by `deps_manifest_sha256`). + - Prewarm status is surfaced in `ml status --json` under the `prewarm` field. + + ### Phase 0: Trust and usability (highest priority) + + #### 1) Make `ml status` excellent (human output) + + - Show a compact summary of: + - queued/running/completed/failed counts + - a short list of most relevant tasks + - **prewarm state** (worker id, target task id, phase, dataset count, age) + - Preserve `--json` output as stable API for scripting. + + #### 2) Add a dry-run preview command (`ml explain`) + + - Print the resolved execution plan before running: + - commit id, experiment manifest overall sha + - dependency manifest name + sha + - snapshot id + expected sha (when applicable) + - dataset identities + checksums (when applicable) + - requested resources (cpu/mem/gpu) + - candidate runtime image (base vs warmed tag) + + - Enforce a strict preflight by default: + - Queue-time blocking (do not enqueue tasks that fail reproducibility requirements). + - The strict preflight should be shared by `ml queue` and `ml explain`. + - Record the resolved plan into task metadata for traceability: + - `repro_policy: strict` + - `trust_level: ` (simple trust ladder) + - `plan_sha256: ` (digest of the resolved execution plan) + + #### 3) Tighten run manifest completeness + + - For `running`: require `started_at`. + - For `completed/failed`: require `started_at`, `ended_at`, and `exit_code`. + - When snapshots/datasets are used: ensure manifest records the relevant identifiers and digests. + + #### 4) Dataset identity (minimal but research-grade) + + - Prefer structured `dataset_specs` (name + checksum) as the authoritative input. + - Treat missing checksum as an error by default (strict-by-default). + + ### Phase 1: Simple performance wins (only after Phase 0 feels solid) + + - Keep prewarming single-level (next task only). + - Improve observability first (status output + metrics), then expand capabilities. + + ### Phase 2+: Research workflows + + - `ml compare `: manifest-driven diff of provenance and key parameters. + - `ml reproduce `: submit a new task derived from the recorded manifest inputs. + - `ml export `: package provenance + artifacts for collaborators/reviewers. + + ### Phase 3: Infrastructure (only if needed) + + - Multi-level prewarming, predictive scheduling, tmpfs caching, dashboards. + - Optional scalable storage backend for team deployments: + - Store run manifests + artifacts in S3-compatible object storage (e.g., MinIO) for durability and multi-worker/Kubernetes setups. + - Keep workstation-first defaults (local filesystem) for simplicity. + - Optional integrations via plugins/exporters (keep core strict and offline-capable): + - Server-side exporters that mirror run metadata, metrics, and artifacts to external systems (e.g., MLflow Tracking, Weights & Biases). + - Prefer lifecycle hooks that consume `run_manifest.json` / artifact manifests over plugins that influence execution semantics. + - Optional Kubernetes deployment path (for teams on scalable infra): + - Publish versioned container images for the backend (API server; optionally worker) and provide reference manifests (Helm/Kustomize). + - Keep the CLI as the primary UX; Kubernetes is an execution/deployment backend, not a UI. + - These are optional and should be driven by measured bottlenecks. + --- This architecture provides a solid foundation for secure, scalable machine learning experiments while maintaining simplicity and developer productivity. diff --git a/docs/src/development-setup.md b/docs/src/development-setup.md deleted file mode 100644 index 55bf47f..0000000 --- a/docs/src/development-setup.md +++ /dev/null @@ -1,55 +0,0 @@ -# Development Setup - -Set up your local development environment for Fetch ML. - -## Prerequisites - -**Container Runtimes:** -- **Docker Compose**: For testing and development only -- **Podman**: For production experiment execution - -- Go 1.21+ -- Zig 0.11+ -- Docker Compose (testing only) -- Redis (or use Docker) -- Git - -## Quick Setup - -```bash -# Clone repository -git clone https://github.com/jfraeys/fetch_ml.git -cd fetch_ml - -# Start dependencies -see [Quick Start](quick-start.md) for Docker setup redis postgres - -# Build all components -make build - -# Run tests -see [Testing Guide](testing.md) -``` - -## Detailed Setup - - -## Quick Start -```bash -git clone https://github.com/jfraeys/fetch_ml.git -cd fetch_ml -see [Quick Start](quick-start.md) for Docker setup -make build -see [Testing Guide](testing.md) -``` - -## Key Commands -- `make build` - Build all components -- `see [Testing Guide](testing.md)` - Run tests -- `make dev` - Development build -- `see [CLI Reference](cli-reference.md) and [Zig CLI](zig-cli.md)` - Build CLI - -## Common Issues -- Build fails: `go mod tidy` -- Zig errors: `cd cli && rm -rf zig-out zig-cache` -- Port conflicts: `lsof -i :9101` diff --git a/docs/src/queue.md b/docs/src/queue.md index 1697467..08e9f37 100644 --- a/docs/src/queue.md +++ b/docs/src/queue.md @@ -218,7 +218,7 @@ func (w *Worker) executeTask(task *queue.Task) { ## Configuration -### API Server (`configs/config.yaml`) +### API Server (`configs/api/dev.yaml`) ```yaml redis: @@ -227,15 +227,14 @@ redis: db: 0 ``` -### Worker (`configs/worker-config.yaml`) +### Worker (`configs/workers/docker.yaml`) ```yaml -redis: - addr: "localhost:6379" - password: "" - db: 0 - -metrics_flush_interval: 500ms +redis_addr: "localhost:6379" +redis_password: "" +redis_db: 0 + +metrics_flush_interval: "500ms" ``` ## Monitoring