docs: update architecture/queue pages and remove legacy development setup

This commit is contained in:
Jeremie Fraeys 2026-01-05 12:41:22 -05:00
parent f6e506a632
commit 3fa4f6ae51
4 changed files with 145 additions and 161 deletions

View file

@ -1,90 +0,0 @@
---
trigger: model_decision
description: When a new feature is added, this prompt needs to be run
---
# Development Guidelines
## Code Quality Standards
### Testing Requirements
- MANDATORY: Every new feature MUST include corresponding tests
- Write tests BEFORE implementing complex features (TDD approach)
- Test coverage for new code should be >80%
- Include both unit tests and integration tests where applicable
- Test edge cases, error paths, and boundary conditions
### Documentation Standards
- Update relevant documentation IN THE SAME COMMIT as code changes
- Documentation locations:
- README.md: User-facing features, installation, quick start
- CHANGELOG.md: All changes, following Keep a Changelog format
- Code comments: Complex logic, non-obvious decisions, API contracts
- Function/struct docs: Public APIs must have doc comments
- Use concrete examples in documentation
- Keep docs concise but complete
### Code Organization
- CRITICAL: Clean up as you go - no orphaned files or dead code
- Remove commented-out code blocks (use git history instead)
- Delete unused imports, functions, and variables immediately
- Consolidate duplicate code into reusable functions
- Move TODO items from loose files into:
- Code comments with `// TODO(context):` for implementation tasks
- GitHub Issues for larger features
- NEVER create standalone .md files for tracking
### When Making Changes
For EVERY significant change, complete ALL of these:
1. Write/update tests
2. Update documentation (README, CHANGELOG, code comments)
3. Update build scripts if dependencies/build process changed
4. Remove any temporary/debug code added during development
5. Delete unused files created during exploration
6. Verify no dead code remains (unused functions, imports, variables)
### Cleanup Checklist (Run BEFORE committing)
- [ ] Removed all debug print statements
- [ ] Deleted temporary test files
- [ ] Removed commented-out code
- [ ] Cleaned up unused imports
- [ ] Deleted exploratory/spike code
- [ ] Consolidated duplicate logic
- [ ] Removed obsolete scripts/configs
### Communication Style
- Report what you've done: "Added feature X with tests in test/x_test.go"
- Highlight what needs attention: "WARNING: Manual testing needed for edge case Y"
- Ask questions directly: "Should we support Z? Trade-offs are..."
- NEVER say "I'll track this in a markdown file" - use code comments or tell me directly
### Script/Build System Updates
- Update Makefile/build.zig when adding new targets or commands
- Modify CI/CD configs (.github/workflows) if build/test process changes
- Update package.json/Cargo.toml/go.mod when dependencies change
- Document new scripts in README under "Development" section
## Anti-Patterns to AVOID
- Creating notes.md, todo.md, tasks.md, ideas.md files
- Leaving commented-out code "for reference"
- Keeping old implementation files with .old or .backup suffixes
- Adding features without tests
- Updating code without updating docs
- Leaving TODO comments without context or assignee
## Preferred Patterns
- Inline TODO comments: `// TODO(user): Add caching layer for better performance`
- Self-documenting code with clear names
- Tests that serve as usage examples
- Incremental, complete commits (code + tests + docs)
- Direct communication about tasks and priorities
## Definition of Done
A task is complete ONLY when:
1. Code is written and working
2. Tests are written and passing
3. Documentation is updated
4. All temporary/dead code is removed
5. Build scripts are updated if needed
6. Changes are committed with clear message

View file

@ -15,14 +15,22 @@ Simple, secure architecture for ML experiments in your homelab.
graph TB
subgraph "Homelab Stack"
CLI[Zig CLI]
API[HTTPS API]
API[API Server (HTTPS + WebSocket)]
REDIS[Redis Cache]
DB[(SQLite/PostgreSQL)]
FS[Local Storage]
WORKER[Worker Service]
PODMAN[Podman/Docker]
end
CLI --> API
API --> REDIS
API --> DB
API --> FS
WORKER --> API
WORKER --> REDIS
WORKER --> FS
WORKER --> PODMAN
```
## Core Services
@ -81,7 +89,7 @@ sequenceDiagram
participant Redis
participant Storage
CLI->>API: HTTPS Request
CLI->>API: HTTPS + WebSocket request
API->>API: Validate Auth
API->>Redis: Cache/Queue
API->>Storage: Experiment Data
@ -107,7 +115,7 @@ services:
### Local Setup
```bash
./setup.sh && ./manage.sh start
docker-compose -f deployments/docker-compose.dev.yml up -d
```
## Network Architecture
@ -121,9 +129,11 @@ services:
```
data/
├── experiments/ # ML experiment results
├── cache/ # Temporary cache files
└── backups/ # Local backups
├── experiments/ # Experiment definitions, run manifests, and artifacts
├── tracking/ # Tracking tool state (e.g., MLflow/TensorBoard), when enabled
├── .prewarm/ # Best-effort prewarm staging (snapshots/env/datasets), when enabled
├── cache/ # Temporary caches (best-effort)
└── backups/ # Local backups
logs/
├── app.log # Application logs
@ -136,7 +146,7 @@ logs/
Simple, lightweight monitoring:
- **Health Checks**: Service availability
- **Log Files**: Structured logging
- **Basic Metrics**: Request counts, error rates
- **Prometheus Metrics**: Worker and API metrics (including prewarm hit/miss/timing)
- **Security Events**: Failed auth, rate limits
## Homelab Benefits
@ -155,7 +165,7 @@ graph TB
subgraph "Client Layer"
CLI[CLI Tools]
TUI[Terminal UI]
API[REST API]
API[WebSocket API]
end
subgraph "Authentication Layer"
@ -200,6 +210,41 @@ graph TB
Podman --> Containers
```
## Tracking & Plugin System
fetch_ml includes an optional tracking plugin system that can provision sidecar tools and/or pass through environment variables for common research tracking stacks.
### Tracking modes
Tracking tools support the following modes:
- `sidecar`: provision a local sidecar container per task (best-effort).
- `remote`: point to an externally managed instance (no local provisioning).
- `disabled`: disable the tool entirely.
### How it works
- The worker maintains a tracking registry and provisions tools during task startup.
- Provisioned plugins return environment variables that are injected into the task container.
- Some plugins also require host paths (e.g., TensorBoard log directory); these are mounted into the task container and sanitized to avoid leaking host paths.
### Built-in plugins
The worker ships with built-in plugins:
- `mlflow`: can run an MLflow server as a sidecar or use a remote `MLFLOW_TRACKING_URI`.
- `tensorboard`: runs a TensorBoard sidecar and mounts a per-job log directory.
- `wandb`: does not provision a sidecar; it forwards configuration via environment variables.
### Configuration
Plugins can be configured via worker configuration under `plugins`, including:
- `enabled`
- `image`
- `mode`
- per-plugin paths/settings (e.g., artifact base path, log base path)
## Zig CLI Architecture
### Component Structure
@ -733,6 +778,91 @@ graph TB
- **Security**: Built-in authentication and encryption
- **Monitoring**: Basic health checks and logging
## Roadmap (Research-First, Workstation-First)
fetchml is a research-first ML experiment runner with production-grade discipline.
### Guiding principles
- **Reproducibility over speed**: optimizations must never change experimental semantics.
- **Explicit over magic**: every run should be explainable from manifests, configs, and logs.
- **Best-effort optimizations**: prewarming/caching must be optional and must not be required for correctness.
- **Workstation-first**: prioritize single-node reliability, observability, and fast iteration; avoid HPC-specific complexity.
### Where we are now
- **Run provenance**: `run_manifest.json` exists and is readable via `ml info <path|id>`.
- **Validation**: `ml validate <commit_id>` and `ml validate --task <task_id>` exist; task validation includes run-manifest lifecycle/provenance checks.
- **Prewarming (Phase 1, best-effort)**:
- Next-task prewarm loop stages snapshots under `base/.prewarm/snapshots/<task_id>`.
- Best-effort dataset prefetch with a TTL cache.
- Warmed container image infrastructure exists (images keyed by `deps_manifest_sha256`).
- Prewarm status is surfaced in `ml status --json` under the `prewarm` field.
### Phase 0: Trust and usability (highest priority)
#### 1) Make `ml status` excellent (human output)
- Show a compact summary of:
- queued/running/completed/failed counts
- a short list of most relevant tasks
- **prewarm state** (worker id, target task id, phase, dataset count, age)
- Preserve `--json` output as stable API for scripting.
#### 2) Add a dry-run preview command (`ml explain`)
- Print the resolved execution plan before running:
- commit id, experiment manifest overall sha
- dependency manifest name + sha
- snapshot id + expected sha (when applicable)
- dataset identities + checksums (when applicable)
- requested resources (cpu/mem/gpu)
- candidate runtime image (base vs warmed tag)
- Enforce a strict preflight by default:
- Queue-time blocking (do not enqueue tasks that fail reproducibility requirements).
- The strict preflight should be shared by `ml queue` and `ml explain`.
- Record the resolved plan into task metadata for traceability:
- `repro_policy: strict`
- `trust_level: <L0..L4>` (simple trust ladder)
- `plan_sha256: <sha256>` (digest of the resolved execution plan)
#### 3) Tighten run manifest completeness
- For `running`: require `started_at`.
- For `completed/failed`: require `started_at`, `ended_at`, and `exit_code`.
- When snapshots/datasets are used: ensure manifest records the relevant identifiers and digests.
#### 4) Dataset identity (minimal but research-grade)
- Prefer structured `dataset_specs` (name + checksum) as the authoritative input.
- Treat missing checksum as an error by default (strict-by-default).
### Phase 1: Simple performance wins (only after Phase 0 feels solid)
- Keep prewarming single-level (next task only).
- Improve observability first (status output + metrics), then expand capabilities.
### Phase 2+: Research workflows
- `ml compare <runA> <runB>`: manifest-driven diff of provenance and key parameters.
- `ml reproduce <run-id>`: submit a new task derived from the recorded manifest inputs.
- `ml export <run-id>`: package provenance + artifacts for collaborators/reviewers.
### Phase 3: Infrastructure (only if needed)
- Multi-level prewarming, predictive scheduling, tmpfs caching, dashboards.
- Optional scalable storage backend for team deployments:
- Store run manifests + artifacts in S3-compatible object storage (e.g., MinIO) for durability and multi-worker/Kubernetes setups.
- Keep workstation-first defaults (local filesystem) for simplicity.
- Optional integrations via plugins/exporters (keep core strict and offline-capable):
- Server-side exporters that mirror run metadata, metrics, and artifacts to external systems (e.g., MLflow Tracking, Weights & Biases).
- Prefer lifecycle hooks that consume `run_manifest.json` / artifact manifests over plugins that influence execution semantics.
- Optional Kubernetes deployment path (for teams on scalable infra):
- Publish versioned container images for the backend (API server; optionally worker) and provide reference manifests (Helm/Kustomize).
- Keep the CLI as the primary UX; Kubernetes is an execution/deployment backend, not a UI.
- These are optional and should be driven by measured bottlenecks.
---
This architecture provides a solid foundation for secure, scalable machine learning experiments while maintaining simplicity and developer productivity.

View file

@ -1,55 +0,0 @@
# Development Setup
Set up your local development environment for Fetch ML.
## Prerequisites
**Container Runtimes:**
- **Docker Compose**: For testing and development only
- **Podman**: For production experiment execution
- Go 1.21+
- Zig 0.11+
- Docker Compose (testing only)
- Redis (or use Docker)
- Git
## Quick Setup
```bash
# Clone repository
git clone https://github.com/jfraeys/fetch_ml.git
cd fetch_ml
# Start dependencies
see [Quick Start](quick-start.md) for Docker setup redis postgres
# Build all components
make build
# Run tests
see [Testing Guide](testing.md)
```
## Detailed Setup
## Quick Start
```bash
git clone https://github.com/jfraeys/fetch_ml.git
cd fetch_ml
see [Quick Start](quick-start.md) for Docker setup
make build
see [Testing Guide](testing.md)
```
## Key Commands
- `make build` - Build all components
- `see [Testing Guide](testing.md)` - Run tests
- `make dev` - Development build
- `see [CLI Reference](cli-reference.md) and [Zig CLI](zig-cli.md)` - Build CLI
## Common Issues
- Build fails: `go mod tidy`
- Zig errors: `cd cli && rm -rf zig-out zig-cache`
- Port conflicts: `lsof -i :9101`

View file

@ -218,7 +218,7 @@ func (w *Worker) executeTask(task *queue.Task) {
## Configuration
### API Server (`configs/config.yaml`)
### API Server (`configs/api/dev.yaml`)
```yaml
redis:
@ -227,15 +227,14 @@ redis:
db: 0
```
### Worker (`configs/worker-config.yaml`)
### Worker (`configs/workers/docker.yaml`)
```yaml
redis:
addr: "localhost:6379"
password: ""
db: 0
metrics_flush_interval: 500ms
redis_addr: "localhost:6379"
redis_password: ""
redis_db: 0
metrics_flush_interval: "500ms"
```
## Monitoring