docs: update architecture/queue pages and remove legacy development setup
This commit is contained in:
parent
f6e506a632
commit
3fa4f6ae51
4 changed files with 145 additions and 161 deletions
|
|
@ -1,90 +0,0 @@
|
|||
---
|
||||
trigger: model_decision
|
||||
description: When a new feature is added, this prompt needs to be run
|
||||
---
|
||||
|
||||
# Development Guidelines
|
||||
|
||||
## Code Quality Standards
|
||||
|
||||
### Testing Requirements
|
||||
- MANDATORY: Every new feature MUST include corresponding tests
|
||||
- Write tests BEFORE implementing complex features (TDD approach)
|
||||
- Test coverage for new code should be >80%
|
||||
- Include both unit tests and integration tests where applicable
|
||||
- Test edge cases, error paths, and boundary conditions
|
||||
|
||||
### Documentation Standards
|
||||
- Update relevant documentation IN THE SAME COMMIT as code changes
|
||||
- Documentation locations:
|
||||
- README.md: User-facing features, installation, quick start
|
||||
- CHANGELOG.md: All changes, following Keep a Changelog format
|
||||
- Code comments: Complex logic, non-obvious decisions, API contracts
|
||||
- Function/struct docs: Public APIs must have doc comments
|
||||
- Use concrete examples in documentation
|
||||
- Keep docs concise but complete
|
||||
|
||||
### Code Organization
|
||||
- CRITICAL: Clean up as you go - no orphaned files or dead code
|
||||
- Remove commented-out code blocks (use git history instead)
|
||||
- Delete unused imports, functions, and variables immediately
|
||||
- Consolidate duplicate code into reusable functions
|
||||
- Move TODO items from loose files into:
|
||||
- Code comments with `// TODO(context):` for implementation tasks
|
||||
- GitHub Issues for larger features
|
||||
- NEVER create standalone .md files for tracking
|
||||
|
||||
### When Making Changes
|
||||
For EVERY significant change, complete ALL of these:
|
||||
|
||||
1. Write/update tests
|
||||
2. Update documentation (README, CHANGELOG, code comments)
|
||||
3. Update build scripts if dependencies/build process changed
|
||||
4. Remove any temporary/debug code added during development
|
||||
5. Delete unused files created during exploration
|
||||
6. Verify no dead code remains (unused functions, imports, variables)
|
||||
|
||||
### Cleanup Checklist (Run BEFORE committing)
|
||||
- [ ] Removed all debug print statements
|
||||
- [ ] Deleted temporary test files
|
||||
- [ ] Removed commented-out code
|
||||
- [ ] Cleaned up unused imports
|
||||
- [ ] Deleted exploratory/spike code
|
||||
- [ ] Consolidated duplicate logic
|
||||
- [ ] Removed obsolete scripts/configs
|
||||
|
||||
### Communication Style
|
||||
- Report what you've done: "Added feature X with tests in test/x_test.go"
|
||||
- Highlight what needs attention: "WARNING: Manual testing needed for edge case Y"
|
||||
- Ask questions directly: "Should we support Z? Trade-offs are..."
|
||||
- NEVER say "I'll track this in a markdown file" - use code comments or tell me directly
|
||||
|
||||
### Script/Build System Updates
|
||||
- Update Makefile/build.zig when adding new targets or commands
|
||||
- Modify CI/CD configs (.github/workflows) if build/test process changes
|
||||
- Update package.json/Cargo.toml/go.mod when dependencies change
|
||||
- Document new scripts in README under "Development" section
|
||||
|
||||
## Anti-Patterns to AVOID
|
||||
- Creating notes.md, todo.md, tasks.md, ideas.md files
|
||||
- Leaving commented-out code "for reference"
|
||||
- Keeping old implementation files with .old or .backup suffixes
|
||||
- Adding features without tests
|
||||
- Updating code without updating docs
|
||||
- Leaving TODO comments without context or assignee
|
||||
|
||||
## Preferred Patterns
|
||||
- Inline TODO comments: `// TODO(user): Add caching layer for better performance`
|
||||
- Self-documenting code with clear names
|
||||
- Tests that serve as usage examples
|
||||
- Incremental, complete commits (code + tests + docs)
|
||||
- Direct communication about tasks and priorities
|
||||
|
||||
## Definition of Done
|
||||
A task is complete ONLY when:
|
||||
1. Code is written and working
|
||||
2. Tests are written and passing
|
||||
3. Documentation is updated
|
||||
4. All temporary/dead code is removed
|
||||
5. Build scripts are updated if needed
|
||||
6. Changes are committed with clear message
|
||||
|
|
@ -15,14 +15,22 @@ Simple, secure architecture for ML experiments in your homelab.
|
|||
graph TB
|
||||
subgraph "Homelab Stack"
|
||||
CLI[Zig CLI]
|
||||
API[HTTPS API]
|
||||
API[API Server (HTTPS + WebSocket)]
|
||||
REDIS[Redis Cache]
|
||||
DB[(SQLite/PostgreSQL)]
|
||||
FS[Local Storage]
|
||||
WORKER[Worker Service]
|
||||
PODMAN[Podman/Docker]
|
||||
end
|
||||
|
||||
CLI --> API
|
||||
API --> REDIS
|
||||
API --> DB
|
||||
API --> FS
|
||||
WORKER --> API
|
||||
WORKER --> REDIS
|
||||
WORKER --> FS
|
||||
WORKER --> PODMAN
|
||||
```
|
||||
|
||||
## Core Services
|
||||
|
|
@ -81,7 +89,7 @@ sequenceDiagram
|
|||
participant Redis
|
||||
participant Storage
|
||||
|
||||
CLI->>API: HTTPS Request
|
||||
CLI->>API: HTTPS + WebSocket request
|
||||
API->>API: Validate Auth
|
||||
API->>Redis: Cache/Queue
|
||||
API->>Storage: Experiment Data
|
||||
|
|
@ -107,7 +115,7 @@ services:
|
|||
|
||||
### Local Setup
|
||||
```bash
|
||||
./setup.sh && ./manage.sh start
|
||||
docker-compose -f deployments/docker-compose.dev.yml up -d
|
||||
```
|
||||
|
||||
## Network Architecture
|
||||
|
|
@ -121,9 +129,11 @@ services:
|
|||
|
||||
```
|
||||
data/
|
||||
├── experiments/ # ML experiment results
|
||||
├── cache/ # Temporary cache files
|
||||
└── backups/ # Local backups
|
||||
├── experiments/ # Experiment definitions, run manifests, and artifacts
|
||||
├── tracking/ # Tracking tool state (e.g., MLflow/TensorBoard), when enabled
|
||||
├── .prewarm/ # Best-effort prewarm staging (snapshots/env/datasets), when enabled
|
||||
├── cache/ # Temporary caches (best-effort)
|
||||
└── backups/ # Local backups
|
||||
|
||||
logs/
|
||||
├── app.log # Application logs
|
||||
|
|
@ -136,7 +146,7 @@ logs/
|
|||
Simple, lightweight monitoring:
|
||||
- **Health Checks**: Service availability
|
||||
- **Log Files**: Structured logging
|
||||
- **Basic Metrics**: Request counts, error rates
|
||||
- **Prometheus Metrics**: Worker and API metrics (including prewarm hit/miss/timing)
|
||||
- **Security Events**: Failed auth, rate limits
|
||||
|
||||
## Homelab Benefits
|
||||
|
|
@ -155,7 +165,7 @@ graph TB
|
|||
subgraph "Client Layer"
|
||||
CLI[CLI Tools]
|
||||
TUI[Terminal UI]
|
||||
API[REST API]
|
||||
API[WebSocket API]
|
||||
end
|
||||
|
||||
subgraph "Authentication Layer"
|
||||
|
|
@ -200,6 +210,41 @@ graph TB
|
|||
Podman --> Containers
|
||||
```
|
||||
|
||||
## Tracking & Plugin System
|
||||
|
||||
fetch_ml includes an optional tracking plugin system that can provision sidecar tools and/or pass through environment variables for common research tracking stacks.
|
||||
|
||||
### Tracking modes
|
||||
|
||||
Tracking tools support the following modes:
|
||||
|
||||
- `sidecar`: provision a local sidecar container per task (best-effort).
|
||||
- `remote`: point to an externally managed instance (no local provisioning).
|
||||
- `disabled`: disable the tool entirely.
|
||||
|
||||
### How it works
|
||||
|
||||
- The worker maintains a tracking registry and provisions tools during task startup.
|
||||
- Provisioned plugins return environment variables that are injected into the task container.
|
||||
- Some plugins also require host paths (e.g., TensorBoard log directory); these are mounted into the task container and sanitized to avoid leaking host paths.
|
||||
|
||||
### Built-in plugins
|
||||
|
||||
The worker ships with built-in plugins:
|
||||
|
||||
- `mlflow`: can run an MLflow server as a sidecar or use a remote `MLFLOW_TRACKING_URI`.
|
||||
- `tensorboard`: runs a TensorBoard sidecar and mounts a per-job log directory.
|
||||
- `wandb`: does not provision a sidecar; it forwards configuration via environment variables.
|
||||
|
||||
### Configuration
|
||||
|
||||
Plugins can be configured via worker configuration under `plugins`, including:
|
||||
|
||||
- `enabled`
|
||||
- `image`
|
||||
- `mode`
|
||||
- per-plugin paths/settings (e.g., artifact base path, log base path)
|
||||
|
||||
## Zig CLI Architecture
|
||||
|
||||
### Component Structure
|
||||
|
|
@ -733,6 +778,91 @@ graph TB
|
|||
- **Security**: Built-in authentication and encryption
|
||||
- **Monitoring**: Basic health checks and logging
|
||||
|
||||
## Roadmap (Research-First, Workstation-First)
|
||||
|
||||
fetchml is a research-first ML experiment runner with production-grade discipline.
|
||||
|
||||
### Guiding principles
|
||||
|
||||
- **Reproducibility over speed**: optimizations must never change experimental semantics.
|
||||
- **Explicit over magic**: every run should be explainable from manifests, configs, and logs.
|
||||
- **Best-effort optimizations**: prewarming/caching must be optional and must not be required for correctness.
|
||||
- **Workstation-first**: prioritize single-node reliability, observability, and fast iteration; avoid HPC-specific complexity.
|
||||
|
||||
### Where we are now
|
||||
|
||||
- **Run provenance**: `run_manifest.json` exists and is readable via `ml info <path|id>`.
|
||||
- **Validation**: `ml validate <commit_id>` and `ml validate --task <task_id>` exist; task validation includes run-manifest lifecycle/provenance checks.
|
||||
- **Prewarming (Phase 1, best-effort)**:
|
||||
- Next-task prewarm loop stages snapshots under `base/.prewarm/snapshots/<task_id>`.
|
||||
- Best-effort dataset prefetch with a TTL cache.
|
||||
- Warmed container image infrastructure exists (images keyed by `deps_manifest_sha256`).
|
||||
- Prewarm status is surfaced in `ml status --json` under the `prewarm` field.
|
||||
|
||||
### Phase 0: Trust and usability (highest priority)
|
||||
|
||||
#### 1) Make `ml status` excellent (human output)
|
||||
|
||||
- Show a compact summary of:
|
||||
- queued/running/completed/failed counts
|
||||
- a short list of most relevant tasks
|
||||
- **prewarm state** (worker id, target task id, phase, dataset count, age)
|
||||
- Preserve `--json` output as stable API for scripting.
|
||||
|
||||
#### 2) Add a dry-run preview command (`ml explain`)
|
||||
|
||||
- Print the resolved execution plan before running:
|
||||
- commit id, experiment manifest overall sha
|
||||
- dependency manifest name + sha
|
||||
- snapshot id + expected sha (when applicable)
|
||||
- dataset identities + checksums (when applicable)
|
||||
- requested resources (cpu/mem/gpu)
|
||||
- candidate runtime image (base vs warmed tag)
|
||||
|
||||
- Enforce a strict preflight by default:
|
||||
- Queue-time blocking (do not enqueue tasks that fail reproducibility requirements).
|
||||
- The strict preflight should be shared by `ml queue` and `ml explain`.
|
||||
- Record the resolved plan into task metadata for traceability:
|
||||
- `repro_policy: strict`
|
||||
- `trust_level: <L0..L4>` (simple trust ladder)
|
||||
- `plan_sha256: <sha256>` (digest of the resolved execution plan)
|
||||
|
||||
#### 3) Tighten run manifest completeness
|
||||
|
||||
- For `running`: require `started_at`.
|
||||
- For `completed/failed`: require `started_at`, `ended_at`, and `exit_code`.
|
||||
- When snapshots/datasets are used: ensure manifest records the relevant identifiers and digests.
|
||||
|
||||
#### 4) Dataset identity (minimal but research-grade)
|
||||
|
||||
- Prefer structured `dataset_specs` (name + checksum) as the authoritative input.
|
||||
- Treat missing checksum as an error by default (strict-by-default).
|
||||
|
||||
### Phase 1: Simple performance wins (only after Phase 0 feels solid)
|
||||
|
||||
- Keep prewarming single-level (next task only).
|
||||
- Improve observability first (status output + metrics), then expand capabilities.
|
||||
|
||||
### Phase 2+: Research workflows
|
||||
|
||||
- `ml compare <runA> <runB>`: manifest-driven diff of provenance and key parameters.
|
||||
- `ml reproduce <run-id>`: submit a new task derived from the recorded manifest inputs.
|
||||
- `ml export <run-id>`: package provenance + artifacts for collaborators/reviewers.
|
||||
|
||||
### Phase 3: Infrastructure (only if needed)
|
||||
|
||||
- Multi-level prewarming, predictive scheduling, tmpfs caching, dashboards.
|
||||
- Optional scalable storage backend for team deployments:
|
||||
- Store run manifests + artifacts in S3-compatible object storage (e.g., MinIO) for durability and multi-worker/Kubernetes setups.
|
||||
- Keep workstation-first defaults (local filesystem) for simplicity.
|
||||
- Optional integrations via plugins/exporters (keep core strict and offline-capable):
|
||||
- Server-side exporters that mirror run metadata, metrics, and artifacts to external systems (e.g., MLflow Tracking, Weights & Biases).
|
||||
- Prefer lifecycle hooks that consume `run_manifest.json` / artifact manifests over plugins that influence execution semantics.
|
||||
- Optional Kubernetes deployment path (for teams on scalable infra):
|
||||
- Publish versioned container images for the backend (API server; optionally worker) and provide reference manifests (Helm/Kustomize).
|
||||
- Keep the CLI as the primary UX; Kubernetes is an execution/deployment backend, not a UI.
|
||||
- These are optional and should be driven by measured bottlenecks.
|
||||
|
||||
---
|
||||
|
||||
This architecture provides a solid foundation for secure, scalable machine learning experiments while maintaining simplicity and developer productivity.
|
||||
|
|
|
|||
|
|
@ -1,55 +0,0 @@
|
|||
# Development Setup
|
||||
|
||||
Set up your local development environment for Fetch ML.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
**Container Runtimes:**
|
||||
- **Docker Compose**: For testing and development only
|
||||
- **Podman**: For production experiment execution
|
||||
|
||||
- Go 1.21+
|
||||
- Zig 0.11+
|
||||
- Docker Compose (testing only)
|
||||
- Redis (or use Docker)
|
||||
- Git
|
||||
|
||||
## Quick Setup
|
||||
|
||||
```bash
|
||||
# Clone repository
|
||||
git clone https://github.com/jfraeys/fetch_ml.git
|
||||
cd fetch_ml
|
||||
|
||||
# Start dependencies
|
||||
see [Quick Start](quick-start.md) for Docker setup redis postgres
|
||||
|
||||
# Build all components
|
||||
make build
|
||||
|
||||
# Run tests
|
||||
see [Testing Guide](testing.md)
|
||||
```
|
||||
|
||||
## Detailed Setup
|
||||
|
||||
|
||||
## Quick Start
|
||||
```bash
|
||||
git clone https://github.com/jfraeys/fetch_ml.git
|
||||
cd fetch_ml
|
||||
see [Quick Start](quick-start.md) for Docker setup
|
||||
make build
|
||||
see [Testing Guide](testing.md)
|
||||
```
|
||||
|
||||
## Key Commands
|
||||
- `make build` - Build all components
|
||||
- `see [Testing Guide](testing.md)` - Run tests
|
||||
- `make dev` - Development build
|
||||
- `see [CLI Reference](cli-reference.md) and [Zig CLI](zig-cli.md)` - Build CLI
|
||||
|
||||
## Common Issues
|
||||
- Build fails: `go mod tidy`
|
||||
- Zig errors: `cd cli && rm -rf zig-out zig-cache`
|
||||
- Port conflicts: `lsof -i :9101`
|
||||
|
|
@ -218,7 +218,7 @@ func (w *Worker) executeTask(task *queue.Task) {
|
|||
|
||||
## Configuration
|
||||
|
||||
### API Server (`configs/config.yaml`)
|
||||
### API Server (`configs/api/dev.yaml`)
|
||||
|
||||
```yaml
|
||||
redis:
|
||||
|
|
@ -227,15 +227,14 @@ redis:
|
|||
db: 0
|
||||
```
|
||||
|
||||
### Worker (`configs/worker-config.yaml`)
|
||||
### Worker (`configs/workers/docker.yaml`)
|
||||
|
||||
```yaml
|
||||
redis:
|
||||
addr: "localhost:6379"
|
||||
password: ""
|
||||
db: 0
|
||||
|
||||
metrics_flush_interval: 500ms
|
||||
redis_addr: "localhost:6379"
|
||||
redis_password: ""
|
||||
redis_db: 0
|
||||
|
||||
metrics_flush_interval: "500ms"
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
|
|
|||
Loading…
Reference in a new issue