# AGENTS.md - FetchML ## Architecture ``` ┌─────────┐ ┌─────────┐ ┌──────────┐ ┌─────────┐ ┌──────────┐ │ CLI │────▶│ API │────▶│ Scheduler│────▶│ Worker │────▶│ Storage │ │ (Zig) │◄────│(Go/HTTP)│◄────│ (Go) │◄────│ (Go) │◄────│ (MinIO) │ └─────────┘ └─────────┘ └──────────┘ └─────────┘ └──────────┘ │ ▼ ┌──────────┐ │ Redis │ │ (Queue) │ └──────────┘ ``` **CLI ↔ Server**: HTTP (default) or Unix socket (local). `execution_mode` config: `direct` (bypass scheduler) or `queue` (full flow). Auth via API key in header. --- ## Container Architecture **Docker** - Used for: - CI/CD testing pipelines (`.forgejo/workflows/docker-tests.yml`) - Application deployments (staging/production) - Build environments **Podman** - Used for: - ML experiment isolation only - Running untrusted/3rd party ML workloads - Rootless container execution for security **Rule**: Never use Podman for CI testing or deployments. Never use Docker for experiment isolation. --- ## Critical Invariants ### Audit Log — never break these - **Append-only** — entries are never modified or deleted - **Hash chain** — every entry includes SHA256 of the previous entry - **All mutations** to tasks/groups/tokens must produce an audit entry - Write the audit entry before the storage write — partial failures must be audited ### Auth - `TokenFromContext(ctx)` is the only authorised way to extract auth in handlers - Group visibility enforced at DB query level — never filter in application code - API keys hashed with bcrypt before storage — never log raw keys ### Storage - All DB access through repository types in `internal/db/repository/` - Transactions via `WithTx(ctx, db, func(tx *sql.Tx) error)` — never manage tx manually - Migrations: additive only — new columns must be nullable or have defaults, never drop columns (mark deprecated, remove later) ### CGO / Native Libs Use `-tags native_libs` when building with C++ extensions. This has broken twice — always check build tags when touching GPU detection or native code. --- ## Build Commands ```bash make build # all components make dev # fast, no LTO make prod # production-optimized make prod-with-native # production + C++ libs make cross-platform # Linux/macOS/Windows cd cli && make dev # Zig: fast compile + format cd cli && make prod # Zig: release=fast, LTO cd cli && make debug # Zig: no optimizations cd cli && zig build test ``` ## Test Commands ```bash make test # all tests (Docker) make test-unit make test-integration make test-e2e make test-coverage go test -v ./path/to/package -run TestName go test -race ./path/to/package/... LOG_LEVEL=debug go test -v ./path/to/package FETCH_ML_E2E_PODMAN=1 go test ./tests/e2e/... ``` ## Lint / Security ```bash make lint make security-scan make configlint make openapi-validate go vet ./... cd cli && zig fmt . ``` --- ## Legacy Go — modernize when touching existing code only | Legacy | Modern | | -------------------------- | ----------------------- | | `interface{}` | `any` | | `for i := 0; i < n; i++` | `for i := range items` | | `[]byte(fmt.Sprintf(...))` | `fmt.Appendf(nil, ...)` | | `sort.Slice` with closure | `slices.Sort(x)` | | Manual contains loop | `slices.Contains` | --- ## Dependencies - Go 1.25+, Zig 0.15+, Python 3.11+ - Redis (integration tests), Docker/Podman (container tests) --- ## Known Limitations See `docs/known-limitations.md` for full details. **Key items**: - **AMD GPU**: Not implemented. Use NVIDIA, Apple Silicon, or CPU. Mock available for testing. - **100+ node gang allocation**: Stress testing not yet implemented. - **Podman-in-Docker CI**: Requires privileged mode, not yet automated. **Error Handling**: ```go // For unimplemented features: return apierrors.NewNotImplemented("feature name") // Validation: if err := detectionResult.Validate(); err != nil { return err // Clear error message for user } ``` **Container Rule Reminder**: - Docker = testing & deployments - Podman = experiment isolation only --- ## Error Codes | Code | HTTP Status | Use Case | |------|-------------|----------| | `NOT_IMPLEMENTED` | 501 | Feature planned but not available | | `NOT_FOUND` | 404 | Resource doesn't exist | | `INVALID_CONFIGURATION` | 400 | Bad config (e.g., AMD GPU in production) |