fetch_ml/AGENTS.md
Jeremie Fraeys b00fa236db
docs: add Known Limitations section and testing structure updates
Add Known Limitations section to AGENTS.md documenting:
- AMD GPU not implemented (use NVIDIA, Apple Silicon, or CPU)
- 100+ node gang allocation stress testing not yet implemented
- Podman-in-Docker CI requires privileged mode, not yet automated
- Error handling patterns for unimplemented features
- Container usage rules (Docker for testing/deployments, Podman for experiments)
- Error codes table (NOT_IMPLEMENTED, NOT_FOUND, INVALID_CONFIGURATION)

Update testing documentation to reflect new test locations:
- Unit tests moved from tests/unit/ to internal/ (Go convention)
- Update all test file path references in security testing docs
2026-03-12 16:33:19 -04:00

162 lines
5 KiB
Markdown

# AGENTS.md - FetchML
## Architecture
```
┌─────────┐ ┌─────────┐ ┌──────────┐ ┌─────────┐ ┌──────────┐
│ CLI │────▶│ API │────▶│ Scheduler│────▶│ Worker │────▶│ Storage │
│ (Zig) │◄────│(Go/HTTP)│◄────│ (Go) │◄────│ (Go) │◄────│ (MinIO) │
└─────────┘ └─────────┘ └──────────┘ └─────────┘ └──────────┘
┌──────────┐
│ Redis │
│ (Queue) │
└──────────┘
```
**CLI ↔ Server**: HTTP (default) or Unix socket (local). `execution_mode` config:
`direct` (bypass scheduler) or `queue` (full flow). Auth via API key in header.
---
## Container Architecture
**Docker** - Used for:
- CI/CD testing pipelines (`.forgejo/workflows/docker-tests.yml`)
- Application deployments (staging/production)
- Build environments
**Podman** - Used for:
- ML experiment isolation only
- Running untrusted/3rd party ML workloads
- Rootless container execution for security
**Rule**: Never use Podman for CI testing or deployments. Never use Docker for experiment isolation.
---
## Critical Invariants
### Audit Log — never break these
- **Append-only** — entries are never modified or deleted
- **Hash chain** — every entry includes SHA256 of the previous entry
- **All mutations** to tasks/groups/tokens must produce an audit entry
- Write the audit entry before the storage write — partial failures must be audited
### Auth
- `TokenFromContext(ctx)` is the only authorised way to extract auth in handlers
- Group visibility enforced at DB query level — never filter in application code
- API keys hashed with bcrypt before storage — never log raw keys
### Storage
- All DB access through repository types in `internal/db/repository/`
- Transactions via `WithTx(ctx, db, func(tx *sql.Tx) error)` — never manage tx manually
- Migrations: additive only — new columns must be nullable or have defaults,
never drop columns (mark deprecated, remove later)
### CGO / Native Libs
Use `-tags native_libs` when building with C++ extensions. This has broken twice —
always check build tags when touching GPU detection or native code.
---
## Build Commands
```bash
make build # all components
make dev # fast, no LTO
make prod # production-optimized
make prod-with-native # production + C++ libs
make cross-platform # Linux/macOS/Windows
cd cli && make dev # Zig: fast compile + format
cd cli && make prod # Zig: release=fast, LTO
cd cli && make debug # Zig: no optimizations
cd cli && zig build test
```
## Test Commands
```bash
make test # all tests (Docker)
make test-unit
make test-integration
make test-e2e
make test-coverage
go test -v ./path/to/package -run TestName
go test -race ./path/to/package/...
LOG_LEVEL=debug go test -v ./path/to/package
FETCH_ML_E2E_PODMAN=1 go test ./tests/e2e/...
```
## Lint / Security
```bash
make lint
make security-scan
make configlint
make openapi-validate
go vet ./...
cd cli && zig fmt .
```
---
## Legacy Go — modernize when touching existing code only
| Legacy | Modern |
| -------------------------- | ----------------------- |
| `interface{}` | `any` |
| `for i := 0; i < n; i++` | `for i := range items` |
| `[]byte(fmt.Sprintf(...))` | `fmt.Appendf(nil, ...)` |
| `sort.Slice` with closure | `slices.Sort(x)` |
| Manual contains loop | `slices.Contains` |
---
## Dependencies
- Go 1.25+, Zig 0.15+, Python 3.11+
- Redis (integration tests), Docker/Podman (container tests)
---
## Known Limitations
See `docs/known-limitations.md` for full details.
**Key items**:
- **AMD GPU**: Not implemented. Use NVIDIA, Apple Silicon, or CPU. Mock available for testing.
- **100+ node gang allocation**: Stress testing not yet implemented.
- **Podman-in-Docker CI**: Requires privileged mode, not yet automated.
**Error Handling**:
```go
// For unimplemented features:
return apierrors.NewNotImplemented("feature name")
// Validation:
if err := detectionResult.Validate(); err != nil {
return err // Clear error message for user
}
```
**Container Rule Reminder**:
- Docker = testing & deployments
- Podman = experiment isolation only
---
## Error Codes
| Code | HTTP Status | Use Case |
|------|-------------|----------|
| `NOT_IMPLEMENTED` | 501 | Feature planned but not available |
| `NOT_FOUND` | 404 | Resource doesn't exist |
| `INVALID_CONFIGURATION` | 400 | Bad config (e.g., AMD GPU in production) |