fetch_ml/AGENTS.md

# AGENTS.md - FetchML

## Architecture

```
┌─────────┐     ┌─────────┐     ┌──────────┐     ┌─────────┐     ┌──────────┐
│   CLI   │────▶│   API   │────▶│ Scheduler│────▶│  Worker  │────▶│ Storage  │
│  (Zig)  │◄────│(Go/HTTP)│◄────│  (Go)    │◄────│  (Go)    │◄────│ (MinIO)  │
└─────────┘     └─────────┘     └──────────┘     └─────────┘     └──────────┘
                                     │
                                     ▼
                              ┌──────────┐
                              │   Redis  │
                              │  (Queue) │
                              └──────────┘
```

**CLI ↔ Server**: HTTP (default) or Unix socket (local). `execution_mode` config:
`direct` (bypass scheduler) or `queue` (full flow). Auth via API key in header.

---

## Container Architecture

**Docker** - Used for:
- CI/CD testing pipelines (`.forgejo/workflows/docker-tests.yml`)
- Application deployments (staging/production)
- Build environments

**Podman** - Used for:
- ML experiment isolation only
- Running untrusted/3rd party ML workloads
- Rootless container execution for security

**Rule**: Never use Podman for CI testing or deployments. Never use Docker for experiment isolation.

---

## Critical Invariants

### Audit Log — never break these

- **Append-only** — entries are never modified or deleted
- **Hash chain** — every entry includes SHA256 of the previous entry
- **All mutations** to tasks/groups/tokens must produce an audit entry
- Write the audit entry before the storage write — partial failures must be audited

### Auth

- `TokenFromContext(ctx)` is the only authorised way to extract auth in handlers
- Group visibility enforced at DB query level — never filter in application code
- API keys hashed with bcrypt before storage — never log raw keys

### Storage

- All DB access through repository types in `internal/db/repository/`
- Transactions via `WithTx(ctx, db, func(tx *sql.Tx) error)` — never manage tx manually
- Migrations: additive only — new columns must be nullable or have defaults,
  never drop columns (mark deprecated, remove later)

### CGO / Native Libs

Use `-tags native_libs` when building with C++ extensions. This has broken twice —
always check build tags when touching GPU detection or native code.

---

## Build Commands

```bash
make build              # all components
make dev                # fast, no LTO
make prod               # production-optimized
make prod-with-native   # production + C++ libs
make cross-platform     # Linux/macOS/Windows

cd cli && make dev      # Zig: fast compile + format
cd cli && make prod     # Zig: release=fast, LTO
cd cli && make debug    # Zig: no optimizations
cd cli && zig build test
```

## Test Commands

```bash
make test               # all tests (Docker)
make test-unit
make test-integration
make test-e2e
make test-coverage

go test -v ./path/to/package -run TestName
go test -race ./path/to/package/...
LOG_LEVEL=debug go test -v ./path/to/package
FETCH_ML_E2E_PODMAN=1 go test ./tests/e2e/...
```

## Lint / Security

```bash
make lint
make security-scan
make configlint
make openapi-validate
go vet ./...
cd cli && zig fmt .
```

---

## Legacy Go — modernize when touching existing code only

| Legacy                     | Modern                  |
| -------------------------- | ----------------------- |
| `interface{}`              | `any`                   |
| `for i := 0; i < n; i++`   | `for i := range items`  |
| `[]byte(fmt.Sprintf(...))` | `fmt.Appendf(nil, ...)` |
| `sort.Slice` with closure  | `slices.Sort(x)`        |
| Manual contains loop       | `slices.Contains`       |

---

## Dependencies

- Go 1.25+, Zig 0.15+, Python 3.11+
- Redis (integration tests), Docker/Podman (container tests)

---

## Known Limitations

See `docs/known-limitations.md` for full details.

**Key items**:
- **AMD GPU**: Not implemented. Use NVIDIA, Apple Silicon, or CPU. Mock available for testing.
- **100+ node gang allocation**: Stress testing not yet implemented.
- **Podman-in-Docker CI**: Requires privileged mode, not yet automated.

**Error Handling**:
```go
// For unimplemented features:
return apierrors.NewNotImplemented("feature name")

// Validation:
if err := detectionResult.Validate(); err != nil {
    return err // Clear error message for user
}
```

**Container Rule Reminder**:
- Docker = testing & deployments
- Podman = experiment isolation only

---

## Error Codes

| Code | HTTP Status | Use Case |
|------|-------------|----------|
| `NOT_IMPLEMENTED` | 501 | Feature planned but not available |
| `NOT_FOUND` | 404 | Resource doesn't exist |
| `INVALID_CONFIGURATION` | 400 | Bad config (e.g., AMD GPU in production) |