fetch_ml/AGENTS.md
Jeremie Fraeys b00fa236db
docs: add Known Limitations section and testing structure updates
Add Known Limitations section to AGENTS.md documenting:
- AMD GPU not implemented (use NVIDIA, Apple Silicon, or CPU)
- 100+ node gang allocation stress testing not yet implemented
- Podman-in-Docker CI requires privileged mode, not yet automated
- Error handling patterns for unimplemented features
- Container usage rules (Docker for testing/deployments, Podman for experiments)
- Error codes table (NOT_IMPLEMENTED, NOT_FOUND, INVALID_CONFIGURATION)

Update testing documentation to reflect new test locations:
- Unit tests moved from tests/unit/ to internal/ (Go convention)
- Update all test file path references in security testing docs
2026-03-12 16:33:19 -04:00

5 KiB

AGENTS.md - FetchML

Architecture

┌─────────┐     ┌─────────┐     ┌──────────┐     ┌─────────┐     ┌──────────┐
│   CLI   │────▶│   API   │────▶│ Scheduler│────▶│  Worker  │────▶│ Storage  │
│  (Zig)  │◄────│(Go/HTTP)│◄────│  (Go)    │◄────│  (Go)    │◄────│ (MinIO)  │
└─────────┘     └─────────┘     └──────────┘     └─────────┘     └──────────┘
                                     │
                                     ▼
                              ┌──────────┐
                              │   Redis  │
                              │  (Queue) │
                              └──────────┘

CLI ↔ Server: HTTP (default) or Unix socket (local). execution_mode config: direct (bypass scheduler) or queue (full flow). Auth via API key in header.


Container Architecture

Docker - Used for:

  • CI/CD testing pipelines (.forgejo/workflows/docker-tests.yml)
  • Application deployments (staging/production)
  • Build environments

Podman - Used for:

  • ML experiment isolation only
  • Running untrusted/3rd party ML workloads
  • Rootless container execution for security

Rule: Never use Podman for CI testing or deployments. Never use Docker for experiment isolation.


Critical Invariants

Audit Log — never break these

  • Append-only — entries are never modified or deleted
  • Hash chain — every entry includes SHA256 of the previous entry
  • All mutations to tasks/groups/tokens must produce an audit entry
  • Write the audit entry before the storage write — partial failures must be audited

Auth

  • TokenFromContext(ctx) is the only authorised way to extract auth in handlers
  • Group visibility enforced at DB query level — never filter in application code
  • API keys hashed with bcrypt before storage — never log raw keys

Storage

  • All DB access through repository types in internal/db/repository/
  • Transactions via WithTx(ctx, db, func(tx *sql.Tx) error) — never manage tx manually
  • Migrations: additive only — new columns must be nullable or have defaults, never drop columns (mark deprecated, remove later)

CGO / Native Libs

Use -tags native_libs when building with C++ extensions. This has broken twice — always check build tags when touching GPU detection or native code.


Build Commands

make build              # all components
make dev                # fast, no LTO
make prod               # production-optimized
make prod-with-native   # production + C++ libs
make cross-platform     # Linux/macOS/Windows

cd cli && make dev      # Zig: fast compile + format
cd cli && make prod     # Zig: release=fast, LTO
cd cli && make debug    # Zig: no optimizations
cd cli && zig build test

Test Commands

make test               # all tests (Docker)
make test-unit
make test-integration
make test-e2e
make test-coverage

go test -v ./path/to/package -run TestName
go test -race ./path/to/package/...
LOG_LEVEL=debug go test -v ./path/to/package
FETCH_ML_E2E_PODMAN=1 go test ./tests/e2e/...

Lint / Security

make lint
make security-scan
make configlint
make openapi-validate
go vet ./...
cd cli && zig fmt .

Legacy Go — modernize when touching existing code only

Legacy Modern
interface{} any
for i := 0; i < n; i++ for i := range items
[]byte(fmt.Sprintf(...)) fmt.Appendf(nil, ...)
sort.Slice with closure slices.Sort(x)
Manual contains loop slices.Contains

Dependencies

  • Go 1.25+, Zig 0.15+, Python 3.11+
  • Redis (integration tests), Docker/Podman (container tests)

Known Limitations

See docs/known-limitations.md for full details.

Key items:

  • AMD GPU: Not implemented. Use NVIDIA, Apple Silicon, or CPU. Mock available for testing.
  • 100+ node gang allocation: Stress testing not yet implemented.
  • Podman-in-Docker CI: Requires privileged mode, not yet automated.

Error Handling:

// For unimplemented features:
return apierrors.NewNotImplemented("feature name")

// Validation:
if err := detectionResult.Validate(); err != nil {
    return err // Clear error message for user
}

Container Rule Reminder:

  • Docker = testing & deployments
  • Podman = experiment isolation only

Error Codes

Code HTTP Status Use Case
NOT_IMPLEMENTED 501 Feature planned but not available
NOT_FOUND 404 Resource doesn't exist
INVALID_CONFIGURATION 400 Bad config (e.g., AMD GPU in production)