docs: add Known Limitations section and testing structure updates

Add Known Limitations section to AGENTS.md documenting:
- AMD GPU not implemented (use NVIDIA, Apple Silicon, or CPU)
- 100+ node gang allocation stress testing not yet implemented
- Podman-in-Docker CI requires privileged mode, not yet automated
- Error handling patterns for unimplemented features
- Container usage rules (Docker for testing/deployments, Podman for experiments)
- Error codes table (NOT_IMPLEMENTED, NOT_FOUND, INVALID_CONFIGURATION)

Update testing documentation to reflect new test locations:
- Unit tests moved from tests/unit/ to internal/ (Go convention)
- Update all test file path references in security testing docs

2026-03-12 16:33:19 -04:00

5 KiB

Raw Blame History

AGENTS.md - FetchML

Architecture

┌─────────┐     ┌─────────┐     ┌──────────┐     ┌─────────┐     ┌──────────┐
│   CLI   │────▶│   API   │────▶│ Scheduler│────▶│  Worker  │────▶│ Storage  │
│  (Zig)  │◄────│(Go/HTTP)│◄────│  (Go)    │◄────│  (Go)    │◄────│ (MinIO)  │
└─────────┘     └─────────┘     └──────────┘     └─────────┘     └──────────┘
                                     │
                                     ▼
                              ┌──────────┐
                              │   Redis  │
                              │  (Queue) │
                              └──────────┘

CLI ↔ Server: HTTP (default) or Unix socket (local). execution_mode config: direct (bypass scheduler) or queue (full flow). Auth via API key in header.

Container Architecture

Docker - Used for:

CI/CD testing pipelines (.forgejo/workflows/docker-tests.yml)
Application deployments (staging/production)
Build environments

Podman - Used for:

ML experiment isolation only
Running untrusted/3rd party ML workloads
Rootless container execution for security

Rule: Never use Podman for CI testing or deployments. Never use Docker for experiment isolation.

Critical Invariants

Audit Log — never break these

Append-only — entries are never modified or deleted
Hash chain — every entry includes SHA256 of the previous entry
All mutations to tasks/groups/tokens must produce an audit entry
Write the audit entry before the storage write — partial failures must be audited

Auth

TokenFromContext(ctx) is the only authorised way to extract auth in handlers
Group visibility enforced at DB query level — never filter in application code
API keys hashed with bcrypt before storage — never log raw keys

Storage

All DB access through repository types in internal/db/repository/
Transactions via WithTx(ctx, db, func(tx *sql.Tx) error) — never manage tx manually
Migrations: additive only — new columns must be nullable or have defaults, never drop columns (mark deprecated, remove later)

CGO / Native Libs

Use -tags native_libs when building with C++ extensions. This has broken twice — always check build tags when touching GPU detection or native code.

Build Commands

make build              # all components
make dev                # fast, no LTO
make prod               # production-optimized
make prod-with-native   # production + C++ libs
make cross-platform     # Linux/macOS/Windows

cd cli && make dev      # Zig: fast compile + format
cd cli && make prod     # Zig: release=fast, LTO
cd cli && make debug    # Zig: no optimizations
cd cli && zig build test

Test Commands

make test               # all tests (Docker)
make test-unit
make test-integration
make test-e2e
make test-coverage

go test -v ./path/to/package -run TestName
go test -race ./path/to/package/...
LOG_LEVEL=debug go test -v ./path/to/package
FETCH_ML_E2E_PODMAN=1 go test ./tests/e2e/...

Lint / Security

make lint
make security-scan
make configlint
make openapi-validate
go vet ./...
cd cli && zig fmt .

Legacy Go — modernize when touching existing code only

Legacy	Modern
`interface{}`	`any`
`for i := 0; i < n; i++`	`for i := range items`
`[]byte(fmt.Sprintf(...))`	`fmt.Appendf(nil, ...)`
`sort.Slice` with closure	`slices.Sort(x)`
Manual contains loop	`slices.Contains`

Dependencies

Go 1.25+, Zig 0.15+, Python 3.11+
Redis (integration tests), Docker/Podman (container tests)

Known Limitations

See docs/known-limitations.md for full details.

Key items:

AMD GPU: Not implemented. Use NVIDIA, Apple Silicon, or CPU. Mock available for testing.
100+ node gang allocation: Stress testing not yet implemented.
Podman-in-Docker CI: Requires privileged mode, not yet automated.

Error Handling:

// For unimplemented features:
return apierrors.NewNotImplemented("feature name")

// Validation:
if err := detectionResult.Validate(); err != nil {
    return err // Clear error message for user
}

Container Rule Reminder:

Docker = testing & deployments
Podman = experiment isolation only

Error Codes

Code	HTTP Status	Use Case
`NOT_IMPLEMENTED`	501	Feature planned but not available
`NOT_FOUND`	404	Resource doesn't exist
`INVALID_CONFIGURATION`	400	Bad config (e.g., AMD GPU in production)

5 KiB Raw Blame History