fetch_ml/docs/known-limitations.md
Jeremie Fraeys f827ee522a
test(tracking/plugins): add PodmanInterface and comprehensive plugin tests for 91% coverage
Refactor plugins to use interface for testability:
- Add PodmanInterface to container package (StartContainer, StopContainer, RemoveContainer)
- Update MLflow plugin to use container.PodmanInterface
- Update TensorBoard plugin to use container.PodmanInterface
- Add comprehensive mocked tests for all three plugins (wandb, mlflow, tensorboard)
- Coverage increased from 18% to 91.4%
2026-03-14 16:59:16 -04:00

5.2 KiB

Known Limitations

This document tracks features that are planned but not yet implemented, along with workarounds where available.

GPU Support

AMD GPU (ROCm)

Status: Not Implemented (Deferred) Priority: Low (adoption growing but not mainstream for ML/AI)

AMD GPU detection and ROCm integration are not yet implemented. While AMD GPU adoption for ML/AI workloads is growing, it remains less mainstream than NVIDIA. The system will return a clear error if AMD is requested.

Rationale for Deferral:

  • NVIDIA dominates ML/AI training and inference (90%+ market share)
  • AMD ROCm ecosystem still maturing for deep learning frameworks
  • Limited user demand compared to NVIDIA/Apple Silicon
  • Can be added later when user demand increases

Error Message:

AMD GPU support is not yet implemented. 
Use NVIDIA GPUs, Apple Silicon, or CPU-only mode. 
For development/testing, use FETCH_ML_MOCK_GPU_TYPE=AMD

Workaround:

  • Use NVIDIA GPUs with FETCH_ML_GPU_TYPE=nvidia
  • Use Apple Silicon with Metal (FETCH_ML_GPU_TYPE=apple)
  • Use CPU-only mode with FETCH_ML_GPU_TYPE=none
  • For testing/development, use mock AMD:
    FETCH_ML_MOCK_GPU_TYPE=AMD
    FETCH_ML_MOCK_GPU_COUNT=4
    

Implementation Requirements (for future consideration):

  • ROCm SMI Go bindings or CGO wrapper
  • AMD GPU hardware for testing
  • ROCm runtime in container images
  • Driver compatibility matrix
  • User demand validation (file an issue if you need this)

Platform Support

Windows Process Isolation

Status: Not Implemented

Process isolation limits (max open files, max processes) are not enforced on Windows. The Windows implementation uses stub functions that return errors when limits are requested.

Error Message:

process isolation limits not implemented on Windows (max_open_files=1000, max_processes=100)

Workaround: Use Linux or macOS for production deployments requiring process isolation.

Implementation Requirements:

  • Windows Job Objects integration
  • VirtualLock API for memory locking
  • Platform-specific testing

API Features

REST API Task Operations

Status: Not Implemented

Task creation, cancellation, and details via REST API are not implemented. These operations must use WebSocket protocol.

Error Message:

{
  "error": "Not implemented",
  "code": "NOT_IMPLEMENTED",
  "message": "Task creation via REST API not yet implemented - use WebSocket"
}

Workaround: Use WebSocket protocol for task operations:

  • Connect to /ws endpoint
  • Use binary protocol for job submission
  • See WebSocket API documentation

Experiments API

Status: Not Implemented

Experiment listing and creation endpoints return empty stub responses.

Workaround: Use direct database access or experiment manager interfaces.


Plugin Version Query

Status: Not Implemented

Plugin version information returns hardcoded "1.0.0" instead of querying actual plugin binary/container versions. Backend support exists but no CLI access (uses HTTP REST; CLI uses WebSocket).

Workaround: Query plugin binaries directly for version information.


Scheduler Features

Gang Allocation Stress Testing

Status: Partial (100+ node jobs not tested)

While gang allocation works for typical multi-node jobs, stress testing with 100+ nodes is not yet implemented.

Workaround: Test with smaller node counts (8-16 nodes) for validation.


Test Infrastructure

Podman-in-Docker CI Tests

Status: Not Implemented

Running Podman containers inside Docker CI runners requires privileged mode and cgroup configuration that is not yet automated.

Workaround: Tests run with direct Docker container execution.


Reporting

Test Coverage Dashboard

Status: Not Implemented

Automated coverage dashboard with trend tracking is planned but not yet available.

Workaround: Use go test -coverprofile and upload artifacts manually.


Native Libraries (C++)

AMD GPU Support in Native Libs

Status: Not Implemented

The native C++ libraries (dataset_hash, queue_index) do not yet have AMD GPU acceleration.

Workaround: Use CPU implementations which are still significantly faster than pure Go.


How to Handle Not Implemented Errors

For Users

When you encounter a "not implemented" error:

  1. Check this document for workarounds
  2. Use mock mode for development/testing (see gpu_detector_mock.go)
  3. File an issue to request the feature with your use case
  4. Consider contributing - see CONTRIBUTING.md

For Developers

When implementing new features:

  1. Use errors.NewNotImplemented(featureName) for clear error messages
  2. Add the limitation to this document
  3. Provide a workaround if possible
  4. Reference any GitHub tracking issues

Example:

if requestedFeature == "rocm" {
    return apierrors.NewNotImplemented("AMD ROCm support")
}

Feature Request Process

To request an unimplemented feature:

  1. Open a GitHub issue with label feature-request
  2. Describe your use case and hardware/environment
  3. Mention if you're willing to test or contribute
  4. Reference any related limitations in this document

Last updated: March 2026