test(tracking/plugins): add PodmanInterface and comprehensive plugin tests for 91% coverage

Refactor plugins to use interface for testability:
- Add PodmanInterface to container package (StartContainer, StopContainer, RemoveContainer)
- Update MLflow plugin to use container.PodmanInterface
- Update TensorBoard plugin to use container.PodmanInterface
- Add comprehensive mocked tests for all three plugins (wandb, mlflow, tensorboard)
- Coverage increased from 18% to 91.4%

2026-03-14 16:59:16 -04:00

5.2 KiB

Raw Blame History

Known Limitations

This document tracks features that are planned but not yet implemented, along with workarounds where available.

GPU Support

AMD GPU (ROCm)

Status: ⏳ Not Implemented (Deferred) Priority: Low (adoption growing but not mainstream for ML/AI)

AMD GPU detection and ROCm integration are not yet implemented. While AMD GPU adoption for ML/AI workloads is growing, it remains less mainstream than NVIDIA. The system will return a clear error if AMD is requested.

Rationale for Deferral:

NVIDIA dominates ML/AI training and inference (90%+ market share)
AMD ROCm ecosystem still maturing for deep learning frameworks
Limited user demand compared to NVIDIA/Apple Silicon
Can be added later when user demand increases

Error Message:

AMD GPU support is not yet implemented. 
Use NVIDIA GPUs, Apple Silicon, or CPU-only mode. 
For development/testing, use FETCH_ML_MOCK_GPU_TYPE=AMD

Workaround:

Use NVIDIA GPUs with FETCH_ML_GPU_TYPE=nvidia
Use Apple Silicon with Metal (FETCH_ML_GPU_TYPE=apple)
Use CPU-only mode with FETCH_ML_GPU_TYPE=none

For testing/development, use mock AMD:

FETCH_ML_MOCK_GPU_TYPE=AMD
FETCH_ML_MOCK_GPU_COUNT=4

Implementation Requirements (for future consideration):

ROCm SMI Go bindings or CGO wrapper
AMD GPU hardware for testing
ROCm runtime in container images
Driver compatibility matrix
User demand validation (file an issue if you need this)

Platform Support

Windows Process Isolation

Status: ⏳ Not Implemented

Process isolation limits (max open files, max processes) are not enforced on Windows. The Windows implementation uses stub functions that return errors when limits are requested.

Error Message:

process isolation limits not implemented on Windows (max_open_files=1000, max_processes=100)

Workaround: Use Linux or macOS for production deployments requiring process isolation.

Implementation Requirements:

Windows Job Objects integration
VirtualLock API for memory locking
Platform-specific testing

API Features

REST API Task Operations

Status: ⏳ Not Implemented

Task creation, cancellation, and details via REST API are not implemented. These operations must use WebSocket protocol.

Error Message:

{
  "error": "Not implemented",
  "code": "NOT_IMPLEMENTED",
  "message": "Task creation via REST API not yet implemented - use WebSocket"
}

Workaround: Use WebSocket protocol for task operations:

Connect to /ws endpoint
Use binary protocol for job submission
See WebSocket API documentation

Experiments API

Status: ⏳ Not Implemented

Experiment listing and creation endpoints return empty stub responses.

Workaround: Use direct database access or experiment manager interfaces.

Plugin Version Query

Status: ⏳ Not Implemented

Plugin version information returns hardcoded "1.0.0" instead of querying actual plugin binary/container versions. Backend support exists but no CLI access (uses HTTP REST; CLI uses WebSocket).

Workaround: Query plugin binaries directly for version information.

Scheduler Features

Gang Allocation Stress Testing

Status: ⏳ Partial (100+ node jobs not tested)

While gang allocation works for typical multi-node jobs, stress testing with 100+ nodes is not yet implemented.

Workaround: Test with smaller node counts (8-16 nodes) for validation.

Test Infrastructure

Podman-in-Docker CI Tests

Status: ⏳ Not Implemented

Running Podman containers inside Docker CI runners requires privileged mode and cgroup configuration that is not yet automated.

Workaround: Tests run with direct Docker container execution.

Reporting

Test Coverage Dashboard

Status: ⏳ Not Implemented

Automated coverage dashboard with trend tracking is planned but not yet available.

Workaround: Use go test -coverprofile and upload artifacts manually.

Native Libraries (C++)

AMD GPU Support in Native Libs

Status: ⏳ Not Implemented

The native C++ libraries (dataset_hash, queue_index) do not yet have AMD GPU acceleration.

Workaround: Use CPU implementations which are still significantly faster than pure Go.

How to Handle Not Implemented Errors

For Users

When you encounter a "not implemented" error:

Check this document for workarounds
Use mock mode for development/testing (see gpu_detector_mock.go)
File an issue to request the feature with your use case
Consider contributing - see CONTRIBUTING.md

For Developers

When implementing new features:

Use errors.NewNotImplemented(featureName) for clear error messages
Add the limitation to this document
Provide a workaround if possible
Reference any GitHub tracking issues

Example:

if requestedFeature == "rocm" {
    return apierrors.NewNotImplemented("AMD ROCm support")
}

Feature Request Process

To request an unimplemented feature:

Open a GitHub issue with label feature-request
Describe your use case and hardware/environment
Mention if you're willing to test or contribute
Reference any related limitations in this document

Last updated: March 2026

5.2 KiB Raw Blame History

Known Limitations

GPU Support

AMD GPU (ROCm)

Platform Support

Windows Process Isolation

API Features

REST API Task Operations

Experiments API

Plugin Version Query

Scheduler Features

Gang Allocation Stress Testing

Test Infrastructure

Podman-in-Docker CI Tests

Reporting

Test Coverage Dashboard

Native Libraries (C++)

AMD GPU Support in Native Libs

How to Handle Not Implemented Errors

For Users

For Developers

Feature Request Process

5.2 KiB

Raw Blame History