Refactor plugins to use interface for testability: - Add PodmanInterface to container package (StartContainer, StopContainer, RemoveContainer) - Update MLflow plugin to use container.PodmanInterface - Update TensorBoard plugin to use container.PodmanInterface - Add comprehensive mocked tests for all three plugins (wandb, mlflow, tensorboard) - Coverage increased from 18% to 91.4%
5.2 KiB
Known Limitations
This document tracks features that are planned but not yet implemented, along with workarounds where available.
GPU Support
AMD GPU (ROCm)
Status: ⏳ Not Implemented (Deferred) Priority: Low (adoption growing but not mainstream for ML/AI)
AMD GPU detection and ROCm integration are not yet implemented. While AMD GPU adoption for ML/AI workloads is growing, it remains less mainstream than NVIDIA. The system will return a clear error if AMD is requested.
Rationale for Deferral:
- NVIDIA dominates ML/AI training and inference (90%+ market share)
- AMD ROCm ecosystem still maturing for deep learning frameworks
- Limited user demand compared to NVIDIA/Apple Silicon
- Can be added later when user demand increases
Error Message:
AMD GPU support is not yet implemented.
Use NVIDIA GPUs, Apple Silicon, or CPU-only mode.
For development/testing, use FETCH_ML_MOCK_GPU_TYPE=AMD
Workaround:
- Use NVIDIA GPUs with
FETCH_ML_GPU_TYPE=nvidia - Use Apple Silicon with Metal (
FETCH_ML_GPU_TYPE=apple) - Use CPU-only mode with
FETCH_ML_GPU_TYPE=none - For testing/development, use mock AMD:
FETCH_ML_MOCK_GPU_TYPE=AMD FETCH_ML_MOCK_GPU_COUNT=4
Implementation Requirements (for future consideration):
- ROCm SMI Go bindings or CGO wrapper
- AMD GPU hardware for testing
- ROCm runtime in container images
- Driver compatibility matrix
- User demand validation (file an issue if you need this)
Platform Support
Windows Process Isolation
Status: ⏳ Not Implemented
Process isolation limits (max open files, max processes) are not enforced on Windows. The Windows implementation uses stub functions that return errors when limits are requested.
Error Message:
process isolation limits not implemented on Windows (max_open_files=1000, max_processes=100)
Workaround: Use Linux or macOS for production deployments requiring process isolation.
Implementation Requirements:
- Windows Job Objects integration
- VirtualLock API for memory locking
- Platform-specific testing
API Features
REST API Task Operations
Status: ⏳ Not Implemented
Task creation, cancellation, and details via REST API are not implemented. These operations must use WebSocket protocol.
Error Message:
{
"error": "Not implemented",
"code": "NOT_IMPLEMENTED",
"message": "Task creation via REST API not yet implemented - use WebSocket"
}
Workaround: Use WebSocket protocol for task operations:
- Connect to
/wsendpoint - Use binary protocol for job submission
- See WebSocket API documentation
Experiments API
Status: ⏳ Not Implemented
Experiment listing and creation endpoints return empty stub responses.
Workaround: Use direct database access or experiment manager interfaces.
Plugin Version Query
Status: ⏳ Not Implemented
Plugin version information returns hardcoded "1.0.0" instead of querying actual plugin binary/container versions. Backend support exists but no CLI access (uses HTTP REST; CLI uses WebSocket).
Workaround: Query plugin binaries directly for version information.
Scheduler Features
Gang Allocation Stress Testing
Status: ⏳ Partial (100+ node jobs not tested)
While gang allocation works for typical multi-node jobs, stress testing with 100+ nodes is not yet implemented.
Workaround: Test with smaller node counts (8-16 nodes) for validation.
Test Infrastructure
Podman-in-Docker CI Tests
Status: ⏳ Not Implemented
Running Podman containers inside Docker CI runners requires privileged mode and cgroup configuration that is not yet automated.
Workaround: Tests run with direct Docker container execution.
Reporting
Test Coverage Dashboard
Status: ⏳ Not Implemented
Automated coverage dashboard with trend tracking is planned but not yet available.
Workaround: Use go test -coverprofile and upload artifacts manually.
Native Libraries (C++)
AMD GPU Support in Native Libs
Status: ⏳ Not Implemented
The native C++ libraries (dataset_hash, queue_index) do not yet have AMD GPU acceleration.
Workaround: Use CPU implementations which are still significantly faster than pure Go.
How to Handle Not Implemented Errors
For Users
When you encounter a "not implemented" error:
- Check this document for workarounds
- Use mock mode for development/testing (see
gpu_detector_mock.go) - File an issue to request the feature with your use case
- Consider contributing - see
CONTRIBUTING.md
For Developers
When implementing new features:
- Use
errors.NewNotImplemented(featureName)for clear error messages - Add the limitation to this document
- Provide a workaround if possible
- Reference any GitHub tracking issues
Example:
if requestedFeature == "rocm" {
return apierrors.NewNotImplemented("AMD ROCm support")
}
Feature Request Process
To request an unimplemented feature:
- Open a GitHub issue with label
feature-request - Describe your use case and hardware/environment
- Mention if you're willing to test or contribute
- Reference any related limitations in this document
Last updated: March 2026