fetch_ml/docs/known-limitations.md
Jeremie Fraeys f827ee522a
test(tracking/plugins): add PodmanInterface and comprehensive plugin tests for 91% coverage
Refactor plugins to use interface for testability:
- Add PodmanInterface to container package (StartContainer, StopContainer, RemoveContainer)
- Update MLflow plugin to use container.PodmanInterface
- Update TensorBoard plugin to use container.PodmanInterface
- Add comprehensive mocked tests for all three plugins (wandb, mlflow, tensorboard)
- Coverage increased from 18% to 91.4%
2026-03-14 16:59:16 -04:00

191 lines
5.2 KiB
Markdown

# Known Limitations
This document tracks features that are planned but not yet implemented, along with workarounds where available.
## GPU Support
### AMD GPU (ROCm)
**Status**: ⏳ Not Implemented (Deferred)
**Priority**: Low (adoption growing but not mainstream for ML/AI)
AMD GPU detection and ROCm integration are not yet implemented. While AMD GPU adoption for ML/AI workloads is growing, it remains less mainstream than NVIDIA. The system will return a clear error if AMD is requested.
**Rationale for Deferral**:
- NVIDIA dominates ML/AI training and inference (90%+ market share)
- AMD ROCm ecosystem still maturing for deep learning frameworks
- Limited user demand compared to NVIDIA/Apple Silicon
- Can be added later when user demand increases
**Error Message**:
```
AMD GPU support is not yet implemented.
Use NVIDIA GPUs, Apple Silicon, or CPU-only mode.
For development/testing, use FETCH_ML_MOCK_GPU_TYPE=AMD
```
**Workaround**:
- Use NVIDIA GPUs with `FETCH_ML_GPU_TYPE=nvidia`
- Use Apple Silicon with Metal (`FETCH_ML_GPU_TYPE=apple`)
- Use CPU-only mode with `FETCH_ML_GPU_TYPE=none`
- For testing/development, use mock AMD:
```bash
FETCH_ML_MOCK_GPU_TYPE=AMD
FETCH_ML_MOCK_GPU_COUNT=4
```
**Implementation Requirements** (for future consideration):
- [ ] ROCm SMI Go bindings or CGO wrapper
- [ ] AMD GPU hardware for testing
- [ ] ROCm runtime in container images
- [ ] Driver compatibility matrix
- [ ] User demand validation (file an issue if you need this)
---
## Platform Support
### Windows Process Isolation
**Status**: ⏳ Not Implemented
Process isolation limits (max open files, max processes) are not enforced on Windows. The Windows implementation uses stub functions that return errors when limits are requested.
**Error Message**:
```
process isolation limits not implemented on Windows (max_open_files=1000, max_processes=100)
```
**Workaround**: Use Linux or macOS for production deployments requiring process isolation.
**Implementation Requirements**:
- [ ] Windows Job Objects integration
- [ ] VirtualLock API for memory locking
- [ ] Platform-specific testing
---
## API Features
### REST API Task Operations
**Status**: ⏳ Not Implemented
Task creation, cancellation, and details via REST API are not implemented. These operations must use WebSocket protocol.
**Error Message**:
```json
{
"error": "Not implemented",
"code": "NOT_IMPLEMENTED",
"message": "Task creation via REST API not yet implemented - use WebSocket"
}
```
**Workaround**: Use WebSocket protocol for task operations:
- Connect to `/ws` endpoint
- Use binary protocol for job submission
- See WebSocket API documentation
---
### Experiments API
**Status**: ⏳ Not Implemented
Experiment listing and creation endpoints return empty stub responses.
**Workaround**: Use direct database access or experiment manager interfaces.
---
### Plugin Version Query
**Status**: ⏳ Not Implemented
Plugin version information returns hardcoded "1.0.0" instead of querying actual plugin binary/container versions. Backend support exists but no CLI access (uses HTTP REST; CLI uses WebSocket).
**Workaround**: Query plugin binaries directly for version information.
---
## Scheduler Features
### Gang Allocation Stress Testing
**Status**: ⏳ Partial (100+ node jobs not tested)
While gang allocation works for typical multi-node jobs, stress testing with 100+ nodes is not yet implemented.
**Workaround**: Test with smaller node counts (8-16 nodes) for validation.
---
## Test Infrastructure
### Podman-in-Docker CI Tests
**Status**: ⏳ Not Implemented
Running Podman containers inside Docker CI runners requires privileged mode and cgroup configuration that is not yet automated.
**Workaround**: Tests run with direct Docker container execution.
---
## Reporting
### Test Coverage Dashboard
**Status**: ⏳ Not Implemented
Automated coverage dashboard with trend tracking is planned but not yet available.
**Workaround**: Use `go test -coverprofile` and upload artifacts manually.
---
## Native Libraries (C++)
### AMD GPU Support in Native Libs
**Status**: ⏳ Not Implemented
The native C++ libraries (dataset_hash, queue_index) do not yet have AMD GPU acceleration.
**Workaround**: Use CPU implementations which are still significantly faster than pure Go.
---
## How to Handle Not Implemented Errors
### For Users
When you encounter a "not implemented" error:
1. **Check this document** for workarounds
2. **Use mock mode** for development/testing (see `gpu_detector_mock.go`)
3. **File an issue** to request the feature with your use case
4. **Consider contributing** - see `CONTRIBUTING.md`
### For Developers
When implementing new features:
1. Use `errors.NewNotImplemented(featureName)` for clear error messages
2. Add the limitation to this document
3. Provide a workaround if possible
4. Reference any GitHub tracking issues
Example:
```go
if requestedFeature == "rocm" {
return apierrors.NewNotImplemented("AMD ROCm support")
}
```
---
## Feature Request Process
To request an unimplemented feature:
1. Open a GitHub issue with label `feature-request`
2. Describe your use case and hardware/environment
3. Mention if you're willing to test or contribute
4. Reference any related limitations in this document
---
*Last updated: March 2026*