Refactor plugins to use interface for testability: - Add PodmanInterface to container package (StartContainer, StopContainer, RemoveContainer) - Update MLflow plugin to use container.PodmanInterface - Update TensorBoard plugin to use container.PodmanInterface - Add comprehensive mocked tests for all three plugins (wandb, mlflow, tensorboard) - Coverage increased from 18% to 91.4%
191 lines
5.2 KiB
Markdown
191 lines
5.2 KiB
Markdown
# Known Limitations
|
|
|
|
This document tracks features that are planned but not yet implemented, along with workarounds where available.
|
|
|
|
## GPU Support
|
|
|
|
### AMD GPU (ROCm)
|
|
**Status**: ⏳ Not Implemented (Deferred)
|
|
**Priority**: Low (adoption growing but not mainstream for ML/AI)
|
|
|
|
AMD GPU detection and ROCm integration are not yet implemented. While AMD GPU adoption for ML/AI workloads is growing, it remains less mainstream than NVIDIA. The system will return a clear error if AMD is requested.
|
|
|
|
**Rationale for Deferral**:
|
|
- NVIDIA dominates ML/AI training and inference (90%+ market share)
|
|
- AMD ROCm ecosystem still maturing for deep learning frameworks
|
|
- Limited user demand compared to NVIDIA/Apple Silicon
|
|
- Can be added later when user demand increases
|
|
|
|
**Error Message**:
|
|
```
|
|
AMD GPU support is not yet implemented.
|
|
Use NVIDIA GPUs, Apple Silicon, or CPU-only mode.
|
|
For development/testing, use FETCH_ML_MOCK_GPU_TYPE=AMD
|
|
```
|
|
|
|
**Workaround**:
|
|
- Use NVIDIA GPUs with `FETCH_ML_GPU_TYPE=nvidia`
|
|
- Use Apple Silicon with Metal (`FETCH_ML_GPU_TYPE=apple`)
|
|
- Use CPU-only mode with `FETCH_ML_GPU_TYPE=none`
|
|
- For testing/development, use mock AMD:
|
|
```bash
|
|
FETCH_ML_MOCK_GPU_TYPE=AMD
|
|
FETCH_ML_MOCK_GPU_COUNT=4
|
|
```
|
|
|
|
**Implementation Requirements** (for future consideration):
|
|
- [ ] ROCm SMI Go bindings or CGO wrapper
|
|
- [ ] AMD GPU hardware for testing
|
|
- [ ] ROCm runtime in container images
|
|
- [ ] Driver compatibility matrix
|
|
- [ ] User demand validation (file an issue if you need this)
|
|
|
|
---
|
|
|
|
## Platform Support
|
|
|
|
### Windows Process Isolation
|
|
**Status**: ⏳ Not Implemented
|
|
|
|
Process isolation limits (max open files, max processes) are not enforced on Windows. The Windows implementation uses stub functions that return errors when limits are requested.
|
|
|
|
**Error Message**:
|
|
```
|
|
process isolation limits not implemented on Windows (max_open_files=1000, max_processes=100)
|
|
```
|
|
|
|
**Workaround**: Use Linux or macOS for production deployments requiring process isolation.
|
|
|
|
**Implementation Requirements**:
|
|
- [ ] Windows Job Objects integration
|
|
- [ ] VirtualLock API for memory locking
|
|
- [ ] Platform-specific testing
|
|
|
|
---
|
|
|
|
## API Features
|
|
|
|
### REST API Task Operations
|
|
**Status**: ⏳ Not Implemented
|
|
|
|
Task creation, cancellation, and details via REST API are not implemented. These operations must use WebSocket protocol.
|
|
|
|
**Error Message**:
|
|
```json
|
|
{
|
|
"error": "Not implemented",
|
|
"code": "NOT_IMPLEMENTED",
|
|
"message": "Task creation via REST API not yet implemented - use WebSocket"
|
|
}
|
|
```
|
|
|
|
**Workaround**: Use WebSocket protocol for task operations:
|
|
- Connect to `/ws` endpoint
|
|
- Use binary protocol for job submission
|
|
- See WebSocket API documentation
|
|
|
|
---
|
|
|
|
### Experiments API
|
|
**Status**: ⏳ Not Implemented
|
|
|
|
Experiment listing and creation endpoints return empty stub responses.
|
|
|
|
**Workaround**: Use direct database access or experiment manager interfaces.
|
|
|
|
---
|
|
|
|
### Plugin Version Query
|
|
**Status**: ⏳ Not Implemented
|
|
|
|
Plugin version information returns hardcoded "1.0.0" instead of querying actual plugin binary/container versions. Backend support exists but no CLI access (uses HTTP REST; CLI uses WebSocket).
|
|
|
|
**Workaround**: Query plugin binaries directly for version information.
|
|
|
|
---
|
|
|
|
## Scheduler Features
|
|
|
|
### Gang Allocation Stress Testing
|
|
**Status**: ⏳ Partial (100+ node jobs not tested)
|
|
|
|
While gang allocation works for typical multi-node jobs, stress testing with 100+ nodes is not yet implemented.
|
|
|
|
**Workaround**: Test with smaller node counts (8-16 nodes) for validation.
|
|
|
|
---
|
|
|
|
## Test Infrastructure
|
|
|
|
### Podman-in-Docker CI Tests
|
|
**Status**: ⏳ Not Implemented
|
|
|
|
Running Podman containers inside Docker CI runners requires privileged mode and cgroup configuration that is not yet automated.
|
|
|
|
**Workaround**: Tests run with direct Docker container execution.
|
|
|
|
---
|
|
|
|
## Reporting
|
|
|
|
### Test Coverage Dashboard
|
|
**Status**: ⏳ Not Implemented
|
|
|
|
Automated coverage dashboard with trend tracking is planned but not yet available.
|
|
|
|
**Workaround**: Use `go test -coverprofile` and upload artifacts manually.
|
|
|
|
---
|
|
|
|
## Native Libraries (C++)
|
|
|
|
### AMD GPU Support in Native Libs
|
|
**Status**: ⏳ Not Implemented
|
|
|
|
The native C++ libraries (dataset_hash, queue_index) do not yet have AMD GPU acceleration.
|
|
|
|
**Workaround**: Use CPU implementations which are still significantly faster than pure Go.
|
|
|
|
---
|
|
|
|
## How to Handle Not Implemented Errors
|
|
|
|
### For Users
|
|
|
|
When you encounter a "not implemented" error:
|
|
|
|
1. **Check this document** for workarounds
|
|
2. **Use mock mode** for development/testing (see `gpu_detector_mock.go`)
|
|
3. **File an issue** to request the feature with your use case
|
|
4. **Consider contributing** - see `CONTRIBUTING.md`
|
|
|
|
### For Developers
|
|
|
|
When implementing new features:
|
|
|
|
1. Use `errors.NewNotImplemented(featureName)` for clear error messages
|
|
2. Add the limitation to this document
|
|
3. Provide a workaround if possible
|
|
4. Reference any GitHub tracking issues
|
|
|
|
Example:
|
|
```go
|
|
if requestedFeature == "rocm" {
|
|
return apierrors.NewNotImplemented("AMD ROCm support")
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Feature Request Process
|
|
|
|
To request an unimplemented feature:
|
|
|
|
1. Open a GitHub issue with label `feature-request`
|
|
2. Describe your use case and hardware/environment
|
|
3. Mention if you're willing to test or contribute
|
|
4. Reference any related limitations in this document
|
|
|
|
---
|
|
|
|
*Last updated: March 2026*
|