# Known Limitations This document tracks features that are planned but not yet implemented, along with workarounds where available. ## GPU Support ### AMD GPU (ROCm) **Status**: ⏳ Not Implemented (Deferred) **Priority**: Low (adoption growing but not mainstream for ML/AI) AMD GPU detection and ROCm integration are not yet implemented. While AMD GPU adoption for ML/AI workloads is growing, it remains less mainstream than NVIDIA. The system will return a clear error if AMD is requested. **Rationale for Deferral**: - NVIDIA dominates ML/AI training and inference (90%+ market share) - AMD ROCm ecosystem still maturing for deep learning frameworks - Limited user demand compared to NVIDIA/Apple Silicon - Can be added later when user demand increases **Error Message**: ``` AMD GPU support is not yet implemented. Use NVIDIA GPUs, Apple Silicon, or CPU-only mode. For development/testing, use FETCH_ML_MOCK_GPU_TYPE=AMD ``` **Workaround**: - Use NVIDIA GPUs with `FETCH_ML_GPU_TYPE=nvidia` - Use Apple Silicon with Metal (`FETCH_ML_GPU_TYPE=apple`) - Use CPU-only mode with `FETCH_ML_GPU_TYPE=none` - For testing/development, use mock AMD: ```bash FETCH_ML_MOCK_GPU_TYPE=AMD FETCH_ML_MOCK_GPU_COUNT=4 ``` **Implementation Requirements** (for future consideration): - [ ] ROCm SMI Go bindings or CGO wrapper - [ ] AMD GPU hardware for testing - [ ] ROCm runtime in container images - [ ] Driver compatibility matrix - [ ] User demand validation (file an issue if you need this) --- ## Platform Support ### Windows Process Isolation **Status**: ⏳ Not Implemented Process isolation limits (max open files, max processes) are not enforced on Windows. The Windows implementation uses stub functions that return errors when limits are requested. **Error Message**: ``` process isolation limits not implemented on Windows (max_open_files=1000, max_processes=100) ``` **Workaround**: Use Linux or macOS for production deployments requiring process isolation. **Implementation Requirements**: - [ ] Windows Job Objects integration - [ ] VirtualLock API for memory locking - [ ] Platform-specific testing --- ## API Features ### REST API Task Operations **Status**: ⏳ Not Implemented Task creation, cancellation, and details via REST API are not implemented. These operations must use WebSocket protocol. **Error Message**: ```json { "error": "Not implemented", "code": "NOT_IMPLEMENTED", "message": "Task creation via REST API not yet implemented - use WebSocket" } ``` **Workaround**: Use WebSocket protocol for task operations: - Connect to `/ws` endpoint - Use binary protocol for job submission - See WebSocket API documentation --- ### Experiments API **Status**: ⏳ Not Implemented Experiment listing and creation endpoints return empty stub responses. **Workaround**: Use direct database access or experiment manager interfaces. --- ### Plugin Version Query **Status**: ⏳ Not Implemented Plugin version information returns hardcoded "1.0.0" instead of querying actual plugin binary/container versions. Backend support exists but no CLI access (uses HTTP REST; CLI uses WebSocket). **Workaround**: Query plugin binaries directly for version information. --- ## Scheduler Features ### Gang Allocation Stress Testing **Status**: ⏳ Partial (100+ node jobs not tested) While gang allocation works for typical multi-node jobs, stress testing with 100+ nodes is not yet implemented. **Workaround**: Test with smaller node counts (8-16 nodes) for validation. --- ## Test Infrastructure ### Podman-in-Docker CI Tests **Status**: ⏳ Not Implemented Running Podman containers inside Docker CI runners requires privileged mode and cgroup configuration that is not yet automated. **Workaround**: Tests run with direct Docker container execution. --- ## Reporting ### Test Coverage Dashboard **Status**: ⏳ Not Implemented Automated coverage dashboard with trend tracking is planned but not yet available. **Workaround**: Use `go test -coverprofile` and upload artifacts manually. --- ## Native Libraries (C++) ### AMD GPU Support in Native Libs **Status**: ⏳ Not Implemented The native C++ libraries (dataset_hash, queue_index) do not yet have AMD GPU acceleration. **Workaround**: Use CPU implementations which are still significantly faster than pure Go. --- ## How to Handle Not Implemented Errors ### For Users When you encounter a "not implemented" error: 1. **Check this document** for workarounds 2. **Use mock mode** for development/testing (see `gpu_detector_mock.go`) 3. **File an issue** to request the feature with your use case 4. **Consider contributing** - see `CONTRIBUTING.md` ### For Developers When implementing new features: 1. Use `errors.NewNotImplemented(featureName)` for clear error messages 2. Add the limitation to this document 3. Provide a workaround if possible 4. Reference any GitHub tracking issues Example: ```go if requestedFeature == "rocm" { return apierrors.NewNotImplemented("AMD ROCm support") } ``` --- ## Feature Request Process To request an unimplemented feature: 1. Open a GitHub issue with label `feature-request` 2. Describe your use case and hardware/environment 3. Mention if you're willing to test or contribute 4. Reference any related limitations in this document --- *Last updated: March 2026*