fetch_ml/docs/known-limitations.md

# Known Limitations

This document tracks features that are planned but not yet implemented, along with workarounds where available.

## GPU Support

### AMD GPU (ROCm)
**Status**: ⏳ Not Implemented (Deferred)
**Priority**: Low (adoption growing but not mainstream for ML/AI)

AMD GPU detection and ROCm integration are not yet implemented. While AMD GPU adoption for ML/AI workloads is growing, it remains less mainstream than NVIDIA. The system will return a clear error if AMD is requested.

**Rationale for Deferral**:
- NVIDIA dominates ML/AI training and inference (90%+ market share)
- AMD ROCm ecosystem still maturing for deep learning frameworks
- Limited user demand compared to NVIDIA/Apple Silicon
- Can be added later when user demand increases

**Error Message**:
```
AMD GPU support is not yet implemented.
Use NVIDIA GPUs, Apple Silicon, or CPU-only mode.
For development/testing, use FETCH_ML_MOCK_GPU_TYPE=AMD
```

**Workaround**:
- Use NVIDIA GPUs with `FETCH_ML_GPU_TYPE=nvidia`
- Use Apple Silicon with Metal (`FETCH_ML_GPU_TYPE=apple`)
- Use CPU-only mode with `FETCH_ML_GPU_TYPE=none`
- For testing/development, use mock AMD:
  ```bash
  FETCH_ML_MOCK_GPU_TYPE=AMD
  FETCH_ML_MOCK_GPU_COUNT=4
  ```

**Implementation Requirements** (for future consideration):
- [ ] ROCm SMI Go bindings or CGO wrapper
- [ ] AMD GPU hardware for testing
- [ ] ROCm runtime in container images
- [ ] Driver compatibility matrix
- [ ] User demand validation (file an issue if you need this)

---

## Platform Support

### Windows Process Isolation
**Status**: ⏳ Not Implemented

Process isolation limits (max open files, max processes) are not enforced on Windows. The Windows implementation uses stub functions that return errors when limits are requested.

**Error Message**:
```
process isolation limits not implemented on Windows (max_open_files=1000, max_processes=100)
```

**Workaround**: Use Linux or macOS for production deployments requiring process isolation.

**Implementation Requirements**:
- [ ] Windows Job Objects integration
- [ ] VirtualLock API for memory locking
- [ ] Platform-specific testing

---

## API Features

### REST API Task Operations
**Status**: ⏳ Not Implemented

Task creation, cancellation, and details via REST API are not implemented. These operations must use WebSocket protocol.

**Error Message**:
```json
{
  "error": "Not implemented",
  "code": "NOT_IMPLEMENTED",
  "message": "Task creation via REST API not yet implemented - use WebSocket"
}
```

**Workaround**: Use WebSocket protocol for task operations:
- Connect to `/ws` endpoint
- Use binary protocol for job submission
- See WebSocket API documentation

---

### Experiments API
**Status**: ⏳ Not Implemented

Experiment listing and creation endpoints return empty stub responses.

**Workaround**: Use direct database access or experiment manager interfaces.

---

### Plugin Version Query
**Status**: ⏳ Not Implemented

Plugin version information returns hardcoded "1.0.0" instead of querying actual plugin binary/container versions. Backend support exists but no CLI access (uses HTTP REST; CLI uses WebSocket).

**Workaround**: Query plugin binaries directly for version information.

---

## Scheduler Features

### Gang Allocation Stress Testing
**Status**: ⏳ Partial (100+ node jobs not tested)

While gang allocation works for typical multi-node jobs, stress testing with 100+ nodes is not yet implemented.

**Workaround**: Test with smaller node counts (8-16 nodes) for validation.

---

## Test Infrastructure

### Podman-in-Docker CI Tests
**Status**: ⏳ Not Implemented

Running Podman containers inside Docker CI runners requires privileged mode and cgroup configuration that is not yet automated.

**Workaround**: Tests run with direct Docker container execution.

---

## Reporting

### Test Coverage Dashboard
**Status**: ⏳ Not Implemented

Automated coverage dashboard with trend tracking is planned but not yet available.

**Workaround**: Use `go test -coverprofile` and upload artifacts manually.

---

## Native Libraries (C++)

### AMD GPU Support in Native Libs
**Status**: ⏳ Not Implemented

The native C++ libraries (dataset_hash, queue_index) do not yet have AMD GPU acceleration.

**Workaround**: Use CPU implementations which are still significantly faster than pure Go.

---

## How to Handle Not Implemented Errors

### For Users

When you encounter a "not implemented" error:

1. **Check this document** for workarounds
2. **Use mock mode** for development/testing (see `gpu_detector_mock.go`)
3. **File an issue** to request the feature with your use case
4. **Consider contributing** - see `CONTRIBUTING.md`

### For Developers

When implementing new features:

1. Use `errors.NewNotImplemented(featureName)` for clear error messages
2. Add the limitation to this document
3. Provide a workaround if possible
4. Reference any GitHub tracking issues

Example:
```go
if requestedFeature == "rocm" {
    return apierrors.NewNotImplemented("AMD ROCm support")
}
```

---

## Feature Request Process

To request an unimplemented feature:

1. Open a GitHub issue with label `feature-request`
2. Describe your use case and hardware/environment
3. Mention if you're willing to test or contribute
4. Reference any related limitations in this document

---

*Last updated: March 2026*