fetch_ml/internal
Jeremie Fraeys 05b7af6991
feat: implement NVML-based GPU monitoring
- Add native/nvml_gpu/ C++ library wrapping NVIDIA Management Library
- Add Go bindings in internal/worker/gpu_nvml_native.go and gpu_nvml_stub.go
- Update gpu_detector.go to use NVML for accurate GPU count detection
- Update native/CMakeLists.txt to build nvml_gpu library
- Provides real-time GPU utilization, memory, temperature, clocks, power
- Falls back to environment variable when NVML unavailable
2026-02-21 15:16:09 -05:00
..
api refactor(api): internal refactoring for TUI and worker modules 2026-02-20 15:51:23 -05:00
audit feat: implement tamper-evident audit logging 2026-02-19 15:34:28 -05:00
auth refactor(api): internal refactoring for TUI and worker modules 2026-02-20 15:51:23 -05:00
config refactor(api): internal refactoring for TUI and worker modules 2026-02-20 15:51:23 -05:00
container test: Reorganize and add unit tests 2026-02-18 21:28:13 -05:00
controller Fix multi-user authentication and clean up debug code 2025-12-06 12:35:32 -05:00
crypto feat: implement Argon2id hashing and Ed25519 manifest signing 2026-02-19 15:34:20 -05:00
domain refactor(api): internal refactoring for TUI and worker modules 2026-02-20 15:51:23 -05:00
envpool feat(worker): add integrity checks, snapshot staging, and prewarm support 2026-01-05 12:31:13 -05:00
errtypes feat: implement research-grade maintainability phases 1,3,4,7 2026-02-18 15:27:50 -05:00
experiment refactor: adopt PathRegistry in experiment manager 2026-02-18 16:53:41 -05:00
fileutil Fix multi-user authentication and clean up debug code 2025-12-06 12:35:32 -05:00
jupyter test: Reorganize and add unit tests 2026-02-18 21:28:13 -05:00
logging refactor(api): internal refactoring for TUI and worker modules 2026-02-20 15:51:23 -05:00
manifest feat: add manifest signing and native hashing support 2026-02-19 15:34:39 -05:00
metrics refactor: Phase 6 - Complete migration, remove legacy files 2026-02-17 14:39:48 -05:00
middleware fix: resolve TODOs and standardize tests 2026-02-19 15:34:59 -05:00
network refactor(dependency-hygiene): Move path functions from config to storage 2026-02-17 21:15:23 -05:00
privacy feat: Privacy and PII detection 2026-02-18 21:27:23 -05:00
prommetrics feat(api): refactor websocket handlers; add health and prometheus middleware 2026-01-05 12:31:07 -05:00
queue feat: integrate native queue backend into worker and API 2026-02-21 14:11:10 -05:00
resources feat(worker): add integrity checks, snapshot staging, and prewarm support 2026-01-05 12:31:13 -05:00
security feat: add security monitoring and validation framework 2026-02-19 15:34:25 -05:00
storage refactor(api): internal refactoring for TUI and worker modules 2026-02-20 15:51:23 -05:00
telemetry Fix multi-user authentication and clean up debug code 2025-12-06 12:35:32 -05:00
tracking feat(tracking): add pluggable tracking backends and audit support 2026-01-05 12:33:57 -05:00
validation feat: add security monitoring and validation framework 2026-02-19 15:34:25 -05:00
worker feat: implement NVML-based GPU monitoring 2026-02-21 15:16:09 -05:00