fetch_ml

Author	SHA1	Message	Date
Jeremie Fraeys	43e6446587	feat(scheduler): implement multi-tenant job scheduler with gang scheduling Add new scheduler component for distributed ML workload orchestration: - Hub-based coordination for multi-worker clusters - Pacing controller for rate limiting job submissions - Priority queue with preemption support - Port allocator for dynamic service discovery - Protocol handlers for worker-scheduler communication - Service manager with OS-specific implementations - Connection management and state persistence - Template system for service deployment Includes comprehensive test suite: - Unit tests for all core components - Integration tests for distributed scenarios - Benchmark tests for performance validation - Mock fixtures for isolated testing Refs: scheduler-architecture.md	2026-02-26 12:03:23 -05:00
Jeremie Fraeys	9434f4c8e6	feat(security): Artifact ingestion caps enforcement Add MaxArtifactFiles and MaxArtifactTotalBytes to SandboxConfig: - Default MaxArtifactFiles: 10,000 (configurable via SecurityDefaults) - Default MaxArtifactTotalBytes: 100GB (configurable via SecurityDefaults) - ApplySecurityDefaults() sets defaults if not specified Enforce caps in scanArtifacts() during directory walk: - Returns error immediately when MaxArtifactFiles exceeded - Returns error immediately when MaxArtifactTotalBytes exceeded - Prevents resource exhaustion attacks from malicious artifact trees Update all call sites to pass SandboxConfig for cap enforcement: - Native bridge libs updated to pass caps argument - Benchmark tests updated with nil caps (unlimited for benchmarks) - Unit tests updated with nil caps Closes: artifact ingestion caps items from security plan	2026-02-23 19:43:28 -05:00
Jeremie Fraeys	be67cb77d3	test(benchmarks): update benchmark tests with job cleanup and improvements Payload Performance Test: - Add job cleanup after each iteration using DeleteJob() - Ensure isolated memory measurements between test runs All Benchmark Tests: - General improvements and maintenance updates	2026-02-23 18:03:54 -05:00
Jeremie Fraeys	3b194ff2e8	feat: GPU detection transparency and artifact scanner improvements Some checks failed Build CLI with Embedded SQLite / build (arm64, aarch64-linux) (push) Waiting to run Details Build CLI with Embedded SQLite / build (x86_64, x86_64-linux) (push) Waiting to run Details Build CLI with Embedded SQLite / build-macos (arm64) (push) Waiting to run Details Build CLI with Embedded SQLite / build-macos (x86_64) (push) Waiting to run Details Security Scan / Security Analysis (push) Waiting to run Details Security Scan / Native Library Security (push) Waiting to run Details Checkout test / test (push) Successful in 6s Details CI/CD Pipeline / Test (push) Failing after 1s Details CI/CD Pipeline / Dev Compose Smoke Test (push) Has been skipped Details CI/CD Pipeline / Build (push) Has been skipped Details CI/CD Pipeline / Test Scripts (push) Has been skipped Details CI/CD Pipeline / Test Native Libraries (push) Has been skipped Details CI/CD Pipeline / GPU Golden Test Matrix (push) Has been skipped Details Documentation / build-and-publish (push) Failing after 39s Details CI/CD Pipeline / Docker Build (push) Has been skipped Details - Surface GPUDetectionInfo from parseGPUCountFromConfig for detection metadata - Document FETCH_ML_TOTAL_CPU and FETCH_ML_GPU_SLOTS_PER_GPU env vars - Add debug logging for all env var overrides to stderr - Track config-layer auto-detection in GPUDetectionInfo.ConfigLayerAutoDetected - Add --include-all flag to artifact scanner (includeAll parameter) - Add AMD production mode enforcement (error in non-local mode) - Add GPU detector unit tests for env overrides and AMD aliasing	2026-02-23 12:29:34 -05:00
Jeremie Fraeys	e557313e08	fix: context reuse benchmark uses temp directory - Replace hardcoded testdata path with b.TempDir() - Add createSmallDataset helper for self-contained benchmarks - Fixes FAIL: BenchmarkContextReuse / BenchmarkSequentialHashes	2026-02-21 14:38:00 -05:00
Jeremie Fraeys	5f8e7c59a5	fix: resolve undefined DirOverallSHA256HexParallel in benchmark files - Replace worker.DirOverallSHA256HexParallel with worker.DirOverallSHA256Hex - Fixes in dataset_hash_bench_test.go and hash_bench_test.go - All benchmarks pass with native_libs build tag	2026-02-21 14:30:22 -05:00
Jeremie Fraeys	fa383ebc6f	fix: benchmark function name and verify native context reuse	2026-02-21 14:28:04 -05:00
Jeremie Fraeys	158c525bef	fix: resolve benchmark and build tag conflicts - Remove duplicate hash_selector.go (build tags handle switching) - Fix benchmark to use worker.DirOverallSHA256Hex - Fix snapshot_store.go to use integrity.DirOverallSHA256Hex directly - Native tests pass, benchmarks now correctly test native vs Go	2026-02-21 14:26:48 -05:00
Jeremie Fraeys	90d702823b	fix: correct C type cast and add context reuse benchmark - Fix C.uint32_t cast for runtime.NumCPU() in native_bridge_libs.go - Add context_reuse_bench_test.go to verify performance gains - All native tests pass (8/8) - Benchmarks functional	2026-02-21 14:20:40 -05:00
Jeremie Fraeys	23e5f3d1dc	refactor(api): internal refactoring for TUI and worker modules - Refactor internal/worker and internal/queue packages - Update cmd/tui for monitoring interface - Update test configurations	2026-02-20 15:51:23 -05:00
Jeremie Fraeys	37aad7ae87	feat: add manifest signing and native hashing support - Integrate RunManifest.Validate with existing Validator - Add manifest Sign() and Verify() methods - Add native C++ hashing libraries (dataset_hash, queue_index) - Add native bridge for Go/C++ integration - Add deduplication support in queue	2026-02-19 15:34:39 -05:00
Jeremie Fraeys	6c83bda608	test(benchmarks): add tolerance to response packet regression test Add 5% tolerance for timing noise to prevent flaky failures from nanosecond-level benchmark variations	2026-02-18 12:45:40 -05:00
Jeremie Fraeys	38c09c92bb	test(benchmarks): fix native lib benchmarks when disabled - Add skip checks to native queue benchmarks when FETCHML_NATIVE_LIBS=0 - Skip TestGoNativeArtifactScanLeak cleanly instead of 100 warnings - Add build tags (!native_libs/native_libs) for Go vs Native comparison - Add benchmark-native and benchmark-compare Makefile targets	2026-02-18 12:45:30 -05:00
Jeremie Fraeys	7305e2bc21	test: add comprehensive test coverage and command improvements - Add logs and debug end-to-end tests - Add test helper utilities - Improve test fixtures and templates - Update API server and config lint commands - Add multi-user database initialization	2026-02-16 20:38:15 -05:00
Jeremie Fraeys	d408a60eb1	ci: push all workflow updates Some checks failed Documentation / build-and-publish (push) Waiting to run Details Test / test (push) Waiting to run Details Checkout test / test (push) Successful in 5s Details CI with Native Libraries / test-native (push) Has been cancelled Details CI with Native Libraries / build-release (push) Has been cancelled Details	2026-02-12 13:28:15 -05:00
Jeremie Fraeys	2854d3df95	chore(cleanup): remove legacy artifacts and add tooling configs Some checks failed Documentation / build-and-publish (push) Has been cancelled Details Checkout test / test (push) Has been cancelled Details - Remove .github/ directory (migrated to .forgejo/) - Remove .local-artifacts/ benchmark results - Add AGENTS.md for coding assistants - Add .windsurf/rules/ for development guidelines - Update .gitignore	2026-02-12 12:06:09 -05:00
Jeremie Fraeys	72b4b29ecd	perf: add profiling benchmarks and parallel Go baseline for C++ optimization Add comprehensive benchmarking suite for C++ optimization targets: - tests/benchmarks/dataset_hash_bench_test.go - dirOverallSHA256Hex profiling - tests/benchmarks/queue_bench_test.go - filesystem queue profiling - tests/benchmarks/artifact_and_snapshot_bench_test.go - scanArtifacts/extractTarGz profiling - tests/unit/worker/artifacts_test.go - moved from internal/ for clean separation Add parallel Go implementation as baseline for C++ comparison: - internal/worker/data_integrity.go: dirOverallSHA256HexParallel() with worker pool - Benchmarks show 2.1x speedup (3.97ms -> 1.90ms) vs sequential Exported wrappers for testing: - ScanArtifacts() - artifact scanning - ExtractTarGz() - tar.gz extraction - DirOverallSHA256HexParallel() - parallel hashing Profiling results (Apple M2 Ultra): - dirOverallSHA256Hex: 78% syscall overhead (target for mmap C++) - rebuildIndex: 96% syscall overhead (target for binary index C++) - scanArtifacts: 87% syscall overhead (target for fast traversal C++) - extractTarGz: 95% syscall overhead (target for parallel gzip C++) Related: C++ optimization strategy in memory 5d5f0bb6	2026-02-12 12:04:02 -05:00
Jeremie Fraeys	a8287f3087	test: expand unit/integration/e2e coverage for new worker/api behavior	2026-01-05 12:31:36 -05:00
Jeremie Fraeys	ea15af1833	Fix multi-user authentication and clean up debug code - Fix YAML tags in auth config struct (json -> yaml) - Update CLI configs to use pre-hashed API keys - Remove double hashing in WebSocket client - Fix port mapping (9102 -> 9103) in CLI commands - Update permission keys to use jobs:read, jobs:create, etc. - Clean up all debug logging from CLI and server - All user roles now authenticate correctly: * Admin: Can queue jobs and see all jobs * Researcher: Can queue jobs and see own jobs * Analyst: Can see status (read-only access) Multi-user authentication is now fully functional.	2025-12-06 12:35:32 -05:00

19 commits