fetch_ml

jfraeysd/fetch_ml

Fork 0

Commit graph

Author	SHA1	Message	Date
Jeremie Fraeys	38fa017b8e	refactor: Phase 6 - Complete migration, remove legacy files BREAKING CHANGE: Legacy worker files removed, Worker struct simplified Changes: 1. worker.go - Simplified to 8 fields using composed dependencies: - runLoop, runner, metrics, health (from new packages) - Removed: server, queue, running, datasetCache, ctx, cancel, etc. 2. factory.go - Updated NewWorker to use new structure - Uses lifecycle.NewRunLoop - Integrates jupyter.Manager properly 3. Removed legacy files: - execution.go (1,016 lines) - data_integrity.go (929 lines) - runloop.go (555 lines) - jupyter_task.go (144 lines) - simplified.go (demonstration no longer needed) 4. Fixed references to use new packages: - hash_selector.go -> integrity.DirOverallSHA256Hex - snapshot_store.go -> integrity.NormalizeSHA256ChecksumHex - metrics.go - Removed resource-dependent metrics temporarily 5. Added RecordQueueLatency to metrics.Metrics for lifecycle.MetricsRecorder Worker struct: 27 fields -> 8 fields (70% reduction) Build status: Compiles successfully	2026-02-17 14:39:48 -05:00
Jeremie Fraeys	72b4b29ecd	perf: add profiling benchmarks and parallel Go baseline for C++ optimization Add comprehensive benchmarking suite for C++ optimization targets: - tests/benchmarks/dataset_hash_bench_test.go - dirOverallSHA256Hex profiling - tests/benchmarks/queue_bench_test.go - filesystem queue profiling - tests/benchmarks/artifact_and_snapshot_bench_test.go - scanArtifacts/extractTarGz profiling - tests/unit/worker/artifacts_test.go - moved from internal/ for clean separation Add parallel Go implementation as baseline for C++ comparison: - internal/worker/data_integrity.go: dirOverallSHA256HexParallel() with worker pool - Benchmarks show 2.1x speedup (3.97ms -> 1.90ms) vs sequential Exported wrappers for testing: - ScanArtifacts() - artifact scanning - ExtractTarGz() - tar.gz extraction - DirOverallSHA256HexParallel() - parallel hashing Profiling results (Apple M2 Ultra): - dirOverallSHA256Hex: 78% syscall overhead (target for mmap C++) - rebuildIndex: 96% syscall overhead (target for binary index C++) - scanArtifacts: 87% syscall overhead (target for fast traversal C++) - extractTarGz: 95% syscall overhead (target for parallel gzip C++) Related: C++ optimization strategy in memory 5d5f0bb6	2026-02-12 12:04:02 -05:00
Jeremie Fraeys	82034c68f3	feat(worker): add integrity checks, snapshot staging, and prewarm support	2026-01-05 12:31:13 -05:00

Author

SHA1

Message

Date

Jeremie Fraeys

38fa017b8e

refactor: Phase 6 - Complete migration, remove legacy files

BREAKING CHANGE: Legacy worker files removed, Worker struct simplified

Changes:
1. worker.go - Simplified to 8 fields using composed dependencies:
   - runLoop, runner, metrics, health (from new packages)
   - Removed: server, queue, running, datasetCache, ctx, cancel, etc.

2. factory.go - Updated NewWorker to use new structure
   - Uses lifecycle.NewRunLoop
   - Integrates jupyter.Manager properly

3. Removed legacy files:
   - execution.go (1,016 lines)
   - data_integrity.go (929 lines)
   - runloop.go (555 lines)
   - jupyter_task.go (144 lines)
   - simplified.go (demonstration no longer needed)

4. Fixed references to use new packages:
   - hash_selector.go -> integrity.DirOverallSHA256Hex
   - snapshot_store.go -> integrity.NormalizeSHA256ChecksumHex
   - metrics.go - Removed resource-dependent metrics temporarily

5. Added RecordQueueLatency to metrics.Metrics for lifecycle.MetricsRecorder

Worker struct: 27 fields -> 8 fields (70% reduction)

Build status: Compiles successfully

2026-02-17 14:39:48 -05:00

Jeremie Fraeys

72b4b29ecd

perf: add profiling benchmarks and parallel Go baseline for C++ optimization

Add comprehensive benchmarking suite for C++ optimization targets:
- tests/benchmarks/dataset_hash_bench_test.go - dirOverallSHA256Hex profiling
- tests/benchmarks/queue_bench_test.go - filesystem queue profiling
- tests/benchmarks/artifact_and_snapshot_bench_test.go - scanArtifacts/extractTarGz profiling
- tests/unit/worker/artifacts_test.go - moved from internal/ for clean separation

Add parallel Go implementation as baseline for C++ comparison:
- internal/worker/data_integrity.go: dirOverallSHA256HexParallel() with worker pool
- Benchmarks show 2.1x speedup (3.97ms -> 1.90ms) vs sequential

Exported wrappers for testing:
- ScanArtifacts() - artifact scanning
- ExtractTarGz() - tar.gz extraction
- DirOverallSHA256HexParallel() - parallel hashing

Profiling results (Apple M2 Ultra):
- dirOverallSHA256Hex: 78% syscall overhead (target for mmap C++)
- rebuildIndex: 96% syscall overhead (target for binary index C++)
- scanArtifacts: 87% syscall overhead (target for fast traversal C++)
- extractTarGz: 95% syscall overhead (target for parallel gzip C++)

Related: C++ optimization strategy in memory 5d5f0bb6

2026-02-12 12:04:02 -05:00

Jeremie Fraeys

82034c68f3

feat(worker): add integrity checks, snapshot staging, and prewarm support

2026-01-05 12:31:13 -05:00

3 commits