Commit graph

21 commits

Author SHA1 Message Date
Jeremie Fraeys
417133afce
refactor: Export test compatibility functions from worker package
- Added DirOverallSHA256Hex re-export from integrity package
- Added NormalizeSHA256ChecksumHex re-export from integrity package
- Added integrity import to worker.go
- Fixed import errors for test compilation

Build status: Compiles successfully
2026-02-17 16:49:26 -05:00
Jeremie Fraeys
4c8c9dfe4b
refactor: Export SelectDependencyManifest for API helpers
- Renamed selectDependencyManifest to SelectDependencyManifest (exported)
- Added re-export in worker package for backward compatibility
- Updated internal call in container.go to use exported function
- API helpers can now access via worker.SelectDependencyManifest

Build status: Compiles successfully
2026-02-17 16:45:59 -05:00
Jeremie Fraeys
085c23f66a
refactor(phase7): Initialize JobRunner in factory.go
- Create jobRunner using NewJobRunner with local and container executors
- Assign jobRunner to Worker.runner field
- JobRunner available for future task execution orchestration

Build status: Compiles successfully
2026-02-17 16:40:03 -05:00
Jeremie Fraeys
51698d60de
refactor(phase7): Restore resource metrics in metrics.go
- Re-enabled all resource metrics (CPU, GPU, acquisition stats)
- Metrics are conditionally registered only when w.resources != nil
- Added nil check to prevent panics if resource manager not initialized

Build status: Compiles successfully
2026-02-17 16:38:47 -05:00
Jeremie Fraeys
1ba67e419d
refactor(phase7): Integrate resource manager into Worker
- Added resources field to Worker struct
- Updated factory.go to pass resource manager to Worker
- Removed placeholder discard of resource manager
- Build compiles successfully
2026-02-17 16:37:33 -05:00
Jeremie Fraeys
a7360869f8
refactor: Implement TaskExecutorAdapter and Worker.runningCount()
- Created executor/TaskExecutorAdapter implementing lifecycle.TaskExecutor
- Properly wires LocalExecutor and ContainerExecutor through adapter
- Worker.runningCount() now delegates to runLoop.RunningCount()
- Added lifecycle.RunLoop.RunningCount() public method
- factory.go creates proper executor chain instead of placeholder

Build status: Compiles successfully
2026-02-17 16:15:41 -05:00
Jeremie Fraeys
38fa017b8e
refactor: Phase 6 - Complete migration, remove legacy files
BREAKING CHANGE: Legacy worker files removed, Worker struct simplified

Changes:
1. worker.go - Simplified to 8 fields using composed dependencies:
   - runLoop, runner, metrics, health (from new packages)
   - Removed: server, queue, running, datasetCache, ctx, cancel, etc.

2. factory.go - Updated NewWorker to use new structure
   - Uses lifecycle.NewRunLoop
   - Integrates jupyter.Manager properly

3. Removed legacy files:
   - execution.go (1,016 lines)
   - data_integrity.go (929 lines)
   - runloop.go (555 lines)
   - jupyter_task.go (144 lines)
   - simplified.go (demonstration no longer needed)

4. Fixed references to use new packages:
   - hash_selector.go -> integrity.DirOverallSHA256Hex
   - snapshot_store.go -> integrity.NormalizeSHA256ChecksumHex
   - metrics.go - Removed resource-dependent metrics temporarily

5. Added RecordQueueLatency to metrics.Metrics for lifecycle.MetricsRecorder

Worker struct: 27 fields -> 8 fields (70% reduction)

Build status: Compiles successfully
2026-02-17 14:39:48 -05:00
Jeremie Fraeys
94bb52d09c
refactor: Phase 5 - Simplified Worker demonstration
Created simplified.go demonstrating target architecture:

internal/worker/simplified.go (109 lines)
- SimplifiedWorker struct with 6 fields vs original 27 fields
- Uses composed dependencies from previous phases:
  - lifecycle.RunLoop for task lifecycle management
  - executor.JobRunner for job execution
  - lifecycle.HealthMonitor for health tracking
  - lifecycle.MetricsRecorder for metrics

Key improvements demonstrated:
- Dependency injection via SimplifiedWorkerConfig
- Clear separation of concerns
- No direct resource access (queue, metrics, etc.)
- Each component implements a defined interface
- Easy to test with mock implementations

Note: This is a demonstration of the target architecture.
The original Worker struct remains for backward compatibility.
Migration would happen incrementally in future PRs.

Build status: Compiles successfully
2026-02-17 14:24:40 -05:00
Jeremie Fraeys
062b78cbe0
refactor: Phase 4 - Extract lifecycle types and interfaces
Created lifecycle package with foundational types for future extraction:

1. internal/worker/lifecycle/runloop.go (117 lines)
   - TaskExecutor interface for task execution contract
   - RunLoopConfig for run loop configuration
   - RunLoop type with core orchestration logic
   - MetricsRecorder and Logger interfaces for dependencies
   - Start(), Stop() methods for loop control
   - executeTask() method for task lifecycle management

2. internal/worker/lifecycle/health.go (52 lines)
   - HealthMonitor type for health tracking
   - RecordHeartbeat(), IsHealthy(), MarkUnhealthy() methods
   - Heartbeater interface for heartbeat operations
   - HeartbeatLoop() function for background heartbeats

Note: These are interface/type foundations for Phase 5.
The actual Worker struct methods remain in runloop.go until
Phase 5 when they'll migrate to use these abstractions.

Build status: Compiles successfully
2026-02-17 14:22:58 -05:00
Jeremie Fraeys
3248279c01
refactor: Phase 3 - Extract data integrity layer
Created integrity package with extracted data utilities:

1. internal/worker/integrity/hash.go (113 lines)
   - FileSHA256Hex() - SHA256 hash of single file
   - NormalizeSHA256ChecksumHex() - Checksum normalization
   - DirOverallSHA256Hex() - Directory hash (sequential)
   - DirOverallSHA256HexParallel() - Directory hash (parallel workers)

2. internal/worker/integrity/validate.go (76 lines)
   - DatasetVerifier type for dataset validation
   - VerifyDatasetSpecs() method for checksum validation
   - ProvenanceCalculator type for provenance computation
   - ComputeProvenance() method for task provenance

Note: Used 'integrity' instead of 'data' due to .gitignore conflict
(data/ directory is ignored for experiment artifacts)

Functions extracted from data_integrity.go:
- fileSHA256Hex → FileSHA256Hex
- normalizeSHA256ChecksumHex → NormalizeSHA256ChecksumHex
- dirOverallSHA256HexGo → DirOverallSHA256Hex
- dirOverallSHA256HexParallel → DirOverallSHA256HexParallel
- verifyDatasetSpecs logic → DatasetVerifier
- computeTaskProvenance logic → ProvenanceCalculator

Build status: Compiles successfully
2026-02-17 14:20:41 -05:00
Jeremie Fraeys
22f3d66f1d
refactor: Phase 2 - Extract executor implementations
Created executor package with extracted job execution logic:

1. internal/worker/executor/local.go (104 lines)
   - LocalExecutor implements JobExecutor interface
   - Execute() method for local bash script execution
   - generateScript() helper for creating experiment scripts

2. internal/worker/executor/container.go (229 lines)
   - ContainerExecutor implements JobExecutor interface
   - Execute() method for podman container execution
   - EnvironmentPool interface for image caching
   - Tracking tool provisioning (MLflow, TensorBoard, Wandb)
   - Volume and cache setup
   - selectDependencyManifest() helper

3. internal/worker/executor/runner.go (131 lines)
   - JobRunner orchestrates execution
   - ExecutionMode enum (Auto, Local, Container)
   - Run() method with directory setup and executor selection
   - finalize() for success/failure handling

Key design decisions:
- Executors depend on interfaces (ManifestWriter, not Worker)
- JobRunner composes both executors
- No direct Worker dependencies in executor package
- SetupJobDirectories reused from execution package

Build status: Compiles successfully
2026-02-17 14:14:04 -05:00
Jeremie Fraeys
ae0a370fb4
refactor: Phase 1 - Extract worker interfaces
Created interfaces package to break tight coupling:

1. internal/worker/interfaces/executor.go (30 lines)
   - JobExecutor interface for job execution
   - ExecutionEnv struct for execution context
   - ExecutionResult struct for results

2. internal/worker/interfaces/tracker.go (20 lines)
   - ProgressTracker interface for execution stages
   - StageStart, StageComplete, StageFailed methods
   - JobComplete for final status

3. internal/worker/interfaces/manifest.go (18 lines)
   - ManifestWriter interface for manifest operations
   - Upsert method for update/create
   - BuildInitial method for creating new manifests

These interfaces will enable:
- Dependency injection in future phases
- Mocking for unit tests
- Clean separation between orchestration and execution

Build status: Compiles successfully
2026-02-17 14:10:03 -05:00
Jeremie Fraeys
c46be7f815
refactor: Phase 4 deferred - Extract GPU utilities and execution helpers
Extracted from execution.go to focused packages:

1. internal/worker/gpu.go (60 lines)
   - gpuVisibleDevicesString() - GPU device string formatting
   - filterExistingDevicePaths() - Device path filtering
   - gpuVisibleEnvVarName() - GPU env var selection
   - Reuses GPUType constants from gpu_detector.go

2. internal/worker/execution/setup.go (108 lines)
   - SetupJobDirectories() - Job directory creation
   - CopyDir() - Directory tree copying
   - copyFile() - Single file copy helper

3. internal/worker/execution/snapshot.go (52 lines)
   - StageSnapshot() - Snapshot staging for jobs
   - StageSnapshotFromPath() - Snapshot staging from path

Updated execution.go:
- Removed 64 lines of GPU utilities (now in gpu.go)
- Reduced from 1,082 to ~1,018 lines
- Still contains main execution flow (runJob, executeJob, etc.)

Build status: Compiles successfully
2026-02-17 14:03:11 -05:00
Jeremie Fraeys
a5c1a9fc0b
refactor: Phase 4 - split worker package into focused files
Split 551-line worker/core.go into single-concern files:

- worker/config.go (+44 lines)
  - Added config parsing: envInt(), parseCPUFromConfig(), parseGPUCountFromConfig()
  - parseGPUSlotsPerGPUFromConfig()
  - Now has all config logic in one place (440 lines total)

- worker/metrics.go (new file, 172 lines)
  - Extracted setupMetricsExporter() with ~30 Prometheus metric registrations
  - Isolated metrics logic for easy modification

- worker/factory.go (new file, 183 lines)
  - Extracted NewWorker() factory function
  - Moved prePullImages(), pullImage() from core.go
  - Centralized worker instantiation

- worker/worker.go (renamed from core.go, ~100 lines)
  - Now just defines Worker struct, MLServer, JupyterManager
  - Clean, focused file without mixed concerns

Lines redistributed: ~350 lines moved from monolithic core.go
Build status: Compiles successfully
2026-02-17 12:57:02 -05:00
Jeremie Fraeys
d1bef0a450
refactor: Phase 3 - fix config/storage boundaries
Move schema ownership to infrastructure layer:

- Redis keys: config/constants.go -> queue/keys.go (TaskQueueKey, TaskPrefix, etc.)

- Filesystem paths: config/paths.go -> storage/paths.go (JobPaths)

- Create config/shared.go with RedisConfig, SSHConfig

- Update all imports: worker/, api/helpers, api/ws_jobs, api/ws_validate

- Clean up: remove duplicates from queue/task.go, queue/queue.go, config/paths.go

Build status: Compiles successfully
2026-02-17 12:49:53 -05:00
Jeremie Fraeys
a93b6715fd
feat: add native library bridge and queue integration
- Add native_queue.go with CGO bindings for queue operations
- Add native_queue_stub.go for non-CGO builds
- Add hash_selector to choose between Go and native implementations
- Add native_bridge_libs.go for CGO builds with native_libs tag
- Add native_bridge_nocgo.go stub for non-CGO builds
- Update queue errors and task handling for native integration
- Update worker config and runloop for native library support
2026-02-16 20:38:30 -05:00
Jeremie Fraeys
43d241c28d
feat: implement C++ native libraries for performance-critical operations
- Add arena allocator for zero-allocation hot paths
- Add thread pool for parallel operations
- Add mmap utilities for memory-mapped I/O
- Implement queue_index with heap-based priority queue
- Implement dataset_hash with SIMD support (SHA-NI, ARMv8)
- Add runtime SIMD detection for cross-platform correctness
- Add comprehensive tests and benchmarks
2026-02-16 20:38:04 -05:00
Jeremie Fraeys
d408a60eb1
ci: push all workflow updates
Some checks failed
Documentation / build-and-publish (push) Waiting to run
Test / test (push) Waiting to run
Checkout test / test (push) Successful in 5s
CI with Native Libraries / test-native (push) Has been cancelled
CI with Native Libraries / build-release (push) Has been cancelled
2026-02-12 13:28:15 -05:00
Jeremie Fraeys
2e701340e5
feat(core): API, worker, queue, and manifest improvements
- Add protocol buffer optimizations (internal/api/protocol.go)
- Add filesystem queue backend (internal/queue/filesystem_queue.go)
- Add run manifest support (internal/manifest/run_manifest.go)
- Worker and jupyter task refinements
- Exported test wrappers for benchmarking
2026-02-12 12:05:17 -05:00
Jeremie Fraeys
72b4b29ecd
perf: add profiling benchmarks and parallel Go baseline for C++ optimization
Add comprehensive benchmarking suite for C++ optimization targets:
- tests/benchmarks/dataset_hash_bench_test.go - dirOverallSHA256Hex profiling
- tests/benchmarks/queue_bench_test.go - filesystem queue profiling
- tests/benchmarks/artifact_and_snapshot_bench_test.go - scanArtifacts/extractTarGz profiling
- tests/unit/worker/artifacts_test.go - moved from internal/ for clean separation

Add parallel Go implementation as baseline for C++ comparison:
- internal/worker/data_integrity.go: dirOverallSHA256HexParallel() with worker pool
- Benchmarks show 2.1x speedup (3.97ms -> 1.90ms) vs sequential

Exported wrappers for testing:
- ScanArtifacts() - artifact scanning
- ExtractTarGz() - tar.gz extraction
- DirOverallSHA256HexParallel() - parallel hashing

Profiling results (Apple M2 Ultra):
- dirOverallSHA256Hex: 78% syscall overhead (target for mmap C++)
- rebuildIndex: 96% syscall overhead (target for binary index C++)
- scanArtifacts: 87% syscall overhead (target for fast traversal C++)
- extractTarGz: 95% syscall overhead (target for parallel gzip C++)

Related: C++ optimization strategy in memory 5d5f0bb6
2026-02-12 12:04:02 -05:00
Jeremie Fraeys
82034c68f3 feat(worker): add integrity checks, snapshot staging, and prewarm support 2026-01-05 12:31:13 -05:00