Commit graph

9 commits

Author SHA1 Message Date
Jeremie Fraeys
3fb6902fa1
feat(worker): integrate scheduler endpoints and security hardening
Update worker system for scheduler integration:
- Worker server with scheduler registration
- Configuration with scheduler endpoint support
- Artifact handling with integrity verification
- Container executor with supply chain validation
- Local executor enhancements
- GPU detection improvements (cross-platform)
- Error handling with execution context
- Factory pattern for executor instantiation
- Hash integrity with native library support
2026-02-26 12:06:16 -05:00
Jeremie Fraeys
fc2459977c
refactor(worker): update worker tests and native bridge
**Worker Refactoring:**
- Update internal/worker/factory.go, worker.go, snapshot_store.go
- Update native_bridge.go and native_bridge_nocgo.go for native library integration

**Test Updates:**
- Update all worker unit tests for new interfaces
- Update chaos tests
- Update container/podman_test.go
- Add internal/workertest/worker.go for shared test utilities

**Documentation:**
- Update native/README.md
2026-02-23 18:04:22 -05:00
Jeremie Fraeys
3b194ff2e8
feat: GPU detection transparency and artifact scanner improvements
Some checks failed
Build CLI with Embedded SQLite / build (arm64, aarch64-linux) (push) Waiting to run
Build CLI with Embedded SQLite / build (x86_64, x86_64-linux) (push) Waiting to run
Build CLI with Embedded SQLite / build-macos (arm64) (push) Waiting to run
Build CLI with Embedded SQLite / build-macos (x86_64) (push) Waiting to run
Security Scan / Security Analysis (push) Waiting to run
Security Scan / Native Library Security (push) Waiting to run
Checkout test / test (push) Successful in 6s
CI/CD Pipeline / Test (push) Failing after 1s
CI/CD Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI/CD Pipeline / Build (push) Has been skipped
CI/CD Pipeline / Test Scripts (push) Has been skipped
CI/CD Pipeline / Test Native Libraries (push) Has been skipped
CI/CD Pipeline / GPU Golden Test Matrix (push) Has been skipped
Documentation / build-and-publish (push) Failing after 39s
CI/CD Pipeline / Docker Build (push) Has been skipped
- Surface GPUDetectionInfo from parseGPUCountFromConfig for detection metadata
- Document FETCH_ML_TOTAL_CPU and FETCH_ML_GPU_SLOTS_PER_GPU env vars
- Add debug logging for all env var overrides to stderr
- Track config-layer auto-detection in GPUDetectionInfo.ConfigLayerAutoDetected
- Add --include-all flag to artifact scanner (includeAll parameter)
- Add AMD production mode enforcement (error in non-local mode)
- Add GPU detector unit tests for env overrides and AMD aliasing
2026-02-23 12:29:34 -05:00
Jeremie Fraeys
23e5f3d1dc
refactor(api): internal refactoring for TUI and worker modules
- Refactor internal/worker and internal/queue packages
- Update cmd/tui for monitoring interface
- Update test configurations
2026-02-20 15:51:23 -05:00
Jeremie Fraeys
085c23f66a
refactor(phase7): Initialize JobRunner in factory.go
- Create jobRunner using NewJobRunner with local and container executors
- Assign jobRunner to Worker.runner field
- JobRunner available for future task execution orchestration

Build status: Compiles successfully
2026-02-17 16:40:03 -05:00
Jeremie Fraeys
1ba67e419d
refactor(phase7): Integrate resource manager into Worker
- Added resources field to Worker struct
- Updated factory.go to pass resource manager to Worker
- Removed placeholder discard of resource manager
- Build compiles successfully
2026-02-17 16:37:33 -05:00
Jeremie Fraeys
a7360869f8
refactor: Implement TaskExecutorAdapter and Worker.runningCount()
- Created executor/TaskExecutorAdapter implementing lifecycle.TaskExecutor
- Properly wires LocalExecutor and ContainerExecutor through adapter
- Worker.runningCount() now delegates to runLoop.RunningCount()
- Added lifecycle.RunLoop.RunningCount() public method
- factory.go creates proper executor chain instead of placeholder

Build status: Compiles successfully
2026-02-17 16:15:41 -05:00
Jeremie Fraeys
38fa017b8e
refactor: Phase 6 - Complete migration, remove legacy files
BREAKING CHANGE: Legacy worker files removed, Worker struct simplified

Changes:
1. worker.go - Simplified to 8 fields using composed dependencies:
   - runLoop, runner, metrics, health (from new packages)
   - Removed: server, queue, running, datasetCache, ctx, cancel, etc.

2. factory.go - Updated NewWorker to use new structure
   - Uses lifecycle.NewRunLoop
   - Integrates jupyter.Manager properly

3. Removed legacy files:
   - execution.go (1,016 lines)
   - data_integrity.go (929 lines)
   - runloop.go (555 lines)
   - jupyter_task.go (144 lines)
   - simplified.go (demonstration no longer needed)

4. Fixed references to use new packages:
   - hash_selector.go -> integrity.DirOverallSHA256Hex
   - snapshot_store.go -> integrity.NormalizeSHA256ChecksumHex
   - metrics.go - Removed resource-dependent metrics temporarily

5. Added RecordQueueLatency to metrics.Metrics for lifecycle.MetricsRecorder

Worker struct: 27 fields -> 8 fields (70% reduction)

Build status: Compiles successfully
2026-02-17 14:39:48 -05:00
Jeremie Fraeys
a5c1a9fc0b
refactor: Phase 4 - split worker package into focused files
Split 551-line worker/core.go into single-concern files:

- worker/config.go (+44 lines)
  - Added config parsing: envInt(), parseCPUFromConfig(), parseGPUCountFromConfig()
  - parseGPUSlotsPerGPUFromConfig()
  - Now has all config logic in one place (440 lines total)

- worker/metrics.go (new file, 172 lines)
  - Extracted setupMetricsExporter() with ~30 Prometheus metric registrations
  - Isolated metrics logic for easy modification

- worker/factory.go (new file, 183 lines)
  - Extracted NewWorker() factory function
  - Moved prePullImages(), pullImage() from core.go
  - Centralized worker instantiation

- worker/worker.go (renamed from core.go, ~100 lines)
  - Now just defines Worker struct, MLServer, JupyterManager
  - Clean, focused file without mixed concerns

Lines redistributed: ~350 lines moved from monolithic core.go
Build status: Compiles successfully
2026-02-17 12:57:02 -05:00