Jeremie Fraeys
fc2459977c
refactor(worker): update worker tests and native bridge
...
**Worker Refactoring:**
- Update internal/worker/factory.go, worker.go, snapshot_store.go
- Update native_bridge.go and native_bridge_nocgo.go for native library integration
**Test Updates:**
- Update all worker unit tests for new interfaces
- Update chaos tests
- Update container/podman_test.go
- Add internal/workertest/worker.go for shared test utilities
**Documentation:**
- Update native/README.md
2026-02-23 18:04:22 -05:00
Jeremie Fraeys
3b194ff2e8
feat: GPU detection transparency and artifact scanner improvements
...
Build CLI with Embedded SQLite / build (arm64, aarch64-linux) (push) Waiting to run
Build CLI with Embedded SQLite / build (x86_64, x86_64-linux) (push) Waiting to run
Build CLI with Embedded SQLite / build-macos (arm64) (push) Waiting to run
Build CLI with Embedded SQLite / build-macos (x86_64) (push) Waiting to run
Security Scan / Security Analysis (push) Waiting to run
Security Scan / Native Library Security (push) Waiting to run
Checkout test / test (push) Successful in 6s
CI/CD Pipeline / Test (push) Failing after 1s
CI/CD Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI/CD Pipeline / Build (push) Has been skipped
CI/CD Pipeline / Test Scripts (push) Has been skipped
CI/CD Pipeline / Test Native Libraries (push) Has been skipped
CI/CD Pipeline / GPU Golden Test Matrix (push) Has been skipped
Documentation / build-and-publish (push) Failing after 39s
CI/CD Pipeline / Docker Build (push) Has been skipped
- Surface GPUDetectionInfo from parseGPUCountFromConfig for detection metadata
- Document FETCH_ML_TOTAL_CPU and FETCH_ML_GPU_SLOTS_PER_GPU env vars
- Add debug logging for all env var overrides to stderr
- Track config-layer auto-detection in GPUDetectionInfo.ConfigLayerAutoDetected
- Add --include-all flag to artifact scanner (includeAll parameter)
- Add AMD production mode enforcement (error in non-local mode)
- Add GPU detector unit tests for env overrides and AMD aliasing
2026-02-23 12:29:34 -05:00
Jeremie Fraeys
a4543750cd
chore: remove unused runningCount method from worker
...
The runningCount method was orphaned after removing metrics.go.
All tests pass.
2026-02-17 20:43:16 -05:00
Jeremie Fraeys
bd2b99b09c
chore: remove orphaned unused functions from worker package
...
Deleted files:
- internal/worker/gpu.go (76 lines) - all 3 functions unused: _gpuVisibleDevicesString, _filterExistingDevicePaths, _gpuVisibleEnvVarName
- internal/worker/metrics.go (229 lines) - _setupMetricsExporter method unused
Modified:
- internal/worker/worker.go - removed _isValidName and _getGPUDetector functions
All tests pass, build compiles successfully.
2026-02-17 20:41:58 -05:00
Jeremie Fraeys
3187ff26ea
refactor: complete maintainability phases 1-9 and fix all tests
...
Test fixes (all 41 test packages now pass):
- Fix ComputeTaskProvenance - add dataset_specs JSON output
- Fix EnforceTaskProvenance - populate all metadata fields in best-effort mode
- Fix PrewarmNextOnce - preserve prewarm state when queue empty
- Fix RunManifest directory creation in SetupJobDirectories
- Add ManifestWriter to test worker (simpleManifestWriter)
- Fix worker ID mismatch (use cfg.WorkerID)
- Fix WebSocket binary protocol responses
- Implement all WebSocket handlers: QueueJob, QueueJobWithSnapshot, StatusRequest,
CancelJob, Prune, ValidateRequest (with run manifest validation), LogMetric,
GetExperiment, DatasetList/Register/Info/Search
Maintainability phases completed:
- Phases 1-6: Domain types, error system, config boundaries, worker/API/queue splits
- Phase 7: TUI cleanup - reorganize model package (jobs.go, messages.go, styles.go, keys.go)
- Phase 8: MLServer unification - consolidate worker + TUI into internal/network/mlserver.go
- Phase 9: CI enforcement - add scripts/ci-checks.sh with 5 checks:
* No internal/ -> cmd/ imports
* domain/ has zero internal imports
* File size limit (500 lines, rigid)
* No circular imports
* Package naming conventions
Documentation:
- Add docs/src/file-naming-conventions.md
- Add make ci-checks target
Lines changed: +756/-36 (WebSocket fixes), +518/-320 (TUI), +263/-20 (Phase 8-9)
2026-02-17 20:32:14 -05:00
Jeremie Fraeys
fb2bbbaae5
refactor: Phase 7 - TUI cleanup - reorganize model package
...
Phase 7 of the monorepo maintainability plan:
New files created:
- model/jobs.go - Job type, JobStatus constants, list.Item interface
- model/messages.go - tea.Msg types (JobsLoadedMsg, StatusMsg, TickMsg, etc.)
- model/styles.go - NewJobListDelegate(), JobListTitleStyle(), SpinnerStyle()
- model/keys.go - KeyMap struct, DefaultKeys() function
Modified files:
- model/state.go - reduced from 226 to ~130 lines
- Removed: Job, JobStatus, KeyMap, Keys, inline styles
- Kept: State struct, domain re-exports, ViewMode, DatasetInfo, InitialState()
- controller/commands.go - use model. prefix for message types
- controller/controller.go - use model. prefix for message types
- controller/settings.go - use model.SettingsContentMsg
Deleted files:
- controller/keys.go (moved to model/keys.go since State references KeyMap)
Result:
- No file >150 lines in model/ package
- Single concern per file: state, jobs, messages, styles, keys
- All 41 test packages pass
2026-02-17 20:22:04 -05:00
Jeremie Fraeys
a1ce267b86
feat: Implement all worker stub methods with real functionality
...
- VerifySnapshot: SHA256 verification using integrity package
- EnforceTaskProvenance: Strict and best-effort provenance validation
- RunJupyterTask: Full Jupyter service lifecycle (start/stop/remove/restore/list_packages)
- RunJob: Job execution using executor.JobRunner
- PrewarmNextOnce: Prewarming with queue integration
All methods now use new architecture components instead of placeholders
2026-02-17 17:37:56 -05:00
Jeremie Fraeys
713dba896c
refactor: Add test compatibility methods to worker package
...
- Added ComputeTaskProvenance function (delegates to integrity.ProvenanceCalculator)
- Added Worker.VerifyDatasetSpecs method
- Added Worker.EnforceTaskProvenance method (placeholder)
- Added Worker.VerifySnapshot method (placeholder)
- All methods added for backward compatibility with existing tests
Build status: Compiles successfully
2026-02-17 16:55:22 -05:00
Jeremie Fraeys
417133afce
refactor: Export test compatibility functions from worker package
...
- Added DirOverallSHA256Hex re-export from integrity package
- Added NormalizeSHA256ChecksumHex re-export from integrity package
- Added integrity import to worker.go
- Fixed import errors for test compilation
Build status: Compiles successfully
2026-02-17 16:49:26 -05:00
Jeremie Fraeys
4c8c9dfe4b
refactor: Export SelectDependencyManifest for API helpers
...
- Renamed selectDependencyManifest to SelectDependencyManifest (exported)
- Added re-export in worker package for backward compatibility
- Updated internal call in container.go to use exported function
- API helpers can now access via worker.SelectDependencyManifest
Build status: Compiles successfully
2026-02-17 16:45:59 -05:00
Jeremie Fraeys
1ba67e419d
refactor(phase7): Integrate resource manager into Worker
...
- Added resources field to Worker struct
- Updated factory.go to pass resource manager to Worker
- Removed placeholder discard of resource manager
- Build compiles successfully
2026-02-17 16:37:33 -05:00
Jeremie Fraeys
a7360869f8
refactor: Implement TaskExecutorAdapter and Worker.runningCount()
...
- Created executor/TaskExecutorAdapter implementing lifecycle.TaskExecutor
- Properly wires LocalExecutor and ContainerExecutor through adapter
- Worker.runningCount() now delegates to runLoop.RunningCount()
- Added lifecycle.RunLoop.RunningCount() public method
- factory.go creates proper executor chain instead of placeholder
Build status: Compiles successfully
2026-02-17 16:15:41 -05:00
Jeremie Fraeys
38fa017b8e
refactor: Phase 6 - Complete migration, remove legacy files
...
BREAKING CHANGE: Legacy worker files removed, Worker struct simplified
Changes:
1. worker.go - Simplified to 8 fields using composed dependencies:
- runLoop, runner, metrics, health (from new packages)
- Removed: server, queue, running, datasetCache, ctx, cancel, etc.
2. factory.go - Updated NewWorker to use new structure
- Uses lifecycle.NewRunLoop
- Integrates jupyter.Manager properly
3. Removed legacy files:
- execution.go (1,016 lines)
- data_integrity.go (929 lines)
- runloop.go (555 lines)
- jupyter_task.go (144 lines)
- simplified.go (demonstration no longer needed)
4. Fixed references to use new packages:
- hash_selector.go -> integrity.DirOverallSHA256Hex
- snapshot_store.go -> integrity.NormalizeSHA256ChecksumHex
- metrics.go - Removed resource-dependent metrics temporarily
5. Added RecordQueueLatency to metrics.Metrics for lifecycle.MetricsRecorder
Worker struct: 27 fields -> 8 fields (70% reduction)
Build status: Compiles successfully
2026-02-17 14:39:48 -05:00
Jeremie Fraeys
a5c1a9fc0b
refactor: Phase 4 - split worker package into focused files
...
Split 551-line worker/core.go into single-concern files:
- worker/config.go (+44 lines)
- Added config parsing: envInt(), parseCPUFromConfig(), parseGPUCountFromConfig()
- parseGPUSlotsPerGPUFromConfig()
- Now has all config logic in one place (440 lines total)
- worker/metrics.go (new file, 172 lines)
- Extracted setupMetricsExporter() with ~30 Prometheus metric registrations
- Isolated metrics logic for easy modification
- worker/factory.go (new file, 183 lines)
- Extracted NewWorker() factory function
- Moved prePullImages(), pullImage() from core.go
- Centralized worker instantiation
- worker/worker.go (renamed from core.go, ~100 lines)
- Now just defines Worker struct, MLServer, JupyterManager
- Clean, focused file without mixed concerns
Lines redistributed: ~350 lines moved from monolithic core.go
Build status: Compiles successfully
2026-02-17 12:57:02 -05:00