Update internal/api/server_config.go to use centralized PathRegistry:
Changes:
- Update EnsureLogDirectory() to use config.FromEnv().LogDir() with EnsureDir()
- Update Validate() to use PathRegistry for default BasePath and DataDir
- Remove hardcoded /tmp/ml-experiments default
- Use paths.ExperimentsDir() and paths.DataDir() for consistent paths
Benefits:
- Consistent directory locations via PathRegistry
- Centralized directory creation with EnsureDir()
- Better error handling for directory creation
Update internal/experiment/manager.go to use centralized PathRegistry:
Changes:
- Add import for internal/config package
- Add NewManagerFromPaths() constructor using PathRegistry
- Update Initialize() to use config.FromEnv().ExperimentsDir() with EnsureDir()
- Update archiveExperiment() to use PathRegistry pattern
Benefits:
- Consistent experiment directory location via PathRegistry
- Centralized directory creation with EnsureDir()
- Backward compatible: existing NewManager() still works
- New code can use NewManagerFromPaths() for PathRegistry integration
Update internal/jupyter/service_manager.go to use centralized PathRegistry:
Changes:
- Import config package for PathRegistry access
- Update stateDir() to use config.FromEnv().JupyterStateDir()
- Update workspaceBaseDir() to use config.FromEnv().ActiveDataDir()
- Update trashBaseDir() to use config.FromEnv().JupyterStateDir()
- Update NewServiceManager() to use PathRegistry for workspace metadata file
- Update loadServices() to use PathRegistry for services file path
- Update saveServices() to use PathRegistry with EnsureDir()
- Rename parameter 'config' to 'svcConfig' to avoid shadowing import
Benefits:
- Consistent path management across codebase
- Centralized directory creation with EnsureDir()
- Environment variable override still supported (backward compatible)
- Proper error handling for directory creation failures
Add comprehensive Podman secrets support to prevent credential exposure:
New types and methods (internal/container/podman.go):
- PodmanSecret struct for secret definitions
- CreateSecret() - Create Podman secrets from sensitive data
- DeleteSecret() - Clean up secrets after use
- BuildSecretArgs() - Generate podman run arguments for secrets
- SanitizeContainerEnv() - Extract sensitive env vars as secrets
- ContainerConfig.Secrets field for secret list
Enhanced container lifecycle:
- StartContainer() now creates secrets before starting container
- Secrets automatically mounted via --secret flag
- Cleanup on failure to prevent secret leakage
- Secrets logged as count only (not content)
Jupyter service integration (internal/jupyter/service_manager.go):
- prepareContainerConfig() uses SanitizeContainerEnv()
- JUPYTER_TOKEN and JUPYTER_PASSWORD now use secrets
- Maintains backward compatibility with env var mounting
Security benefits:
- Credentials no longer visible in 'podman inspect' output
- Secrets not exposed via /proc/*/environ inside container
- Automatic cleanup prevents secret accumulation
- Compatible with existing Jupyter authentication
- Add stripTokenFromURL() helper function to remove tokens from URLs
- Use it when logging service start URLs
- Use it when logging connectivity test URLs
- Prevents sensitive tokens from being written to log files
- Move queue_spec_test.go from internal/queue/ to tests/unit/queue/
- Update imports to use github.com/jfraeys/fetch_ml/internal/queue
- Remove duplicate docker-compose.dev.yml from root (exists in deployments/)
- Fix spec tests: add required Status field, JobName field
- Fix loop variable capture in priority ordering test
- Fix missing closing brace between test functions
- Fix existing queue_test.go: change 50ms to 1s for Redis min duration
All tests pass: go test ./tests/unit/queue/...
Phase 2: Deterministic Manifests
- Add manifest.Validator with required field checking
- Support Validate() and ValidateStrict() modes
- Integrate validation into worker executor before execution
- Block execution if manifest missing commit_id or deps_manifest_sha256
Phase 5: Pinned Dependencies
- Add hermetic.dockerfile template with pinned system deps
- Frozen package versions: libblas3, libcudnn8, etc.
- Support for deps_manifest.json and requirements.txt with hashes
- Image tagging strategy: deps-<first-8-of-sha256>
Phase 8: Tests as Specifications
- Add queue_spec_test.go with executable scheduler specs
- Document priority ordering (higher first)
- Document FIFO tiebreaker for same priority
- Test cases for negative/zero priorities
Phase 10: Local Dev Parity
- Create root-level docker-compose.dev.yml
- Simplified from deployments/ for quick local dev
- Redis + API server + Worker with hot reload volumes
- Debug ports: 9101 (API), 6379 (Redis)
Move ExpandPath function and path-related utilities from internal/config to internal/storage where they belong.
Files updated:
- internal/worker/config.go: use storage.ExpandPath
- internal/network/ssh.go: use storage.ExpandPath
- cmd/data_manager/data_manager_config.go: use storage.ExpandPath
- internal/api/server_config.go: use storage.ExpandPath
internal/storage/paths.go already contained the canonical implementation.
Result: Path utilities now live in storage layer, config package focuses on configuration structs.
- VerifySnapshot: SHA256 verification using integrity package
- EnforceTaskProvenance: Strict and best-effort provenance validation
- RunJupyterTask: Full Jupyter service lifecycle (start/stop/remove/restore/list_packages)
- RunJob: Job execution using executor.JobRunner
- PrewarmNextOnce: Prewarming with queue integration
All methods now use new architecture components instead of placeholders
- Renamed selectDependencyManifest to SelectDependencyManifest (exported)
- Added re-export in worker package for backward compatibility
- Updated internal call in container.go to use exported function
- API helpers can now access via worker.SelectDependencyManifest
Build status: Compiles successfully
- Create jobRunner using NewJobRunner with local and container executors
- Assign jobRunner to Worker.runner field
- JobRunner available for future task execution orchestration
Build status: Compiles successfully
- Re-enabled all resource metrics (CPU, GPU, acquisition stats)
- Metrics are conditionally registered only when w.resources != nil
- Added nil check to prevent panics if resource manager not initialized
Build status: Compiles successfully
Created simplified.go demonstrating target architecture:
internal/worker/simplified.go (109 lines)
- SimplifiedWorker struct with 6 fields vs original 27 fields
- Uses composed dependencies from previous phases:
- lifecycle.RunLoop for task lifecycle management
- executor.JobRunner for job execution
- lifecycle.HealthMonitor for health tracking
- lifecycle.MetricsRecorder for metrics
Key improvements demonstrated:
- Dependency injection via SimplifiedWorkerConfig
- Clear separation of concerns
- No direct resource access (queue, metrics, etc.)
- Each component implements a defined interface
- Easy to test with mock implementations
Note: This is a demonstration of the target architecture.
The original Worker struct remains for backward compatibility.
Migration would happen incrementally in future PRs.
Build status: Compiles successfully
Created lifecycle package with foundational types for future extraction:
1. internal/worker/lifecycle/runloop.go (117 lines)
- TaskExecutor interface for task execution contract
- RunLoopConfig for run loop configuration
- RunLoop type with core orchestration logic
- MetricsRecorder and Logger interfaces for dependencies
- Start(), Stop() methods for loop control
- executeTask() method for task lifecycle management
2. internal/worker/lifecycle/health.go (52 lines)
- HealthMonitor type for health tracking
- RecordHeartbeat(), IsHealthy(), MarkUnhealthy() methods
- Heartbeater interface for heartbeat operations
- HeartbeatLoop() function for background heartbeats
Note: These are interface/type foundations for Phase 5.
The actual Worker struct methods remain in runloop.go until
Phase 5 when they'll migrate to use these abstractions.
Build status: Compiles successfully
Created interfaces package to break tight coupling:
1. internal/worker/interfaces/executor.go (30 lines)
- JobExecutor interface for job execution
- ExecutionEnv struct for execution context
- ExecutionResult struct for results
2. internal/worker/interfaces/tracker.go (20 lines)
- ProgressTracker interface for execution stages
- StageStart, StageComplete, StageFailed methods
- JobComplete for final status
3. internal/worker/interfaces/manifest.go (18 lines)
- ManifestWriter interface for manifest operations
- Upsert method for update/create
- BuildInitial method for creating new manifests
These interfaces will enable:
- Dependency injection in future phases
- Mocking for unit tests
- Clean separation between orchestration and execution
Build status: Compiles successfully
Updated api/ws_compat.go to properly delegate to api/ws package:
- NewWSHandler returns http.Handler interface (not interface{})
- All Opcode* constants re-exported from ws package
- Maintains backward compatibility for existing tests
This allows gradual migration of tests to use api/ws directly without
breaking the build. Tests can be updated incrementally.
Build status: Compiles successfully
- Add native_queue.go with CGO bindings for queue operations
- Add native_queue_stub.go for non-CGO builds
- Add hash_selector to choose between Go and native implementations
- Add native_bridge_libs.go for CGO builds with native_libs tag
- Add native_bridge_nocgo.go stub for non-CGO builds
- Update queue errors and task handling for native integration
- Update worker config and runloop for native library support