Add new scheduler component for distributed ML workload orchestration:
- Hub-based coordination for multi-worker clusters
- Pacing controller for rate limiting job submissions
- Priority queue with preemption support
- Port allocator for dynamic service discovery
- Protocol handlers for worker-scheduler communication
- Service manager with OS-specific implementations
- Connection management and state persistence
- Template system for service deployment
Includes comprehensive test suite:
- Unit tests for all core components
- Integration tests for distributed scenarios
- Benchmark tests for performance validation
- Mock fixtures for isolated testing
Refs: scheduler-architecture.md
Bug fixes and cleanup for test infrastructure:
- schema_test.go: Fix SchemaVersion reference with proper manifest import
- schema_test.go: Update all schema.json paths to internal/manifest location
- manifestenv.go: Remove unused helper functions (isArtifactsType, getPackagePath)
- nobaredetector.go: Fix exprToString syntax error, remove unused functions
All tests now pass without errors or warnings
Implement reproducibility crossover requirements:
- TestManifestEnvironmentCapture: Environment population with ConfigHash and DetectionMethod
- TestConfigHashPostDefaults: Hash computation after env expansion and defaults
Verifies manifest.Environment is properly populated for reproducibility tracking
Rename and enhance existing tests to align with coverage map:
- TestGPUDetectorAMDVendorAlias -> TestAMDAliasManifestRecord
- TestScanArtifacts_SkipsKnownPathsAndLogs -> TestScanExclusionsRecorded
- Add env var expansion verification to TestHIPAAValidation_InlineCredentials
- Record exclusions in manifest.Artifacts for audit trail
Fix ValidatePath to correctly resolve symlinks and handle edge cases:
- Resolve symlinks before boundary check to prevent traversal
- Handle macOS /private prefix correctly
- Add fallback for non-existent paths (parent directory resolution)
- Double boundary checks: before AND after symlink resolution
- Prevent race conditions between check and use
Update path traversal tests:
- Correct test expectations for "..." (three dots is valid filename, not traversal)
- Add tests for symlink escape attempts
- Add unicode attack tests
- Add deeply nested traversal tests
Security impact: Prevents path traversal via symlink following in artifact
scanning and other file operations.
Add MaxArtifactFiles and MaxArtifactTotalBytes to SandboxConfig:
- Default MaxArtifactFiles: 10,000 (configurable via SecurityDefaults)
- Default MaxArtifactTotalBytes: 100GB (configurable via SecurityDefaults)
- ApplySecurityDefaults() sets defaults if not specified
Enforce caps in scanArtifacts() during directory walk:
- Returns error immediately when MaxArtifactFiles exceeded
- Returns error immediately when MaxArtifactTotalBytes exceeded
- Prevents resource exhaustion attacks from malicious artifact trees
Update all call sites to pass SandboxConfig for cap enforcement:
- Native bridge libs updated to pass caps argument
- Benchmark tests updated with nil caps (unlimited for benchmarks)
- Unit tests updated with nil caps
Closes: artifact ingestion caps items from security plan
- Surface GPUDetectionInfo from parseGPUCountFromConfig for detection metadata
- Document FETCH_ML_TOTAL_CPU and FETCH_ML_GPU_SLOTS_PER_GPU env vars
- Add debug logging for all env var overrides to stderr
- Track config-layer auto-detection in GPUDetectionInfo.ConfigLayerAutoDetected
- Add --include-all flag to artifact scanner (includeAll parameter)
- Add AMD production mode enforcement (error in non-local mode)
- Add GPU detector unit tests for env overrides and AMD aliasing
- Replace worker.DirOverallSHA256HexParallel with worker.DirOverallSHA256Hex
- Fixes in dataset_hash_bench_test.go and hash_bench_test.go
- All benchmarks pass with native_libs build tag
- Fix duplicate check in security_test.go lint warning
- Mark SHA256 tests as Legacy for backward compatibility
- Convert TODO comments to documentation (task, handlers, privacy)
- Update user_manager_test to use GenerateAPIKey pattern
Reorganize tests for better structure and coverage:
- Move container/security_test.go from internal/ to tests/unit/container/
- Move related tests to proper unit test locations
- Delete orphaned test files (startup_blacklist_test.go)
- Add privacy middleware unit tests
- Add worker config unit tests
- Update E2E tests for homelab and websocket scenarios
- Update test fixtures with utility functions
- Add CLI helper script for arraylist fixes
- Move queue_spec_test.go from internal/queue/ to tests/unit/queue/
- Update imports to use github.com/jfraeys/fetch_ml/internal/queue
- Remove duplicate docker-compose.dev.yml from root (exists in deployments/)
- Fix spec tests: add required Status field, JobName field
- Fix loop variable capture in priority ordering test
- Fix missing closing brace between test functions
- Fix existing queue_test.go: change 50ms to 1s for Redis min duration
All tests pass: go test ./tests/unit/queue/...
- VerifySnapshot: SHA256 verification using integrity package
- EnforceTaskProvenance: Strict and best-effort provenance validation
- RunJupyterTask: Full Jupyter service lifecycle (start/stop/remove/restore/list_packages)
- RunJob: Job execution using executor.JobRunner
- PrewarmNextOnce: Prewarming with queue integration
All methods now use new architecture components instead of placeholders
- Changed package from worker to worker_test to match other test files
- Updated all type references to use worker.* prefix
- Fixed Worker field access to use exported fields (ID, Config, etc.)
Build status: Compiles successfully
Removed tests/unit/jupyter/ws_jupyter_errorcode_test.go which referenced
non-existent api.JupyterTaskErrorCode function.
This test was validating functionality that was removed during Phase 5
API refactoring. The jupyter error code logic is now handled in the
api/jupyter/ package.
Build status: Compiles successfully
- Add logs and debug end-to-end tests
- Add test helper utilities
- Improve test fixtures and templates
- Update API server and config lint commands
- Add multi-user database initialization
- Fix YAML tags in auth config struct (json -> yaml)
- Update CLI configs to use pre-hashed API keys
- Remove double hashing in WebSocket client
- Fix port mapping (9102 -> 9103) in CLI commands
- Update permission keys to use jobs:read, jobs:create, etc.
- Clean up all debug logging from CLI and server
- All user roles now authenticate correctly:
* Admin: Can queue jobs and see all jobs
* Researcher: Can queue jobs and see own jobs
* Analyst: Can see status (read-only access)
Multi-user authentication is now fully functional.
- Add end-to-end tests for complete workflow validation
- Include integration tests for API and database interactions
- Add unit tests for all major components and utilities
- Include performance tests for payload handling
- Add CLI API integration tests
- Include Podman container integration tests
- Add WebSocket and queue execution tests
- Include shell script tests for setup validation
Provides comprehensive test coverage ensuring platform reliability
and functionality across all components and interactions.