Extend worker capabilities with new execution plugins and security features:
- Jupyter plugin for notebook-based ML experiments
- vLLM plugin for LLM inference workloads
- Cross-platform process isolation (Unix/Windows)
- Network policy enforcement with platform-specific implementations
- Service manager integration for lifecycle management
- Scheduler backend integration for queue coordination
Update lifecycle management:
- Enhanced runloop with state transitions
- Service manager integration for plugin coordination
- Improved state persistence and recovery
Add test coverage:
- Unit tests for Jupyter and vLLM plugins
- Updated worker execution tests
Add new scheduler component for distributed ML workload orchestration:
- Hub-based coordination for multi-worker clusters
- Pacing controller for rate limiting job submissions
- Priority queue with preemption support
- Port allocator for dynamic service discovery
- Protocol handlers for worker-scheduler communication
- Service manager with OS-specific implementations
- Connection management and state persistence
- Template system for service deployment
Includes comprehensive test suite:
- Unit tests for all core components
- Integration tests for distributed scenarios
- Benchmark tests for performance validation
- Mock fixtures for isolated testing
Refs: scheduler-architecture.md
Copy just promtail-config.yml to temp root instead of entire monitoring/
directory. This fixes the mount error where promtail couldn't find its
config at the expected path.
Replace all .. with proper relative paths:
- Build context: Use '.' (current directory = project root when using --project-directory)
- Volume mounts: Use './data/...' instead of '../data/...'
- Config mounts: Use './configs/...' instead of '../configs/...'
The '..' fallback was incorrect - when --project-directory is set to repo root,
'..' would point to parent of repo instead of repo itself. Using '.' or
'./path' correctly resolves relative to project root.
Environment variables for data directories (SMOKE_TEST_DATA_DIR, PROD_DATA_DIR,
HOMELAB_DATA_DIR, LOCAL_DATA_DIR) are preserved for runtime customization.
Ensure FETCHML_REPO_ROOT is set in the env file passed to docker-compose.
This fixes path resolution so fallback paths don't incorrectly use parent directory.
Update all docker-compose files to use environment variables for data paths:
- docker-compose.local.yml: Use LOCAL_DATA_DIR with fallback to ../data/dev
- docker-compose.prod.yml: Use PROD_DATA_DIR with fallback to data/prod
- docker-compose.prod.smoke.yml: Use SMOKE_TEST_DATA_DIR with fallback
This allows smoke tests and local development to use temp directories
instead of repo-relative paths, avoiding file sharing permission issues
on macOS with Docker Desktop or Colima.
Promtail mounts monitoring configs from repo root which fails in Colima:
- Copy monitoring/ directory to temp SMOKE_TEST_DATA_DIR
- Update promtail volume path to use SMOKE_TEST_DATA_DIR for configs
- This ensures all mounts are from accessible temp directories
Use /tmp for smoke test data to avoid file sharing issues on macOS/Colima:
- smoke-test.sh: Create temp dir with mktemp, export SMOKE_TEST_DATA_DIR
- docker-compose.dev.yml: Use SMOKE_TEST_DATA_DIR with fallback to data/dev
- Remove file sharing permission checks (no longer needed with tmp)
This avoids Docker Desktop/Colima file sharing permission issues entirely
by using a system temp directory that's always accessible.
Detect if user is running Colima and provide appropriate fix instructions:
- Check for colima command presence
- If Colima detected: suggest virtiofs/sshfs mount options
- Show colima.yaml mount configuration example
- Include verification command: colima ssh -- ls ...
Maintains Docker Desktop instructions for non-Colima users.
Dockerfile targets systems without GPUs:
- Add -DBUILD_NVML_GPU=OFF to cmake in simple.Dockerfile
- Add BUILD_NVML_GPU option to native/CMakeLists.txt (default ON)
- Conditionally include nvml_gpu subdirectory
- Update all_native_libs target to exclude nvml_gpu when disabled
This allows native libraries (dataset_hash, queue_index) to build
without requiring NVIDIA drivers/libraries.
Fix build context resolution in smoke test scripts:
- docker-compose.dev.yml: Use ${FETCHML_REPO_ROOT:-..} for api-server and worker
- docker-compose.prod.smoke.yml: Simplify dockerfile path (remove redundant FETCHML_REPO_ROOT)
Previously used 'context: ..' which resolved incorrectly when docker-compose
was run with --project-directory. Now consistently uses FETCHML_REPO_ROOT env var
for proper path resolution in both dev and prod smoke tests.
Bug fixes and cleanup for test infrastructure:
- schema_test.go: Fix SchemaVersion reference with proper manifest import
- schema_test.go: Update all schema.json paths to internal/manifest location
- manifestenv.go: Remove unused helper functions (isArtifactsType, getPackagePath)
- nobaredetector.go: Fix exprToString syntax error, remove unused functions
All tests now pass without errors or warnings
Implement reproducibility crossover requirements:
- TestManifestEnvironmentCapture: Environment population with ConfigHash and DetectionMethod
- TestConfigHashPostDefaults: Hash computation after env expansion and defaults
Verifies manifest.Environment is properly populated for reproducibility tracking
Rename and enhance existing tests to align with coverage map:
- TestGPUDetectorAMDVendorAlias -> TestAMDAliasManifestRecord
- TestScanArtifacts_SkipsKnownPathsAndLogs -> TestScanExclusionsRecorded
- Add env var expansion verification to TestHIPAAValidation_InlineCredentials
- Record exclusions in manifest.Artifacts for audit trail
Fix ValidatePath to correctly resolve symlinks and handle edge cases:
- Resolve symlinks before boundary check to prevent traversal
- Handle macOS /private prefix correctly
- Add fallback for non-existent paths (parent directory resolution)
- Double boundary checks: before AND after symlink resolution
- Prevent race conditions between check and use
Update path traversal tests:
- Correct test expectations for "..." (three dots is valid filename, not traversal)
- Add tests for symlink escape attempts
- Add unicode attack tests
- Add deeply nested traversal tests
Security impact: Prevents path traversal via symlink following in artifact
scanning and other file operations.
Add MaxArtifactFiles and MaxArtifactTotalBytes to SandboxConfig:
- Default MaxArtifactFiles: 10,000 (configurable via SecurityDefaults)
- Default MaxArtifactTotalBytes: 100GB (configurable via SecurityDefaults)
- ApplySecurityDefaults() sets defaults if not specified
Enforce caps in scanArtifacts() during directory walk:
- Returns error immediately when MaxArtifactFiles exceeded
- Returns error immediately when MaxArtifactTotalBytes exceeded
- Prevents resource exhaustion attacks from malicious artifact trees
Update all call sites to pass SandboxConfig for cap enforcement:
- Native bridge libs updated to pass caps argument
- Benchmark tests updated with nil caps (unlimited for benchmarks)
- Unit tests updated with nil caps
Closes: artifact ingestion caps items from security plan
**scripts/benchmarks/run-benchmarks-local.sh:**
- Add support for native library benchmarks
**scripts/ci/test.sh:**
- Update CI test commands for new test structure
**scripts/dev/smoke-test.sh:**
- Improve smoke test reliability and output
**Payload Performance Test:**
- Add job cleanup after each iteration using DeleteJob()
- Ensure isolated memory measurements between test runs
**All Benchmark Tests:**
- General improvements and maintenance updates
**Makefile:**
- Add native build targets and test infrastructure
- Update benchmark and CI test commands
**cli/build.zig:**
- Build configuration updates for CLI compilation
**Cleanup:**
- Delete internal/worker/testutil.go (150 lines of unused test utilities)
- Remove unused stateDir() function from internal/jupyter/service_manager.go
- Silence unused variable warning in internal/worker/executor/container.go
Reduce worker polling interval from 5ms to 1ms for faster task pickup
Add 100ms buffer after job submission to allow queue to settle
Increase timeout from 30s to 60s to prevent flaky failures
Fixes intermittent timeout issues in integration tests
Add @mkdir -p tests/bin to profile-load, profile-load-norate, and profile-ws-queue targets
Fixes 'no such file or directory' error when writing CPU profile files
Replace bind mount with Docker named volume for Redis data
This fixes 'operation not permitted' errors on macOS Docker Desktop
where bind mounts fail due to file sharing restrictions