Remove deprecated components replaced by new scheduler:
- Delete internal/controller/pacing_controller.go (replaced by scheduler/pacing.go)
- Delete internal/manifest/schema_test.go (consolidated into tests/unit/)
- Delete internal/workertest/worker.go (consolidated into tests/fixtures/)
- Update .gitignore with scheduler binary and new patterns
Update comprehensive test coverage:
- E2E tests with scheduler integration
- Integration tests with tenant isolation
- Unit tests with security assertions
- Security tests with audit validation
- Audit verification tests
- Auth tests with tenant scoping
- Config validation tests
- Container security tests
- Worker tests with scheduler mock
- Environment pool tests
- Load tests with distributed patterns
- Test fixtures with scheduler support
- Update go.mod/go.sum with new dependencies
Update worker system for scheduler integration:
- Worker server with scheduler registration
- Configuration with scheduler endpoint support
- Artifact handling with integrity verification
- Container executor with supply chain validation
- Local executor enhancements
- GPU detection improvements (cross-platform)
- Error handling with execution context
- Factory pattern for executor instantiation
- Hash integrity with native library support
Update authentication system for multi-tenant support:
- API key management with tenant scoping
- Permission checks for multi-tenant operations
- Database layer with tenant isolation
- Keychain integration with audit logging
Update API layer for scheduler integration:
- WebSocket handlers with scheduler protocol support
- Jobs WebSocket endpoint with priority queue integration
- Validation middleware for scheduler messages
- Server configuration with security hardening
- Protocol definitions for worker-scheduler communication
- Dataset handlers with tenant isolation checks
- Response helpers with audit context
- OpenAPI spec updates for new endpoints
Extend worker capabilities with new execution plugins and security features:
- Jupyter plugin for notebook-based ML experiments
- vLLM plugin for LLM inference workloads
- Cross-platform process isolation (Unix/Windows)
- Network policy enforcement with platform-specific implementations
- Service manager integration for lifecycle management
- Scheduler backend integration for queue coordination
Update lifecycle management:
- Enhanced runloop with state transitions
- Service manager integration for plugin coordination
- Improved state persistence and recovery
Add test coverage:
- Unit tests for Jupyter and vLLM plugins
- Updated worker execution tests
Add new scheduler component for distributed ML workload orchestration:
- Hub-based coordination for multi-worker clusters
- Pacing controller for rate limiting job submissions
- Priority queue with preemption support
- Port allocator for dynamic service discovery
- Protocol handlers for worker-scheduler communication
- Service manager with OS-specific implementations
- Connection management and state persistence
- Template system for service deployment
Includes comprehensive test suite:
- Unit tests for all core components
- Integration tests for distributed scenarios
- Benchmark tests for performance validation
- Mock fixtures for isolated testing
Refs: scheduler-architecture.md
Copy just promtail-config.yml to temp root instead of entire monitoring/
directory. This fixes the mount error where promtail couldn't find its
config at the expected path.
Replace all .. with proper relative paths:
- Build context: Use '.' (current directory = project root when using --project-directory)
- Volume mounts: Use './data/...' instead of '../data/...'
- Config mounts: Use './configs/...' instead of '../configs/...'
The '..' fallback was incorrect - when --project-directory is set to repo root,
'..' would point to parent of repo instead of repo itself. Using '.' or
'./path' correctly resolves relative to project root.
Environment variables for data directories (SMOKE_TEST_DATA_DIR, PROD_DATA_DIR,
HOMELAB_DATA_DIR, LOCAL_DATA_DIR) are preserved for runtime customization.
Ensure FETCHML_REPO_ROOT is set in the env file passed to docker-compose.
This fixes path resolution so fallback paths don't incorrectly use parent directory.
Update all docker-compose files to use environment variables for data paths:
- docker-compose.local.yml: Use LOCAL_DATA_DIR with fallback to ../data/dev
- docker-compose.prod.yml: Use PROD_DATA_DIR with fallback to data/prod
- docker-compose.prod.smoke.yml: Use SMOKE_TEST_DATA_DIR with fallback
This allows smoke tests and local development to use temp directories
instead of repo-relative paths, avoiding file sharing permission issues
on macOS with Docker Desktop or Colima.
Promtail mounts monitoring configs from repo root which fails in Colima:
- Copy monitoring/ directory to temp SMOKE_TEST_DATA_DIR
- Update promtail volume path to use SMOKE_TEST_DATA_DIR for configs
- This ensures all mounts are from accessible temp directories
Use /tmp for smoke test data to avoid file sharing issues on macOS/Colima:
- smoke-test.sh: Create temp dir with mktemp, export SMOKE_TEST_DATA_DIR
- docker-compose.dev.yml: Use SMOKE_TEST_DATA_DIR with fallback to data/dev
- Remove file sharing permission checks (no longer needed with tmp)
This avoids Docker Desktop/Colima file sharing permission issues entirely
by using a system temp directory that's always accessible.
Detect if user is running Colima and provide appropriate fix instructions:
- Check for colima command presence
- If Colima detected: suggest virtiofs/sshfs mount options
- Show colima.yaml mount configuration example
- Include verification command: colima ssh -- ls ...
Maintains Docker Desktop instructions for non-Colima users.
Dockerfile targets systems without GPUs:
- Add -DBUILD_NVML_GPU=OFF to cmake in simple.Dockerfile
- Add BUILD_NVML_GPU option to native/CMakeLists.txt (default ON)
- Conditionally include nvml_gpu subdirectory
- Update all_native_libs target to exclude nvml_gpu when disabled
This allows native libraries (dataset_hash, queue_index) to build
without requiring NVIDIA drivers/libraries.
Fix build context resolution in smoke test scripts:
- docker-compose.dev.yml: Use ${FETCHML_REPO_ROOT:-..} for api-server and worker
- docker-compose.prod.smoke.yml: Simplify dockerfile path (remove redundant FETCHML_REPO_ROOT)
Previously used 'context: ..' which resolved incorrectly when docker-compose
was run with --project-directory. Now consistently uses FETCHML_REPO_ROOT env var
for proper path resolution in both dev and prod smoke tests.
Bug fixes and cleanup for test infrastructure:
- schema_test.go: Fix SchemaVersion reference with proper manifest import
- schema_test.go: Update all schema.json paths to internal/manifest location
- manifestenv.go: Remove unused helper functions (isArtifactsType, getPackagePath)
- nobaredetector.go: Fix exprToString syntax error, remove unused functions
All tests now pass without errors or warnings
Implement reproducibility crossover requirements:
- TestManifestEnvironmentCapture: Environment population with ConfigHash and DetectionMethod
- TestConfigHashPostDefaults: Hash computation after env expansion and defaults
Verifies manifest.Environment is properly populated for reproducibility tracking
Rename and enhance existing tests to align with coverage map:
- TestGPUDetectorAMDVendorAlias -> TestAMDAliasManifestRecord
- TestScanArtifacts_SkipsKnownPathsAndLogs -> TestScanExclusionsRecorded
- Add env var expansion verification to TestHIPAAValidation_InlineCredentials
- Record exclusions in manifest.Artifacts for audit trail
Fix ValidatePath to correctly resolve symlinks and handle edge cases:
- Resolve symlinks before boundary check to prevent traversal
- Handle macOS /private prefix correctly
- Add fallback for non-existent paths (parent directory resolution)
- Double boundary checks: before AND after symlink resolution
- Prevent race conditions between check and use
Update path traversal tests:
- Correct test expectations for "..." (three dots is valid filename, not traversal)
- Add tests for symlink escape attempts
- Add unicode attack tests
- Add deeply nested traversal tests
Security impact: Prevents path traversal via symlink following in artifact
scanning and other file operations.
Add MaxArtifactFiles and MaxArtifactTotalBytes to SandboxConfig:
- Default MaxArtifactFiles: 10,000 (configurable via SecurityDefaults)
- Default MaxArtifactTotalBytes: 100GB (configurable via SecurityDefaults)
- ApplySecurityDefaults() sets defaults if not specified
Enforce caps in scanArtifacts() during directory walk:
- Returns error immediately when MaxArtifactFiles exceeded
- Returns error immediately when MaxArtifactTotalBytes exceeded
- Prevents resource exhaustion attacks from malicious artifact trees
Update all call sites to pass SandboxConfig for cap enforcement:
- Native bridge libs updated to pass caps argument
- Benchmark tests updated with nil caps (unlimited for benchmarks)
- Unit tests updated with nil caps
Closes: artifact ingestion caps items from security plan
**scripts/benchmarks/run-benchmarks-local.sh:**
- Add support for native library benchmarks
**scripts/ci/test.sh:**
- Update CI test commands for new test structure
**scripts/dev/smoke-test.sh:**
- Improve smoke test reliability and output
**Payload Performance Test:**
- Add job cleanup after each iteration using DeleteJob()
- Ensure isolated memory measurements between test runs
**All Benchmark Tests:**
- General improvements and maintenance updates
**Makefile:**
- Add native build targets and test infrastructure
- Update benchmark and CI test commands
**cli/build.zig:**
- Build configuration updates for CLI compilation