Implement VaultProvider with Transit engine:
- AppRole, Kubernetes, and Token authentication
- Encrypt/Decrypt via /transit/encrypt and /transit/decrypt
- Key lifecycle via /transit/keys API
- Health check via /sys/health
Implement AWSProvider with SDK v2:
- Per-region key naming with alias prefix
- Encrypt/Decrypt via KMS SDK
- Key lifecycle (CreateKey, Disable, ScheduleDeletion, Enable)
- AWS endpoint support for LocalStack testing
Add KMSProvider interface for external key management systems:
- Encrypt/Decrypt operations for DEK wrapping
- Key lifecycle management (Create, Disable, ScheduleDeletion, Enable)
- HealthCheck and Close methods
Implement MemoryProvider for development/testing:
- XOR encryption with HMAC-SHA256 authentication
- Secure random key generation using crypto/rand
- MAC verification to detect wrong keys
Implement DEKCache per ADR-012:
- 15-minute TTL with configurable grace window (1 hour)
- LRU eviction with 1000 entry limit
- Cache key includes (tenantID, artifactID, kmsKeyID) for isolation
- Thread-safe operations with RWMutex
- Secure memory wiping on eviction/cleanup
Add config package with types:
- ProviderType enum (vault, aws, memory)
- VaultConfig with AppRole/Kubernetes/Token auth
- AWSConfig with region and alias prefix
- CacheConfig with TTL, MaxEntries, GraceWindow
- Validation methods for all config types
- Add plugin_quota.go with GPU quota management for scheduler
- Update scheduler hub and protocol for plugin support
- Add comprehensive plugin quota unit tests
- Update gang service and WebSocket queue integration tests
Remove deprecated components replaced by new scheduler:
- Delete internal/controller/pacing_controller.go (replaced by scheduler/pacing.go)
- Delete internal/manifest/schema_test.go (consolidated into tests/unit/)
- Delete internal/workertest/worker.go (consolidated into tests/fixtures/)
- Update .gitignore with scheduler binary and new patterns
Update worker system for scheduler integration:
- Worker server with scheduler registration
- Configuration with scheduler endpoint support
- Artifact handling with integrity verification
- Container executor with supply chain validation
- Local executor enhancements
- GPU detection improvements (cross-platform)
- Error handling with execution context
- Factory pattern for executor instantiation
- Hash integrity with native library support
Update authentication system for multi-tenant support:
- API key management with tenant scoping
- Permission checks for multi-tenant operations
- Database layer with tenant isolation
- Keychain integration with audit logging
Update API layer for scheduler integration:
- WebSocket handlers with scheduler protocol support
- Jobs WebSocket endpoint with priority queue integration
- Validation middleware for scheduler messages
- Server configuration with security hardening
- Protocol definitions for worker-scheduler communication
- Dataset handlers with tenant isolation checks
- Response helpers with audit context
- OpenAPI spec updates for new endpoints
Extend worker capabilities with new execution plugins and security features:
- Jupyter plugin for notebook-based ML experiments
- vLLM plugin for LLM inference workloads
- Cross-platform process isolation (Unix/Windows)
- Network policy enforcement with platform-specific implementations
- Service manager integration for lifecycle management
- Scheduler backend integration for queue coordination
Update lifecycle management:
- Enhanced runloop with state transitions
- Service manager integration for plugin coordination
- Improved state persistence and recovery
Add test coverage:
- Unit tests for Jupyter and vLLM plugins
- Updated worker execution tests
Add new scheduler component for distributed ML workload orchestration:
- Hub-based coordination for multi-worker clusters
- Pacing controller for rate limiting job submissions
- Priority queue with preemption support
- Port allocator for dynamic service discovery
- Protocol handlers for worker-scheduler communication
- Service manager with OS-specific implementations
- Connection management and state persistence
- Template system for service deployment
Includes comprehensive test suite:
- Unit tests for all core components
- Integration tests for distributed scenarios
- Benchmark tests for performance validation
- Mock fixtures for isolated testing
Refs: scheduler-architecture.md
Rename and enhance existing tests to align with coverage map:
- TestGPUDetectorAMDVendorAlias -> TestAMDAliasManifestRecord
- TestScanArtifacts_SkipsKnownPathsAndLogs -> TestScanExclusionsRecorded
- Add env var expansion verification to TestHIPAAValidation_InlineCredentials
- Record exclusions in manifest.Artifacts for audit trail
Fix ValidatePath to correctly resolve symlinks and handle edge cases:
- Resolve symlinks before boundary check to prevent traversal
- Handle macOS /private prefix correctly
- Add fallback for non-existent paths (parent directory resolution)
- Double boundary checks: before AND after symlink resolution
- Prevent race conditions between check and use
Update path traversal tests:
- Correct test expectations for "..." (three dots is valid filename, not traversal)
- Add tests for symlink escape attempts
- Add unicode attack tests
- Add deeply nested traversal tests
Security impact: Prevents path traversal via symlink following in artifact
scanning and other file operations.
Add MaxArtifactFiles and MaxArtifactTotalBytes to SandboxConfig:
- Default MaxArtifactFiles: 10,000 (configurable via SecurityDefaults)
- Default MaxArtifactTotalBytes: 100GB (configurable via SecurityDefaults)
- ApplySecurityDefaults() sets defaults if not specified
Enforce caps in scanArtifacts() during directory walk:
- Returns error immediately when MaxArtifactFiles exceeded
- Returns error immediately when MaxArtifactTotalBytes exceeded
- Prevents resource exhaustion attacks from malicious artifact trees
Update all call sites to pass SandboxConfig for cap enforcement:
- Native bridge libs updated to pass caps argument
- Benchmark tests updated with nil caps (unlimited for benchmarks)
- Unit tests updated with nil caps
Closes: artifact ingestion caps items from security plan
**Cleanup:**
- Delete internal/worker/testutil.go (150 lines of unused test utilities)
- Remove unused stateDir() function from internal/jupyter/service_manager.go
- Silence unused variable warning in internal/worker/executor/container.go
- Surface GPUDetectionInfo from parseGPUCountFromConfig for detection metadata
- Document FETCH_ML_TOTAL_CPU and FETCH_ML_GPU_SLOTS_PER_GPU env vars
- Add debug logging for all env var overrides to stderr
- Track config-layer auto-detection in GPUDetectionInfo.ConfigLayerAutoDetected
- Add --include-all flag to artifact scanner (includeAll parameter)
- Add AMD production mode enforcement (error in non-local mode)
- Add GPU detector unit tests for env overrides and AMD aliasing
- Add native/nvml_gpu/ C++ library wrapping NVIDIA Management Library
- Add Go bindings in internal/worker/gpu_nvml_native.go and gpu_nvml_stub.go
- Update gpu_detector.go to use NVML for accurate GPU count detection
- Update native/CMakeLists.txt to build nvml_gpu library
- Provides real-time GPU utilization, memory, temperature, clocks, power
- Falls back to environment variable when NVML unavailable
- Remove duplicate hash_selector.go (build tags handle switching)
- Fix benchmark to use worker.DirOverallSHA256Hex
- Fix snapshot_store.go to use integrity.DirOverallSHA256Hex directly
- Native tests pass, benchmarks now correctly test native vs Go
Go Worker (internal/worker/native_bridge_libs.go):
- Add global hashCtx with sync.Once for lazy initialization
- Eliminates 5-20ms fh_init/fh_cleanup per hash operation
- Uses runtime.NumCPU() for optimal thread count
- Log initialization time for observability
Zig CLI (cli/src/native/hash.zig):
- Add global_ctx with atomic flag and mutex
- Thread-safe initialization with double-check pattern
- Idempotent init() callable from multiple threads
- Log init time for debugging
Replace FETCHML_NATIVE_LIBS=1 environment variable with -tags native_libs:
Changes:
- internal/queue/native_queue.go: UseNativeQueue is now const true
- internal/queue/native_queue_stub.go: UseNativeQueue is now const false
- build/docker/simple.Dockerfile: Add -tags native_libs to go build
- deployments/docker-compose.dev.yml: Remove FETCHML_NATIVE_LIBS env var
- native/README.md: Update documentation for build tags
- scripts/test-native-with-redis.sh: New test script with Redis via docker-compose
Benefits:
- Compile-time enforcement (no runtime checks needed)
- Cleaner deployment (no env var management)
- Type safety (const vs var)
- Simpler testing with docker-compose Redis integration
- Fix duplicate check in security_test.go lint warning
- Mark SHA256 tests as Legacy for backward compatibility
- Convert TODO comments to documentation (task, handlers, privacy)
- Update user_manager_test to use GenerateAPIKey pattern
Reorganize tests for better structure and coverage:
- Move container/security_test.go from internal/ to tests/unit/container/
- Move related tests to proper unit test locations
- Delete orphaned test files (startup_blacklist_test.go)
- Add privacy middleware unit tests
- Add worker config unit tests
- Update E2E tests for homelab and websocket scenarios
- Update test fixtures with utility functions
- Add CLI helper script for arraylist fixes
Add comprehensive research context tracking to jobs:
- Narrative fields: hypothesis, context, intent, expected_outcome
- Experiment groups and tags for organization
- Run comparison (compare command) for diff analysis
- Run search (find command) with criteria filtering
- Run export (export command) for data portability
- Outcome setting (outcome command) for experiment validation
Update queue and requeue commands to support narrative fields.
Add narrative validation to manifest validator.
Add WebSocket handlers for compare, find, export, and outcome operations.
Includes E2E tests for phase 2 features.