Add MaxArtifactFiles and MaxArtifactTotalBytes to SandboxConfig:
- Default MaxArtifactFiles: 10,000 (configurable via SecurityDefaults)
- Default MaxArtifactTotalBytes: 100GB (configurable via SecurityDefaults)
- ApplySecurityDefaults() sets defaults if not specified
Enforce caps in scanArtifacts() during directory walk:
- Returns error immediately when MaxArtifactFiles exceeded
- Returns error immediately when MaxArtifactTotalBytes exceeded
- Prevents resource exhaustion attacks from malicious artifact trees
Update all call sites to pass SandboxConfig for cap enforcement:
- Native bridge libs updated to pass caps argument
- Benchmark tests updated with nil caps (unlimited for benchmarks)
- Unit tests updated with nil caps
Closes: artifact ingestion caps items from security plan
**Cleanup:**
- Delete internal/worker/testutil.go (150 lines of unused test utilities)
- Remove unused stateDir() function from internal/jupyter/service_manager.go
- Silence unused variable warning in internal/worker/executor/container.go
- Surface GPUDetectionInfo from parseGPUCountFromConfig for detection metadata
- Document FETCH_ML_TOTAL_CPU and FETCH_ML_GPU_SLOTS_PER_GPU env vars
- Add debug logging for all env var overrides to stderr
- Track config-layer auto-detection in GPUDetectionInfo.ConfigLayerAutoDetected
- Add --include-all flag to artifact scanner (includeAll parameter)
- Add AMD production mode enforcement (error in non-local mode)
- Add GPU detector unit tests for env overrides and AMD aliasing
- Add native/nvml_gpu/ C++ library wrapping NVIDIA Management Library
- Add Go bindings in internal/worker/gpu_nvml_native.go and gpu_nvml_stub.go
- Update gpu_detector.go to use NVML for accurate GPU count detection
- Update native/CMakeLists.txt to build nvml_gpu library
- Provides real-time GPU utilization, memory, temperature, clocks, power
- Falls back to environment variable when NVML unavailable
- Remove duplicate hash_selector.go (build tags handle switching)
- Fix benchmark to use worker.DirOverallSHA256Hex
- Fix snapshot_store.go to use integrity.DirOverallSHA256Hex directly
- Native tests pass, benchmarks now correctly test native vs Go
Go Worker (internal/worker/native_bridge_libs.go):
- Add global hashCtx with sync.Once for lazy initialization
- Eliminates 5-20ms fh_init/fh_cleanup per hash operation
- Uses runtime.NumCPU() for optimal thread count
- Log initialization time for observability
Zig CLI (cli/src/native/hash.zig):
- Add global_ctx with atomic flag and mutex
- Thread-safe initialization with double-check pattern
- Idempotent init() callable from multiple threads
- Log init time for debugging
Replace FETCHML_NATIVE_LIBS=1 environment variable with -tags native_libs:
Changes:
- internal/queue/native_queue.go: UseNativeQueue is now const true
- internal/queue/native_queue_stub.go: UseNativeQueue is now const false
- build/docker/simple.Dockerfile: Add -tags native_libs to go build
- deployments/docker-compose.dev.yml: Remove FETCHML_NATIVE_LIBS env var
- native/README.md: Update documentation for build tags
- scripts/test-native-with-redis.sh: New test script with Redis via docker-compose
Benefits:
- Compile-time enforcement (no runtime checks needed)
- Cleaner deployment (no env var management)
- Type safety (const vs var)
- Simpler testing with docker-compose Redis integration
- Fix duplicate check in security_test.go lint warning
- Mark SHA256 tests as Legacy for backward compatibility
- Convert TODO comments to documentation (task, handlers, privacy)
- Update user_manager_test to use GenerateAPIKey pattern
Reorganize tests for better structure and coverage:
- Move container/security_test.go from internal/ to tests/unit/container/
- Move related tests to proper unit test locations
- Delete orphaned test files (startup_blacklist_test.go)
- Add privacy middleware unit tests
- Add worker config unit tests
- Update E2E tests for homelab and websocket scenarios
- Update test fixtures with utility functions
- Add CLI helper script for arraylist fixes
Add comprehensive research context tracking to jobs:
- Narrative fields: hypothesis, context, intent, expected_outcome
- Experiment groups and tags for organization
- Run comparison (compare command) for diff analysis
- Run search (find command) with criteria filtering
- Run export (export command) for data portability
- Outcome setting (outcome command) for experiment validation
Update queue and requeue commands to support narrative fields.
Add narrative validation to manifest validator.
Add WebSocket handlers for compare, find, export, and outcome operations.
Includes E2E tests for phase 2 features.
Update internal/jupyter/workspace_metadata.go to use centralized PathRegistry:
Changes:
- Add import for internal/config package
- Update saveMetadata() to use config.FromEnv() for directory creation
- Replace os.MkdirAll with paths.EnsureDir() for metadata directory
Benefits:
- Consistent directory creation via PathRegistry
- Centralized path management for workspace metadata
- Better error handling for directory creation
Update internal/queue/filesystem_queue.go to use centralized PathRegistry:
Changes:
- Add import for internal/config package
- Update NewFilesystemQueue to use config.FromEnv() for directory creation
- Replace os.MkdirAll with paths.EnsureDir() for all queue directories:
- pending/entries
- running
- finished
- failed
Benefits:
- Consistent directory creation via PathRegistry
- Centralized path management for queue storage
- Better error handling for directory creation
Update internal/worker/snapshot_store.go to use centralized PathRegistry:
Changes:
- Add import for internal/config package
- Update ResolveSnapshot to use config.FromEnv() for directory creation
- Replace os.MkdirAll with paths.EnsureDir() for tmpRoot
- Replace os.MkdirAll with paths.EnsureDir() for extractDir
- Replace os.MkdirAll with paths.EnsureDir() for cacheDir parent
Benefits:
- Consistent directory creation via PathRegistry
- Centralized path management for snapshot storage
- Better error handling for directory creation
Update internal/worker/config.go to use centralized PathRegistry:
Changes:
- Initialize PathRegistry with config.FromEnv() in LoadConfig
- Update BasePath default to use paths.ExperimentsDir()
- Update DataDir default to use paths.DataDir()
- Simplify DataDir logic by using PathRegistry directly
Benefits:
- Consistent directory locations via PathRegistry
- Centralized path management across worker and api-server
- Simpler configuration with fewer conditional branches
Update internal/api/server_config.go to use centralized PathRegistry:
Changes:
- Update EnsureLogDirectory() to use config.FromEnv().LogDir() with EnsureDir()
- Update Validate() to use PathRegistry for default BasePath and DataDir
- Remove hardcoded /tmp/ml-experiments default
- Use paths.ExperimentsDir() and paths.DataDir() for consistent paths
Benefits:
- Consistent directory locations via PathRegistry
- Centralized directory creation with EnsureDir()
- Better error handling for directory creation
Update internal/experiment/manager.go to use centralized PathRegistry:
Changes:
- Add import for internal/config package
- Add NewManagerFromPaths() constructor using PathRegistry
- Update Initialize() to use config.FromEnv().ExperimentsDir() with EnsureDir()
- Update archiveExperiment() to use PathRegistry pattern
Benefits:
- Consistent experiment directory location via PathRegistry
- Centralized directory creation with EnsureDir()
- Backward compatible: existing NewManager() still works
- New code can use NewManagerFromPaths() for PathRegistry integration
Update internal/jupyter/service_manager.go to use centralized PathRegistry:
Changes:
- Import config package for PathRegistry access
- Update stateDir() to use config.FromEnv().JupyterStateDir()
- Update workspaceBaseDir() to use config.FromEnv().ActiveDataDir()
- Update trashBaseDir() to use config.FromEnv().JupyterStateDir()
- Update NewServiceManager() to use PathRegistry for workspace metadata file
- Update loadServices() to use PathRegistry for services file path
- Update saveServices() to use PathRegistry with EnsureDir()
- Rename parameter 'config' to 'svcConfig' to avoid shadowing import
Benefits:
- Consistent path management across codebase
- Centralized directory creation with EnsureDir()
- Environment variable override still supported (backward compatible)
- Proper error handling for directory creation failures
Add comprehensive Podman secrets support to prevent credential exposure:
New types and methods (internal/container/podman.go):
- PodmanSecret struct for secret definitions
- CreateSecret() - Create Podman secrets from sensitive data
- DeleteSecret() - Clean up secrets after use
- BuildSecretArgs() - Generate podman run arguments for secrets
- SanitizeContainerEnv() - Extract sensitive env vars as secrets
- ContainerConfig.Secrets field for secret list
Enhanced container lifecycle:
- StartContainer() now creates secrets before starting container
- Secrets automatically mounted via --secret flag
- Cleanup on failure to prevent secret leakage
- Secrets logged as count only (not content)
Jupyter service integration (internal/jupyter/service_manager.go):
- prepareContainerConfig() uses SanitizeContainerEnv()
- JUPYTER_TOKEN and JUPYTER_PASSWORD now use secrets
- Maintains backward compatibility with env var mounting
Security benefits:
- Credentials no longer visible in 'podman inspect' output
- Secrets not exposed via /proc/*/environ inside container
- Automatic cleanup prevents secret accumulation
- Compatible with existing Jupyter authentication
- Add stripTokenFromURL() helper function to remove tokens from URLs
- Use it when logging service start URLs
- Use it when logging connectivity test URLs
- Prevents sensitive tokens from being written to log files
- Move queue_spec_test.go from internal/queue/ to tests/unit/queue/
- Update imports to use github.com/jfraeys/fetch_ml/internal/queue
- Remove duplicate docker-compose.dev.yml from root (exists in deployments/)
- Fix spec tests: add required Status field, JobName field
- Fix loop variable capture in priority ordering test
- Fix missing closing brace between test functions
- Fix existing queue_test.go: change 50ms to 1s for Redis min duration
All tests pass: go test ./tests/unit/queue/...
Phase 2: Deterministic Manifests
- Add manifest.Validator with required field checking
- Support Validate() and ValidateStrict() modes
- Integrate validation into worker executor before execution
- Block execution if manifest missing commit_id or deps_manifest_sha256
Phase 5: Pinned Dependencies
- Add hermetic.dockerfile template with pinned system deps
- Frozen package versions: libblas3, libcudnn8, etc.
- Support for deps_manifest.json and requirements.txt with hashes
- Image tagging strategy: deps-<first-8-of-sha256>
Phase 8: Tests as Specifications
- Add queue_spec_test.go with executable scheduler specs
- Document priority ordering (higher first)
- Document FIFO tiebreaker for same priority
- Test cases for negative/zero priorities
Phase 10: Local Dev Parity
- Create root-level docker-compose.dev.yml
- Simplified from deployments/ for quick local dev
- Redis + API server + Worker with hot reload volumes
- Debug ports: 9101 (API), 6379 (Redis)
Move ExpandPath function and path-related utilities from internal/config to internal/storage where they belong.
Files updated:
- internal/worker/config.go: use storage.ExpandPath
- internal/network/ssh.go: use storage.ExpandPath
- cmd/data_manager/data_manager_config.go: use storage.ExpandPath
- internal/api/server_config.go: use storage.ExpandPath
internal/storage/paths.go already contained the canonical implementation.
Result: Path utilities now live in storage layer, config package focuses on configuration structs.