GPU detection refactor:
- Major rewrite of gpu_detector.go with unified detection interface
- Support for NVIDIA (NVML), AMD (ROCm), and Apple Metal
- Runtime GPU capability querying for scheduler matching
macOS improvements:
- gpu_macos.go: native Metal device enumeration and memory queries
- Support for Apple Silicon (M1/M2/M3) unified memory reporting
- Fallback to system profiler for Intel Macs
Testing infrastructure:
- Add gpu_detector_mock.go for testing without hardware
- Update gpu_golden_test.go with platform-specific expectations
- Cross-platform GPU info validation
Major WebSocket handler refactor:
- Rewrite ws/handler.go with structured message routing and backpressure
- Add connection lifecycle management with heartbeats and timeouts
- Implement graceful connection draining for zero-downtime restarts
Protocol improvements:
- Define structured protocol types in protocol.go for hub communication
- Add versioned message envelopes for backward compatibility
- Standardize error codes and response formats across WebSocket API
Job streaming via WebSocket:
- Simplify ws/jobs.go with async job status streaming
- Add compression for high-volume job updates
Testing:
- Update websocket_e2e_test.go for new protocol semantics
- Add connection resilience tests
Add comprehensive capability routing system to scheduler hub:
- Capability-aware worker matching with requirement/offer negotiation
- Hub v2 protocol with structured message types and heartbeat management
- Worker capability advertisement and dynamic routing decisions
- Orphan recovery for disconnected workers with state reconciliation
- Template-based job scheduling with capability constraints
Add extensive test coverage:
- Unit tests for capability routing logic and heartbeat mechanics
- Unit tests for orphan recovery scenarios
- E2E tests for capability routing across multiple workers
- Hub capabilities integration tests
- Scheduler fixture helpers for test setup
Protocol improvements:
- Define structured protocol messages for hub-worker communication
- Add capability matching algorithm with scoring
- Implement graceful worker disconnection handling
Comprehensive audit system for security and compliance:
- middleware/audit.go: HTTP request/response auditing middleware
* Captures request details, user identity, response status
* Chains audit events with cryptographic hashes for tamper detection
* Configurable filtering for sensitive data redaction
- audit/chain.go: Blockchain-style audit log chaining
* Each entry includes hash of previous entry
* Tamper detection through hash verification
* Supports incremental verification without full scan
- checkpoint.go: Periodic integrity checkpoints
* Creates signed checkpoints for fast verification
* Configurable checkpoint intervals
* Recovery from last known good checkpoint
- rotation.go: Automatic log rotation and archival
* Size-based and time-based rotation policies
* Compressed archival with integrity seals
* Retention policy enforcement
- sealed.go: Cryptographic sealing of audit logs
* Digital signatures for log integrity
* HSM support preparation
* Exportable sealed bundles for external auditors
- verifier.go: Log verification and forensic analysis
* Complete chain verification from genesis to latest
* Detects gaps, tampering, unauthorized modifications
* Forensic export for incident response
Add new API endpoints and clean up handler interfaces:
- groups/handlers.go: New lab group management API
* CRUD operations for lab groups
* Member management with role assignment (admin/member/viewer)
* Group listing and membership queries
- tokens/handlers.go: Token generation and validation endpoints
* Create access tokens for public task sharing
* Validate tokens for secure access
* Token revocation and cleanup
- routes.go: Refactor handler registration
* Integrate groups handler into WebSocket routes
* Remove nil parameters from all handler constructors
* Cleaner dependency injection pattern
- Handler interface cleanup across all modules:
* jobs/handlers.go: Remove unused nil privacyEnforcer parameter
* jupyter/handlers.go: Streamline initialization
* scheduler/handlers.go: Consistent constructor signature
* ws/handler.go: Add groups handler to dependencies
Add comprehensive authentication and authorization enhancements:
- tokens.go: New token management system for public task access and cloning
* SHA-256 hashed token storage for security
* Token generation, validation, and automatic cleanup
* Support for public access and clone permissions
- api_key.go: Extend User struct with Groups field
* Lab group membership (ml-lab, nlp-group)
* Integration with permission system for group-based access
- flags.go: Security hardening - migrate to structured logging
* Replace log.Printf with log/slog to prevent log injection attacks
* Consistent structured output for all auth warnings
* Safe handling of file paths and errors in logs
- permissions.go: Add task sharing permission constants
* PermissionTasksReadOwn: Access own tasks
* PermissionTasksReadLab: Access lab group tasks
* PermissionTasksReadAll: Admin/institution-wide access
* PermissionTasksShare: Grant access to other users
* PermissionTasksClone: Create copies of shared tasks
* CanAccessTask() method with visibility checks
- database.go: Improve error handling
* Add structured error logging on row close failures
Add comprehensive database storage layer for new features:
- db_groups.go: Lab group management with members, roles (admin/member/viewer),
and group-based task visibility queries
- db_tasks.go: Task visibility system (private/lab/institution/open),
task sharing with expiry, public clone tokens, and optimized
ListTasksForUser() for access control
- db_tokens.go: Secure token management for public task access and cloning,
with SHA-256 hashed token storage and automatic cleanup
- db_audit.go: Audit log persistence with checkpoint chains, tamper
detection, and log rotation support
- schema_sqlite.sql: Updated schema with:
- groups, group_members tables
- tasks.visibility enum, task_shares with expiry
- access_tokens table with hashed tokens
- audit_logs, audit_checkpoints tables
- indexes for all foreign keys and query patterns
- db_experiments.go: Add CascadeVisibilityToTasks() for propagating
visibility changes from experiments to associated tasks
## Problem
TestEndToEndJobLifecycle was failing with two issues:
1. Race condition: Workers signaled ready before job was processed, receiving
MsgNoWork instead of MsgJobAssign
2. getTask() didn't check pendingAcceptance - assigned-but-not-yet-accepted
tasks returned nil
## Changes
### Test Fix (restart_recovery_test.go)
- Replace single-shot select with retry loop that re-signals workers as ready
- Handle both assignment and non-assignment messages correctly
- Add 10ms delay between non-assignment messages to allow job processing
- Use 2-second deadline with 100ms timeout intervals
### Scheduler Fix (hub.go)
- Extend getTask() to check pendingAcceptance map after batch/service queues
- Allows GetTask() to find tasks in 'assigned' state before acceptance
- Maintains backward compatibility with existing queue/running lookups
## Testing
make test now passes: 475 passed, 0 failed, 34 skipped
Enhance ml info to query server when connected, falling back to local
manifests when offline. Unifies behavior with other commands like run,
exec, and cancel.
CLI changes:
- Add --local and --remote flags for explicit control
- Auto-detect connection state via mode.detect()
- queryRemoteRun(): Query server via WebSocket for run details
- queryLocalRun(): Read local run_manifest.json
- displayRunInfo(): Shared display logic for both sources
- Add connection status indicators (Remote: connecting.../connected)
WebSocket protocol:
- Add query_run_info opcode (0x28) to cli and server
- Add sendQueryRunInfo() method to ws/client.zig
- Protocol: [opcode:1][api_key_hash:16][run_id_len:1][run_id:var]
Server changes:
- Add handleQueryRunInfo() handler to ws/handler.go
- Returns run_id, job_name, user, timestamp, overall_sha, files_count
- Checks PermJobsRead permission
- Looks up run in experiment manager
Usage:
ml info abc123 # Auto: tries remote, falls back to local
ml info abc123 --local # Force local manifest lookup
ml info abc123 --remote # Force remote query (fails if offline)
Remove three unused methods/parameter identified by static analysis:
- canRequeue(): never integrated into scheduling flow
- runMetricsClient clientID param: accepted but never used
- getUsageLocked(): callers inline the logic
Fixes IDE warnings about unused code per AGENTS.md cleanup discipline.
Implement VaultProvider with Transit engine:
- AppRole, Kubernetes, and Token authentication
- Encrypt/Decrypt via /transit/encrypt and /transit/decrypt
- Key lifecycle via /transit/keys API
- Health check via /sys/health
Implement AWSProvider with SDK v2:
- Per-region key naming with alias prefix
- Encrypt/Decrypt via KMS SDK
- Key lifecycle (CreateKey, Disable, ScheduleDeletion, Enable)
- AWS endpoint support for LocalStack testing
Add KMSProvider interface for external key management systems:
- Encrypt/Decrypt operations for DEK wrapping
- Key lifecycle management (Create, Disable, ScheduleDeletion, Enable)
- HealthCheck and Close methods
Implement MemoryProvider for development/testing:
- XOR encryption with HMAC-SHA256 authentication
- Secure random key generation using crypto/rand
- MAC verification to detect wrong keys
Implement DEKCache per ADR-012:
- 15-minute TTL with configurable grace window (1 hour)
- LRU eviction with 1000 entry limit
- Cache key includes (tenantID, artifactID, kmsKeyID) for isolation
- Thread-safe operations with RWMutex
- Secure memory wiping on eviction/cleanup
Add config package with types:
- ProviderType enum (vault, aws, memory)
- VaultConfig with AppRole/Kubernetes/Token auth
- AWSConfig with region and alias prefix
- CacheConfig with TTL, MaxEntries, GraceWindow
- Validation methods for all config types
- Add plugin_quota.go with GPU quota management for scheduler
- Update scheduler hub and protocol for plugin support
- Add comprehensive plugin quota unit tests
- Update gang service and WebSocket queue integration tests
Remove deprecated components replaced by new scheduler:
- Delete internal/controller/pacing_controller.go (replaced by scheduler/pacing.go)
- Delete internal/manifest/schema_test.go (consolidated into tests/unit/)
- Delete internal/workertest/worker.go (consolidated into tests/fixtures/)
- Update .gitignore with scheduler binary and new patterns
Update worker system for scheduler integration:
- Worker server with scheduler registration
- Configuration with scheduler endpoint support
- Artifact handling with integrity verification
- Container executor with supply chain validation
- Local executor enhancements
- GPU detection improvements (cross-platform)
- Error handling with execution context
- Factory pattern for executor instantiation
- Hash integrity with native library support
Update authentication system for multi-tenant support:
- API key management with tenant scoping
- Permission checks for multi-tenant operations
- Database layer with tenant isolation
- Keychain integration with audit logging
Update API layer for scheduler integration:
- WebSocket handlers with scheduler protocol support
- Jobs WebSocket endpoint with priority queue integration
- Validation middleware for scheduler messages
- Server configuration with security hardening
- Protocol definitions for worker-scheduler communication
- Dataset handlers with tenant isolation checks
- Response helpers with audit context
- OpenAPI spec updates for new endpoints
Extend worker capabilities with new execution plugins and security features:
- Jupyter plugin for notebook-based ML experiments
- vLLM plugin for LLM inference workloads
- Cross-platform process isolation (Unix/Windows)
- Network policy enforcement with platform-specific implementations
- Service manager integration for lifecycle management
- Scheduler backend integration for queue coordination
Update lifecycle management:
- Enhanced runloop with state transitions
- Service manager integration for plugin coordination
- Improved state persistence and recovery
Add test coverage:
- Unit tests for Jupyter and vLLM plugins
- Updated worker execution tests
Add new scheduler component for distributed ML workload orchestration:
- Hub-based coordination for multi-worker clusters
- Pacing controller for rate limiting job submissions
- Priority queue with preemption support
- Port allocator for dynamic service discovery
- Protocol handlers for worker-scheduler communication
- Service manager with OS-specific implementations
- Connection management and state persistence
- Template system for service deployment
Includes comprehensive test suite:
- Unit tests for all core components
- Integration tests for distributed scenarios
- Benchmark tests for performance validation
- Mock fixtures for isolated testing
Refs: scheduler-architecture.md
Rename and enhance existing tests to align with coverage map:
- TestGPUDetectorAMDVendorAlias -> TestAMDAliasManifestRecord
- TestScanArtifacts_SkipsKnownPathsAndLogs -> TestScanExclusionsRecorded
- Add env var expansion verification to TestHIPAAValidation_InlineCredentials
- Record exclusions in manifest.Artifacts for audit trail
Fix ValidatePath to correctly resolve symlinks and handle edge cases:
- Resolve symlinks before boundary check to prevent traversal
- Handle macOS /private prefix correctly
- Add fallback for non-existent paths (parent directory resolution)
- Double boundary checks: before AND after symlink resolution
- Prevent race conditions between check and use
Update path traversal tests:
- Correct test expectations for "..." (three dots is valid filename, not traversal)
- Add tests for symlink escape attempts
- Add unicode attack tests
- Add deeply nested traversal tests
Security impact: Prevents path traversal via symlink following in artifact
scanning and other file operations.
Add MaxArtifactFiles and MaxArtifactTotalBytes to SandboxConfig:
- Default MaxArtifactFiles: 10,000 (configurable via SecurityDefaults)
- Default MaxArtifactTotalBytes: 100GB (configurable via SecurityDefaults)
- ApplySecurityDefaults() sets defaults if not specified
Enforce caps in scanArtifacts() during directory walk:
- Returns error immediately when MaxArtifactFiles exceeded
- Returns error immediately when MaxArtifactTotalBytes exceeded
- Prevents resource exhaustion attacks from malicious artifact trees
Update all call sites to pass SandboxConfig for cap enforcement:
- Native bridge libs updated to pass caps argument
- Benchmark tests updated with nil caps (unlimited for benchmarks)
- Unit tests updated with nil caps
Closes: artifact ingestion caps items from security plan
**Cleanup:**
- Delete internal/worker/testutil.go (150 lines of unused test utilities)
- Remove unused stateDir() function from internal/jupyter/service_manager.go
- Silence unused variable warning in internal/worker/executor/container.go
- Surface GPUDetectionInfo from parseGPUCountFromConfig for detection metadata
- Document FETCH_ML_TOTAL_CPU and FETCH_ML_GPU_SLOTS_PER_GPU env vars
- Add debug logging for all env var overrides to stderr
- Track config-layer auto-detection in GPUDetectionInfo.ConfigLayerAutoDetected
- Add --include-all flag to artifact scanner (includeAll parameter)
- Add AMD production mode enforcement (error in non-local mode)
- Add GPU detector unit tests for env overrides and AMD aliasing