Refactor plugins to use interface for testability:
- Add PodmanInterface to container package (StartContainer, StopContainer, RemoveContainer)
- Update MLflow plugin to use container.PodmanInterface
- Update TensorBoard plugin to use container.PodmanInterface
- Add comprehensive mocked tests for all three plugins (wandb, mlflow, tensorboard)
- Coverage increased from 18% to 91.4%
Use atomic operations for shared variables:
- alertCount: atomic.Int32 for concurrent access
- lastAlert: atomic.Value for alert storage
Fixes data races detected by -race flag.
Fixes to support proper test coverage:
- db_jobs.go: UpdateJobStatus now checks RowsAffected and returns error
for nonexistent jobs instead of silently succeeding
- db_audit.go: GetOldestAuditLogDate uses sql.NullString to parse SQLite
datetime strings in YYYY-MM-DD HH:MM:SS format with RFC3339 fallback
- db_experiments.go: ListTasksForExperiment uses sql.NullString for
nullable worker_id and error fields to prevent scan errors
- db_connect.go: DB struct adds isClosed state tracking with mutex;
Close() now returns error on double close to match test expectations
Refactor getPluginVersion to accept PluginConfig parameter:
- Change signature from getPluginVersion(pluginName) to getPluginVersion(pluginName, cfg)
- Update all call sites to pass config
- Add TODO comment for future implementation querying actual plugin binary/container
Update plugin handlers to use dynamic version retrieval:
- GetV1Plugins: Use h.getPluginVersion(name, cfg) instead of hardcoded "1.0.0"
- PutV1PluginsPluginNameConfig: Pass newConfig to version retrieval
- GetV1PluginsPluginNameHealth: Use actual version from config
This prepares the API for dynamic version reporting while maintaining
backward compatibility with the current placeholder implementation.
Add CodeNotImplemented error constant (HTTP 501) for planned but unavailable features.
Refactor WebSocket packet handling from JSON to binary protocol for improved efficiency:
New packet structure:
- PacketTypeSuccess (0x00): [type:1][json_data:var]
- PacketTypeError (0x01): [type:1][code_len:1][code:var][msg_len:2][msg:var][details_len:2][details:var]
- PacketTypeData (0x02): Reserved for future use
Update SendErrorPacket:
- Build binary error packets with length-prefixed fields
- Use WriteMessage with websocket.BinaryMessage
Update SendSuccessPacket:
- Marshal data to JSON then wrap in binary packet
- Eliminates "success" wrapper field for cleaner protocol
Add helper functions:
- NewNotImplemented(feature) - Standard 501 error
- NewNotImplementedWithIssue(feature, issueURL) - 501 with GitHub reference
Move store package to improve reusability and follow Go project conventions:
- cmd/tui/internal/store/store.go -> internal/store/store.go
- cmd/tui/internal/store/store_test.go -> internal/store/store_test.go
This makes the store package available to other components beyond the TUI,
reducing coupling and enabling future reuse by API server, CLI, or other tools.
Move unit tests from tests/unit/ to internal/ following Go conventions:
- tests/unit/queue/* -> internal/queue/* (dedup, filesystem_fallback, queue_permissions, queue_spec, queue, sqlite_queue tests)
- tests/unit/gpu/* -> internal/resources/* (gpu_detector, gpu_golden tests)
- tests/unit/resources/* -> internal/resources/* (manager_test.go)
Update import paths in test files to reflect new locations.
Note: GPU tests consolidated into resources package since GPU detection is part of resource management. Manager tests show significant new test coverage (166 lines).
Move unit tests from tests/unit/ to internal/ following Go conventions:
- tests/unit/logging/* -> internal/logging/* (logging tests)
- tests/unit/manifest/* -> internal/manifest/* (run_manifest, schema tests)
- tests/unit/network/* -> internal/network/* (retry, ssh_pool, ssh tests)
- tests/unit/privacy/* -> internal/privacy/* (pii tests)
- tests/unit/metrics/* -> internal/prommetrics/* (metrics tests)
Update import paths in test files to reflect new locations.
Note: metrics_test.go moved from tests/unit/metrics/ to internal/prommetrics/ to match the actual package name.
Move unit tests from tests/unit/ to internal/ following Go conventions:
- tests/unit/config/* -> internal/config/* (constants, mode_paths, paths, validation)
- tests/unit/container/* -> internal/container/* (podman, security tests)
- tests/unit/envpool/* -> internal/envpool/* (envpool tests)
- tests/unit/errors/* -> internal/errtypes/* (errors_test.go moved to errtypes package)
- tests/unit/experiment/* -> internal/experiment/* (manager tests)
- tests/unit/jupyter/* -> internal/jupyter/* (config, package_blacklist, service_manager, trash_restore)
Update import paths in test files to reflect new locations.
Note: errors_test.go moved from tests/unit/errors/ to internal/errtypes/ to match the package structure.
Move unit tests from tests/unit/ to internal/ following Go conventions:
- tests/unit/api/* -> internal/api/* (WebSocket handlers, helpers, duplicate detection)
- tests/unit/audit/* -> internal/audit/* (alert, sealed, verifier tests)
- tests/unit/auth/* -> internal/auth/* (API key, keychain, user manager)
- tests/unit/crypto/kms/* -> internal/auth/kms/* (cache, protocol tests)
Update import paths in test files to reflect new locations.
Benefits:
- Tests live alongside the code they test
- Easier navigation and maintenance
- Clearer package boundaries
- Follows standard Go project layout
- Add DisableTLSForTesting to HubConfig for test environments
- Add IsUsingTLS() method to detect scheduler TLS status
- Update MockWorker to auto-select ws:// vs wss:// protocol
- Set DisableTLSForTesting: true in DefaultHubConfig
GPU detection refactor:
- Major rewrite of gpu_detector.go with unified detection interface
- Support for NVIDIA (NVML), AMD (ROCm), and Apple Metal
- Runtime GPU capability querying for scheduler matching
macOS improvements:
- gpu_macos.go: native Metal device enumeration and memory queries
- Support for Apple Silicon (M1/M2/M3) unified memory reporting
- Fallback to system profiler for Intel Macs
Testing infrastructure:
- Add gpu_detector_mock.go for testing without hardware
- Update gpu_golden_test.go with platform-specific expectations
- Cross-platform GPU info validation
Major WebSocket handler refactor:
- Rewrite ws/handler.go with structured message routing and backpressure
- Add connection lifecycle management with heartbeats and timeouts
- Implement graceful connection draining for zero-downtime restarts
Protocol improvements:
- Define structured protocol types in protocol.go for hub communication
- Add versioned message envelopes for backward compatibility
- Standardize error codes and response formats across WebSocket API
Job streaming via WebSocket:
- Simplify ws/jobs.go with async job status streaming
- Add compression for high-volume job updates
Testing:
- Update websocket_e2e_test.go for new protocol semantics
- Add connection resilience tests
Add comprehensive capability routing system to scheduler hub:
- Capability-aware worker matching with requirement/offer negotiation
- Hub v2 protocol with structured message types and heartbeat management
- Worker capability advertisement and dynamic routing decisions
- Orphan recovery for disconnected workers with state reconciliation
- Template-based job scheduling with capability constraints
Add extensive test coverage:
- Unit tests for capability routing logic and heartbeat mechanics
- Unit tests for orphan recovery scenarios
- E2E tests for capability routing across multiple workers
- Hub capabilities integration tests
- Scheduler fixture helpers for test setup
Protocol improvements:
- Define structured protocol messages for hub-worker communication
- Add capability matching algorithm with scoring
- Implement graceful worker disconnection handling
Comprehensive audit system for security and compliance:
- middleware/audit.go: HTTP request/response auditing middleware
* Captures request details, user identity, response status
* Chains audit events with cryptographic hashes for tamper detection
* Configurable filtering for sensitive data redaction
- audit/chain.go: Blockchain-style audit log chaining
* Each entry includes hash of previous entry
* Tamper detection through hash verification
* Supports incremental verification without full scan
- checkpoint.go: Periodic integrity checkpoints
* Creates signed checkpoints for fast verification
* Configurable checkpoint intervals
* Recovery from last known good checkpoint
- rotation.go: Automatic log rotation and archival
* Size-based and time-based rotation policies
* Compressed archival with integrity seals
* Retention policy enforcement
- sealed.go: Cryptographic sealing of audit logs
* Digital signatures for log integrity
* HSM support preparation
* Exportable sealed bundles for external auditors
- verifier.go: Log verification and forensic analysis
* Complete chain verification from genesis to latest
* Detects gaps, tampering, unauthorized modifications
* Forensic export for incident response
Add new API endpoints and clean up handler interfaces:
- groups/handlers.go: New lab group management API
* CRUD operations for lab groups
* Member management with role assignment (admin/member/viewer)
* Group listing and membership queries
- tokens/handlers.go: Token generation and validation endpoints
* Create access tokens for public task sharing
* Validate tokens for secure access
* Token revocation and cleanup
- routes.go: Refactor handler registration
* Integrate groups handler into WebSocket routes
* Remove nil parameters from all handler constructors
* Cleaner dependency injection pattern
- Handler interface cleanup across all modules:
* jobs/handlers.go: Remove unused nil privacyEnforcer parameter
* jupyter/handlers.go: Streamline initialization
* scheduler/handlers.go: Consistent constructor signature
* ws/handler.go: Add groups handler to dependencies
Add comprehensive authentication and authorization enhancements:
- tokens.go: New token management system for public task access and cloning
* SHA-256 hashed token storage for security
* Token generation, validation, and automatic cleanup
* Support for public access and clone permissions
- api_key.go: Extend User struct with Groups field
* Lab group membership (ml-lab, nlp-group)
* Integration with permission system for group-based access
- flags.go: Security hardening - migrate to structured logging
* Replace log.Printf with log/slog to prevent log injection attacks
* Consistent structured output for all auth warnings
* Safe handling of file paths and errors in logs
- permissions.go: Add task sharing permission constants
* PermissionTasksReadOwn: Access own tasks
* PermissionTasksReadLab: Access lab group tasks
* PermissionTasksReadAll: Admin/institution-wide access
* PermissionTasksShare: Grant access to other users
* PermissionTasksClone: Create copies of shared tasks
* CanAccessTask() method with visibility checks
- database.go: Improve error handling
* Add structured error logging on row close failures
Add comprehensive database storage layer for new features:
- db_groups.go: Lab group management with members, roles (admin/member/viewer),
and group-based task visibility queries
- db_tasks.go: Task visibility system (private/lab/institution/open),
task sharing with expiry, public clone tokens, and optimized
ListTasksForUser() for access control
- db_tokens.go: Secure token management for public task access and cloning,
with SHA-256 hashed token storage and automatic cleanup
- db_audit.go: Audit log persistence with checkpoint chains, tamper
detection, and log rotation support
- schema_sqlite.sql: Updated schema with:
- groups, group_members tables
- tasks.visibility enum, task_shares with expiry
- access_tokens table with hashed tokens
- audit_logs, audit_checkpoints tables
- indexes for all foreign keys and query patterns
- db_experiments.go: Add CascadeVisibilityToTasks() for propagating
visibility changes from experiments to associated tasks
## Problem
TestEndToEndJobLifecycle was failing with two issues:
1. Race condition: Workers signaled ready before job was processed, receiving
MsgNoWork instead of MsgJobAssign
2. getTask() didn't check pendingAcceptance - assigned-but-not-yet-accepted
tasks returned nil
## Changes
### Test Fix (restart_recovery_test.go)
- Replace single-shot select with retry loop that re-signals workers as ready
- Handle both assignment and non-assignment messages correctly
- Add 10ms delay between non-assignment messages to allow job processing
- Use 2-second deadline with 100ms timeout intervals
### Scheduler Fix (hub.go)
- Extend getTask() to check pendingAcceptance map after batch/service queues
- Allows GetTask() to find tasks in 'assigned' state before acceptance
- Maintains backward compatibility with existing queue/running lookups
## Testing
make test now passes: 475 passed, 0 failed, 34 skipped
Enhance ml info to query server when connected, falling back to local
manifests when offline. Unifies behavior with other commands like run,
exec, and cancel.
CLI changes:
- Add --local and --remote flags for explicit control
- Auto-detect connection state via mode.detect()
- queryRemoteRun(): Query server via WebSocket for run details
- queryLocalRun(): Read local run_manifest.json
- displayRunInfo(): Shared display logic for both sources
- Add connection status indicators (Remote: connecting.../connected)
WebSocket protocol:
- Add query_run_info opcode (0x28) to cli and server
- Add sendQueryRunInfo() method to ws/client.zig
- Protocol: [opcode:1][api_key_hash:16][run_id_len:1][run_id:var]
Server changes:
- Add handleQueryRunInfo() handler to ws/handler.go
- Returns run_id, job_name, user, timestamp, overall_sha, files_count
- Checks PermJobsRead permission
- Looks up run in experiment manager
Usage:
ml info abc123 # Auto: tries remote, falls back to local
ml info abc123 --local # Force local manifest lookup
ml info abc123 --remote # Force remote query (fails if offline)
Remove three unused methods/parameter identified by static analysis:
- canRequeue(): never integrated into scheduling flow
- runMetricsClient clientID param: accepted but never used
- getUsageLocked(): callers inline the logic
Fixes IDE warnings about unused code per AGENTS.md cleanup discipline.
Implement VaultProvider with Transit engine:
- AppRole, Kubernetes, and Token authentication
- Encrypt/Decrypt via /transit/encrypt and /transit/decrypt
- Key lifecycle via /transit/keys API
- Health check via /sys/health
Implement AWSProvider with SDK v2:
- Per-region key naming with alias prefix
- Encrypt/Decrypt via KMS SDK
- Key lifecycle (CreateKey, Disable, ScheduleDeletion, Enable)
- AWS endpoint support for LocalStack testing
Add KMSProvider interface for external key management systems:
- Encrypt/Decrypt operations for DEK wrapping
- Key lifecycle management (Create, Disable, ScheduleDeletion, Enable)
- HealthCheck and Close methods
Implement MemoryProvider for development/testing:
- XOR encryption with HMAC-SHA256 authentication
- Secure random key generation using crypto/rand
- MAC verification to detect wrong keys
Implement DEKCache per ADR-012:
- 15-minute TTL with configurable grace window (1 hour)
- LRU eviction with 1000 entry limit
- Cache key includes (tenantID, artifactID, kmsKeyID) for isolation
- Thread-safe operations with RWMutex
- Secure memory wiping on eviction/cleanup
Add config package with types:
- ProviderType enum (vault, aws, memory)
- VaultConfig with AppRole/Kubernetes/Token auth
- AWSConfig with region and alias prefix
- CacheConfig with TTL, MaxEntries, GraceWindow
- Validation methods for all config types
- Add plugin_quota.go with GPU quota management for scheduler
- Update scheduler hub and protocol for plugin support
- Add comprehensive plugin quota unit tests
- Update gang service and WebSocket queue integration tests
Remove deprecated components replaced by new scheduler:
- Delete internal/controller/pacing_controller.go (replaced by scheduler/pacing.go)
- Delete internal/manifest/schema_test.go (consolidated into tests/unit/)
- Delete internal/workertest/worker.go (consolidated into tests/fixtures/)
- Update .gitignore with scheduler binary and new patterns