Add Rust build cache cleanup to maintenance script:
- Add cleanup_rust_cache() function with cargo clean integration
- Check for cargo availability before attempting cleanup
- Show cache size before/after for visibility
- Add 'rust' subcommand for targeted Rust cache cleanup
- Include Rust cache in 'benchmarks' and 'all' cleanup operations
- Add aggressive cleanup support for Rust
- Update help text and disk usage display
Update detect-native.go to detect and report both C++ and Rust native libs:
- Split library detection into C++ (native/build/) and Rust (native_rust/target/release/) sections
- Display separate '=== C++ Native Libraries ===' and '=== Rust Native Libraries ===' headers
- Check for both .so and .dylib extensions for each platform
- Update help text to show 'make native-build' for C++ and 'make rust-build' for Rust
- Show benchmark file availability for both implementations
Extend local benchmark runner to support Rust native libraries:
- Rename Step 1b native -> Step 1b C++ native for clarity
- Add Step 1c for Rust native library benchmarks via cargo bench
- Check for native_rust/target/release/libqueue_index.{dylib,so}
- Separate result files: native_cpp_benchmark_results.txt vs native_rust_benchmark_results.txt
- Updated summary output to show both C++ and Rust benchmark availability
Add new build script for Rust native libraries:
- Builds dataset_hash and queue_index crates via cargo
- Cross-platform support: Linux (.so), macOS (.dylib), Windows (.dll)
- Outputs to bin/native/ for consistency with C++ native libs
- Error handling for missing cargo installation
Separate C++ and Rust native library targets in the Makefile:
- Rename native/rust/ references to native_rust/ for new workspace location
- Split 'Native Libraries' help section into C++ and Rust categories
- Rename prod-with-native -> prod-with-rust, detect-regressions-native -> detect-regressions-rust
- Add rust-debug target, remove native-release (redundant)
- Add compare-benchmarks target for Rust/Go/C++ performance comparison
- Update all rust-* targets with proper cargo availability checks
- Add cargo existence checks to prevent errors when Rust not installed
Remove the old native/rust/ directory - files were previously moved to native_rust/
workspace at the repository root. This cleans up the deprecated location
after the Rust workspace reorganization.
Refactor plugins to use interface for testability:
- Add PodmanInterface to container package (StartContainer, StopContainer, RemoveContainer)
- Update MLflow plugin to use container.PodmanInterface
- Update TensorBoard plugin to use container.PodmanInterface
- Add comprehensive mocked tests for all three plugins (wandb, mlflow, tensorboard)
- Coverage increased from 18% to 91.4%
Use atomic operations for shared variables:
- alertCount: atomic.Int32 for concurrent access
- lastAlert: atomic.Value for alert storage
Fixes data races detected by -race flag.
Fixes to support proper test coverage:
- db_jobs.go: UpdateJobStatus now checks RowsAffected and returns error
for nonexistent jobs instead of silently succeeding
- db_audit.go: GetOldestAuditLogDate uses sql.NullString to parse SQLite
datetime strings in YYYY-MM-DD HH:MM:SS format with RFC3339 fallback
- db_experiments.go: ListTasksForExperiment uses sql.NullString for
nullable worker_id and error fields to prevent scan errors
- db_connect.go: DB struct adds isClosed state tracking with mutex;
Close() now returns error on double close to match test expectations
Refactor getPluginVersion to accept PluginConfig parameter:
- Change signature from getPluginVersion(pluginName) to getPluginVersion(pluginName, cfg)
- Update all call sites to pass config
- Add TODO comment for future implementation querying actual plugin binary/container
Update plugin handlers to use dynamic version retrieval:
- GetV1Plugins: Use h.getPluginVersion(name, cfg) instead of hardcoded "1.0.0"
- PutV1PluginsPluginNameConfig: Pass newConfig to version retrieval
- GetV1PluginsPluginNameHealth: Use actual version from config
This prepares the API for dynamic version reporting while maintaining
backward compatibility with the current placeholder implementation.
Add CodeNotImplemented error constant (HTTP 501) for planned but unavailable features.
Refactor WebSocket packet handling from JSON to binary protocol for improved efficiency:
New packet structure:
- PacketTypeSuccess (0x00): [type:1][json_data:var]
- PacketTypeError (0x01): [type:1][code_len:1][code:var][msg_len:2][msg:var][details_len:2][details:var]
- PacketTypeData (0x02): Reserved for future use
Update SendErrorPacket:
- Build binary error packets with length-prefixed fields
- Use WriteMessage with websocket.BinaryMessage
Update SendSuccessPacket:
- Marshal data to JSON then wrap in binary packet
- Eliminates "success" wrapper field for cleaner protocol
Add helper functions:
- NewNotImplemented(feature) - Standard 501 error
- NewNotImplementedWithIssue(feature, issueURL) - 501 with GitHub reference
Update test paths to reflect new test locations:
- Change tests/unit/... to internal/... in test, test-unit, and verify-audit targets
- Update test-coverage target to use correct coverpkg paths
- Add coverage summary output to test-coverage target
Clean up deleted test files from old locations:
- Remove tests/unit/crypto/kms/cache_test.go (now in internal/auth/kms/)
- Remove tests/unit/crypto/kms/protocol_test.go (now in internal/auth/kms/)
- Remove tests/unit/resources/manager_test.go (now in internal/resources/)
Move store package to improve reusability and follow Go project conventions:
- cmd/tui/internal/store/store.go -> internal/store/store.go
- cmd/tui/internal/store/store_test.go -> internal/store/store_test.go
This makes the store package available to other components beyond the TUI,
reducing coupling and enabling future reuse by API server, CLI, or other tools.
Move integration-appropriate tests from tests/unit/ to tests/integration/:
- tests/unit/simple_test.go -> tests/integration/simple_test.go
- tests/unit/deployments/traefik_compose_test.go -> tests/integration/traefik_compose_test.go
- tests/unit/worker_trust_test.go -> tests/integration/worker_trust_test.go
Update test package declarations and imports to reflect new locations.
These tests were misplaced in the unit tests directory but actually test
integration between components or external systems (Traefik, worker trust).
Move unit tests from tests/unit/ to internal/ following Go conventions:
- tests/unit/queue/* -> internal/queue/* (dedup, filesystem_fallback, queue_permissions, queue_spec, queue, sqlite_queue tests)
- tests/unit/gpu/* -> internal/resources/* (gpu_detector, gpu_golden tests)
- tests/unit/resources/* -> internal/resources/* (manager_test.go)
Update import paths in test files to reflect new locations.
Note: GPU tests consolidated into resources package since GPU detection is part of resource management. Manager tests show significant new test coverage (166 lines).
Move unit tests from tests/unit/ to internal/ following Go conventions:
- tests/unit/logging/* -> internal/logging/* (logging tests)
- tests/unit/manifest/* -> internal/manifest/* (run_manifest, schema tests)
- tests/unit/network/* -> internal/network/* (retry, ssh_pool, ssh tests)
- tests/unit/privacy/* -> internal/privacy/* (pii tests)
- tests/unit/metrics/* -> internal/prommetrics/* (metrics tests)
Update import paths in test files to reflect new locations.
Note: metrics_test.go moved from tests/unit/metrics/ to internal/prommetrics/ to match the actual package name.
Move unit tests from tests/unit/ to internal/ following Go conventions:
- tests/unit/config/* -> internal/config/* (constants, mode_paths, paths, validation)
- tests/unit/container/* -> internal/container/* (podman, security tests)
- tests/unit/envpool/* -> internal/envpool/* (envpool tests)
- tests/unit/errors/* -> internal/errtypes/* (errors_test.go moved to errtypes package)
- tests/unit/experiment/* -> internal/experiment/* (manager tests)
- tests/unit/jupyter/* -> internal/jupyter/* (config, package_blacklist, service_manager, trash_restore)
Update import paths in test files to reflect new locations.
Note: errors_test.go moved from tests/unit/errors/ to internal/errtypes/ to match the package structure.
Move unit tests from tests/unit/ to internal/ following Go conventions:
- tests/unit/api/* -> internal/api/* (WebSocket handlers, helpers, duplicate detection)
- tests/unit/audit/* -> internal/audit/* (alert, sealed, verifier tests)
- tests/unit/auth/* -> internal/auth/* (API key, keychain, user manager)
- tests/unit/crypto/kms/* -> internal/auth/kms/* (cache, protocol tests)
Update import paths in test files to reflect new locations.
Benefits:
- Tests live alongside the code they test
- Easier navigation and maintenance
- Clearer package boundaries
- Follows standard Go project layout
Add Known Limitations section to AGENTS.md documenting:
- AMD GPU not implemented (use NVIDIA, Apple Silicon, or CPU)
- 100+ node gang allocation stress testing not yet implemented
- Podman-in-Docker CI requires privileged mode, not yet automated
- Error handling patterns for unimplemented features
- Container usage rules (Docker for testing/deployments, Podman for experiments)
- Error codes table (NOT_IMPLEMENTED, NOT_FOUND, INVALID_CONFIGURATION)
Update testing documentation to reflect new test locations:
- Unit tests moved from tests/unit/ to internal/ (Go convention)
- Update all test file path references in security testing docs
- Create docker-tests.yml for merge-to-main CI pipeline
- Add mock GPU test matrix (NVIDIA, Metal, CPU-only)
- Add AGENTS.md with container architecture rules:
* Docker for CI/CD testing and deployments
* Podman for ML experiment isolation only
- Update .gitignore to track AGENTS.md
- Add DisableTLSForTesting to HubConfig for test environments
- Add IsUsingTLS() method to detect scheduler TLS status
- Update MockWorker to auto-select ws:// vs wss:// protocol
- Set DisableTLSForTesting: true in DefaultHubConfig
Remove ID and GPUCount fields from batchJob in TestServiceSlotPoolSeparation
that were assigned but never used. The test only validates SlotPool values.
GPU detection refactor:
- Major rewrite of gpu_detector.go with unified detection interface
- Support for NVIDIA (NVML), AMD (ROCm), and Apple Metal
- Runtime GPU capability querying for scheduler matching
macOS improvements:
- gpu_macos.go: native Metal device enumeration and memory queries
- Support for Apple Silicon (M1/M2/M3) unified memory reporting
- Fallback to system profiler for Intel Macs
Testing infrastructure:
- Add gpu_detector_mock.go for testing without hardware
- Update gpu_golden_test.go with platform-specific expectations
- Cross-platform GPU info validation
Major WebSocket handler refactor:
- Rewrite ws/handler.go with structured message routing and backpressure
- Add connection lifecycle management with heartbeats and timeouts
- Implement graceful connection draining for zero-downtime restarts
Protocol improvements:
- Define structured protocol types in protocol.go for hub communication
- Add versioned message envelopes for backward compatibility
- Standardize error codes and response formats across WebSocket API
Job streaming via WebSocket:
- Simplify ws/jobs.go with async job status streaming
- Add compression for high-volume job updates
Testing:
- Update websocket_e2e_test.go for new protocol semantics
- Add connection resilience tests
New commands and modules:
- Add workers.zig command for worker management and status
- Add scheduler_client.zig for scheduler hub communication
- Add pii.zig utility for PII detection and redaction in logs/outputs
Improvements to existing commands:
- groups.zig: enhanced group management with capability metadata
- jupyter/mod.zig: improved Jupyter workspace lifecycle handling
- tasks.zig: better task status reporting and cancellation support
Networking and sync improvements:
- ws/client.zig: WebSocket client enhancements for hub protocol
- sync_manager.zig: improved sync with scheduler state and conflict resolution
- uuid.zig: optimized UUID generation for macOS and Linux
Database utilities:
- sqlite_embedded.zig: embedded SQLite for CLI-local state caching
Add comprehensive capability routing system to scheduler hub:
- Capability-aware worker matching with requirement/offer negotiation
- Hub v2 protocol with structured message types and heartbeat management
- Worker capability advertisement and dynamic routing decisions
- Orphan recovery for disconnected workers with state reconciliation
- Template-based job scheduling with capability constraints
Add extensive test coverage:
- Unit tests for capability routing logic and heartbeat mechanics
- Unit tests for orphan recovery scenarios
- E2E tests for capability routing across multiple workers
- Hub capabilities integration tests
- Scheduler fixture helpers for test setup
Protocol improvements:
- Define structured protocol messages for hub-worker communication
- Add capability matching algorithm with scoring
- Implement graceful worker disconnection handling