Commit graph

157 commits

Author SHA1 Message Date
Jeremie Fraeys
f827ee522a
test(tracking/plugins): add PodmanInterface and comprehensive plugin tests for 91% coverage
Refactor plugins to use interface for testability:
- Add PodmanInterface to container package (StartContainer, StopContainer, RemoveContainer)
- Update MLflow plugin to use container.PodmanInterface
- Update TensorBoard plugin to use container.PodmanInterface
- Add comprehensive mocked tests for all three plugins (wandb, mlflow, tensorboard)
- Coverage increased from 18% to 91.4%
2026-03-14 16:59:16 -04:00
Jeremie Fraeys
4b8adeacdc
test(crypto): add getter tests and property-based round-trip tests
Add tests for:
- GetPublicKey: returns correct public key
- GetKeyID: returns correct key ID
- Property-based round-trip: Sign -> Verify for various message types
  (empty, single char, unicode, large messages)

Coverage: GetPublicKey 100%, GetKeyID 100%
2026-03-13 23:41:01 -04:00
Jeremie Fraeys
04175a97ee
test(tracking): add comprehensive plugin registry tests
Add tests for tracking package main exports:
- NewRegistry, Register, Get plugin management
- ProvisionAll with empty, disabled, unregistered configs
- TeardownAll lifecycle management
- NewPortAllocator, Allocate, Release with proper port reuse
- StringSetting, ToolMode constants, ToolConfig structure

Coverage: 84.7%
2026-03-13 23:35:10 -04:00
Jeremie Fraeys
f74f3fa730
test(security): fix race conditions in monitor tests
Use atomic operations for shared variables:
- alertCount: atomic.Int32 for concurrent access
- lastAlert: atomic.Value for alert storage

Fixes data races detected by -race flag.
2026-03-13 23:31:22 -04:00
Jeremie Fraeys
4da027868d
fix(storage): handle NULL values and state tracking in database operations
Fixes to support proper test coverage:

- db_jobs.go: UpdateJobStatus now checks RowsAffected and returns error
  for nonexistent jobs instead of silently succeeding
- db_audit.go: GetOldestAuditLogDate uses sql.NullString to parse SQLite
  datetime strings in YYYY-MM-DD HH:MM:SS format with RFC3339 fallback
- db_experiments.go: ListTasksForExperiment uses sql.NullString for
  nullable worker_id and error fields to prevent scan errors
- db_connect.go: DB struct adds isClosed state tracking with mutex;
  Close() now returns error on double close to match test expectations
2026-03-13 23:27:35 -04:00
Jeremie Fraeys
9b8d8e5281
test(tracking): add factory plugin loader tests
Add tests for:
- NewPluginLoader: factory creation
- RegisterFactory: custom factory registration
- LoadPluginsEmpty: empty plugin handling
- LoadPluginsDisabled: skip disabled plugins
- LoadPluginsUnknown: unknown plugin handling
- PluginConfigStructure: config field validation
- LoadPluginsMLflow, TensorBoard, Wandb: plugin type support

Coverage: 79.2%
2026-03-13 23:26:52 -04:00
Jeremie Fraeys
5d39dff6a0
test(store): extend store coverage with edge cases and concurrency
Add tests for:
- Close: proper resource cleanup
- ConcurrentLogMetrics: thread-safe metric logging
- GetRunMetricsEmpty: empty result handling
- GetRunParamsEmpty: empty result handling
- MarkRunSyncedNonexistent: graceful handling of missing runs

Coverage: 75.3%
2026-03-13 23:26:41 -04:00
Jeremie Fraeys
50b6506243
test(storage): add comprehensive storage layer tests
Add tests for:
- dataset: Redis dataset operations, transfer tracking
- db_audit: audit logging with hash chain, access tracking
- db_experiments: experiment metadata, dataset associations
- db_tasks: task listing with pagination for users and groups
- db_jobs: job CRUD, state transitions, worker assignment

Coverage: storage package ~40%+
2026-03-13 23:26:33 -04:00
Jeremie Fraeys
5057f02167
test(crypto,security): add tenant key manager and anomaly monitor tests
Add comprehensive tests for:
- crypto/tenant_keys: KMS integration, key rotation, encryption/decryption
- security/monitor: sliding window, anomaly detection, concurrent access

Coverage: crypto 65.1%, security 100%
2026-03-13 23:26:22 -04:00
Jeremie Fraeys
77542b7068
refactor: update API plugins version retrieval
Refactor getPluginVersion to accept PluginConfig parameter:
- Change signature from getPluginVersion(pluginName) to getPluginVersion(pluginName, cfg)
- Update all call sites to pass config
- Add TODO comment for future implementation querying actual plugin binary/container

Update plugin handlers to use dynamic version retrieval:
- GetV1Plugins: Use h.getPluginVersion(name, cfg) instead of hardcoded "1.0.0"
- PutV1PluginsPluginNameConfig: Pass newConfig to version retrieval
- GetV1PluginsPluginNameHealth: Use actual version from config

This prepares the API for dynamic version reporting while maintaining
backward compatibility with the current placeholder implementation.
2026-03-12 16:40:39 -04:00
Jeremie Fraeys
96dd604789
feat: implement WebSocket binary protocol and NOT_IMPLEMENTED error code
Add CodeNotImplemented error constant (HTTP 501) for planned but unavailable features.

Refactor WebSocket packet handling from JSON to binary protocol for improved efficiency:

New packet structure:
- PacketTypeSuccess (0x00): [type:1][json_data:var]
- PacketTypeError (0x01): [type:1][code_len:1][code:var][msg_len:2][msg:var][details_len:2][details:var]
- PacketTypeData (0x02): Reserved for future use

Update SendErrorPacket:
- Build binary error packets with length-prefixed fields
- Use WriteMessage with websocket.BinaryMessage

Update SendSuccessPacket:
- Marshal data to JSON then wrap in binary packet
- Eliminates "success" wrapper field for cleaner protocol

Add helper functions:
- NewNotImplemented(feature) - Standard 501 error
- NewNotImplementedWithIssue(feature, issueURL) - 501 with GitHub reference
2026-03-12 16:40:23 -04:00
Jeremie Fraeys
d0266c4a90
refactor: scheduler hub bug fix, test helpers, and orphan recovery tests
Fix bug in scheduler hub orphan reconciliation:
- Move delete(h.pendingAcceptance, taskID) inside the requeue success block
- Prevents premature cleanup when requeue fails

Add comprehensive test infrastructure:
- hub_test_helpers.go: New test helper utilities (78 lines)
  - Mock scheduler components for isolated testing
  - Test fixture setup and teardown helpers

Refactor and enhance hub capabilities tests:
- Significant restructuring of hub_capabilities_test.go (213 lines changed)
- Improved test coverage for worker capability matching

Add comprehensive orphan recovery tests:
- internal/scheduler/orphan_recovery_test.go (451 lines)
- Tests orphaned job detection and recovery
- Covers requeue logic, timeout handling, state cleanup
2026-03-12 16:38:33 -04:00
Jeremie Fraeys
939faeb8e4
refactor: relocate store package from cmd/tui/internal to internal
Move store package to improve reusability and follow Go project conventions:
- cmd/tui/internal/store/store.go -> internal/store/store.go
- cmd/tui/internal/store/store_test.go -> internal/store/store_test.go

This makes the store package available to other components beyond the TUI,
reducing coupling and enabling future reuse by API server, CLI, or other tools.
2026-03-12 16:38:01 -04:00
Jeremie Fraeys
61660dc925
refactor: co-locate security, storage, telemetry, tracking, worker tests
Move unit tests from tests/unit/ to internal/ following Go conventions:

Security tests:
- tests/unit/security/* -> internal/security/* (audit, config_integrity, filetype, gpu_audit, hipaa_validation, manifest_filename, path_traversal, resource_quota, secrets)

Storage tests:
- tests/unit/storage/* -> internal/storage/* (db, experiment_metadata)

Telemetry tests:
- tests/unit/telemetry/* -> internal/telemetry/* (telemetry)

Tracking tests:
- tests/unit/reproducibility/* -> internal/tracking/* (config_hash, environment_capture)

Worker tests:
- tests/unit/worker/* -> internal/worker/* (artifacts, config, hash_bench, plugins/jupyter_task, plugins/vllm, prewarm_v1, run_manifest_execution, snapshot_stage, snapshot_store, worker)

Update import paths in test files to reflect new locations.
2026-03-12 16:37:03 -04:00
Jeremie Fraeys
74e06017b5
refactor: co-locate scheduler non-hub tests with source code
Move unit tests from tests/unit/scheduler/ to internal/scheduler/ following Go conventions:
- capability_routing_test.go - Worker capability-based job routing tests
- failure_scenarios_test.go - Scheduler failure handling and recovery tests
- heartbeat_test.go - Worker heartbeat monitoring tests
- plugin_quota_test.go - Plugin resource quota enforcement tests
- port_allocator_test.go - Dynamic port allocation for services tests
- priority_queue_test.go - Job priority queue implementation tests
- service_templates_test.go - Service template management tests
- state_store_test.go - Scheduler state persistence tests

Note: orphan_recovery_test.go excluded from this commit - will be handled with hub refactoring due to significant test changes.
2026-03-12 16:36:29 -04:00
Jeremie Fraeys
ee0b90cfc5
refactor: co-locate queue and resources tests, add manager tests
Move unit tests from tests/unit/ to internal/ following Go conventions:
- tests/unit/queue/* -> internal/queue/* (dedup, filesystem_fallback, queue_permissions, queue_spec, queue, sqlite_queue tests)
- tests/unit/gpu/* -> internal/resources/* (gpu_detector, gpu_golden tests)
- tests/unit/resources/* -> internal/resources/* (manager_test.go)

Update import paths in test files to reflect new locations.

Note: GPU tests consolidated into resources package since GPU detection is part of resource management. Manager tests show significant new test coverage (166 lines).
2026-03-12 16:36:02 -04:00
Jeremie Fraeys
ca6ad970c3
refactor: co-locate logging, manifest, network, privacy, prommetrics tests
Move unit tests from tests/unit/ to internal/ following Go conventions:
- tests/unit/logging/* -> internal/logging/* (logging tests)
- tests/unit/manifest/* -> internal/manifest/* (run_manifest, schema tests)
- tests/unit/network/* -> internal/network/* (retry, ssh_pool, ssh tests)
- tests/unit/privacy/* -> internal/privacy/* (pii tests)
- tests/unit/metrics/* -> internal/prommetrics/* (metrics tests)

Update import paths in test files to reflect new locations.

Note: metrics_test.go moved from tests/unit/metrics/ to internal/prommetrics/ to match the actual package name.
2026-03-12 16:35:37 -04:00
Jeremie Fraeys
cf84246115
refactor: co-locate config, container, envpool, errors, experiment, jupyter tests
Move unit tests from tests/unit/ to internal/ following Go conventions:
- tests/unit/config/* -> internal/config/* (constants, mode_paths, paths, validation)
- tests/unit/container/* -> internal/container/* (podman, security tests)
- tests/unit/envpool/* -> internal/envpool/* (envpool tests)
- tests/unit/errors/* -> internal/errtypes/* (errors_test.go moved to errtypes package)
- tests/unit/experiment/* -> internal/experiment/* (manager tests)
- tests/unit/jupyter/* -> internal/jupyter/* (config, package_blacklist, service_manager, trash_restore)

Update import paths in test files to reflect new locations.

Note: errors_test.go moved from tests/unit/errors/ to internal/errtypes/ to match the package structure.
2026-03-12 16:35:15 -04:00
Jeremie Fraeys
a4e2ecdbe6
refactor: co-locate api, audit, auth tests with source code
Move unit tests from tests/unit/ to internal/ following Go conventions:
- tests/unit/api/* -> internal/api/* (WebSocket handlers, helpers, duplicate detection)
- tests/unit/audit/* -> internal/audit/* (alert, sealed, verifier tests)
- tests/unit/auth/* -> internal/auth/* (API key, keychain, user manager)
- tests/unit/crypto/kms/* -> internal/auth/kms/* (cache, protocol tests)

Update import paths in test files to reflect new locations.

Benefits:
- Tests live alongside the code they test
- Easier navigation and maintenance
- Clearer package boundaries
- Follows standard Go project layout
2026-03-12 16:34:54 -04:00
Jeremie Fraeys
ca913e8878
feat(scheduler): add test mode config and TLS detection
- Add DisableTLSForTesting to HubConfig for test environments
- Add IsUsingTLS() method to detect scheduler TLS status
- Update MockWorker to auto-select ws:// vs wss:// protocol
- Set DisableTLSForTesting: true in DefaultHubConfig
2026-03-12 14:05:35 -04:00
Jeremie Fraeys
2b1ef10514
test(chaos): add worker disconnect chaos test and queue improvements
Chaos testing:
- Add worker_disconnect_chaos_test.go for network partition resilience
- Test scheduler hub recovery and job reassignment scenarios

Queue layer updates:
- event_store.go: add event sourcing for queue operations
- native_queue.go: extend native queue with batch operations and indexing
2026-03-12 12:08:21 -04:00
Jeremie Fraeys
17170667e2
feat(worker): improve lifecycle management and vLLM plugin
Lifecycle improvements:
- runloop.go: refined state machine with better error recovery
- service_manager.go: service dependency management and health checks
- states.go: add states for capability advertisement and draining

Container execution:
- container.go: improved OCI runtime integration with supply chain checks
- Add image verification and signature validation
- Better resource limits enforcement for GPU/memory

vLLM plugin updates:
- vllm.go: support for vLLM 0.3+ with new engine arguments
- Add quantization-aware scheduling (AWQ, GPTQ, FP8)
- Improve model download and caching logic

Configuration:
- config.go: add capability advertisement configuration
- snapshot_store.go: improve snapshot management for checkpointing
2026-03-12 12:05:02 -04:00
Jeremie Fraeys
c18a8619fe
feat(api): add structured error package and refactor handlers
New error handling:
- Add internal/api/errors/errors.go with structured API error types
- Standardize error codes across all API endpoints
- Add user-facing error messages vs internal error details separation

Handler improvements:
- jupyter/handlers.go: better workspace lifecycle and error handling
- plugins/handlers.go: plugin management with validation
- groups/handlers.go: group CRUD with capability metadata
- jobs/handlers.go: job submission and monitoring improvements
- datasets/handlers.go: dataset upload/download with progress
- validate/handlers.go: manifest validation with detailed errors
- audit/handlers.go: audit log querying with filters

Server configuration:
- server_config.go: refined config loading with validation
- server_gen.go: improved code generation for OpenAPI specs
2026-03-12 12:04:46 -04:00
Jeremie Fraeys
37c4d4e9c7
feat(crypto,auth): harden KMS and improve permission handling
KMS improvements:
- cache.go: add LRU eviction with memory-bounded caches
- provider.go: refactor provider initialization and key rotation
- tenant_keys.go: per-tenant key isolation with envelope encryption

Auth layer updates:
- hybrid.go: refine hybrid auth flow for API key + JWT
- permissions_loader.go: faster permission caching with hot-reload
- validator.go: stricter validation with detailed error messages

Security middleware:
- security.go: add rate limiting headers and CORS refinement

Testing and benchmarks:
- Add KMS cache and protocol unit tests
- Add KMS benchmark tests for encryption throughput
- Update KMS integration tests for tenant isolation
2026-03-12 12:04:32 -04:00
Jeremie Fraeys
de83300962
feat(worker): refactor GPU detection with macOS Metal support
GPU detection refactor:
- Major rewrite of gpu_detector.go with unified detection interface
- Support for NVIDIA (NVML), AMD (ROCm), and Apple Metal
- Runtime GPU capability querying for scheduler matching

macOS improvements:
- gpu_macos.go: native Metal device enumeration and memory queries
- Support for Apple Silicon (M1/M2/M3) unified memory reporting
- Fallback to system profiler for Intel Macs

Testing infrastructure:
- Add gpu_detector_mock.go for testing without hardware
- Update gpu_golden_test.go with platform-specific expectations
- Cross-platform GPU info validation
2026-03-12 12:02:41 -04:00
Jeremie Fraeys
188cf55939
refactor(api): overhaul WebSocket handler and protocol layer
Major WebSocket handler refactor:
- Rewrite ws/handler.go with structured message routing and backpressure
- Add connection lifecycle management with heartbeats and timeouts
- Implement graceful connection draining for zero-downtime restarts

Protocol improvements:
- Define structured protocol types in protocol.go for hub communication
- Add versioned message envelopes for backward compatibility
- Standardize error codes and response formats across WebSocket API

Job streaming via WebSocket:
- Simplify ws/jobs.go with async job status streaming
- Add compression for high-volume job updates

Testing:
- Update websocket_e2e_test.go for new protocol semantics
- Add connection resilience tests
2026-03-12 12:01:21 -04:00
Jeremie Fraeys
57787e1e7b
feat(scheduler): implement capability-based routing and hub v2
Add comprehensive capability routing system to scheduler hub:
- Capability-aware worker matching with requirement/offer negotiation
- Hub v2 protocol with structured message types and heartbeat management
- Worker capability advertisement and dynamic routing decisions
- Orphan recovery for disconnected workers with state reconciliation
- Template-based job scheduling with capability constraints

Add extensive test coverage:
- Unit tests for capability routing logic and heartbeat mechanics
- Unit tests for orphan recovery scenarios
- E2E tests for capability routing across multiple workers
- Hub capabilities integration tests
- Scheduler fixture helpers for test setup

Protocol improvements:
- Define structured protocol messages for hub-worker communication
- Add capability matching algorithm with scoring
- Implement graceful worker disconnection handling
2026-03-12 12:00:05 -04:00
Jeremie Fraeys
13ffb81cab
fix: add CGO build tags to consistency tests, remove unused isHex function 2026-03-08 13:10:00 -04:00
Jeremie Fraeys
c74e91dd69
test: update test suite and remove deprecated privacy middleware
Test improvements:
- fixtures/: Updated mocks, fixtures with group context, SSH server, TUI driver
- integration/: WebSocket queue and handler tests with groups
- e2e/: WebSocket and TLS proxy end-to-end tests
- unit/api/ws_test.go: WebSocket API tests
- unit/scheduler/service_templates_test.go: Service template tests
- benchmarks/scheduler_bench_test.go: Performance benchmarks

Cleanup:
- Remove privacy middleware (replaced by audit system)
- Remove privacy_test.go
2026-03-08 13:03:55 -04:00
Jeremie Fraeys
4b2782f674
feat(domain): add task visibility and supporting infrastructure
Core domain and utility updates:

- domain/task.go: Task model with visibility system
  * Visibility enum: private, lab, institution, open
  * Group associations for lab-scoped access
  * CreatedBy tracking for ownership
  * Sharing metadata with expiry

- config/paths.go: Group-scoped data directories and audit log paths
- crypto/signing.go: Key management for audit sealing, token signature verification
- container/supply_chain.go: Image provenance tracking, vulnerability scanning
- fileutil/filetype.go: MIME type detection and security validation
- fileutil/secure.go: Protected file permissions, secure deletion
- jupyter/: Package and service manager updates
- experiment/manager.go: Visibility cascade from experiments to tasks
- network/ssh.go: SSH tunneling improvements
- queue/: Filesystem queue enhancements
2026-03-08 13:03:27 -04:00
Jeremie Fraeys
0b5e99f720
refactor(scheduler,worker): improve service management and GPU detection
Scheduler enhancements:
- auth.go: Group membership validation in authentication
- hub.go: Task distribution with group affinity
- port_allocator.go: Dynamic port allocation with conflict resolution
- scheduler_conn.go: Connection pooling and retry logic
- service_manager.go: Lifecycle management for scheduler services
- service_templates.go: Template-based service configuration
- state.go: Persistent state management with recovery

Worker improvements:
- config.go: Extended configuration for task visibility rules
- execution/setup.go: Sandboxed execution environment setup
- executor/container.go: Container runtime integration
- executor/runner.go: Task runner with visibility enforcement
- gpu_detector.go: Robust GPU detection (NVIDIA, AMD, Apple Silicon, CPU fallback)
- integrity/validate.go: Data integrity validation
- lifecycle/runloop.go: Improved runloop with graceful shutdown
- lifecycle/service_manager.go: Service lifecycle coordination
- process/isolation.go + isolation_unix.go: Process isolation with namespaces/cgroups
- tenant/manager.go: Multi-tenant resource isolation
- tenant/middleware.go: Tenant context propagation
- worker.go: Core worker with group-scoped task execution
2026-03-08 13:03:15 -04:00
Jeremie Fraeys
1c7205c0a0
feat(audit): add HTTP audit middleware and tamper-evident logging
Comprehensive audit system for security and compliance:

- middleware/audit.go: HTTP request/response auditing middleware
  * Captures request details, user identity, response status
  * Chains audit events with cryptographic hashes for tamper detection
  * Configurable filtering for sensitive data redaction

- audit/chain.go: Blockchain-style audit log chaining
  * Each entry includes hash of previous entry
  * Tamper detection through hash verification
  * Supports incremental verification without full scan

- checkpoint.go: Periodic integrity checkpoints
  * Creates signed checkpoints for fast verification
  * Configurable checkpoint intervals
  * Recovery from last known good checkpoint

- rotation.go: Automatic log rotation and archival
  * Size-based and time-based rotation policies
  * Compressed archival with integrity seals
  * Retention policy enforcement

- sealed.go: Cryptographic sealing of audit logs
  * Digital signatures for log integrity
  * HSM support preparation
  * Exportable sealed bundles for external auditors

- verifier.go: Log verification and forensic analysis
  * Complete chain verification from genesis to latest
  * Detects gaps, tampering, unauthorized modifications
  * Forensic export for incident response
2026-03-08 13:03:02 -04:00
Jeremie Fraeys
7e5ceec069
feat(api): add groups and tokens handlers, refactor routes
Add new API endpoints and clean up handler interfaces:

- groups/handlers.go: New lab group management API
  * CRUD operations for lab groups
  * Member management with role assignment (admin/member/viewer)
  * Group listing and membership queries

- tokens/handlers.go: Token generation and validation endpoints
  * Create access tokens for public task sharing
  * Validate tokens for secure access
  * Token revocation and cleanup

- routes.go: Refactor handler registration
  * Integrate groups handler into WebSocket routes
  * Remove nil parameters from all handler constructors
  * Cleaner dependency injection pattern

- Handler interface cleanup across all modules:
  * jobs/handlers.go: Remove unused nil privacyEnforcer parameter
  * jupyter/handlers.go: Streamline initialization
  * scheduler/handlers.go: Consistent constructor signature
  * ws/handler.go: Add groups handler to dependencies
2026-03-08 12:51:25 -04:00
Jeremie Fraeys
c52179dcbe
feat(auth): add token-based access and structured logging
Add comprehensive authentication and authorization enhancements:

- tokens.go: New token management system for public task access and cloning
  * SHA-256 hashed token storage for security
  * Token generation, validation, and automatic cleanup
  * Support for public access and clone permissions

- api_key.go: Extend User struct with Groups field
  * Lab group membership (ml-lab, nlp-group)
  * Integration with permission system for group-based access

- flags.go: Security hardening - migrate to structured logging
  * Replace log.Printf with log/slog to prevent log injection attacks
  * Consistent structured output for all auth warnings
  * Safe handling of file paths and errors in logs

- permissions.go: Add task sharing permission constants
  * PermissionTasksReadOwn: Access own tasks
  * PermissionTasksReadLab: Access lab group tasks
  * PermissionTasksReadAll: Admin/institution-wide access
  * PermissionTasksShare: Grant access to other users
  * PermissionTasksClone: Create copies of shared tasks
  * CanAccessTask() method with visibility checks

- database.go: Improve error handling
  * Add structured error logging on row close failures
2026-03-08 12:51:07 -04:00
Jeremie Fraeys
fbcf4d38e5
feat(storage): add groups, tasks, tokens, and audit database schemas
Add comprehensive database storage layer for new features:

- db_groups.go: Lab group management with members, roles (admin/member/viewer),
  and group-based task visibility queries

- db_tasks.go: Task visibility system (private/lab/institution/open),
  task sharing with expiry, public clone tokens, and optimized
  ListTasksForUser() for access control

- db_tokens.go: Secure token management for public task access and cloning,
  with SHA-256 hashed token storage and automatic cleanup

- db_audit.go: Audit log persistence with checkpoint chains, tamper
  detection, and log rotation support

- schema_sqlite.sql: Updated schema with:
  - groups, group_members tables
  - tasks.visibility enum, task_shares with expiry
  - access_tokens table with hashed tokens
  - audit_logs, audit_checkpoints tables
  - indexes for all foreign keys and query patterns

- db_experiments.go: Add CascadeVisibilityToTasks() for propagating
  visibility changes from experiments to associated tasks
2026-03-08 12:48:42 -04:00
Jeremie Fraeys
ba9a358412
fix(scheduler): resolve TestEndToEndJobLifecycle race and getTask bug
## Problem
TestEndToEndJobLifecycle was failing with two issues:
1. Race condition: Workers signaled ready before job was processed, receiving
   MsgNoWork instead of MsgJobAssign
2. getTask() didn't check pendingAcceptance - assigned-but-not-yet-accepted
   tasks returned nil

## Changes

### Test Fix (restart_recovery_test.go)
- Replace single-shot select with retry loop that re-signals workers as ready
- Handle both assignment and non-assignment messages correctly
- Add 10ms delay between non-assignment messages to allow job processing
- Use 2-second deadline with 100ms timeout intervals

### Scheduler Fix (hub.go)
- Extend getTask() to check pendingAcceptance map after batch/service queues
- Allows GetTask() to find tasks in 'assigned' state before acceptance
- Maintains backward compatibility with existing queue/running lookups

## Testing
make test now passes: 475 passed, 0 failed, 34 skipped
2026-03-05 14:40:43 -05:00
Jeremie Fraeys
c6a224d5fc
feat(cli,server): unify info command with remote/local support
Enhance ml info to query server when connected, falling back to local
manifests when offline. Unifies behavior with other commands like run,
exec, and cancel.

CLI changes:
- Add --local and --remote flags for explicit control
- Auto-detect connection state via mode.detect()
- queryRemoteRun(): Query server via WebSocket for run details
- queryLocalRun(): Read local run_manifest.json
- displayRunInfo(): Shared display logic for both sources
- Add connection status indicators (Remote: connecting.../connected)

WebSocket protocol:
- Add query_run_info opcode (0x28) to cli and server
- Add sendQueryRunInfo() method to ws/client.zig
- Protocol: [opcode:1][api_key_hash:16][run_id_len:1][run_id:var]

Server changes:
- Add handleQueryRunInfo() handler to ws/handler.go
- Returns run_id, job_name, user, timestamp, overall_sha, files_count
- Checks PermJobsRead permission
- Looks up run in experiment manager

Usage:
  ml info abc123              # Auto: tries remote, falls back to local
  ml info abc123 --local      # Force local manifest lookup
  ml info abc123 --remote     # Force remote query (fails if offline)
2026-03-05 12:07:00 -05:00
Jeremie Fraeys
747579eae4
refactor: misc improvements across codebase
Various improvements:
- Makefile: build optimizations and native lib integration
- prune.zig: cleanup logic refinements
- status.zig: improved status reporting
- experiment_core.zig: core functionality updates
- progress.zig: progress bar improvements
- task.go: domain model updates for task handling

All tests pass.
2026-03-05 10:58:22 -05:00
Jeremie Fraeys
08ab628546
refactor(scheduler): remove dead code
Remove three unused methods/parameter identified by static analysis:
- canRequeue(): never integrated into scheduling flow
- runMetricsClient clientID param: accepted but never used
- getUsageLocked(): callers inline the logic

Fixes IDE warnings about unused code per AGENTS.md cleanup discipline.
2026-03-04 13:35:18 -05:00
Jeremie Fraeys
7cd86fb88a
feat: add new API handlers, build scripts, and ADRs
Some checks failed
Build Pipeline / Sign HIPAA Config (push) Has been skipped
Build Pipeline / Generate SLSA Provenance (push) Has been skipped
Checkout test / test (push) Successful in 6s
CI Pipeline / Test (ubuntu-latest on self-hosted) (push) Failing after 1s
CI Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI Pipeline / Security Scan (push) Has been skipped
CI Pipeline / Test Scripts (push) Has been skipped
CI Pipeline / Test Native Libraries (push) Has been skipped
CI Pipeline / Native Library Build Matrix (push) Has been skipped
Contract Tests / Spec Drift Detection (push) Failing after 11s
Contract Tests / API Contract Tests (push) Has been skipped
Deploy API Docs / Build API Documentation (push) Failing after 5s
Deploy API Docs / Deploy to GitHub Pages (push) Has been skipped
Documentation / build-and-publish (push) Failing after 40s
Test Matrix / test-native-vs-pure (cgo) (push) Failing after 14s
Test Matrix / test-native-vs-pure (native) (push) Failing after 35s
Test Matrix / test-native-vs-pure (pure) (push) Failing after 18s
CI Pipeline / Trigger Build Workflow (push) Failing after 1s
Build CLI with Embedded SQLite / build (arm64, aarch64-linux) (push) Has been cancelled
Build CLI with Embedded SQLite / build (x86_64, x86_64-linux) (push) Has been cancelled
Build CLI with Embedded SQLite / build-macos (arm64) (push) Has been cancelled
Build CLI with Embedded SQLite / build-macos (x86_64) (push) Has been cancelled
Security Scan / Security Analysis (push) Has been cancelled
Security Scan / Native Library Security (push) Has been cancelled
Verification & Maintenance / V.1 - Schema Drift Detection (push) Has been cancelled
Verification & Maintenance / V.4 - Custom Go Vet Analyzers (push) Has been cancelled
Verification & Maintenance / V.7 - Audit Chain Integrity (push) Has been cancelled
Verification & Maintenance / V.6 - Extended Security Scanning (push) Has been cancelled
Verification & Maintenance / V.10 - OpenSSF Scorecard (push) Has been cancelled
Verification & Maintenance / Verification Summary (push) Has been cancelled
- Introduce audit, plugin, and scheduler API handlers
- Add spec_embed.go for OpenAPI spec embedding
- Create modular build scripts (cli, go, native, cross-platform)
- Add deployment cleanup and health-check utilities
- New ADRs: hot reload, audit store, SSE updates, RBAC, caching, offline mode, KMS regions, tenant offboarding
- Add KMS configuration schema and worker variants
- Include KMS benchmark tests
2026-03-04 13:24:27 -05:00
Jeremie Fraeys
61081655d2
feat: enhance worker execution and scheduler service templates
- Refactor worker configuration management
- Improve container executor lifecycle handling
- Update runloop and worker core logic
- Enhance scheduler service template generation
- Remove obsolete 'scheduler' symlink/directory
2026-03-04 13:24:20 -05:00
Jeremie Fraeys
66f262d788
security: improve audit, crypto, and config handling
- Enhance audit checkpoint system
- Update KMS provider and tenant key management
- Refine configuration constants
- Improve TUI config handling
2026-03-04 13:23:42 -05:00
Jeremie Fraeys
a4f2c36069
feat: enhance task domain and scheduler protocol
- Update task domain model
- Improve scheduler hub and priority queue
- Enhance protocol definitions
- Update manifest schema and run handling
2026-03-04 13:23:38 -05:00
Jeremie Fraeys
1f495dfbb7
api: regenerate OpenAPI types and server code
- Update openapi.yaml spec
- Regenerate server_gen.go with oapi-codegen
- Update adapter, routes, and server configuration
2026-03-04 13:23:34 -05:00
Jeremie Fraeys
e1ec255ad2
refactor(crypto): integrate KMS with TenantKeyManager
Replace in-memory root keys with KMS interface:
- GenerateDataEncryptionKey: generate DEK, wrap via KMS, cache
- UnwrapDataEncryptionKey: cache check, KMS decrypt, cache store
- EncryptArtifact/DecryptArtifact: use DEK from KMS
- RotateTenantKey: create new KMS key, flush cache
- RevokeTenant: disable KMS key, schedule deletion per ADR-015

Remove deprecated methods: wrapKey, unwrapKey (replaced by KMS)
2026-03-03 19:14:27 -05:00
Jeremie Fraeys
7c03c8b5bd
feat(kms): add HashiCorp Vault and AWS KMS providers
Implement VaultProvider with Transit engine:
- AppRole, Kubernetes, and Token authentication
- Encrypt/Decrypt via /transit/encrypt and /transit/decrypt
- Key lifecycle via /transit/keys API
- Health check via /sys/health

Implement AWSProvider with SDK v2:
- Per-region key naming with alias prefix
- Encrypt/Decrypt via KMS SDK
- Key lifecycle (CreateKey, Disable, ScheduleDeletion, Enable)
- AWS endpoint support for LocalStack testing
2026-03-03 19:14:21 -05:00
Jeremie Fraeys
cb25677695
feat(kms): implement core KMS infrastructure with DEK cache
Add KMSProvider interface for external key management systems:
- Encrypt/Decrypt operations for DEK wrapping
- Key lifecycle management (Create, Disable, ScheduleDeletion, Enable)
- HealthCheck and Close methods

Implement MemoryProvider for development/testing:
- XOR encryption with HMAC-SHA256 authentication
- Secure random key generation using crypto/rand
- MAC verification to detect wrong keys

Implement DEKCache per ADR-012:
- 15-minute TTL with configurable grace window (1 hour)
- LRU eviction with 1000 entry limit
- Cache key includes (tenantID, artifactID, kmsKeyID) for isolation
- Thread-safe operations with RWMutex
- Secure memory wiping on eviction/cleanup

Add config package with types:
- ProviderType enum (vault, aws, memory)
- VaultConfig with AppRole/Kubernetes/Token auth
- AWSConfig with region and alias prefix
- CacheConfig with TTL, MaxEntries, GraceWindow
- Validation methods for all config types
2026-03-03 19:13:55 -05:00
Jeremie Fraeys
da104367d6
feat: add Plugin GPU Quota implementation and tests
Some checks failed
Build Pipeline / Build Binaries (push) Failing after 1m59s
Build Pipeline / Build Docker Images (push) Has been skipped
Build Pipeline / Sign HIPAA Config (push) Has been skipped
Build Pipeline / Generate SLSA Provenance (push) Has been skipped
Checkout test / test (push) Successful in 5s
CI Pipeline / Test (ubuntu-latest on self-hosted) (push) Failing after 1s
CI Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI Pipeline / Security Scan (push) Has been skipped
CI Pipeline / Test Scripts (push) Has been skipped
CI Pipeline / Test Native Libraries (push) Has been skipped
CI Pipeline / Native Library Build Matrix (push) Has been skipped
Documentation / build-and-publish (push) Failing after 35s
CI Pipeline / Trigger Build Workflow (push) Failing after 0s
Security Scan / Security Analysis (push) Has been cancelled
Security Scan / Native Library Security (push) Has been cancelled
Verification & Maintenance / V.1 - Schema Drift Detection (push) Has been cancelled
Verification & Maintenance / V.4 - Custom Go Vet Analyzers (push) Has been cancelled
Verification & Maintenance / V.7 - Audit Chain Integrity (push) Has been cancelled
Verification & Maintenance / V.6 - Extended Security Scanning (push) Has been cancelled
Verification & Maintenance / V.10 - OpenSSF Scorecard (push) Has been cancelled
Verification & Maintenance / Verification Summary (push) Has been cancelled
- Add plugin_quota.go with GPU quota management for scheduler

- Update scheduler hub and protocol for plugin support

- Add comprehensive plugin quota unit tests

- Update gang service and WebSocket queue integration tests
2026-02-26 14:35:05 -05:00
Jeremie Fraeys
8f2495deb0
chore(cleanup): remove obsolete files and update .gitignore
Remove deprecated components replaced by new scheduler:
- Delete internal/controller/pacing_controller.go (replaced by scheduler/pacing.go)
- Delete internal/manifest/schema_test.go (consolidated into tests/unit/)
- Delete internal/workertest/worker.go (consolidated into tests/fixtures/)
- Update .gitignore with scheduler binary and new patterns
2026-02-26 12:09:18 -05:00
Jeremie Fraeys
4cdb68907e
refactor(utilities): update supporting modules for scheduler integration
Update utility modules:
- File utilities with secure file operations
- Environment pool with resource tracking
- Error types with scheduler error categories
- Logging with audit context support
- Network/SSH with connection pooling
- Privacy/PII handling with tenant boundaries
- Resource manager with scheduler allocation
- Security monitor with audit integration
- Tracking plugins (MLflow, TensorBoard) with auth
- Crypto signing with tenant keys
- Database init with multi-user support
2026-02-26 12:07:15 -05:00