fetch_ml

Author	SHA1	Message	Date
Jeremie Fraeys	f827ee522a	test(tracking/plugins): add PodmanInterface and comprehensive plugin tests for 91% coverage Refactor plugins to use interface for testability: - Add PodmanInterface to container package (StartContainer, StopContainer, RemoveContainer) - Update MLflow plugin to use container.PodmanInterface - Update TensorBoard plugin to use container.PodmanInterface - Add comprehensive mocked tests for all three plugins (wandb, mlflow, tensorboard) - Coverage increased from 18% to 91.4%	2026-03-14 16:59:16 -04:00
Jeremie Fraeys	4b8adeacdc	test(crypto): add getter tests and property-based round-trip tests Add tests for: - GetPublicKey: returns correct public key - GetKeyID: returns correct key ID - Property-based round-trip: Sign -> Verify for various message types (empty, single char, unicode, large messages) Coverage: GetPublicKey 100%, GetKeyID 100%	2026-03-13 23:41:01 -04:00
Jeremie Fraeys	04175a97ee	test(tracking): add comprehensive plugin registry tests Add tests for tracking package main exports: - NewRegistry, Register, Get plugin management - ProvisionAll with empty, disabled, unregistered configs - TeardownAll lifecycle management - NewPortAllocator, Allocate, Release with proper port reuse - StringSetting, ToolMode constants, ToolConfig structure Coverage: 84.7%	2026-03-13 23:35:10 -04:00
Jeremie Fraeys	f74f3fa730	test(security): fix race conditions in monitor tests Use atomic operations for shared variables: - alertCount: atomic.Int32 for concurrent access - lastAlert: atomic.Value for alert storage Fixes data races detected by -race flag.	2026-03-13 23:31:22 -04:00
Jeremie Fraeys	4da027868d	fix(storage): handle NULL values and state tracking in database operations Fixes to support proper test coverage: - db_jobs.go: UpdateJobStatus now checks RowsAffected and returns error for nonexistent jobs instead of silently succeeding - db_audit.go: GetOldestAuditLogDate uses sql.NullString to parse SQLite datetime strings in YYYY-MM-DD HH:MM:SS format with RFC3339 fallback - db_experiments.go: ListTasksForExperiment uses sql.NullString for nullable worker_id and error fields to prevent scan errors - db_connect.go: DB struct adds isClosed state tracking with mutex; Close() now returns error on double close to match test expectations	2026-03-13 23:27:35 -04:00
Jeremie Fraeys	9b8d8e5281	test(tracking): add factory plugin loader tests Add tests for: - NewPluginLoader: factory creation - RegisterFactory: custom factory registration - LoadPluginsEmpty: empty plugin handling - LoadPluginsDisabled: skip disabled plugins - LoadPluginsUnknown: unknown plugin handling - PluginConfigStructure: config field validation - LoadPluginsMLflow, TensorBoard, Wandb: plugin type support Coverage: 79.2%	2026-03-13 23:26:52 -04:00
Jeremie Fraeys	5d39dff6a0	test(store): extend store coverage with edge cases and concurrency Add tests for: - Close: proper resource cleanup - ConcurrentLogMetrics: thread-safe metric logging - GetRunMetricsEmpty: empty result handling - GetRunParamsEmpty: empty result handling - MarkRunSyncedNonexistent: graceful handling of missing runs Coverage: 75.3%	2026-03-13 23:26:41 -04:00
Jeremie Fraeys	50b6506243	test(storage): add comprehensive storage layer tests Add tests for: - dataset: Redis dataset operations, transfer tracking - db_audit: audit logging with hash chain, access tracking - db_experiments: experiment metadata, dataset associations - db_tasks: task listing with pagination for users and groups - db_jobs: job CRUD, state transitions, worker assignment Coverage: storage package ~40%+	2026-03-13 23:26:33 -04:00
Jeremie Fraeys	5057f02167	test(crypto,security): add tenant key manager and anomaly monitor tests Add comprehensive tests for: - crypto/tenant_keys: KMS integration, key rotation, encryption/decryption - security/monitor: sliding window, anomaly detection, concurrent access Coverage: crypto 65.1%, security 100%	2026-03-13 23:26:22 -04:00
Jeremie Fraeys	77542b7068	refactor: update API plugins version retrieval Refactor getPluginVersion to accept PluginConfig parameter: - Change signature from getPluginVersion(pluginName) to getPluginVersion(pluginName, cfg) - Update all call sites to pass config - Add TODO comment for future implementation querying actual plugin binary/container Update plugin handlers to use dynamic version retrieval: - GetV1Plugins: Use h.getPluginVersion(name, cfg) instead of hardcoded "1.0.0" - PutV1PluginsPluginNameConfig: Pass newConfig to version retrieval - GetV1PluginsPluginNameHealth: Use actual version from config This prepares the API for dynamic version reporting while maintaining backward compatibility with the current placeholder implementation.	2026-03-12 16:40:39 -04:00
Jeremie Fraeys	96dd604789	feat: implement WebSocket binary protocol and NOT_IMPLEMENTED error code Add CodeNotImplemented error constant (HTTP 501) for planned but unavailable features. Refactor WebSocket packet handling from JSON to binary protocol for improved efficiency: New packet structure: - PacketTypeSuccess (0x00): [type:1][json_data:var] - PacketTypeError (0x01): [type:1][code_len:1][code:var][msg_len:2][msg:var][details_len:2][details:var] - PacketTypeData (0x02): Reserved for future use Update SendErrorPacket: - Build binary error packets with length-prefixed fields - Use WriteMessage with websocket.BinaryMessage Update SendSuccessPacket: - Marshal data to JSON then wrap in binary packet - Eliminates "success" wrapper field for cleaner protocol Add helper functions: - NewNotImplemented(feature) - Standard 501 error - NewNotImplementedWithIssue(feature, issueURL) - 501 with GitHub reference	2026-03-12 16:40:23 -04:00
Jeremie Fraeys	d0266c4a90	refactor: scheduler hub bug fix, test helpers, and orphan recovery tests Fix bug in scheduler hub orphan reconciliation: - Move delete(h.pendingAcceptance, taskID) inside the requeue success block - Prevents premature cleanup when requeue fails Add comprehensive test infrastructure: - hub_test_helpers.go: New test helper utilities (78 lines) - Mock scheduler components for isolated testing - Test fixture setup and teardown helpers Refactor and enhance hub capabilities tests: - Significant restructuring of hub_capabilities_test.go (213 lines changed) - Improved test coverage for worker capability matching Add comprehensive orphan recovery tests: - internal/scheduler/orphan_recovery_test.go (451 lines) - Tests orphaned job detection and recovery - Covers requeue logic, timeout handling, state cleanup	2026-03-12 16:38:33 -04:00
Jeremie Fraeys	939faeb8e4	refactor: relocate store package from cmd/tui/internal to internal Move store package to improve reusability and follow Go project conventions: - cmd/tui/internal/store/store.go -> internal/store/store.go - cmd/tui/internal/store/store_test.go -> internal/store/store_test.go This makes the store package available to other components beyond the TUI, reducing coupling and enabling future reuse by API server, CLI, or other tools.	2026-03-12 16:38:01 -04:00
Jeremie Fraeys	61660dc925	refactor: co-locate security, storage, telemetry, tracking, worker tests Move unit tests from tests/unit/ to internal/ following Go conventions: Security tests: - tests/unit/security/* -> internal/security/* (audit, config_integrity, filetype, gpu_audit, hipaa_validation, manifest_filename, path_traversal, resource_quota, secrets) Storage tests: - tests/unit/storage/* -> internal/storage/* (db, experiment_metadata) Telemetry tests: - tests/unit/telemetry/* -> internal/telemetry/* (telemetry) Tracking tests: - tests/unit/reproducibility/* -> internal/tracking/* (config_hash, environment_capture) Worker tests: - tests/unit/worker/* -> internal/worker/* (artifacts, config, hash_bench, plugins/jupyter_task, plugins/vllm, prewarm_v1, run_manifest_execution, snapshot_stage, snapshot_store, worker) Update import paths in test files to reflect new locations.	2026-03-12 16:37:03 -04:00
Jeremie Fraeys	74e06017b5	refactor: co-locate scheduler non-hub tests with source code Move unit tests from tests/unit/scheduler/ to internal/scheduler/ following Go conventions: - capability_routing_test.go - Worker capability-based job routing tests - failure_scenarios_test.go - Scheduler failure handling and recovery tests - heartbeat_test.go - Worker heartbeat monitoring tests - plugin_quota_test.go - Plugin resource quota enforcement tests - port_allocator_test.go - Dynamic port allocation for services tests - priority_queue_test.go - Job priority queue implementation tests - service_templates_test.go - Service template management tests - state_store_test.go - Scheduler state persistence tests Note: orphan_recovery_test.go excluded from this commit - will be handled with hub refactoring due to significant test changes.	2026-03-12 16:36:29 -04:00
Jeremie Fraeys	ee0b90cfc5	refactor: co-locate queue and resources tests, add manager tests Move unit tests from tests/unit/ to internal/ following Go conventions: - tests/unit/queue/* -> internal/queue/* (dedup, filesystem_fallback, queue_permissions, queue_spec, queue, sqlite_queue tests) - tests/unit/gpu/* -> internal/resources/* (gpu_detector, gpu_golden tests) - tests/unit/resources/* -> internal/resources/* (manager_test.go) Update import paths in test files to reflect new locations. Note: GPU tests consolidated into resources package since GPU detection is part of resource management. Manager tests show significant new test coverage (166 lines).	2026-03-12 16:36:02 -04:00
Jeremie Fraeys	ca6ad970c3	refactor: co-locate logging, manifest, network, privacy, prommetrics tests Move unit tests from tests/unit/ to internal/ following Go conventions: - tests/unit/logging/* -> internal/logging/* (logging tests) - tests/unit/manifest/* -> internal/manifest/* (run_manifest, schema tests) - tests/unit/network/* -> internal/network/* (retry, ssh_pool, ssh tests) - tests/unit/privacy/* -> internal/privacy/* (pii tests) - tests/unit/metrics/* -> internal/prommetrics/* (metrics tests) Update import paths in test files to reflect new locations. Note: metrics_test.go moved from tests/unit/metrics/ to internal/prommetrics/ to match the actual package name.	2026-03-12 16:35:37 -04:00
Jeremie Fraeys	cf84246115	refactor: co-locate config, container, envpool, errors, experiment, jupyter tests Move unit tests from tests/unit/ to internal/ following Go conventions: - tests/unit/config/* -> internal/config/* (constants, mode_paths, paths, validation) - tests/unit/container/* -> internal/container/* (podman, security tests) - tests/unit/envpool/* -> internal/envpool/* (envpool tests) - tests/unit/errors/* -> internal/errtypes/* (errors_test.go moved to errtypes package) - tests/unit/experiment/* -> internal/experiment/* (manager tests) - tests/unit/jupyter/* -> internal/jupyter/* (config, package_blacklist, service_manager, trash_restore) Update import paths in test files to reflect new locations. Note: errors_test.go moved from tests/unit/errors/ to internal/errtypes/ to match the package structure.	2026-03-12 16:35:15 -04:00
Jeremie Fraeys	a4e2ecdbe6	refactor: co-locate api, audit, auth tests with source code Move unit tests from tests/unit/ to internal/ following Go conventions: - tests/unit/api/* -> internal/api/* (WebSocket handlers, helpers, duplicate detection) - tests/unit/audit/* -> internal/audit/* (alert, sealed, verifier tests) - tests/unit/auth/* -> internal/auth/* (API key, keychain, user manager) - tests/unit/crypto/kms/* -> internal/auth/kms/* (cache, protocol tests) Update import paths in test files to reflect new locations. Benefits: - Tests live alongside the code they test - Easier navigation and maintenance - Clearer package boundaries - Follows standard Go project layout	2026-03-12 16:34:54 -04:00
Jeremie Fraeys	ca913e8878	feat(scheduler): add test mode config and TLS detection - Add DisableTLSForTesting to HubConfig for test environments - Add IsUsingTLS() method to detect scheduler TLS status - Update MockWorker to auto-select ws:// vs wss:// protocol - Set DisableTLSForTesting: true in DefaultHubConfig	2026-03-12 14:05:35 -04:00
Jeremie Fraeys	2b1ef10514	test(chaos): add worker disconnect chaos test and queue improvements Chaos testing: - Add worker_disconnect_chaos_test.go for network partition resilience - Test scheduler hub recovery and job reassignment scenarios Queue layer updates: - event_store.go: add event sourcing for queue operations - native_queue.go: extend native queue with batch operations and indexing	2026-03-12 12:08:21 -04:00
Jeremie Fraeys	17170667e2	feat(worker): improve lifecycle management and vLLM plugin Lifecycle improvements: - runloop.go: refined state machine with better error recovery - service_manager.go: service dependency management and health checks - states.go: add states for capability advertisement and draining Container execution: - container.go: improved OCI runtime integration with supply chain checks - Add image verification and signature validation - Better resource limits enforcement for GPU/memory vLLM plugin updates: - vllm.go: support for vLLM 0.3+ with new engine arguments - Add quantization-aware scheduling (AWQ, GPTQ, FP8) - Improve model download and caching logic Configuration: - config.go: add capability advertisement configuration - snapshot_store.go: improve snapshot management for checkpointing	2026-03-12 12:05:02 -04:00
Jeremie Fraeys	c18a8619fe	feat(api): add structured error package and refactor handlers New error handling: - Add internal/api/errors/errors.go with structured API error types - Standardize error codes across all API endpoints - Add user-facing error messages vs internal error details separation Handler improvements: - jupyter/handlers.go: better workspace lifecycle and error handling - plugins/handlers.go: plugin management with validation - groups/handlers.go: group CRUD with capability metadata - jobs/handlers.go: job submission and monitoring improvements - datasets/handlers.go: dataset upload/download with progress - validate/handlers.go: manifest validation with detailed errors - audit/handlers.go: audit log querying with filters Server configuration: - server_config.go: refined config loading with validation - server_gen.go: improved code generation for OpenAPI specs	2026-03-12 12:04:46 -04:00
Jeremie Fraeys	37c4d4e9c7	feat(crypto,auth): harden KMS and improve permission handling KMS improvements: - cache.go: add LRU eviction with memory-bounded caches - provider.go: refactor provider initialization and key rotation - tenant_keys.go: per-tenant key isolation with envelope encryption Auth layer updates: - hybrid.go: refine hybrid auth flow for API key + JWT - permissions_loader.go: faster permission caching with hot-reload - validator.go: stricter validation with detailed error messages Security middleware: - security.go: add rate limiting headers and CORS refinement Testing and benchmarks: - Add KMS cache and protocol unit tests - Add KMS benchmark tests for encryption throughput - Update KMS integration tests for tenant isolation	2026-03-12 12:04:32 -04:00
Jeremie Fraeys	de83300962	feat(worker): refactor GPU detection with macOS Metal support GPU detection refactor: - Major rewrite of gpu_detector.go with unified detection interface - Support for NVIDIA (NVML), AMD (ROCm), and Apple Metal - Runtime GPU capability querying for scheduler matching macOS improvements: - gpu_macos.go: native Metal device enumeration and memory queries - Support for Apple Silicon (M1/M2/M3) unified memory reporting - Fallback to system profiler for Intel Macs Testing infrastructure: - Add gpu_detector_mock.go for testing without hardware - Update gpu_golden_test.go with platform-specific expectations - Cross-platform GPU info validation	2026-03-12 12:02:41 -04:00
Jeremie Fraeys	188cf55939	refactor(api): overhaul WebSocket handler and protocol layer Major WebSocket handler refactor: - Rewrite ws/handler.go with structured message routing and backpressure - Add connection lifecycle management with heartbeats and timeouts - Implement graceful connection draining for zero-downtime restarts Protocol improvements: - Define structured protocol types in protocol.go for hub communication - Add versioned message envelopes for backward compatibility - Standardize error codes and response formats across WebSocket API Job streaming via WebSocket: - Simplify ws/jobs.go with async job status streaming - Add compression for high-volume job updates Testing: - Update websocket_e2e_test.go for new protocol semantics - Add connection resilience tests	2026-03-12 12:01:21 -04:00
Jeremie Fraeys	57787e1e7b	feat(scheduler): implement capability-based routing and hub v2 Add comprehensive capability routing system to scheduler hub: - Capability-aware worker matching with requirement/offer negotiation - Hub v2 protocol with structured message types and heartbeat management - Worker capability advertisement and dynamic routing decisions - Orphan recovery for disconnected workers with state reconciliation - Template-based job scheduling with capability constraints Add extensive test coverage: - Unit tests for capability routing logic and heartbeat mechanics - Unit tests for orphan recovery scenarios - E2E tests for capability routing across multiple workers - Hub capabilities integration tests - Scheduler fixture helpers for test setup Protocol improvements: - Define structured protocol messages for hub-worker communication - Add capability matching algorithm with scoring - Implement graceful worker disconnection handling	2026-03-12 12:00:05 -04:00
Jeremie Fraeys	13ffb81cab	fix: add CGO build tags to consistency tests, remove unused isHex function	2026-03-08 13:10:00 -04:00
Jeremie Fraeys	c74e91dd69	test: update test suite and remove deprecated privacy middleware Test improvements: - fixtures/: Updated mocks, fixtures with group context, SSH server, TUI driver - integration/: WebSocket queue and handler tests with groups - e2e/: WebSocket and TLS proxy end-to-end tests - unit/api/ws_test.go: WebSocket API tests - unit/scheduler/service_templates_test.go: Service template tests - benchmarks/scheduler_bench_test.go: Performance benchmarks Cleanup: - Remove privacy middleware (replaced by audit system) - Remove privacy_test.go	2026-03-08 13:03:55 -04:00
Jeremie Fraeys	4b2782f674	feat(domain): add task visibility and supporting infrastructure Core domain and utility updates: - domain/task.go: Task model with visibility system * Visibility enum: private, lab, institution, open * Group associations for lab-scoped access * CreatedBy tracking for ownership * Sharing metadata with expiry - config/paths.go: Group-scoped data directories and audit log paths - crypto/signing.go: Key management for audit sealing, token signature verification - container/supply_chain.go: Image provenance tracking, vulnerability scanning - fileutil/filetype.go: MIME type detection and security validation - fileutil/secure.go: Protected file permissions, secure deletion - jupyter/: Package and service manager updates - experiment/manager.go: Visibility cascade from experiments to tasks - network/ssh.go: SSH tunneling improvements - queue/: Filesystem queue enhancements	2026-03-08 13:03:27 -04:00
Jeremie Fraeys	0b5e99f720	refactor(scheduler,worker): improve service management and GPU detection Scheduler enhancements: - auth.go: Group membership validation in authentication - hub.go: Task distribution with group affinity - port_allocator.go: Dynamic port allocation with conflict resolution - scheduler_conn.go: Connection pooling and retry logic - service_manager.go: Lifecycle management for scheduler services - service_templates.go: Template-based service configuration - state.go: Persistent state management with recovery Worker improvements: - config.go: Extended configuration for task visibility rules - execution/setup.go: Sandboxed execution environment setup - executor/container.go: Container runtime integration - executor/runner.go: Task runner with visibility enforcement - gpu_detector.go: Robust GPU detection (NVIDIA, AMD, Apple Silicon, CPU fallback) - integrity/validate.go: Data integrity validation - lifecycle/runloop.go: Improved runloop with graceful shutdown - lifecycle/service_manager.go: Service lifecycle coordination - process/isolation.go + isolation_unix.go: Process isolation with namespaces/cgroups - tenant/manager.go: Multi-tenant resource isolation - tenant/middleware.go: Tenant context propagation - worker.go: Core worker with group-scoped task execution	2026-03-08 13:03:15 -04:00
Jeremie Fraeys	1c7205c0a0	feat(audit): add HTTP audit middleware and tamper-evident logging Comprehensive audit system for security and compliance: - middleware/audit.go: HTTP request/response auditing middleware * Captures request details, user identity, response status * Chains audit events with cryptographic hashes for tamper detection * Configurable filtering for sensitive data redaction - audit/chain.go: Blockchain-style audit log chaining * Each entry includes hash of previous entry * Tamper detection through hash verification * Supports incremental verification without full scan - checkpoint.go: Periodic integrity checkpoints * Creates signed checkpoints for fast verification * Configurable checkpoint intervals * Recovery from last known good checkpoint - rotation.go: Automatic log rotation and archival * Size-based and time-based rotation policies * Compressed archival with integrity seals * Retention policy enforcement - sealed.go: Cryptographic sealing of audit logs * Digital signatures for log integrity * HSM support preparation * Exportable sealed bundles for external auditors - verifier.go: Log verification and forensic analysis * Complete chain verification from genesis to latest * Detects gaps, tampering, unauthorized modifications * Forensic export for incident response	2026-03-08 13:03:02 -04:00
Jeremie Fraeys	7e5ceec069	feat(api): add groups and tokens handlers, refactor routes Add new API endpoints and clean up handler interfaces: - groups/handlers.go: New lab group management API * CRUD operations for lab groups * Member management with role assignment (admin/member/viewer) * Group listing and membership queries - tokens/handlers.go: Token generation and validation endpoints * Create access tokens for public task sharing * Validate tokens for secure access * Token revocation and cleanup - routes.go: Refactor handler registration * Integrate groups handler into WebSocket routes * Remove nil parameters from all handler constructors * Cleaner dependency injection pattern - Handler interface cleanup across all modules: * jobs/handlers.go: Remove unused nil privacyEnforcer parameter * jupyter/handlers.go: Streamline initialization * scheduler/handlers.go: Consistent constructor signature * ws/handler.go: Add groups handler to dependencies	2026-03-08 12:51:25 -04:00
Jeremie Fraeys	c52179dcbe	feat(auth): add token-based access and structured logging Add comprehensive authentication and authorization enhancements: - tokens.go: New token management system for public task access and cloning * SHA-256 hashed token storage for security * Token generation, validation, and automatic cleanup * Support for public access and clone permissions - api_key.go: Extend User struct with Groups field * Lab group membership (ml-lab, nlp-group) * Integration with permission system for group-based access - flags.go: Security hardening - migrate to structured logging * Replace log.Printf with log/slog to prevent log injection attacks * Consistent structured output for all auth warnings * Safe handling of file paths and errors in logs - permissions.go: Add task sharing permission constants * PermissionTasksReadOwn: Access own tasks * PermissionTasksReadLab: Access lab group tasks * PermissionTasksReadAll: Admin/institution-wide access * PermissionTasksShare: Grant access to other users * PermissionTasksClone: Create copies of shared tasks * CanAccessTask() method with visibility checks - database.go: Improve error handling * Add structured error logging on row close failures	2026-03-08 12:51:07 -04:00
Jeremie Fraeys	fbcf4d38e5	feat(storage): add groups, tasks, tokens, and audit database schemas Add comprehensive database storage layer for new features: - db_groups.go: Lab group management with members, roles (admin/member/viewer), and group-based task visibility queries - db_tasks.go: Task visibility system (private/lab/institution/open), task sharing with expiry, public clone tokens, and optimized ListTasksForUser() for access control - db_tokens.go: Secure token management for public task access and cloning, with SHA-256 hashed token storage and automatic cleanup - db_audit.go: Audit log persistence with checkpoint chains, tamper detection, and log rotation support - schema_sqlite.sql: Updated schema with: - groups, group_members tables - tasks.visibility enum, task_shares with expiry - access_tokens table with hashed tokens - audit_logs, audit_checkpoints tables - indexes for all foreign keys and query patterns - db_experiments.go: Add CascadeVisibilityToTasks() for propagating visibility changes from experiments to associated tasks	2026-03-08 12:48:42 -04:00
Jeremie Fraeys	ba9a358412	fix(scheduler): resolve TestEndToEndJobLifecycle race and getTask bug ## Problem TestEndToEndJobLifecycle was failing with two issues: 1. Race condition: Workers signaled ready before job was processed, receiving MsgNoWork instead of MsgJobAssign 2. getTask() didn't check pendingAcceptance - assigned-but-not-yet-accepted tasks returned nil ## Changes ### Test Fix (restart_recovery_test.go) - Replace single-shot select with retry loop that re-signals workers as ready - Handle both assignment and non-assignment messages correctly - Add 10ms delay between non-assignment messages to allow job processing - Use 2-second deadline with 100ms timeout intervals ### Scheduler Fix (hub.go) - Extend getTask() to check pendingAcceptance map after batch/service queues - Allows GetTask() to find tasks in 'assigned' state before acceptance - Maintains backward compatibility with existing queue/running lookups ## Testing make test now passes: 475 passed, 0 failed, 34 skipped	2026-03-05 14:40:43 -05:00
Jeremie Fraeys	c6a224d5fc	feat(cli,server): unify info command with remote/local support Enhance ml info to query server when connected, falling back to local manifests when offline. Unifies behavior with other commands like run, exec, and cancel. CLI changes: - Add --local and --remote flags for explicit control - Auto-detect connection state via mode.detect() - queryRemoteRun(): Query server via WebSocket for run details - queryLocalRun(): Read local run_manifest.json - displayRunInfo(): Shared display logic for both sources - Add connection status indicators (Remote: connecting.../connected) WebSocket protocol: - Add query_run_info opcode (0x28) to cli and server - Add sendQueryRunInfo() method to ws/client.zig - Protocol: [opcode:1][api_key_hash:16][run_id_len:1][run_id:var] Server changes: - Add handleQueryRunInfo() handler to ws/handler.go - Returns run_id, job_name, user, timestamp, overall_sha, files_count - Checks PermJobsRead permission - Looks up run in experiment manager Usage: ml info abc123 # Auto: tries remote, falls back to local ml info abc123 --local # Force local manifest lookup ml info abc123 --remote # Force remote query (fails if offline)	2026-03-05 12:07:00 -05:00
Jeremie Fraeys	747579eae4	refactor: misc improvements across codebase Various improvements: - Makefile: build optimizations and native lib integration - prune.zig: cleanup logic refinements - status.zig: improved status reporting - experiment_core.zig: core functionality updates - progress.zig: progress bar improvements - task.go: domain model updates for task handling All tests pass.	2026-03-05 10:58:22 -05:00
Jeremie Fraeys	08ab628546	refactor(scheduler): remove dead code Remove three unused methods/parameter identified by static analysis: - canRequeue(): never integrated into scheduling flow - runMetricsClient clientID param: accepted but never used - getUsageLocked(): callers inline the logic Fixes IDE warnings about unused code per AGENTS.md cleanup discipline.	2026-03-04 13:35:18 -05:00
Jeremie Fraeys	7cd86fb88a	feat: add new API handlers, build scripts, and ADRs Some checks failed Build Pipeline / Sign HIPAA Config (push) Has been skipped Details Build Pipeline / Generate SLSA Provenance (push) Has been skipped Details Checkout test / test (push) Successful in 6s Details CI Pipeline / Test (ubuntu-latest on self-hosted) (push) Failing after 1s Details CI Pipeline / Dev Compose Smoke Test (push) Has been skipped Details CI Pipeline / Security Scan (push) Has been skipped Details CI Pipeline / Test Scripts (push) Has been skipped Details CI Pipeline / Test Native Libraries (push) Has been skipped Details CI Pipeline / Native Library Build Matrix (push) Has been skipped Details Contract Tests / Spec Drift Detection (push) Failing after 11s Details Contract Tests / API Contract Tests (push) Has been skipped Details Deploy API Docs / Build API Documentation (push) Failing after 5s Details Deploy API Docs / Deploy to GitHub Pages (push) Has been skipped Details Documentation / build-and-publish (push) Failing after 40s Details Test Matrix / test-native-vs-pure (cgo) (push) Failing after 14s Details Test Matrix / test-native-vs-pure (native) (push) Failing after 35s Details Test Matrix / test-native-vs-pure (pure) (push) Failing after 18s Details CI Pipeline / Trigger Build Workflow (push) Failing after 1s Details Build CLI with Embedded SQLite / build (arm64, aarch64-linux) (push) Has been cancelled Details Build CLI with Embedded SQLite / build (x86_64, x86_64-linux) (push) Has been cancelled Details Build CLI with Embedded SQLite / build-macos (arm64) (push) Has been cancelled Details Build CLI with Embedded SQLite / build-macos (x86_64) (push) Has been cancelled Details Security Scan / Security Analysis (push) Has been cancelled Details Security Scan / Native Library Security (push) Has been cancelled Details Verification & Maintenance / V.1 - Schema Drift Detection (push) Has been cancelled Details Verification & Maintenance / V.4 - Custom Go Vet Analyzers (push) Has been cancelled Details Verification & Maintenance / V.7 - Audit Chain Integrity (push) Has been cancelled Details Verification & Maintenance / V.6 - Extended Security Scanning (push) Has been cancelled Details Verification & Maintenance / V.10 - OpenSSF Scorecard (push) Has been cancelled Details Verification & Maintenance / Verification Summary (push) Has been cancelled Details - Introduce audit, plugin, and scheduler API handlers - Add spec_embed.go for OpenAPI spec embedding - Create modular build scripts (cli, go, native, cross-platform) - Add deployment cleanup and health-check utilities - New ADRs: hot reload, audit store, SSE updates, RBAC, caching, offline mode, KMS regions, tenant offboarding - Add KMS configuration schema and worker variants - Include KMS benchmark tests	2026-03-04 13:24:27 -05:00
Jeremie Fraeys	61081655d2	feat: enhance worker execution and scheduler service templates - Refactor worker configuration management - Improve container executor lifecycle handling - Update runloop and worker core logic - Enhance scheduler service template generation - Remove obsolete 'scheduler' symlink/directory	2026-03-04 13:24:20 -05:00
Jeremie Fraeys	66f262d788	security: improve audit, crypto, and config handling - Enhance audit checkpoint system - Update KMS provider and tenant key management - Refine configuration constants - Improve TUI config handling	2026-03-04 13:23:42 -05:00
Jeremie Fraeys	a4f2c36069	feat: enhance task domain and scheduler protocol - Update task domain model - Improve scheduler hub and priority queue - Enhance protocol definitions - Update manifest schema and run handling	2026-03-04 13:23:38 -05:00
Jeremie Fraeys	1f495dfbb7	api: regenerate OpenAPI types and server code - Update openapi.yaml spec - Regenerate server_gen.go with oapi-codegen - Update adapter, routes, and server configuration	2026-03-04 13:23:34 -05:00
Jeremie Fraeys	e1ec255ad2	refactor(crypto): integrate KMS with TenantKeyManager Replace in-memory root keys with KMS interface: - GenerateDataEncryptionKey: generate DEK, wrap via KMS, cache - UnwrapDataEncryptionKey: cache check, KMS decrypt, cache store - EncryptArtifact/DecryptArtifact: use DEK from KMS - RotateTenantKey: create new KMS key, flush cache - RevokeTenant: disable KMS key, schedule deletion per ADR-015 Remove deprecated methods: wrapKey, unwrapKey (replaced by KMS)	2026-03-03 19:14:27 -05:00
Jeremie Fraeys	7c03c8b5bd	feat(kms): add HashiCorp Vault and AWS KMS providers Implement VaultProvider with Transit engine: - AppRole, Kubernetes, and Token authentication - Encrypt/Decrypt via /transit/encrypt and /transit/decrypt - Key lifecycle via /transit/keys API - Health check via /sys/health Implement AWSProvider with SDK v2: - Per-region key naming with alias prefix - Encrypt/Decrypt via KMS SDK - Key lifecycle (CreateKey, Disable, ScheduleDeletion, Enable) - AWS endpoint support for LocalStack testing	2026-03-03 19:14:21 -05:00
Jeremie Fraeys	cb25677695	feat(kms): implement core KMS infrastructure with DEK cache Add KMSProvider interface for external key management systems: - Encrypt/Decrypt operations for DEK wrapping - Key lifecycle management (Create, Disable, ScheduleDeletion, Enable) - HealthCheck and Close methods Implement MemoryProvider for development/testing: - XOR encryption with HMAC-SHA256 authentication - Secure random key generation using crypto/rand - MAC verification to detect wrong keys Implement DEKCache per ADR-012: - 15-minute TTL with configurable grace window (1 hour) - LRU eviction with 1000 entry limit - Cache key includes (tenantID, artifactID, kmsKeyID) for isolation - Thread-safe operations with RWMutex - Secure memory wiping on eviction/cleanup Add config package with types: - ProviderType enum (vault, aws, memory) - VaultConfig with AppRole/Kubernetes/Token auth - AWSConfig with region and alias prefix - CacheConfig with TTL, MaxEntries, GraceWindow - Validation methods for all config types	2026-03-03 19:13:55 -05:00
Jeremie Fraeys	da104367d6	feat: add Plugin GPU Quota implementation and tests Some checks failed Build Pipeline / Build Binaries (push) Failing after 1m59s Details Build Pipeline / Build Docker Images (push) Has been skipped Details Build Pipeline / Sign HIPAA Config (push) Has been skipped Details Build Pipeline / Generate SLSA Provenance (push) Has been skipped Details Checkout test / test (push) Successful in 5s Details CI Pipeline / Test (ubuntu-latest on self-hosted) (push) Failing after 1s Details CI Pipeline / Dev Compose Smoke Test (push) Has been skipped Details CI Pipeline / Security Scan (push) Has been skipped Details CI Pipeline / Test Scripts (push) Has been skipped Details CI Pipeline / Test Native Libraries (push) Has been skipped Details CI Pipeline / Native Library Build Matrix (push) Has been skipped Details Documentation / build-and-publish (push) Failing after 35s Details CI Pipeline / Trigger Build Workflow (push) Failing after 0s Details Security Scan / Security Analysis (push) Has been cancelled Details Security Scan / Native Library Security (push) Has been cancelled Details Verification & Maintenance / V.1 - Schema Drift Detection (push) Has been cancelled Details Verification & Maintenance / V.4 - Custom Go Vet Analyzers (push) Has been cancelled Details Verification & Maintenance / V.7 - Audit Chain Integrity (push) Has been cancelled Details Verification & Maintenance / V.6 - Extended Security Scanning (push) Has been cancelled Details Verification & Maintenance / V.10 - OpenSSF Scorecard (push) Has been cancelled Details Verification & Maintenance / Verification Summary (push) Has been cancelled Details - Add plugin_quota.go with GPU quota management for scheduler - Update scheduler hub and protocol for plugin support - Add comprehensive plugin quota unit tests - Update gang service and WebSocket queue integration tests	2026-02-26 14:35:05 -05:00
Jeremie Fraeys	8f2495deb0	chore(cleanup): remove obsolete files and update .gitignore Remove deprecated components replaced by new scheduler: - Delete internal/controller/pacing_controller.go (replaced by scheduler/pacing.go) - Delete internal/manifest/schema_test.go (consolidated into tests/unit/) - Delete internal/workertest/worker.go (consolidated into tests/fixtures/) - Update .gitignore with scheduler binary and new patterns	2026-02-26 12:09:18 -05:00
Jeremie Fraeys	4cdb68907e	refactor(utilities): update supporting modules for scheduler integration Update utility modules: - File utilities with secure file operations - Environment pool with resource tracking - Error types with scheduler error categories - Logging with audit context support - Network/SSH with connection pooling - Privacy/PII handling with tenant boundaries - Resource manager with scheduler allocation - Security monitor with audit integration - Tracking plugins (MLflow, TensorBoard) with auth - Crypto signing with tenant keys - Database init with multi-user support	2026-02-26 12:07:15 -05:00

1 2 3 4

157 commits