fetch_ml

Author	SHA1	Message	Date
Jeremie Fraeys	2b1ef10514	test(chaos): add worker disconnect chaos test and queue improvements Chaos testing: - Add worker_disconnect_chaos_test.go for network partition resilience - Test scheduler hub recovery and job reassignment scenarios Queue layer updates: - event_store.go: add event sourcing for queue operations - native_queue.go: extend native queue with batch operations and indexing	2026-03-12 12:08:21 -04:00
Jeremie Fraeys	17170667e2	feat(worker): improve lifecycle management and vLLM plugin Lifecycle improvements: - runloop.go: refined state machine with better error recovery - service_manager.go: service dependency management and health checks - states.go: add states for capability advertisement and draining Container execution: - container.go: improved OCI runtime integration with supply chain checks - Add image verification and signature validation - Better resource limits enforcement for GPU/memory vLLM plugin updates: - vllm.go: support for vLLM 0.3+ with new engine arguments - Add quantization-aware scheduling (AWQ, GPTQ, FP8) - Improve model download and caching logic Configuration: - config.go: add capability advertisement configuration - snapshot_store.go: improve snapshot management for checkpointing	2026-03-12 12:05:02 -04:00
Jeremie Fraeys	c18a8619fe	feat(api): add structured error package and refactor handlers New error handling: - Add internal/api/errors/errors.go with structured API error types - Standardize error codes across all API endpoints - Add user-facing error messages vs internal error details separation Handler improvements: - jupyter/handlers.go: better workspace lifecycle and error handling - plugins/handlers.go: plugin management with validation - groups/handlers.go: group CRUD with capability metadata - jobs/handlers.go: job submission and monitoring improvements - datasets/handlers.go: dataset upload/download with progress - validate/handlers.go: manifest validation with detailed errors - audit/handlers.go: audit log querying with filters Server configuration: - server_config.go: refined config loading with validation - server_gen.go: improved code generation for OpenAPI specs	2026-03-12 12:04:46 -04:00
Jeremie Fraeys	37c4d4e9c7	feat(crypto,auth): harden KMS and improve permission handling KMS improvements: - cache.go: add LRU eviction with memory-bounded caches - provider.go: refactor provider initialization and key rotation - tenant_keys.go: per-tenant key isolation with envelope encryption Auth layer updates: - hybrid.go: refine hybrid auth flow for API key + JWT - permissions_loader.go: faster permission caching with hot-reload - validator.go: stricter validation with detailed error messages Security middleware: - security.go: add rate limiting headers and CORS refinement Testing and benchmarks: - Add KMS cache and protocol unit tests - Add KMS benchmark tests for encryption throughput - Update KMS integration tests for tenant isolation	2026-03-12 12:04:32 -04:00
Jeremie Fraeys	de83300962	feat(worker): refactor GPU detection with macOS Metal support GPU detection refactor: - Major rewrite of gpu_detector.go with unified detection interface - Support for NVIDIA (NVML), AMD (ROCm), and Apple Metal - Runtime GPU capability querying for scheduler matching macOS improvements: - gpu_macos.go: native Metal device enumeration and memory queries - Support for Apple Silicon (M1/M2/M3) unified memory reporting - Fallback to system profiler for Intel Macs Testing infrastructure: - Add gpu_detector_mock.go for testing without hardware - Update gpu_golden_test.go with platform-specific expectations - Cross-platform GPU info validation	2026-03-12 12:02:41 -04:00
Jeremie Fraeys	188cf55939	refactor(api): overhaul WebSocket handler and protocol layer Major WebSocket handler refactor: - Rewrite ws/handler.go with structured message routing and backpressure - Add connection lifecycle management with heartbeats and timeouts - Implement graceful connection draining for zero-downtime restarts Protocol improvements: - Define structured protocol types in protocol.go for hub communication - Add versioned message envelopes for backward compatibility - Standardize error codes and response formats across WebSocket API Job streaming via WebSocket: - Simplify ws/jobs.go with async job status streaming - Add compression for high-volume job updates Testing: - Update websocket_e2e_test.go for new protocol semantics - Add connection resilience tests	2026-03-12 12:01:21 -04:00
Jeremie Fraeys	57787e1e7b	feat(scheduler): implement capability-based routing and hub v2 Add comprehensive capability routing system to scheduler hub: - Capability-aware worker matching with requirement/offer negotiation - Hub v2 protocol with structured message types and heartbeat management - Worker capability advertisement and dynamic routing decisions - Orphan recovery for disconnected workers with state reconciliation - Template-based job scheduling with capability constraints Add extensive test coverage: - Unit tests for capability routing logic and heartbeat mechanics - Unit tests for orphan recovery scenarios - E2E tests for capability routing across multiple workers - Hub capabilities integration tests - Scheduler fixture helpers for test setup Protocol improvements: - Define structured protocol messages for hub-worker communication - Add capability matching algorithm with scoring - Implement graceful worker disconnection handling	2026-03-12 12:00:05 -04:00
Jeremie Fraeys	13ffb81cab	fix: add CGO build tags to consistency tests, remove unused isHex function	2026-03-08 13:10:00 -04:00
Jeremie Fraeys	c74e91dd69	test: update test suite and remove deprecated privacy middleware Test improvements: - fixtures/: Updated mocks, fixtures with group context, SSH server, TUI driver - integration/: WebSocket queue and handler tests with groups - e2e/: WebSocket and TLS proxy end-to-end tests - unit/api/ws_test.go: WebSocket API tests - unit/scheduler/service_templates_test.go: Service template tests - benchmarks/scheduler_bench_test.go: Performance benchmarks Cleanup: - Remove privacy middleware (replaced by audit system) - Remove privacy_test.go	2026-03-08 13:03:55 -04:00
Jeremie Fraeys	4b2782f674	feat(domain): add task visibility and supporting infrastructure Core domain and utility updates: - domain/task.go: Task model with visibility system * Visibility enum: private, lab, institution, open * Group associations for lab-scoped access * CreatedBy tracking for ownership * Sharing metadata with expiry - config/paths.go: Group-scoped data directories and audit log paths - crypto/signing.go: Key management for audit sealing, token signature verification - container/supply_chain.go: Image provenance tracking, vulnerability scanning - fileutil/filetype.go: MIME type detection and security validation - fileutil/secure.go: Protected file permissions, secure deletion - jupyter/: Package and service manager updates - experiment/manager.go: Visibility cascade from experiments to tasks - network/ssh.go: SSH tunneling improvements - queue/: Filesystem queue enhancements	2026-03-08 13:03:27 -04:00
Jeremie Fraeys	0b5e99f720	refactor(scheduler,worker): improve service management and GPU detection Scheduler enhancements: - auth.go: Group membership validation in authentication - hub.go: Task distribution with group affinity - port_allocator.go: Dynamic port allocation with conflict resolution - scheduler_conn.go: Connection pooling and retry logic - service_manager.go: Lifecycle management for scheduler services - service_templates.go: Template-based service configuration - state.go: Persistent state management with recovery Worker improvements: - config.go: Extended configuration for task visibility rules - execution/setup.go: Sandboxed execution environment setup - executor/container.go: Container runtime integration - executor/runner.go: Task runner with visibility enforcement - gpu_detector.go: Robust GPU detection (NVIDIA, AMD, Apple Silicon, CPU fallback) - integrity/validate.go: Data integrity validation - lifecycle/runloop.go: Improved runloop with graceful shutdown - lifecycle/service_manager.go: Service lifecycle coordination - process/isolation.go + isolation_unix.go: Process isolation with namespaces/cgroups - tenant/manager.go: Multi-tenant resource isolation - tenant/middleware.go: Tenant context propagation - worker.go: Core worker with group-scoped task execution	2026-03-08 13:03:15 -04:00
Jeremie Fraeys	1c7205c0a0	feat(audit): add HTTP audit middleware and tamper-evident logging Comprehensive audit system for security and compliance: - middleware/audit.go: HTTP request/response auditing middleware * Captures request details, user identity, response status * Chains audit events with cryptographic hashes for tamper detection * Configurable filtering for sensitive data redaction - audit/chain.go: Blockchain-style audit log chaining * Each entry includes hash of previous entry * Tamper detection through hash verification * Supports incremental verification without full scan - checkpoint.go: Periodic integrity checkpoints * Creates signed checkpoints for fast verification * Configurable checkpoint intervals * Recovery from last known good checkpoint - rotation.go: Automatic log rotation and archival * Size-based and time-based rotation policies * Compressed archival with integrity seals * Retention policy enforcement - sealed.go: Cryptographic sealing of audit logs * Digital signatures for log integrity * HSM support preparation * Exportable sealed bundles for external auditors - verifier.go: Log verification and forensic analysis * Complete chain verification from genesis to latest * Detects gaps, tampering, unauthorized modifications * Forensic export for incident response	2026-03-08 13:03:02 -04:00
Jeremie Fraeys	7e5ceec069	feat(api): add groups and tokens handlers, refactor routes Add new API endpoints and clean up handler interfaces: - groups/handlers.go: New lab group management API * CRUD operations for lab groups * Member management with role assignment (admin/member/viewer) * Group listing and membership queries - tokens/handlers.go: Token generation and validation endpoints * Create access tokens for public task sharing * Validate tokens for secure access * Token revocation and cleanup - routes.go: Refactor handler registration * Integrate groups handler into WebSocket routes * Remove nil parameters from all handler constructors * Cleaner dependency injection pattern - Handler interface cleanup across all modules: * jobs/handlers.go: Remove unused nil privacyEnforcer parameter * jupyter/handlers.go: Streamline initialization * scheduler/handlers.go: Consistent constructor signature * ws/handler.go: Add groups handler to dependencies	2026-03-08 12:51:25 -04:00
Jeremie Fraeys	c52179dcbe	feat(auth): add token-based access and structured logging Add comprehensive authentication and authorization enhancements: - tokens.go: New token management system for public task access and cloning * SHA-256 hashed token storage for security * Token generation, validation, and automatic cleanup * Support for public access and clone permissions - api_key.go: Extend User struct with Groups field * Lab group membership (ml-lab, nlp-group) * Integration with permission system for group-based access - flags.go: Security hardening - migrate to structured logging * Replace log.Printf with log/slog to prevent log injection attacks * Consistent structured output for all auth warnings * Safe handling of file paths and errors in logs - permissions.go: Add task sharing permission constants * PermissionTasksReadOwn: Access own tasks * PermissionTasksReadLab: Access lab group tasks * PermissionTasksReadAll: Admin/institution-wide access * PermissionTasksShare: Grant access to other users * PermissionTasksClone: Create copies of shared tasks * CanAccessTask() method with visibility checks - database.go: Improve error handling * Add structured error logging on row close failures	2026-03-08 12:51:07 -04:00
Jeremie Fraeys	fbcf4d38e5	feat(storage): add groups, tasks, tokens, and audit database schemas Add comprehensive database storage layer for new features: - db_groups.go: Lab group management with members, roles (admin/member/viewer), and group-based task visibility queries - db_tasks.go: Task visibility system (private/lab/institution/open), task sharing with expiry, public clone tokens, and optimized ListTasksForUser() for access control - db_tokens.go: Secure token management for public task access and cloning, with SHA-256 hashed token storage and automatic cleanup - db_audit.go: Audit log persistence with checkpoint chains, tamper detection, and log rotation support - schema_sqlite.sql: Updated schema with: - groups, group_members tables - tasks.visibility enum, task_shares with expiry - access_tokens table with hashed tokens - audit_logs, audit_checkpoints tables - indexes for all foreign keys and query patterns - db_experiments.go: Add CascadeVisibilityToTasks() for propagating visibility changes from experiments to associated tasks	2026-03-08 12:48:42 -04:00
Jeremie Fraeys	ba9a358412	fix(scheduler): resolve TestEndToEndJobLifecycle race and getTask bug ## Problem TestEndToEndJobLifecycle was failing with two issues: 1. Race condition: Workers signaled ready before job was processed, receiving MsgNoWork instead of MsgJobAssign 2. getTask() didn't check pendingAcceptance - assigned-but-not-yet-accepted tasks returned nil ## Changes ### Test Fix (restart_recovery_test.go) - Replace single-shot select with retry loop that re-signals workers as ready - Handle both assignment and non-assignment messages correctly - Add 10ms delay between non-assignment messages to allow job processing - Use 2-second deadline with 100ms timeout intervals ### Scheduler Fix (hub.go) - Extend getTask() to check pendingAcceptance map after batch/service queues - Allows GetTask() to find tasks in 'assigned' state before acceptance - Maintains backward compatibility with existing queue/running lookups ## Testing make test now passes: 475 passed, 0 failed, 34 skipped	2026-03-05 14:40:43 -05:00
Jeremie Fraeys	c6a224d5fc	feat(cli,server): unify info command with remote/local support Enhance ml info to query server when connected, falling back to local manifests when offline. Unifies behavior with other commands like run, exec, and cancel. CLI changes: - Add --local and --remote flags for explicit control - Auto-detect connection state via mode.detect() - queryRemoteRun(): Query server via WebSocket for run details - queryLocalRun(): Read local run_manifest.json - displayRunInfo(): Shared display logic for both sources - Add connection status indicators (Remote: connecting.../connected) WebSocket protocol: - Add query_run_info opcode (0x28) to cli and server - Add sendQueryRunInfo() method to ws/client.zig - Protocol: [opcode:1][api_key_hash:16][run_id_len:1][run_id:var] Server changes: - Add handleQueryRunInfo() handler to ws/handler.go - Returns run_id, job_name, user, timestamp, overall_sha, files_count - Checks PermJobsRead permission - Looks up run in experiment manager Usage: ml info abc123 # Auto: tries remote, falls back to local ml info abc123 --local # Force local manifest lookup ml info abc123 --remote # Force remote query (fails if offline)	2026-03-05 12:07:00 -05:00
Jeremie Fraeys	747579eae4	refactor: misc improvements across codebase Various improvements: - Makefile: build optimizations and native lib integration - prune.zig: cleanup logic refinements - status.zig: improved status reporting - experiment_core.zig: core functionality updates - progress.zig: progress bar improvements - task.go: domain model updates for task handling All tests pass.	2026-03-05 10:58:22 -05:00
Jeremie Fraeys	08ab628546	refactor(scheduler): remove dead code Remove three unused methods/parameter identified by static analysis: - canRequeue(): never integrated into scheduling flow - runMetricsClient clientID param: accepted but never used - getUsageLocked(): callers inline the logic Fixes IDE warnings about unused code per AGENTS.md cleanup discipline.	2026-03-04 13:35:18 -05:00
Jeremie Fraeys	7cd86fb88a	feat: add new API handlers, build scripts, and ADRs Some checks failed Build Pipeline / Sign HIPAA Config (push) Has been skipped Details Build Pipeline / Generate SLSA Provenance (push) Has been skipped Details Checkout test / test (push) Successful in 6s Details CI Pipeline / Test (ubuntu-latest on self-hosted) (push) Failing after 1s Details CI Pipeline / Dev Compose Smoke Test (push) Has been skipped Details CI Pipeline / Security Scan (push) Has been skipped Details CI Pipeline / Test Scripts (push) Has been skipped Details CI Pipeline / Test Native Libraries (push) Has been skipped Details CI Pipeline / Native Library Build Matrix (push) Has been skipped Details Contract Tests / Spec Drift Detection (push) Failing after 11s Details Contract Tests / API Contract Tests (push) Has been skipped Details Deploy API Docs / Build API Documentation (push) Failing after 5s Details Deploy API Docs / Deploy to GitHub Pages (push) Has been skipped Details Documentation / build-and-publish (push) Failing after 40s Details Test Matrix / test-native-vs-pure (cgo) (push) Failing after 14s Details Test Matrix / test-native-vs-pure (native) (push) Failing after 35s Details Test Matrix / test-native-vs-pure (pure) (push) Failing after 18s Details CI Pipeline / Trigger Build Workflow (push) Failing after 1s Details Build CLI with Embedded SQLite / build (arm64, aarch64-linux) (push) Has been cancelled Details Build CLI with Embedded SQLite / build (x86_64, x86_64-linux) (push) Has been cancelled Details Build CLI with Embedded SQLite / build-macos (arm64) (push) Has been cancelled Details Build CLI with Embedded SQLite / build-macos (x86_64) (push) Has been cancelled Details Security Scan / Security Analysis (push) Has been cancelled Details Security Scan / Native Library Security (push) Has been cancelled Details Verification & Maintenance / V.1 - Schema Drift Detection (push) Has been cancelled Details Verification & Maintenance / V.4 - Custom Go Vet Analyzers (push) Has been cancelled Details Verification & Maintenance / V.7 - Audit Chain Integrity (push) Has been cancelled Details Verification & Maintenance / V.6 - Extended Security Scanning (push) Has been cancelled Details Verification & Maintenance / V.10 - OpenSSF Scorecard (push) Has been cancelled Details Verification & Maintenance / Verification Summary (push) Has been cancelled Details - Introduce audit, plugin, and scheduler API handlers - Add spec_embed.go for OpenAPI spec embedding - Create modular build scripts (cli, go, native, cross-platform) - Add deployment cleanup and health-check utilities - New ADRs: hot reload, audit store, SSE updates, RBAC, caching, offline mode, KMS regions, tenant offboarding - Add KMS configuration schema and worker variants - Include KMS benchmark tests	2026-03-04 13:24:27 -05:00
Jeremie Fraeys	61081655d2	feat: enhance worker execution and scheduler service templates - Refactor worker configuration management - Improve container executor lifecycle handling - Update runloop and worker core logic - Enhance scheduler service template generation - Remove obsolete 'scheduler' symlink/directory	2026-03-04 13:24:20 -05:00
Jeremie Fraeys	66f262d788	security: improve audit, crypto, and config handling - Enhance audit checkpoint system - Update KMS provider and tenant key management - Refine configuration constants - Improve TUI config handling	2026-03-04 13:23:42 -05:00
Jeremie Fraeys	a4f2c36069	feat: enhance task domain and scheduler protocol - Update task domain model - Improve scheduler hub and priority queue - Enhance protocol definitions - Update manifest schema and run handling	2026-03-04 13:23:38 -05:00
Jeremie Fraeys	1f495dfbb7	api: regenerate OpenAPI types and server code - Update openapi.yaml spec - Regenerate server_gen.go with oapi-codegen - Update adapter, routes, and server configuration	2026-03-04 13:23:34 -05:00
Jeremie Fraeys	e1ec255ad2	refactor(crypto): integrate KMS with TenantKeyManager Replace in-memory root keys with KMS interface: - GenerateDataEncryptionKey: generate DEK, wrap via KMS, cache - UnwrapDataEncryptionKey: cache check, KMS decrypt, cache store - EncryptArtifact/DecryptArtifact: use DEK from KMS - RotateTenantKey: create new KMS key, flush cache - RevokeTenant: disable KMS key, schedule deletion per ADR-015 Remove deprecated methods: wrapKey, unwrapKey (replaced by KMS)	2026-03-03 19:14:27 -05:00
Jeremie Fraeys	7c03c8b5bd	feat(kms): add HashiCorp Vault and AWS KMS providers Implement VaultProvider with Transit engine: - AppRole, Kubernetes, and Token authentication - Encrypt/Decrypt via /transit/encrypt and /transit/decrypt - Key lifecycle via /transit/keys API - Health check via /sys/health Implement AWSProvider with SDK v2: - Per-region key naming with alias prefix - Encrypt/Decrypt via KMS SDK - Key lifecycle (CreateKey, Disable, ScheduleDeletion, Enable) - AWS endpoint support for LocalStack testing	2026-03-03 19:14:21 -05:00
Jeremie Fraeys	cb25677695	feat(kms): implement core KMS infrastructure with DEK cache Add KMSProvider interface for external key management systems: - Encrypt/Decrypt operations for DEK wrapping - Key lifecycle management (Create, Disable, ScheduleDeletion, Enable) - HealthCheck and Close methods Implement MemoryProvider for development/testing: - XOR encryption with HMAC-SHA256 authentication - Secure random key generation using crypto/rand - MAC verification to detect wrong keys Implement DEKCache per ADR-012: - 15-minute TTL with configurable grace window (1 hour) - LRU eviction with 1000 entry limit - Cache key includes (tenantID, artifactID, kmsKeyID) for isolation - Thread-safe operations with RWMutex - Secure memory wiping on eviction/cleanup Add config package with types: - ProviderType enum (vault, aws, memory) - VaultConfig with AppRole/Kubernetes/Token auth - AWSConfig with region and alias prefix - CacheConfig with TTL, MaxEntries, GraceWindow - Validation methods for all config types	2026-03-03 19:13:55 -05:00
Jeremie Fraeys	da104367d6	feat: add Plugin GPU Quota implementation and tests Some checks failed Build Pipeline / Build Binaries (push) Failing after 1m59s Details Build Pipeline / Build Docker Images (push) Has been skipped Details Build Pipeline / Sign HIPAA Config (push) Has been skipped Details Build Pipeline / Generate SLSA Provenance (push) Has been skipped Details Checkout test / test (push) Successful in 5s Details CI Pipeline / Test (ubuntu-latest on self-hosted) (push) Failing after 1s Details CI Pipeline / Dev Compose Smoke Test (push) Has been skipped Details CI Pipeline / Security Scan (push) Has been skipped Details CI Pipeline / Test Scripts (push) Has been skipped Details CI Pipeline / Test Native Libraries (push) Has been skipped Details CI Pipeline / Native Library Build Matrix (push) Has been skipped Details Documentation / build-and-publish (push) Failing after 35s Details CI Pipeline / Trigger Build Workflow (push) Failing after 0s Details Security Scan / Security Analysis (push) Has been cancelled Details Security Scan / Native Library Security (push) Has been cancelled Details Verification & Maintenance / V.1 - Schema Drift Detection (push) Has been cancelled Details Verification & Maintenance / V.4 - Custom Go Vet Analyzers (push) Has been cancelled Details Verification & Maintenance / V.7 - Audit Chain Integrity (push) Has been cancelled Details Verification & Maintenance / V.6 - Extended Security Scanning (push) Has been cancelled Details Verification & Maintenance / V.10 - OpenSSF Scorecard (push) Has been cancelled Details Verification & Maintenance / Verification Summary (push) Has been cancelled Details - Add plugin_quota.go with GPU quota management for scheduler - Update scheduler hub and protocol for plugin support - Add comprehensive plugin quota unit tests - Update gang service and WebSocket queue integration tests	2026-02-26 14:35:05 -05:00
Jeremie Fraeys	8f2495deb0	chore(cleanup): remove obsolete files and update .gitignore Remove deprecated components replaced by new scheduler: - Delete internal/controller/pacing_controller.go (replaced by scheduler/pacing.go) - Delete internal/manifest/schema_test.go (consolidated into tests/unit/) - Delete internal/workertest/worker.go (consolidated into tests/fixtures/) - Update .gitignore with scheduler binary and new patterns	2026-02-26 12:09:18 -05:00
Jeremie Fraeys	4cdb68907e	refactor(utilities): update supporting modules for scheduler integration Update utility modules: - File utilities with secure file operations - Environment pool with resource tracking - Error types with scheduler error categories - Logging with audit context support - Network/SSH with connection pooling - Privacy/PII handling with tenant boundaries - Resource manager with scheduler allocation - Security monitor with audit integration - Tracking plugins (MLflow, TensorBoard) with auth - Crypto signing with tenant keys - Database init with multi-user support	2026-02-26 12:07:15 -05:00
Jeremie Fraeys	6866ba9366	refactor(queue): integrate scheduler backend and storage improvements Update queue and storage systems for scheduler integration: - Queue backend with scheduler coordination - Filesystem queue with batch operations - Deduplication with tenant-aware keys - Storage layer with audit logging hooks - Domain models (Task, Events, Errors) with scheduler fields - Database layer with tenant isolation - Dataset storage with integrity checks	2026-02-26 12:06:46 -05:00
Jeremie Fraeys	6b2c377680	refactor(jupyter): enhance security and scheduler integration Update Jupyter integration for security and scheduler support: - Enhanced security configuration with audit logging - Health monitoring with scheduler event integration - Package manager with network policy enforcement - Service manager with lifecycle hooks - Network manager with tenant isolation - Workspace metadata with tenant tags - Config with resource limits - Podman container integration improvements - Experiment manager with tracking integration - Manifest runner with security checks	2026-02-26 12:06:35 -05:00
Jeremie Fraeys	3fb6902fa1	feat(worker): integrate scheduler endpoints and security hardening Update worker system for scheduler integration: - Worker server with scheduler registration - Configuration with scheduler endpoint support - Artifact handling with integrity verification - Container executor with supply chain validation - Local executor enhancements - GPU detection improvements (cross-platform) - Error handling with execution context - Factory pattern for executor instantiation - Hash integrity with native library support	2026-02-26 12:06:16 -05:00
Jeremie Fraeys	ef11d88a75	refactor(auth): add tenant scoping and permission enhancements Update authentication system for multi-tenant support: - API key management with tenant scoping - Permission checks for multi-tenant operations - Database layer with tenant isolation - Keychain integration with audit logging	2026-02-26 12:06:08 -05:00
Jeremie Fraeys	420de879ff	feat(api): integrate scheduler protocol and WebSocket enhancements Update API layer for scheduler integration: - WebSocket handlers with scheduler protocol support - Jobs WebSocket endpoint with priority queue integration - Validation middleware for scheduler messages - Server configuration with security hardening - Protocol definitions for worker-scheduler communication - Dataset handlers with tenant isolation checks - Response helpers with audit context - OpenAPI spec updates for new endpoints	2026-02-26 12:05:57 -05:00
Jeremie Fraeys	95adcba437	feat(worker): add Jupyter/vLLM plugins and process isolation Extend worker capabilities with new execution plugins and security features: - Jupyter plugin for notebook-based ML experiments - vLLM plugin for LLM inference workloads - Cross-platform process isolation (Unix/Windows) - Network policy enforcement with platform-specific implementations - Service manager integration for lifecycle management - Scheduler backend integration for queue coordination Update lifecycle management: - Enhanced runloop with state transitions - Service manager integration for plugin coordination - Improved state persistence and recovery Add test coverage: - Unit tests for Jupyter and vLLM plugins - Updated worker execution tests	2026-02-26 12:03:59 -05:00
Jeremie Fraeys	a981e89005	feat(security): add audit subsystem and tenant isolation Implement comprehensive audit and security infrastructure: - Immutable audit logs with platform-specific backends (Linux/Other) - Sealed log entries with tamper-evident checksums - Audit alert system for real-time security notifications - Log rotation with retention policies - Checkpoint-based audit verification Add multi-tenant security features: - Tenant manager with quota enforcement - Middleware for tenant authentication/authorization - Per-tenant cryptographic key isolation - Supply chain security for container verification - Cross-platform secure file utilities (Unix/Windows) Add test coverage: - Unit tests for audit alerts and sealed logs - Platform-specific audit backend tests	2026-02-26 12:03:45 -05:00
Jeremie Fraeys	43e6446587	feat(scheduler): implement multi-tenant job scheduler with gang scheduling Add new scheduler component for distributed ML workload orchestration: - Hub-based coordination for multi-worker clusters - Pacing controller for rate limiting job submissions - Priority queue with preemption support - Port allocator for dynamic service discovery - Protocol handlers for worker-scheduler communication - Service manager with OS-specific implementations - Connection management and state persistence - Template system for service deployment Includes comprehensive test suite: - Unit tests for all core components - Integration tests for distributed scenarios - Benchmark tests for performance validation - Mock fixtures for isolated testing Refs: scheduler-architecture.md	2026-02-26 12:03:23 -05:00
Jeremie Fraeys	8f9bcef754	test(phase-3): prerequisite security and reproducibility tests Implement 4 prerequisite test requirements: - TestConfigIntegrityVerification: Config signing, tamper detection, hash stability - TestManifestFilenameNonce: Cryptographic nonce generation and filename patterns - TestGPUDetectionAudit: Structured logging of GPU detection at startup - TestResourceEnvVarParsing: Resource env var parsing and override behavior Also update manifest run_manifest.go: - Add nonce-based filename support to WriteToDir - Add nonce-based file detection to LoadFromDir	2026-02-23 20:25:26 -05:00
Jeremie Fraeys	f71352202e	test(phase-1-2): naming alignment and partial test completion Rename and enhance existing tests to align with coverage map: - TestGPUDetectorAMDVendorAlias -> TestAMDAliasManifestRecord - TestScanArtifacts_SkipsKnownPathsAndLogs -> TestScanExclusionsRecorded - Add env var expansion verification to TestHIPAAValidation_InlineCredentials - Record exclusions in manifest.Artifacts for audit trail	2026-02-23 20:25:07 -05:00
Jeremie Fraeys	17d5c75e33	fix(security): Path validation improvements for symlink resolution Fix ValidatePath to correctly resolve symlinks and handle edge cases: - Resolve symlinks before boundary check to prevent traversal - Handle macOS /private prefix correctly - Add fallback for non-existent paths (parent directory resolution) - Double boundary checks: before AND after symlink resolution - Prevent race conditions between check and use Update path traversal tests: - Correct test expectations for "..." (three dots is valid filename, not traversal) - Add tests for symlink escape attempts - Add unicode attack tests - Add deeply nested traversal tests Security impact: Prevents path traversal via symlink following in artifact scanning and other file operations.	2026-02-23 19:44:16 -05:00
Jeremie Fraeys	58c1a5fa58	feat(audit): Tamper-evident audit chain verification system Add ChainVerifier for cryptographic audit log verification: - VerifyLogFile(): Validates entire audit chain integrity - Detects tampering at specific event index (FirstTampered) - Returns chain root hash for external verification - GetChainRootHash(): Standalone hash computation - VerifyAndAlert(): Boolean tampering detection with logging Add audit-verifier CLI tool: - Standalone binary for audit chain verification - Takes log path argument and reports tampering Update audit logger for chain integrity: - Each event includes sequence number and hash chain - SHA-256 linking: hash_n = SHA-256(prev_hash \|\| event_n) - Tamper detection through hash chain validation Add comprehensive test coverage: - Empty log handling - Valid chain verification - Tampering detection with modification - Root hash consistency - Alert mechanism tests Part of: V.7 audit verification from security plan	2026-02-23 19:43:50 -05:00
Jeremie Fraeys	4a4d3de8e1	feat(security): Manifest security - nonce generation, environment tracking, schema validation Add cryptographically secure manifest filename nonce generation: - GenerateManifestNonce() creates 16-byte random nonce (32 hex chars) - GenerateManifestFilename() creates unique filenames: run_manifest_<nonce>.json - Prevents enumeration attacks on manifest files Add ExecutionEnvironment struct to manifest: - Captures ConfigHash for reproducibility verification - Records GPU detection method (auto-detected, env override, config, etc.) - Records sandbox settings (NoNewPrivileges, DropAllCaps, NetworkMode) - Records compliance mode and manifest nonce - Records artifact scan exclusions with reason Add JSON Schema validation: - schema.json: Canonical schema for manifest validation - schema_version.go: Schema versioning and compatibility checking - schema_test.go: Drift detection with SHA-256 hash verification - Validates required fields (run_id, environment.config_hash, etc.) - Validates compliance_mode enum values (hipaa, standard) - Validates no negative sizes in artifacts Closes: manifest nonce, environment tracking, scan exclusions from security plan	2026-02-23 19:43:39 -05:00
Jeremie Fraeys	9434f4c8e6	feat(security): Artifact ingestion caps enforcement Add MaxArtifactFiles and MaxArtifactTotalBytes to SandboxConfig: - Default MaxArtifactFiles: 10,000 (configurable via SecurityDefaults) - Default MaxArtifactTotalBytes: 100GB (configurable via SecurityDefaults) - ApplySecurityDefaults() sets defaults if not specified Enforce caps in scanArtifacts() during directory walk: - Returns error immediately when MaxArtifactFiles exceeded - Returns error immediately when MaxArtifactTotalBytes exceeded - Prevents resource exhaustion attacks from malicious artifact trees Update all call sites to pass SandboxConfig for cap enforcement: - Native bridge libs updated to pass caps argument - Benchmark tests updated with nil caps (unlimited for benchmarks) - Unit tests updated with nil caps Closes: artifact ingestion caps items from security plan	2026-02-23 19:43:28 -05:00
Jeremie Fraeys	a8180f1f26	feat(security): HIPAA compliance mode and PHI denylist validation Add compliance_mode field to Config with strict HIPAA validation: - Requires SnapshotStore.Secure=true in HIPAA mode - Requires NetworkMode="none" for tenant isolation - Requires non-empty SeccompProfile - Requires NoNewPrivileges=true - Enforces credentials via environment variables only (no inline YAML) Add PHI denylist validation for AllowedSecrets: - Blocks secrets matching patterns: patient, ssn, mrn, medical_record, diagnosis, dob, birth, mrn_number, patient_id, patient_name - Prevents accidental PHI exfiltration via secret channels Add comprehensive test coverage in hipaa_validation_test.go: - Network mode enforcement tests - NoNewPrivileges requirement tests - Seccomp profile validation tests - Inline credential rejection tests - PHI denylist validation tests Closes: compliance_mode, PHI denylist items from security plan	2026-02-23 19:43:19 -05:00
Jeremie Fraeys	fc2459977c	refactor(worker): update worker tests and native bridge Worker Refactoring: - Update internal/worker/factory.go, worker.go, snapshot_store.go - Update native_bridge.go and native_bridge_nocgo.go for native library integration Test Updates: - Update all worker unit tests for new interfaces - Update chaos tests - Update container/podman_test.go - Add internal/workertest/worker.go for shared test utilities Documentation: - Update native/README.md	2026-02-23 18:04:22 -05:00
Jeremie Fraeys	a70d8aad8e	refactor: remove dead code and fix unused variables Cleanup: - Delete internal/worker/testutil.go (150 lines of unused test utilities) - Remove unused stateDir() function from internal/jupyter/service_manager.go - Silence unused variable warning in internal/worker/executor/container.go	2026-02-23 18:03:38 -05:00
Jeremie Fraeys	92aab06d76	feat(security): implement comprehensive security hardening phases 1-5,7 Implements defense-in-depth security for HIPAA and multi-tenant requirements: Phase 1 - File Ingestion Security: - SecurePathValidator with symlink resolution and path boundary enforcement in internal/fileutil/secure.go - Magic bytes validation for ML artifacts (safetensors, GGUF, HDF5, numpy) in internal/fileutil/filetype.go - Dangerous extension blocking (.pt, .pkl, .exe, .sh, .zip) - Upload limits (10GB size, 100MB/s rate, 10 uploads/min) Phase 2 - Sandbox Hardening: - ApplySecurityDefaults() with secure-by-default principle - network_mode: none, read_only_root: true, no_new_privileges: true - drop_all_caps: true, user_ns: true, run_as_uid/gid: 1000 - PodmanSecurityConfig and BuildSecurityArgs() in internal/container/podman.go - BuildPodmanCommand now accepts full security configuration - Container executor passes SandboxConfig to Podman command builder - configs/seccomp/default-hardened.json blocks dangerous syscalls (ptrace, mount, reboot, kexec_load, open_by_handle_at) Phase 3 - Secrets Management: - expandSecrets() for environment variable expansion using ${VAR} syntax - validateNoPlaintextSecrets() with entropy-based detection - Pattern matching for AWS, GitHub, GitLab, OpenAI, Stripe tokens - Shannon entropy calculation (>4 bits/char triggers detection) - Secrets expanded during LoadConfig() before validation Phase 5 - HIPAA Audit Logging: - Tamper-evident chain hashing with SHA-256 in internal/audit/audit.go - Event struct extended with PrevHash, EventHash, SequenceNum - File access event types: EventFileRead, EventFileWrite, EventFileDelete - LogFileAccess() helper for HIPAA compliance - VerifyChain() function for tamper detection Supporting Changes: - Add DeleteJob() and DeleteJobsByPrefix() to storage package - Integrate SecurePathValidator in artifact scanning	2026-02-23 18:00:33 -05:00
Jeremie Fraeys	3b194ff2e8	feat: GPU detection transparency and artifact scanner improvements Some checks failed Build CLI with Embedded SQLite / build (arm64, aarch64-linux) (push) Waiting to run Details Build CLI with Embedded SQLite / build (x86_64, x86_64-linux) (push) Waiting to run Details Build CLI with Embedded SQLite / build-macos (arm64) (push) Waiting to run Details Build CLI with Embedded SQLite / build-macos (x86_64) (push) Waiting to run Details Security Scan / Security Analysis (push) Waiting to run Details Security Scan / Native Library Security (push) Waiting to run Details Checkout test / test (push) Successful in 6s Details CI/CD Pipeline / Test (push) Failing after 1s Details CI/CD Pipeline / Dev Compose Smoke Test (push) Has been skipped Details CI/CD Pipeline / Build (push) Has been skipped Details CI/CD Pipeline / Test Scripts (push) Has been skipped Details CI/CD Pipeline / Test Native Libraries (push) Has been skipped Details CI/CD Pipeline / GPU Golden Test Matrix (push) Has been skipped Details Documentation / build-and-publish (push) Failing after 39s Details CI/CD Pipeline / Docker Build (push) Has been skipped Details - Surface GPUDetectionInfo from parseGPUCountFromConfig for detection metadata - Document FETCH_ML_TOTAL_CPU and FETCH_ML_GPU_SLOTS_PER_GPU env vars - Add debug logging for all env var overrides to stderr - Track config-layer auto-detection in GPUDetectionInfo.ConfigLayerAutoDetected - Add --include-all flag to artifact scanner (includeAll parameter) - Add AMD production mode enforcement (error in non-local mode) - Add GPU detector unit tests for env overrides and AMD aliasing	2026-02-23 12:29:34 -05:00
Jeremie Fraeys	1b0781dc68	fix(auth): make DeleteAPIKey resilient to keyring errors Some checks failed Security Scan / Security Analysis (push) Waiting to run Details Security Scan / Native Library Security (push) Waiting to run Details Checkout test / test (push) Successful in 4s Details CI/CD Pipeline / Test (push) Has been cancelled Details CI/CD Pipeline / Dev Compose Smoke Test (push) Has been cancelled Details CI/CD Pipeline / Build (push) Has been cancelled Details CI/CD Pipeline / Test Scripts (push) Has been cancelled Details CI/CD Pipeline / Test Native Libraries (push) Has been cancelled Details CI/CD Pipeline / Docker Build (push) Has been cancelled Details Documentation / build-and-publish (push) Has been cancelled Details DeleteAPIKey now ignores primary keyring errors (e.g., dbus unavailable) and always cleans up the fallback store	2026-02-21 21:19:46 -05:00

1 2 3

137 commits