fetch_ml

Author	SHA1	Message	Date
Jeremie Fraeys	4b8adeacdc	test(crypto): add getter tests and property-based round-trip tests Add tests for: - GetPublicKey: returns correct public key - GetKeyID: returns correct key ID - Property-based round-trip: Sign -> Verify for various message types (empty, single char, unicode, large messages) Coverage: GetPublicKey 100%, GetKeyID 100%	2026-03-13 23:41:01 -04:00
Jeremie Fraeys	04175a97ee	test(tracking): add comprehensive plugin registry tests Add tests for tracking package main exports: - NewRegistry, Register, Get plugin management - ProvisionAll with empty, disabled, unregistered configs - TeardownAll lifecycle management - NewPortAllocator, Allocate, Release with proper port reuse - StringSetting, ToolMode constants, ToolConfig structure Coverage: 84.7%	2026-03-13 23:35:10 -04:00
Jeremie Fraeys	f74f3fa730	test(security): fix race conditions in monitor tests Use atomic operations for shared variables: - alertCount: atomic.Int32 for concurrent access - lastAlert: atomic.Value for alert storage Fixes data races detected by -race flag.	2026-03-13 23:31:22 -04:00
Jeremie Fraeys	4da027868d	fix(storage): handle NULL values and state tracking in database operations Fixes to support proper test coverage: - db_jobs.go: UpdateJobStatus now checks RowsAffected and returns error for nonexistent jobs instead of silently succeeding - db_audit.go: GetOldestAuditLogDate uses sql.NullString to parse SQLite datetime strings in YYYY-MM-DD HH:MM:SS format with RFC3339 fallback - db_experiments.go: ListTasksForExperiment uses sql.NullString for nullable worker_id and error fields to prevent scan errors - db_connect.go: DB struct adds isClosed state tracking with mutex; Close() now returns error on double close to match test expectations	2026-03-13 23:27:35 -04:00
Jeremie Fraeys	9b8d8e5281	test(tracking): add factory plugin loader tests Add tests for: - NewPluginLoader: factory creation - RegisterFactory: custom factory registration - LoadPluginsEmpty: empty plugin handling - LoadPluginsDisabled: skip disabled plugins - LoadPluginsUnknown: unknown plugin handling - PluginConfigStructure: config field validation - LoadPluginsMLflow, TensorBoard, Wandb: plugin type support Coverage: 79.2%	2026-03-13 23:26:52 -04:00
Jeremie Fraeys	5d39dff6a0	test(store): extend store coverage with edge cases and concurrency Add tests for: - Close: proper resource cleanup - ConcurrentLogMetrics: thread-safe metric logging - GetRunMetricsEmpty: empty result handling - GetRunParamsEmpty: empty result handling - MarkRunSyncedNonexistent: graceful handling of missing runs Coverage: 75.3%	2026-03-13 23:26:41 -04:00
Jeremie Fraeys	50b6506243	test(storage): add comprehensive storage layer tests Add tests for: - dataset: Redis dataset operations, transfer tracking - db_audit: audit logging with hash chain, access tracking - db_experiments: experiment metadata, dataset associations - db_tasks: task listing with pagination for users and groups - db_jobs: job CRUD, state transitions, worker assignment Coverage: storage package ~40%+	2026-03-13 23:26:33 -04:00
Jeremie Fraeys	5057f02167	test(crypto,security): add tenant key manager and anomaly monitor tests Add comprehensive tests for: - crypto/tenant_keys: KMS integration, key rotation, encryption/decryption - security/monitor: sliding window, anomaly detection, concurrent access Coverage: crypto 65.1%, security 100%	2026-03-13 23:26:22 -04:00
Jeremie Fraeys	77542b7068	refactor: update API plugins version retrieval Refactor getPluginVersion to accept PluginConfig parameter: - Change signature from getPluginVersion(pluginName) to getPluginVersion(pluginName, cfg) - Update all call sites to pass config - Add TODO comment for future implementation querying actual plugin binary/container Update plugin handlers to use dynamic version retrieval: - GetV1Plugins: Use h.getPluginVersion(name, cfg) instead of hardcoded "1.0.0" - PutV1PluginsPluginNameConfig: Pass newConfig to version retrieval - GetV1PluginsPluginNameHealth: Use actual version from config This prepares the API for dynamic version reporting while maintaining backward compatibility with the current placeholder implementation.	2026-03-12 16:40:39 -04:00
Jeremie Fraeys	96dd604789	feat: implement WebSocket binary protocol and NOT_IMPLEMENTED error code Add CodeNotImplemented error constant (HTTP 501) for planned but unavailable features. Refactor WebSocket packet handling from JSON to binary protocol for improved efficiency: New packet structure: - PacketTypeSuccess (0x00): [type:1][json_data:var] - PacketTypeError (0x01): [type:1][code_len:1][code:var][msg_len:2][msg:var][details_len:2][details:var] - PacketTypeData (0x02): Reserved for future use Update SendErrorPacket: - Build binary error packets with length-prefixed fields - Use WriteMessage with websocket.BinaryMessage Update SendSuccessPacket: - Marshal data to JSON then wrap in binary packet - Eliminates "success" wrapper field for cleaner protocol Add helper functions: - NewNotImplemented(feature) - Standard 501 error - NewNotImplementedWithIssue(feature, issueURL) - 501 with GitHub reference	2026-03-12 16:40:23 -04:00
Jeremie Fraeys	8a30acf661	build: update Makefile test paths and clean up old test references Update test paths to reflect new test locations: - Change tests/unit/... to internal/... in test, test-unit, and verify-audit targets - Update test-coverage target to use correct coverpkg paths - Add coverage summary output to test-coverage target Clean up deleted test files from old locations: - Remove tests/unit/crypto/kms/cache_test.go (now in internal/auth/kms/) - Remove tests/unit/crypto/kms/protocol_test.go (now in internal/auth/kms/) - Remove tests/unit/resources/manager_test.go (now in internal/resources/)	2026-03-12 16:39:52 -04:00
Jeremie Fraeys	d0266c4a90	refactor: scheduler hub bug fix, test helpers, and orphan recovery tests Fix bug in scheduler hub orphan reconciliation: - Move delete(h.pendingAcceptance, taskID) inside the requeue success block - Prevents premature cleanup when requeue fails Add comprehensive test infrastructure: - hub_test_helpers.go: New test helper utilities (78 lines) - Mock scheduler components for isolated testing - Test fixture setup and teardown helpers Refactor and enhance hub capabilities tests: - Significant restructuring of hub_capabilities_test.go (213 lines changed) - Improved test coverage for worker capability matching Add comprehensive orphan recovery tests: - internal/scheduler/orphan_recovery_test.go (451 lines) - Tests orphaned job detection and recovery - Covers requeue logic, timeout handling, state cleanup	2026-03-12 16:38:33 -04:00
Jeremie Fraeys	939faeb8e4	refactor: relocate store package from cmd/tui/internal to internal Move store package to improve reusability and follow Go project conventions: - cmd/tui/internal/store/store.go -> internal/store/store.go - cmd/tui/internal/store/store_test.go -> internal/store/store_test.go This makes the store package available to other components beyond the TUI, reducing coupling and enabling future reuse by API server, CLI, or other tools.	2026-03-12 16:38:01 -04:00
Jeremie Fraeys	7ff2c6c487	refactor: reorganize integration tests Move integration-appropriate tests from tests/unit/ to tests/integration/: - tests/unit/simple_test.go -> tests/integration/simple_test.go - tests/unit/deployments/traefik_compose_test.go -> tests/integration/traefik_compose_test.go - tests/unit/worker_trust_test.go -> tests/integration/worker_trust_test.go Update test package declarations and imports to reflect new locations. These tests were misplaced in the unit tests directory but actually test integration between components or external systems (Traefik, worker trust).	2026-03-12 16:37:42 -04:00
Jeremie Fraeys	61660dc925	refactor: co-locate security, storage, telemetry, tracking, worker tests Move unit tests from tests/unit/ to internal/ following Go conventions: Security tests: - tests/unit/security/* -> internal/security/* (audit, config_integrity, filetype, gpu_audit, hipaa_validation, manifest_filename, path_traversal, resource_quota, secrets) Storage tests: - tests/unit/storage/* -> internal/storage/* (db, experiment_metadata) Telemetry tests: - tests/unit/telemetry/* -> internal/telemetry/* (telemetry) Tracking tests: - tests/unit/reproducibility/* -> internal/tracking/* (config_hash, environment_capture) Worker tests: - tests/unit/worker/* -> internal/worker/* (artifacts, config, hash_bench, plugins/jupyter_task, plugins/vllm, prewarm_v1, run_manifest_execution, snapshot_stage, snapshot_store, worker) Update import paths in test files to reflect new locations.	2026-03-12 16:37:03 -04:00
Jeremie Fraeys	74e06017b5	refactor: co-locate scheduler non-hub tests with source code Move unit tests from tests/unit/scheduler/ to internal/scheduler/ following Go conventions: - capability_routing_test.go - Worker capability-based job routing tests - failure_scenarios_test.go - Scheduler failure handling and recovery tests - heartbeat_test.go - Worker heartbeat monitoring tests - plugin_quota_test.go - Plugin resource quota enforcement tests - port_allocator_test.go - Dynamic port allocation for services tests - priority_queue_test.go - Job priority queue implementation tests - service_templates_test.go - Service template management tests - state_store_test.go - Scheduler state persistence tests Note: orphan_recovery_test.go excluded from this commit - will be handled with hub refactoring due to significant test changes.	2026-03-12 16:36:29 -04:00
Jeremie Fraeys	ee0b90cfc5	refactor: co-locate queue and resources tests, add manager tests Move unit tests from tests/unit/ to internal/ following Go conventions: - tests/unit/queue/* -> internal/queue/* (dedup, filesystem_fallback, queue_permissions, queue_spec, queue, sqlite_queue tests) - tests/unit/gpu/* -> internal/resources/* (gpu_detector, gpu_golden tests) - tests/unit/resources/* -> internal/resources/* (manager_test.go) Update import paths in test files to reflect new locations. Note: GPU tests consolidated into resources package since GPU detection is part of resource management. Manager tests show significant new test coverage (166 lines).	2026-03-12 16:36:02 -04:00
Jeremie Fraeys	ca6ad970c3	refactor: co-locate logging, manifest, network, privacy, prommetrics tests Move unit tests from tests/unit/ to internal/ following Go conventions: - tests/unit/logging/* -> internal/logging/* (logging tests) - tests/unit/manifest/* -> internal/manifest/* (run_manifest, schema tests) - tests/unit/network/* -> internal/network/* (retry, ssh_pool, ssh tests) - tests/unit/privacy/* -> internal/privacy/* (pii tests) - tests/unit/metrics/* -> internal/prommetrics/* (metrics tests) Update import paths in test files to reflect new locations. Note: metrics_test.go moved from tests/unit/metrics/ to internal/prommetrics/ to match the actual package name.	2026-03-12 16:35:37 -04:00
Jeremie Fraeys	cf84246115	refactor: co-locate config, container, envpool, errors, experiment, jupyter tests Move unit tests from tests/unit/ to internal/ following Go conventions: - tests/unit/config/* -> internal/config/* (constants, mode_paths, paths, validation) - tests/unit/container/* -> internal/container/* (podman, security tests) - tests/unit/envpool/* -> internal/envpool/* (envpool tests) - tests/unit/errors/* -> internal/errtypes/* (errors_test.go moved to errtypes package) - tests/unit/experiment/* -> internal/experiment/* (manager tests) - tests/unit/jupyter/* -> internal/jupyter/* (config, package_blacklist, service_manager, trash_restore) Update import paths in test files to reflect new locations. Note: errors_test.go moved from tests/unit/errors/ to internal/errtypes/ to match the package structure.	2026-03-12 16:35:15 -04:00
Jeremie Fraeys	a4e2ecdbe6	refactor: co-locate api, audit, auth tests with source code Move unit tests from tests/unit/ to internal/ following Go conventions: - tests/unit/api/* -> internal/api/* (WebSocket handlers, helpers, duplicate detection) - tests/unit/audit/* -> internal/audit/* (alert, sealed, verifier tests) - tests/unit/auth/* -> internal/auth/* (API key, keychain, user manager) - tests/unit/crypto/kms/* -> internal/auth/kms/* (cache, protocol tests) Update import paths in test files to reflect new locations. Benefits: - Tests live alongside the code they test - Easier navigation and maintenance - Clearer package boundaries - Follows standard Go project layout	2026-03-12 16:34:54 -04:00
Jeremie Fraeys	b00fa236db	docs: add Known Limitations section and testing structure updates Add Known Limitations section to AGENTS.md documenting: - AMD GPU not implemented (use NVIDIA, Apple Silicon, or CPU) - 100+ node gang allocation stress testing not yet implemented - Podman-in-Docker CI requires privileged mode, not yet automated - Error handling patterns for unimplemented features - Container usage rules (Docker for testing/deployments, Podman for experiments) - Error codes table (NOT_IMPLEMENTED, NOT_FOUND, INVALID_CONFIGURATION) Update testing documentation to reflect new test locations: - Unit tests moved from tests/unit/ to internal/ (Go convention) - Update all test file path references in security testing docs	2026-03-12 16:33:19 -04:00
Jeremie Fraeys	6646f3a382	ci(docker): add test workflow and container architecture docs - Create docker-tests.yml for merge-to-main CI pipeline - Add mock GPU test matrix (NVIDIA, Metal, CPU-only) - Add AGENTS.md with container architecture rules: * Docker for CI/CD testing and deployments * Podman for ML experiment isolation only - Update .gitignore to track AGENTS.md	2026-03-12 14:05:53 -04:00
Jeremie Fraeys	6af85ddaf6	feat(tests): enable stress and long-running test suites Stress Tests: - TestStress_WorkerConnectBurst: 30 workers, p99 latency validation - TestStress_JobSubmissionBurst: 1K job submissions - TestStress_WorkerChurn: 50 connect/disconnect cycles, memory leak detection - TestStress_ConcurrentScheduling: 10 workers x 20 jobs contention Long-Running Tests: - TestLongRunning_MemoryLeak: heap growth monitoring - TestLongRunning_OrphanRecovery: worker death/requeue stability - TestLongRunning_WebSocketStability: 20 worker connection stability Infrastructure: - Add testreport package with JSON output, flaky test tracking - Add TestTimer for timing/budget enforcement - Add WaitForEvent, WaitForTaskStatus helpers - Fix worker IDs to use valid bench-worker token patterns	2026-03-12 14:05:45 -04:00
Jeremie Fraeys	ca913e8878	feat(scheduler): add test mode config and TLS detection - Add DisableTLSForTesting to HubConfig for test environments - Add IsUsingTLS() method to detect scheduler TLS status - Update MockWorker to auto-select ws:// vs wss:// protocol - Set DisableTLSForTesting: true in DefaultHubConfig	2026-03-12 14:05:35 -04:00
Jeremie Fraeys	c5524562e9	test(scheduler): remove unused fields in service slot pool separation test Remove ID and GPUCount fields from batchJob in TestServiceSlotPoolSeparation that were assigned but never used. The test only validates SlotPool values.	2026-03-12 12:10:33 -04:00
Jeremie Fraeys	a49e8f593c	chore(tools): update fetchml-vet analyzers Analyzer improvements: - hipaacomplete.go: refined HIPAA compliance checks - manifestenv.go: environment variable validation in manifests - nobaredetector.go: detection of bare credential exposures - noinlinecredentials.go: inline credential scanning improvements	2026-03-12 12:09:34 -04:00
Jeremie Fraeys	2bd7f97ae2	test(integration,unit): update test suites for new features and APIs Integration test updates: - jupyter_experiment_test.go: update for new workspace handling - run_manifest_test.go: reproducibility manifest validation - secrets_integration_test.go: KMS and secret provider tests - storage_redis_integration_test.go: Redis-backed storage tests Unit test updates: - response_helpers_test.go: API response helper tests - config_hash_test.go: configuration hashing for reproducibility - filetype_test.go: security file type detection tests Load testing: - load_test.go: scheduler load and stress tests	2026-03-12 12:09:15 -04:00
Jeremie Fraeys	2b1ef10514	test(chaos): add worker disconnect chaos test and queue improvements Chaos testing: - Add worker_disconnect_chaos_test.go for network partition resilience - Test scheduler hub recovery and job reassignment scenarios Queue layer updates: - event_store.go: add event sourcing for queue operations - native_queue.go: extend native queue with batch operations and indexing	2026-03-12 12:08:21 -04:00
Jeremie Fraeys	93d6d63d8d	chore(deploy): update Docker compose files and add MinIO lifecycle policies Docker Compose updates: - docker-compose.dev.yml: add GPU support, local scheduler and worker - docker-compose.staging.yml: production-like staging with SSL termination - docker-compose.test.yml: ephemeral test environment with seeded data MinIO lifecycle management: - Add lifecycle-dev.json: 7-day retention for dev artifacts - Add lifecycle-staging.json: 30-day retention with transition to cold Build improvements: - Makefile: add native library build targets and cross-platform support - scripts/release/cleanup.sh: improved artifact cleanup with dry-run mode	2026-03-12 12:06:16 -04:00
Jeremie Fraeys	17170667e2	feat(worker): improve lifecycle management and vLLM plugin Lifecycle improvements: - runloop.go: refined state machine with better error recovery - service_manager.go: service dependency management and health checks - states.go: add states for capability advertisement and draining Container execution: - container.go: improved OCI runtime integration with supply chain checks - Add image verification and signature validation - Better resource limits enforcement for GPU/memory vLLM plugin updates: - vllm.go: support for vLLM 0.3+ with new engine arguments - Add quantization-aware scheduling (AWQ, GPTQ, FP8) - Improve model download and caching logic Configuration: - config.go: add capability advertisement configuration - snapshot_store.go: improve snapshot management for checkpointing	2026-03-12 12:05:02 -04:00
Jeremie Fraeys	c18a8619fe	feat(api): add structured error package and refactor handlers New error handling: - Add internal/api/errors/errors.go with structured API error types - Standardize error codes across all API endpoints - Add user-facing error messages vs internal error details separation Handler improvements: - jupyter/handlers.go: better workspace lifecycle and error handling - plugins/handlers.go: plugin management with validation - groups/handlers.go: group CRUD with capability metadata - jobs/handlers.go: job submission and monitoring improvements - datasets/handlers.go: dataset upload/download with progress - validate/handlers.go: manifest validation with detailed errors - audit/handlers.go: audit log querying with filters Server configuration: - server_config.go: refined config loading with validation - server_gen.go: improved code generation for OpenAPI specs	2026-03-12 12:04:46 -04:00
Jeremie Fraeys	37c4d4e9c7	feat(crypto,auth): harden KMS and improve permission handling KMS improvements: - cache.go: add LRU eviction with memory-bounded caches - provider.go: refactor provider initialization and key rotation - tenant_keys.go: per-tenant key isolation with envelope encryption Auth layer updates: - hybrid.go: refine hybrid auth flow for API key + JWT - permissions_loader.go: faster permission caching with hot-reload - validator.go: stricter validation with detailed error messages Security middleware: - security.go: add rate limiting headers and CORS refinement Testing and benchmarks: - Add KMS cache and protocol unit tests - Add KMS benchmark tests for encryption throughput - Update KMS integration tests for tenant isolation	2026-03-12 12:04:32 -04:00
Jeremie Fraeys	de83300962	feat(worker): refactor GPU detection with macOS Metal support GPU detection refactor: - Major rewrite of gpu_detector.go with unified detection interface - Support for NVIDIA (NVML), AMD (ROCm), and Apple Metal - Runtime GPU capability querying for scheduler matching macOS improvements: - gpu_macos.go: native Metal device enumeration and memory queries - Support for Apple Silicon (M1/M2/M3) unified memory reporting - Fallback to system profiler for Intel Macs Testing infrastructure: - Add gpu_detector_mock.go for testing without hardware - Update gpu_golden_test.go with platform-specific expectations - Cross-platform GPU info validation	2026-03-12 12:02:41 -04:00
Jeremie Fraeys	188cf55939	refactor(api): overhaul WebSocket handler and protocol layer Major WebSocket handler refactor: - Rewrite ws/handler.go with structured message routing and backpressure - Add connection lifecycle management with heartbeats and timeouts - Implement graceful connection draining for zero-downtime restarts Protocol improvements: - Define structured protocol types in protocol.go for hub communication - Add versioned message envelopes for backward compatibility - Standardize error codes and response formats across WebSocket API Job streaming via WebSocket: - Simplify ws/jobs.go with async job status streaming - Add compression for high-volume job updates Testing: - Update websocket_e2e_test.go for new protocol semantics - Add connection resilience tests	2026-03-12 12:01:21 -04:00
Jeremie Fraeys	ad3be36a6d	feat(cli): add workers command, scheduler client, and PII utilities New commands and modules: - Add workers.zig command for worker management and status - Add scheduler_client.zig for scheduler hub communication - Add pii.zig utility for PII detection and redaction in logs/outputs Improvements to existing commands: - groups.zig: enhanced group management with capability metadata - jupyter/mod.zig: improved Jupyter workspace lifecycle handling - tasks.zig: better task status reporting and cancellation support Networking and sync improvements: - ws/client.zig: WebSocket client enhancements for hub protocol - sync_manager.zig: improved sync with scheduler state and conflict resolution - uuid.zig: optimized UUID generation for macOS and Linux Database utilities: - sqlite_embedded.zig: embedded SQLite for CLI-local state caching	2026-03-12 12:00:49 -04:00
Jeremie Fraeys	57787e1e7b	feat(scheduler): implement capability-based routing and hub v2 Add comprehensive capability routing system to scheduler hub: - Capability-aware worker matching with requirement/offer negotiation - Hub v2 protocol with structured message types and heartbeat management - Worker capability advertisement and dynamic routing decisions - Orphan recovery for disconnected workers with state reconciliation - Template-based job scheduling with capability constraints Add extensive test coverage: - Unit tests for capability routing logic and heartbeat mechanics - Unit tests for orphan recovery scenarios - E2E tests for capability routing across multiple workers - Hub capabilities integration tests - Scheduler fixture helpers for test setup Protocol improvements: - Define structured protocol messages for hub-worker communication - Add capability matching algorithm with scoring - Implement graceful worker disconnection handling	2026-03-12 12:00:05 -04:00
Jeremie Fraeys	13ffb81cab	fix: add CGO build tags to consistency tests, remove unused isHex function	2026-03-08 13:10:00 -04:00
Jeremie Fraeys	7eee31d721	chore: cleanup and miscellaneous updates - .gitignore: Add reports/ and .api-keys - examples/jupyter_experiment_integration.py: Update for new API - podman/scripts/: CLI integration, secure runner, ML tool testing - tools/: Performance regression detector, profiler utilities	2026-03-08 13:04:01 -04:00
Jeremie Fraeys	c74e91dd69	test: update test suite and remove deprecated privacy middleware Test improvements: - fixtures/: Updated mocks, fixtures with group context, SSH server, TUI driver - integration/: WebSocket queue and handler tests with groups - e2e/: WebSocket and TLS proxy end-to-end tests - unit/api/ws_test.go: WebSocket API tests - unit/scheduler/service_templates_test.go: Service template tests - benchmarks/scheduler_bench_test.go: Performance benchmarks Cleanup: - Remove privacy middleware (replaced by audit system) - Remove privacy_test.go	2026-03-08 13:03:55 -04:00
Jeremie Fraeys	cb142213fa	chore(build): update build system, Dockerfiles, and dependencies Build and deployment improvements: Makefile: - Native library build targets with ASan support - Cross-platform compilation helpers - Performance benchmark targets - Security scan integration Docker: - secure-prod.Dockerfile: Hardened production image (non-root, minimal surface) - simple.Dockerfile: Lightweight development image Scripts: - build/: Go and native library build scripts, cross-platform builds - ci/: checks.sh, test.sh, verify-paths.sh for validation - benchmarks/: Local performance testing and regression tracking - dev/: Monitoring setup Dependencies: Update to latest stable with security patches Commands: - api-server/main.go: Server initialization updates - data_manager/data_sync.go: Data sync with visibility - errors/main.go: Error handling improvements - tui/: TUI improvements for group management	2026-03-08 13:03:48 -04:00
Jeremie Fraeys	4b2782f674	feat(domain): add task visibility and supporting infrastructure Core domain and utility updates: - domain/task.go: Task model with visibility system * Visibility enum: private, lab, institution, open * Group associations for lab-scoped access * CreatedBy tracking for ownership * Sharing metadata with expiry - config/paths.go: Group-scoped data directories and audit log paths - crypto/signing.go: Key management for audit sealing, token signature verification - container/supply_chain.go: Image provenance tracking, vulnerability scanning - fileutil/filetype.go: MIME type detection and security validation - fileutil/secure.go: Protected file permissions, secure deletion - jupyter/: Package and service manager updates - experiment/manager.go: Visibility cascade from experiments to tasks - network/ssh.go: SSH tunneling improvements - queue/: Filesystem queue enhancements	2026-03-08 13:03:27 -04:00
Jeremie Fraeys	0b5e99f720	refactor(scheduler,worker): improve service management and GPU detection Scheduler enhancements: - auth.go: Group membership validation in authentication - hub.go: Task distribution with group affinity - port_allocator.go: Dynamic port allocation with conflict resolution - scheduler_conn.go: Connection pooling and retry logic - service_manager.go: Lifecycle management for scheduler services - service_templates.go: Template-based service configuration - state.go: Persistent state management with recovery Worker improvements: - config.go: Extended configuration for task visibility rules - execution/setup.go: Sandboxed execution environment setup - executor/container.go: Container runtime integration - executor/runner.go: Task runner with visibility enforcement - gpu_detector.go: Robust GPU detection (NVIDIA, AMD, Apple Silicon, CPU fallback) - integrity/validate.go: Data integrity validation - lifecycle/runloop.go: Improved runloop with graceful shutdown - lifecycle/service_manager.go: Service lifecycle coordination - process/isolation.go + isolation_unix.go: Process isolation with namespaces/cgroups - tenant/manager.go: Multi-tenant resource isolation - tenant/middleware.go: Tenant context propagation - worker.go: Core worker with group-scoped task execution	2026-03-08 13:03:15 -04:00
Jeremie Fraeys	5ae997ceb3	feat(cli): add groups and tasks commands with visibility controls New Zig CLI commands for lab management: - groups.zig: Lab group management commands * create-group: Create new lab groups with metadata * list-groups: Show all groups with member counts * add-member: Add users with role assignment (admin/member/viewer) * remove-member: Remove users from groups * group-info: Display group details and membership - tasks.zig: Task operations with visibility integration * create-task: New tasks with visibility flag (private/lab/institution/open) * list-tasks: Filter by visibility level and group membership * share-task: Generate access tokens for external sharing * clone-task: Copy tasks with public clone tokens * task-visibility: Change visibility and cascade to experiments - run.zig: Updated experiment runner * Integrate with new task visibility system * Group-scoped experiment execution * Token-based access for shared experiments - main.zig: Command registration updates * Wire up new groups and tasks commands * Updated help text and command discovery	2026-03-08 13:03:10 -04:00
Jeremie Fraeys	1c7205c0a0	feat(audit): add HTTP audit middleware and tamper-evident logging Comprehensive audit system for security and compliance: - middleware/audit.go: HTTP request/response auditing middleware * Captures request details, user identity, response status * Chains audit events with cryptographic hashes for tamper detection * Configurable filtering for sensitive data redaction - audit/chain.go: Blockchain-style audit log chaining * Each entry includes hash of previous entry * Tamper detection through hash verification * Supports incremental verification without full scan - checkpoint.go: Periodic integrity checkpoints * Creates signed checkpoints for fast verification * Configurable checkpoint intervals * Recovery from last known good checkpoint - rotation.go: Automatic log rotation and archival * Size-based and time-based rotation policies * Compressed archival with integrity seals * Retention policy enforcement - sealed.go: Cryptographic sealing of audit logs * Digital signatures for log integrity * HSM support preparation * Exportable sealed bundles for external auditors - verifier.go: Log verification and forensic analysis * Complete chain verification from genesis to latest * Detects gaps, tampering, unauthorized modifications * Forensic export for incident response	2026-03-08 13:03:02 -04:00
Jeremie Fraeys	7e5ceec069	feat(api): add groups and tokens handlers, refactor routes Add new API endpoints and clean up handler interfaces: - groups/handlers.go: New lab group management API * CRUD operations for lab groups * Member management with role assignment (admin/member/viewer) * Group listing and membership queries - tokens/handlers.go: Token generation and validation endpoints * Create access tokens for public task sharing * Validate tokens for secure access * Token revocation and cleanup - routes.go: Refactor handler registration * Integrate groups handler into WebSocket routes * Remove nil parameters from all handler constructors * Cleaner dependency injection pattern - Handler interface cleanup across all modules: * jobs/handlers.go: Remove unused nil privacyEnforcer parameter * jupyter/handlers.go: Streamline initialization * scheduler/handlers.go: Consistent constructor signature * ws/handler.go: Add groups handler to dependencies	2026-03-08 12:51:25 -04:00
Jeremie Fraeys	c52179dcbe	feat(auth): add token-based access and structured logging Add comprehensive authentication and authorization enhancements: - tokens.go: New token management system for public task access and cloning * SHA-256 hashed token storage for security * Token generation, validation, and automatic cleanup * Support for public access and clone permissions - api_key.go: Extend User struct with Groups field * Lab group membership (ml-lab, nlp-group) * Integration with permission system for group-based access - flags.go: Security hardening - migrate to structured logging * Replace log.Printf with log/slog to prevent log injection attacks * Consistent structured output for all auth warnings * Safe handling of file paths and errors in logs - permissions.go: Add task sharing permission constants * PermissionTasksReadOwn: Access own tasks * PermissionTasksReadLab: Access lab group tasks * PermissionTasksReadAll: Admin/institution-wide access * PermissionTasksShare: Grant access to other users * PermissionTasksClone: Create copies of shared tasks * CanAccessTask() method with visibility checks - database.go: Improve error handling * Add structured error logging on row close failures	2026-03-08 12:51:07 -04:00
Jeremie Fraeys	fbcf4d38e5	feat(storage): add groups, tasks, tokens, and audit database schemas Add comprehensive database storage layer for new features: - db_groups.go: Lab group management with members, roles (admin/member/viewer), and group-based task visibility queries - db_tasks.go: Task visibility system (private/lab/institution/open), task sharing with expiry, public clone tokens, and optimized ListTasksForUser() for access control - db_tokens.go: Secure token management for public task access and cloning, with SHA-256 hashed token storage and automatic cleanup - db_audit.go: Audit log persistence with checkpoint chains, tamper detection, and log rotation support - schema_sqlite.sql: Updated schema with: - groups, group_members tables - tasks.visibility enum, task_shares with expiry - access_tokens table with hashed tokens - audit_logs, audit_checkpoints tables - indexes for all foreign keys and query patterns - db_experiments.go: Add CascadeVisibilityToTasks() for propagating visibility changes from experiments to associated tasks	2026-03-08 12:48:42 -04:00
Jeremie Fraeys	a239f3a14f	test(consistency): add dataset hash consistency test suite Add cross-implementation consistency tests for dataset hash functionality: ## Test Fixtures - Single file, nested directories, and multiple file test cases - Expected hashes in JSON format for validation ## Test Infrastructure - harness.go: Common test utilities and reference implementation runner - dataset_hash_test.go: Consistency test cases comparing implementations - cmd/update.go: Tool to regenerate expected hashes from reference ## Purpose Ensures hash implementations (Go, C++, Zig) produce identical results across all supported platforms and implementations.	2026-03-05 14:41:14 -05:00
Jeremie Fraeys	8e5af0da2d	fix(build): resolve shell error in test_summary macro ## Problem test_summary macro was failing with 'integer expression expected' because grep -c output contained newlines, breaking the [ -gt 0 ] comparison. ## Fix - Add \| tr -d '\n' to strip newlines from grep -c output - Add 2>/dev/null to comparison to suppress any edge case errors ## Result Clean test summary output without shell errors	2026-03-05 14:40:48 -05:00
Jeremie Fraeys	ba9a358412	fix(scheduler): resolve TestEndToEndJobLifecycle race and getTask bug ## Problem TestEndToEndJobLifecycle was failing with two issues: 1. Race condition: Workers signaled ready before job was processed, receiving MsgNoWork instead of MsgJobAssign 2. getTask() didn't check pendingAcceptance - assigned-but-not-yet-accepted tasks returned nil ## Changes ### Test Fix (restart_recovery_test.go) - Replace single-shot select with retry loop that re-signals workers as ready - Handle both assignment and non-assignment messages correctly - Add 10ms delay between non-assignment messages to allow job processing - Use 2-second deadline with 100ms timeout intervals ### Scheduler Fix (hub.go) - Extend getTask() to check pendingAcceptance map after batch/service queues - Allows GetTask() to find tasks in 'assigned' state before acceptance - Maintains backward compatibility with existing queue/running lookups ## Testing make test now passes: 475 passed, 0 failed, 34 skipped	2026-03-05 14:40:43 -05:00

1 2 3 4 5 ...

443 commits