Commit graph

452 commits

Author SHA1 Message Date
Jeremie Fraeys
f87db5d062
feat(scripts/maintenance): add Rust cache cleanup support
Add Rust build cache cleanup to maintenance script:

- Add cleanup_rust_cache() function with cargo clean integration

- Check for cargo availability before attempting cleanup

- Show cache size before/after for visibility

- Add 'rust' subcommand for targeted Rust cache cleanup

- Include Rust cache in 'benchmarks' and 'all' cleanup operations

- Add aggressive cleanup support for Rust

- Update help text and disk usage display
2026-03-23 15:22:27 -04:00
Jeremie Fraeys
b5ca575c6f
feat(scripts/dev): extend native detection for both C++ and Rust libraries
Update detect-native.go to detect and report both C++ and Rust native libs:

- Split library detection into C++ (native/build/) and Rust (native_rust/target/release/) sections

- Display separate '=== C++ Native Libraries ===' and '=== Rust Native Libraries ===' headers

- Check for both .so and .dylib extensions for each platform

- Update help text to show 'make native-build' for C++ and 'make rust-build' for Rust

- Show benchmark file availability for both implementations
2026-03-23 15:21:25 -04:00
Jeremie Fraeys
f4cce2f610
feat(scripts/benchmarks): add Rust native library benchmark support
Extend local benchmark runner to support Rust native libraries:

- Rename Step 1b native -> Step 1b C++ native for clarity

- Add Step 1c for Rust native library benchmarks via cargo bench

- Check for native_rust/target/release/libqueue_index.{dylib,so}

- Separate result files: native_cpp_benchmark_results.txt vs native_rust_benchmark_results.txt

- Updated summary output to show both C++ and Rust benchmark availability
2026-03-23 15:20:14 -04:00
Jeremie Fraeys
3bfaffc735
feat(scripts): add build-rust.sh for cross-platform Rust native library builds
Add new build script for Rust native libraries:

- Builds dataset_hash and queue_index crates via cargo

- Cross-platform support: Linux (.so), macOS (.dylib), Windows (.dll)

- Outputs to bin/native/ for consistency with C++ native libs

- Error handling for missing cargo installation
2026-03-23 15:19:09 -04:00
Jeremie Fraeys
1d7be5c829
refactor(makefile): reorganize native library targets for C++ and Rust separation
Separate C++ and Rust native library targets in the Makefile:

- Rename native/rust/ references to native_rust/ for new workspace location

- Split 'Native Libraries' help section into C++ and Rust categories

- Rename prod-with-native -> prod-with-rust, detect-regressions-native -> detect-regressions-rust

- Add rust-debug target, remove native-release (redundant)

- Add compare-benchmarks target for Rust/Go/C++ performance comparison

- Update all rust-* targets with proper cargo availability checks

- Add cargo existence checks to prevent errors when Rust not installed
2026-03-23 15:17:43 -04:00
Jeremie Fraeys
e67d18900e
chore(native): remove old native/rust directory
Remove the old native/rust/ directory - files were previously moved to native_rust/

workspace at the repository root. This cleans up the deprecated location

after the Rust workspace reorganization.
2026-03-23 15:16:26 -04:00
Jeremie Fraeys
6949287fb3
feat(native_rust): implement BLAKE3 dataset_hash and priority queue_index
Implements two production-ready Rust native libraries:

## dataset_hash (BLAKE3-based hashing)
- FFI exports: ds_hash_file, ds_hash_directory_batch, ds_hash_directory_combined
- BLAKE3 hashing for files and directory trees
- Hidden file filtering (respects .hidden and _prefix files)
- Prometheus-compatible metrics export
- Comprehensive integration tests (12 tests)
- Benchmarks: hash_file_1kb (~14µs), hash_file_1mb (~610µs), dir_100files (~1.6ms)

## queue_index (priority queue)
- FFI exports: 25+ functions matching C++ API
  - Lifecycle: qi_open, qi_close
  - Task ops: add_tasks, update_tasks, remove_tasks, get_task_by_id
  - Queue ops: get_next_batch, peek_next, mark_completed
  - Priority: get_next_priority_task, peek_priority_task
  - Query: get_all_tasks, get_tasks_by_status, get_task_count
  - Retry/DLQ: retry_task, move_to_dlq
  - Lease: renew_lease, release_lease
  - Maintenance: rebuild_index, compact_index
- BinaryHeap-based priority queue with correct Ord (max-heap)
- Memory-mapped storage with safe Rust wrappers
- Panic-safe FFI boundaries using catch_unwind
- Comprehensive integration tests (7 tests, 1 ignored for persistence)
- Benchmarks: add_100 (~60µs), get_10 (~24ns), priority (~5µs)

## Architecture
- Cargo workspace with shared common crate
- Criterion benchmarks for both crates
- Rust 1.85.0 toolchain pinned
- Zero compiler warnings
- All 19 tests passing

Compare: make compare-benchmarks (Rust/Go/C++ comparison)
2026-03-23 12:52:13 -04:00
Jeremie Fraeys
7efefa1933
feat(native): implement Rust native layer as a test
- queue_index: mmap-based priority queue with safe storage wrapper
- dataset_hash: BLAKE3 parallel hashing with rayon
- common: FFI utilities with panic recovery
- Minimal deps: ~20 total (rayon, blake3, memmap2, walkdir, chrono)
- Drop crossbeam, prometheus - use stdlib + manual metrics
- Makefile: cargo build targets, help text updated
- Forgejo CI: clippy, tests, miri, cargo-deny
- C FFI compatible with existing Go bindings
2026-03-14 17:45:58 -04:00
Jeremie Fraeys
f827ee522a
test(tracking/plugins): add PodmanInterface and comprehensive plugin tests for 91% coverage
Refactor plugins to use interface for testability:
- Add PodmanInterface to container package (StartContainer, StopContainer, RemoveContainer)
- Update MLflow plugin to use container.PodmanInterface
- Update TensorBoard plugin to use container.PodmanInterface
- Add comprehensive mocked tests for all three plugins (wandb, mlflow, tensorboard)
- Coverage increased from 18% to 91.4%
2026-03-14 16:59:16 -04:00
Jeremie Fraeys
4b8adeacdc
test(crypto): add getter tests and property-based round-trip tests
Add tests for:
- GetPublicKey: returns correct public key
- GetKeyID: returns correct key ID
- Property-based round-trip: Sign -> Verify for various message types
  (empty, single char, unicode, large messages)

Coverage: GetPublicKey 100%, GetKeyID 100%
2026-03-13 23:41:01 -04:00
Jeremie Fraeys
04175a97ee
test(tracking): add comprehensive plugin registry tests
Add tests for tracking package main exports:
- NewRegistry, Register, Get plugin management
- ProvisionAll with empty, disabled, unregistered configs
- TeardownAll lifecycle management
- NewPortAllocator, Allocate, Release with proper port reuse
- StringSetting, ToolMode constants, ToolConfig structure

Coverage: 84.7%
2026-03-13 23:35:10 -04:00
Jeremie Fraeys
f74f3fa730
test(security): fix race conditions in monitor tests
Use atomic operations for shared variables:
- alertCount: atomic.Int32 for concurrent access
- lastAlert: atomic.Value for alert storage

Fixes data races detected by -race flag.
2026-03-13 23:31:22 -04:00
Jeremie Fraeys
4da027868d
fix(storage): handle NULL values and state tracking in database operations
Fixes to support proper test coverage:

- db_jobs.go: UpdateJobStatus now checks RowsAffected and returns error
  for nonexistent jobs instead of silently succeeding
- db_audit.go: GetOldestAuditLogDate uses sql.NullString to parse SQLite
  datetime strings in YYYY-MM-DD HH:MM:SS format with RFC3339 fallback
- db_experiments.go: ListTasksForExperiment uses sql.NullString for
  nullable worker_id and error fields to prevent scan errors
- db_connect.go: DB struct adds isClosed state tracking with mutex;
  Close() now returns error on double close to match test expectations
2026-03-13 23:27:35 -04:00
Jeremie Fraeys
9b8d8e5281
test(tracking): add factory plugin loader tests
Add tests for:
- NewPluginLoader: factory creation
- RegisterFactory: custom factory registration
- LoadPluginsEmpty: empty plugin handling
- LoadPluginsDisabled: skip disabled plugins
- LoadPluginsUnknown: unknown plugin handling
- PluginConfigStructure: config field validation
- LoadPluginsMLflow, TensorBoard, Wandb: plugin type support

Coverage: 79.2%
2026-03-13 23:26:52 -04:00
Jeremie Fraeys
5d39dff6a0
test(store): extend store coverage with edge cases and concurrency
Add tests for:
- Close: proper resource cleanup
- ConcurrentLogMetrics: thread-safe metric logging
- GetRunMetricsEmpty: empty result handling
- GetRunParamsEmpty: empty result handling
- MarkRunSyncedNonexistent: graceful handling of missing runs

Coverage: 75.3%
2026-03-13 23:26:41 -04:00
Jeremie Fraeys
50b6506243
test(storage): add comprehensive storage layer tests
Add tests for:
- dataset: Redis dataset operations, transfer tracking
- db_audit: audit logging with hash chain, access tracking
- db_experiments: experiment metadata, dataset associations
- db_tasks: task listing with pagination for users and groups
- db_jobs: job CRUD, state transitions, worker assignment

Coverage: storage package ~40%+
2026-03-13 23:26:33 -04:00
Jeremie Fraeys
5057f02167
test(crypto,security): add tenant key manager and anomaly monitor tests
Add comprehensive tests for:
- crypto/tenant_keys: KMS integration, key rotation, encryption/decryption
- security/monitor: sliding window, anomaly detection, concurrent access

Coverage: crypto 65.1%, security 100%
2026-03-13 23:26:22 -04:00
Jeremie Fraeys
77542b7068
refactor: update API plugins version retrieval
Refactor getPluginVersion to accept PluginConfig parameter:
- Change signature from getPluginVersion(pluginName) to getPluginVersion(pluginName, cfg)
- Update all call sites to pass config
- Add TODO comment for future implementation querying actual plugin binary/container

Update plugin handlers to use dynamic version retrieval:
- GetV1Plugins: Use h.getPluginVersion(name, cfg) instead of hardcoded "1.0.0"
- PutV1PluginsPluginNameConfig: Pass newConfig to version retrieval
- GetV1PluginsPluginNameHealth: Use actual version from config

This prepares the API for dynamic version reporting while maintaining
backward compatibility with the current placeholder implementation.
2026-03-12 16:40:39 -04:00
Jeremie Fraeys
96dd604789
feat: implement WebSocket binary protocol and NOT_IMPLEMENTED error code
Add CodeNotImplemented error constant (HTTP 501) for planned but unavailable features.

Refactor WebSocket packet handling from JSON to binary protocol for improved efficiency:

New packet structure:
- PacketTypeSuccess (0x00): [type:1][json_data:var]
- PacketTypeError (0x01): [type:1][code_len:1][code:var][msg_len:2][msg:var][details_len:2][details:var]
- PacketTypeData (0x02): Reserved for future use

Update SendErrorPacket:
- Build binary error packets with length-prefixed fields
- Use WriteMessage with websocket.BinaryMessage

Update SendSuccessPacket:
- Marshal data to JSON then wrap in binary packet
- Eliminates "success" wrapper field for cleaner protocol

Add helper functions:
- NewNotImplemented(feature) - Standard 501 error
- NewNotImplementedWithIssue(feature, issueURL) - 501 with GitHub reference
2026-03-12 16:40:23 -04:00
Jeremie Fraeys
8a30acf661
build: update Makefile test paths and clean up old test references
Update test paths to reflect new test locations:
- Change tests/unit/... to internal/... in test, test-unit, and verify-audit targets
- Update test-coverage target to use correct coverpkg paths
- Add coverage summary output to test-coverage target

Clean up deleted test files from old locations:
- Remove tests/unit/crypto/kms/cache_test.go (now in internal/auth/kms/)
- Remove tests/unit/crypto/kms/protocol_test.go (now in internal/auth/kms/)
- Remove tests/unit/resources/manager_test.go (now in internal/resources/)
2026-03-12 16:39:52 -04:00
Jeremie Fraeys
d0266c4a90
refactor: scheduler hub bug fix, test helpers, and orphan recovery tests
Fix bug in scheduler hub orphan reconciliation:
- Move delete(h.pendingAcceptance, taskID) inside the requeue success block
- Prevents premature cleanup when requeue fails

Add comprehensive test infrastructure:
- hub_test_helpers.go: New test helper utilities (78 lines)
  - Mock scheduler components for isolated testing
  - Test fixture setup and teardown helpers

Refactor and enhance hub capabilities tests:
- Significant restructuring of hub_capabilities_test.go (213 lines changed)
- Improved test coverage for worker capability matching

Add comprehensive orphan recovery tests:
- internal/scheduler/orphan_recovery_test.go (451 lines)
- Tests orphaned job detection and recovery
- Covers requeue logic, timeout handling, state cleanup
2026-03-12 16:38:33 -04:00
Jeremie Fraeys
939faeb8e4
refactor: relocate store package from cmd/tui/internal to internal
Move store package to improve reusability and follow Go project conventions:
- cmd/tui/internal/store/store.go -> internal/store/store.go
- cmd/tui/internal/store/store_test.go -> internal/store/store_test.go

This makes the store package available to other components beyond the TUI,
reducing coupling and enabling future reuse by API server, CLI, or other tools.
2026-03-12 16:38:01 -04:00
Jeremie Fraeys
7ff2c6c487
refactor: reorganize integration tests
Move integration-appropriate tests from tests/unit/ to tests/integration/:
- tests/unit/simple_test.go -> tests/integration/simple_test.go
- tests/unit/deployments/traefik_compose_test.go -> tests/integration/traefik_compose_test.go
- tests/unit/worker_trust_test.go -> tests/integration/worker_trust_test.go

Update test package declarations and imports to reflect new locations.

These tests were misplaced in the unit tests directory but actually test
integration between components or external systems (Traefik, worker trust).
2026-03-12 16:37:42 -04:00
Jeremie Fraeys
61660dc925
refactor: co-locate security, storage, telemetry, tracking, worker tests
Move unit tests from tests/unit/ to internal/ following Go conventions:

Security tests:
- tests/unit/security/* -> internal/security/* (audit, config_integrity, filetype, gpu_audit, hipaa_validation, manifest_filename, path_traversal, resource_quota, secrets)

Storage tests:
- tests/unit/storage/* -> internal/storage/* (db, experiment_metadata)

Telemetry tests:
- tests/unit/telemetry/* -> internal/telemetry/* (telemetry)

Tracking tests:
- tests/unit/reproducibility/* -> internal/tracking/* (config_hash, environment_capture)

Worker tests:
- tests/unit/worker/* -> internal/worker/* (artifacts, config, hash_bench, plugins/jupyter_task, plugins/vllm, prewarm_v1, run_manifest_execution, snapshot_stage, snapshot_store, worker)

Update import paths in test files to reflect new locations.
2026-03-12 16:37:03 -04:00
Jeremie Fraeys
74e06017b5
refactor: co-locate scheduler non-hub tests with source code
Move unit tests from tests/unit/scheduler/ to internal/scheduler/ following Go conventions:
- capability_routing_test.go - Worker capability-based job routing tests
- failure_scenarios_test.go - Scheduler failure handling and recovery tests
- heartbeat_test.go - Worker heartbeat monitoring tests
- plugin_quota_test.go - Plugin resource quota enforcement tests
- port_allocator_test.go - Dynamic port allocation for services tests
- priority_queue_test.go - Job priority queue implementation tests
- service_templates_test.go - Service template management tests
- state_store_test.go - Scheduler state persistence tests

Note: orphan_recovery_test.go excluded from this commit - will be handled with hub refactoring due to significant test changes.
2026-03-12 16:36:29 -04:00
Jeremie Fraeys
ee0b90cfc5
refactor: co-locate queue and resources tests, add manager tests
Move unit tests from tests/unit/ to internal/ following Go conventions:
- tests/unit/queue/* -> internal/queue/* (dedup, filesystem_fallback, queue_permissions, queue_spec, queue, sqlite_queue tests)
- tests/unit/gpu/* -> internal/resources/* (gpu_detector, gpu_golden tests)
- tests/unit/resources/* -> internal/resources/* (manager_test.go)

Update import paths in test files to reflect new locations.

Note: GPU tests consolidated into resources package since GPU detection is part of resource management. Manager tests show significant new test coverage (166 lines).
2026-03-12 16:36:02 -04:00
Jeremie Fraeys
ca6ad970c3
refactor: co-locate logging, manifest, network, privacy, prommetrics tests
Move unit tests from tests/unit/ to internal/ following Go conventions:
- tests/unit/logging/* -> internal/logging/* (logging tests)
- tests/unit/manifest/* -> internal/manifest/* (run_manifest, schema tests)
- tests/unit/network/* -> internal/network/* (retry, ssh_pool, ssh tests)
- tests/unit/privacy/* -> internal/privacy/* (pii tests)
- tests/unit/metrics/* -> internal/prommetrics/* (metrics tests)

Update import paths in test files to reflect new locations.

Note: metrics_test.go moved from tests/unit/metrics/ to internal/prommetrics/ to match the actual package name.
2026-03-12 16:35:37 -04:00
Jeremie Fraeys
cf84246115
refactor: co-locate config, container, envpool, errors, experiment, jupyter tests
Move unit tests from tests/unit/ to internal/ following Go conventions:
- tests/unit/config/* -> internal/config/* (constants, mode_paths, paths, validation)
- tests/unit/container/* -> internal/container/* (podman, security tests)
- tests/unit/envpool/* -> internal/envpool/* (envpool tests)
- tests/unit/errors/* -> internal/errtypes/* (errors_test.go moved to errtypes package)
- tests/unit/experiment/* -> internal/experiment/* (manager tests)
- tests/unit/jupyter/* -> internal/jupyter/* (config, package_blacklist, service_manager, trash_restore)

Update import paths in test files to reflect new locations.

Note: errors_test.go moved from tests/unit/errors/ to internal/errtypes/ to match the package structure.
2026-03-12 16:35:15 -04:00
Jeremie Fraeys
a4e2ecdbe6
refactor: co-locate api, audit, auth tests with source code
Move unit tests from tests/unit/ to internal/ following Go conventions:
- tests/unit/api/* -> internal/api/* (WebSocket handlers, helpers, duplicate detection)
- tests/unit/audit/* -> internal/audit/* (alert, sealed, verifier tests)
- tests/unit/auth/* -> internal/auth/* (API key, keychain, user manager)
- tests/unit/crypto/kms/* -> internal/auth/kms/* (cache, protocol tests)

Update import paths in test files to reflect new locations.

Benefits:
- Tests live alongside the code they test
- Easier navigation and maintenance
- Clearer package boundaries
- Follows standard Go project layout
2026-03-12 16:34:54 -04:00
Jeremie Fraeys
b00fa236db
docs: add Known Limitations section and testing structure updates
Add Known Limitations section to AGENTS.md documenting:
- AMD GPU not implemented (use NVIDIA, Apple Silicon, or CPU)
- 100+ node gang allocation stress testing not yet implemented
- Podman-in-Docker CI requires privileged mode, not yet automated
- Error handling patterns for unimplemented features
- Container usage rules (Docker for testing/deployments, Podman for experiments)
- Error codes table (NOT_IMPLEMENTED, NOT_FOUND, INVALID_CONFIGURATION)

Update testing documentation to reflect new test locations:
- Unit tests moved from tests/unit/ to internal/ (Go convention)
- Update all test file path references in security testing docs
2026-03-12 16:33:19 -04:00
Jeremie Fraeys
6646f3a382
ci(docker): add test workflow and container architecture docs
- Create docker-tests.yml for merge-to-main CI pipeline
- Add mock GPU test matrix (NVIDIA, Metal, CPU-only)
- Add AGENTS.md with container architecture rules:
  * Docker for CI/CD testing and deployments
  * Podman for ML experiment isolation only
- Update .gitignore to track AGENTS.md
2026-03-12 14:05:53 -04:00
Jeremie Fraeys
6af85ddaf6
feat(tests): enable stress and long-running test suites
Stress Tests:
- TestStress_WorkerConnectBurst: 30 workers, p99 latency validation
- TestStress_JobSubmissionBurst: 1K job submissions
- TestStress_WorkerChurn: 50 connect/disconnect cycles, memory leak detection
- TestStress_ConcurrentScheduling: 10 workers x 20 jobs contention

Long-Running Tests:
- TestLongRunning_MemoryLeak: heap growth monitoring
- TestLongRunning_OrphanRecovery: worker death/requeue stability
- TestLongRunning_WebSocketStability: 20 worker connection stability

Infrastructure:
- Add testreport package with JSON output, flaky test tracking
- Add TestTimer for timing/budget enforcement
- Add WaitForEvent, WaitForTaskStatus helpers
- Fix worker IDs to use valid bench-worker token patterns
2026-03-12 14:05:45 -04:00
Jeremie Fraeys
ca913e8878
feat(scheduler): add test mode config and TLS detection
- Add DisableTLSForTesting to HubConfig for test environments
- Add IsUsingTLS() method to detect scheduler TLS status
- Update MockWorker to auto-select ws:// vs wss:// protocol
- Set DisableTLSForTesting: true in DefaultHubConfig
2026-03-12 14:05:35 -04:00
Jeremie Fraeys
c5524562e9
test(scheduler): remove unused fields in service slot pool separation test
Remove ID and GPUCount fields from batchJob in TestServiceSlotPoolSeparation
that were assigned but never used. The test only validates SlotPool values.
2026-03-12 12:10:33 -04:00
Jeremie Fraeys
a49e8f593c
chore(tools): update fetchml-vet analyzers
Analyzer improvements:
- hipaacomplete.go: refined HIPAA compliance checks
- manifestenv.go: environment variable validation in manifests
- nobaredetector.go: detection of bare credential exposures
- noinlinecredentials.go: inline credential scanning improvements
2026-03-12 12:09:34 -04:00
Jeremie Fraeys
2bd7f97ae2
test(integration,unit): update test suites for new features and APIs
Integration test updates:
- jupyter_experiment_test.go: update for new workspace handling
- run_manifest_test.go: reproducibility manifest validation
- secrets_integration_test.go: KMS and secret provider tests
- storage_redis_integration_test.go: Redis-backed storage tests

Unit test updates:
- response_helpers_test.go: API response helper tests
- config_hash_test.go: configuration hashing for reproducibility
- filetype_test.go: security file type detection tests

Load testing:
- load_test.go: scheduler load and stress tests
2026-03-12 12:09:15 -04:00
Jeremie Fraeys
2b1ef10514
test(chaos): add worker disconnect chaos test and queue improvements
Chaos testing:
- Add worker_disconnect_chaos_test.go for network partition resilience
- Test scheduler hub recovery and job reassignment scenarios

Queue layer updates:
- event_store.go: add event sourcing for queue operations
- native_queue.go: extend native queue with batch operations and indexing
2026-03-12 12:08:21 -04:00
Jeremie Fraeys
93d6d63d8d
chore(deploy): update Docker compose files and add MinIO lifecycle policies
Docker Compose updates:
- docker-compose.dev.yml: add GPU support, local scheduler and worker
- docker-compose.staging.yml: production-like staging with SSL termination
- docker-compose.test.yml: ephemeral test environment with seeded data

MinIO lifecycle management:
- Add lifecycle-dev.json: 7-day retention for dev artifacts
- Add lifecycle-staging.json: 30-day retention with transition to cold

Build improvements:
- Makefile: add native library build targets and cross-platform support
- scripts/release/cleanup.sh: improved artifact cleanup with dry-run mode
2026-03-12 12:06:16 -04:00
Jeremie Fraeys
17170667e2
feat(worker): improve lifecycle management and vLLM plugin
Lifecycle improvements:
- runloop.go: refined state machine with better error recovery
- service_manager.go: service dependency management and health checks
- states.go: add states for capability advertisement and draining

Container execution:
- container.go: improved OCI runtime integration with supply chain checks
- Add image verification and signature validation
- Better resource limits enforcement for GPU/memory

vLLM plugin updates:
- vllm.go: support for vLLM 0.3+ with new engine arguments
- Add quantization-aware scheduling (AWQ, GPTQ, FP8)
- Improve model download and caching logic

Configuration:
- config.go: add capability advertisement configuration
- snapshot_store.go: improve snapshot management for checkpointing
2026-03-12 12:05:02 -04:00
Jeremie Fraeys
c18a8619fe
feat(api): add structured error package and refactor handlers
New error handling:
- Add internal/api/errors/errors.go with structured API error types
- Standardize error codes across all API endpoints
- Add user-facing error messages vs internal error details separation

Handler improvements:
- jupyter/handlers.go: better workspace lifecycle and error handling
- plugins/handlers.go: plugin management with validation
- groups/handlers.go: group CRUD with capability metadata
- jobs/handlers.go: job submission and monitoring improvements
- datasets/handlers.go: dataset upload/download with progress
- validate/handlers.go: manifest validation with detailed errors
- audit/handlers.go: audit log querying with filters

Server configuration:
- server_config.go: refined config loading with validation
- server_gen.go: improved code generation for OpenAPI specs
2026-03-12 12:04:46 -04:00
Jeremie Fraeys
37c4d4e9c7
feat(crypto,auth): harden KMS and improve permission handling
KMS improvements:
- cache.go: add LRU eviction with memory-bounded caches
- provider.go: refactor provider initialization and key rotation
- tenant_keys.go: per-tenant key isolation with envelope encryption

Auth layer updates:
- hybrid.go: refine hybrid auth flow for API key + JWT
- permissions_loader.go: faster permission caching with hot-reload
- validator.go: stricter validation with detailed error messages

Security middleware:
- security.go: add rate limiting headers and CORS refinement

Testing and benchmarks:
- Add KMS cache and protocol unit tests
- Add KMS benchmark tests for encryption throughput
- Update KMS integration tests for tenant isolation
2026-03-12 12:04:32 -04:00
Jeremie Fraeys
de83300962
feat(worker): refactor GPU detection with macOS Metal support
GPU detection refactor:
- Major rewrite of gpu_detector.go with unified detection interface
- Support for NVIDIA (NVML), AMD (ROCm), and Apple Metal
- Runtime GPU capability querying for scheduler matching

macOS improvements:
- gpu_macos.go: native Metal device enumeration and memory queries
- Support for Apple Silicon (M1/M2/M3) unified memory reporting
- Fallback to system profiler for Intel Macs

Testing infrastructure:
- Add gpu_detector_mock.go for testing without hardware
- Update gpu_golden_test.go with platform-specific expectations
- Cross-platform GPU info validation
2026-03-12 12:02:41 -04:00
Jeremie Fraeys
188cf55939
refactor(api): overhaul WebSocket handler and protocol layer
Major WebSocket handler refactor:
- Rewrite ws/handler.go with structured message routing and backpressure
- Add connection lifecycle management with heartbeats and timeouts
- Implement graceful connection draining for zero-downtime restarts

Protocol improvements:
- Define structured protocol types in protocol.go for hub communication
- Add versioned message envelopes for backward compatibility
- Standardize error codes and response formats across WebSocket API

Job streaming via WebSocket:
- Simplify ws/jobs.go with async job status streaming
- Add compression for high-volume job updates

Testing:
- Update websocket_e2e_test.go for new protocol semantics
- Add connection resilience tests
2026-03-12 12:01:21 -04:00
Jeremie Fraeys
ad3be36a6d
feat(cli): add workers command, scheduler client, and PII utilities
New commands and modules:
- Add workers.zig command for worker management and status
- Add scheduler_client.zig for scheduler hub communication
- Add pii.zig utility for PII detection and redaction in logs/outputs

Improvements to existing commands:
- groups.zig: enhanced group management with capability metadata
- jupyter/mod.zig: improved Jupyter workspace lifecycle handling
- tasks.zig: better task status reporting and cancellation support

Networking and sync improvements:
- ws/client.zig: WebSocket client enhancements for hub protocol
- sync_manager.zig: improved sync with scheduler state and conflict resolution
- uuid.zig: optimized UUID generation for macOS and Linux

Database utilities:
- sqlite_embedded.zig: embedded SQLite for CLI-local state caching
2026-03-12 12:00:49 -04:00
Jeremie Fraeys
57787e1e7b
feat(scheduler): implement capability-based routing and hub v2
Add comprehensive capability routing system to scheduler hub:
- Capability-aware worker matching with requirement/offer negotiation
- Hub v2 protocol with structured message types and heartbeat management
- Worker capability advertisement and dynamic routing decisions
- Orphan recovery for disconnected workers with state reconciliation
- Template-based job scheduling with capability constraints

Add extensive test coverage:
- Unit tests for capability routing logic and heartbeat mechanics
- Unit tests for orphan recovery scenarios
- E2E tests for capability routing across multiple workers
- Hub capabilities integration tests
- Scheduler fixture helpers for test setup

Protocol improvements:
- Define structured protocol messages for hub-worker communication
- Add capability matching algorithm with scoring
- Implement graceful worker disconnection handling
2026-03-12 12:00:05 -04:00
Jeremie Fraeys
13ffb81cab
fix: add CGO build tags to consistency tests, remove unused isHex function 2026-03-08 13:10:00 -04:00
Jeremie Fraeys
7eee31d721
chore: cleanup and miscellaneous updates
- .gitignore: Add reports/ and .api-keys
- examples/jupyter_experiment_integration.py: Update for new API
- podman/scripts/: CLI integration, secure runner, ML tool testing
- tools/: Performance regression detector, profiler utilities
2026-03-08 13:04:01 -04:00
Jeremie Fraeys
c74e91dd69
test: update test suite and remove deprecated privacy middleware
Test improvements:
- fixtures/: Updated mocks, fixtures with group context, SSH server, TUI driver
- integration/: WebSocket queue and handler tests with groups
- e2e/: WebSocket and TLS proxy end-to-end tests
- unit/api/ws_test.go: WebSocket API tests
- unit/scheduler/service_templates_test.go: Service template tests
- benchmarks/scheduler_bench_test.go: Performance benchmarks

Cleanup:
- Remove privacy middleware (replaced by audit system)
- Remove privacy_test.go
2026-03-08 13:03:55 -04:00
Jeremie Fraeys
cb142213fa
chore(build): update build system, Dockerfiles, and dependencies
Build and deployment improvements:

Makefile:
- Native library build targets with ASan support
- Cross-platform compilation helpers
- Performance benchmark targets
- Security scan integration

Docker:
- secure-prod.Dockerfile: Hardened production image (non-root, minimal surface)
- simple.Dockerfile: Lightweight development image

Scripts:
- build/: Go and native library build scripts, cross-platform builds
- ci/: checks.sh, test.sh, verify-paths.sh for validation
- benchmarks/: Local performance testing and regression tracking
- dev/: Monitoring setup

Dependencies: Update to latest stable with security patches

Commands:
- api-server/main.go: Server initialization updates
- data_manager/data_sync.go: Data sync with visibility
- errors/main.go: Error handling improvements
- tui/: TUI improvements for group management
2026-03-08 13:03:48 -04:00
Jeremie Fraeys
4b2782f674
feat(domain): add task visibility and supporting infrastructure
Core domain and utility updates:

- domain/task.go: Task model with visibility system
  * Visibility enum: private, lab, institution, open
  * Group associations for lab-scoped access
  * CreatedBy tracking for ownership
  * Sharing metadata with expiry

- config/paths.go: Group-scoped data directories and audit log paths
- crypto/signing.go: Key management for audit sealing, token signature verification
- container/supply_chain.go: Image provenance tracking, vulnerability scanning
- fileutil/filetype.go: MIME type detection and security validation
- fileutil/secure.go: Protected file permissions, secure deletion
- jupyter/: Package and service manager updates
- experiment/manager.go: Visibility cascade from experiments to tasks
- network/ssh.go: SSH tunneling improvements
- queue/: Filesystem queue enhancements
2026-03-08 13:03:27 -04:00