Commit graph

88 commits

Author SHA1 Message Date
Jeremie Fraeys
ca6ad970c3
refactor: co-locate logging, manifest, network, privacy, prommetrics tests
Move unit tests from tests/unit/ to internal/ following Go conventions:
- tests/unit/logging/* -> internal/logging/* (logging tests)
- tests/unit/manifest/* -> internal/manifest/* (run_manifest, schema tests)
- tests/unit/network/* -> internal/network/* (retry, ssh_pool, ssh tests)
- tests/unit/privacy/* -> internal/privacy/* (pii tests)
- tests/unit/metrics/* -> internal/prommetrics/* (metrics tests)

Update import paths in test files to reflect new locations.

Note: metrics_test.go moved from tests/unit/metrics/ to internal/prommetrics/ to match the actual package name.
2026-03-12 16:35:37 -04:00
Jeremie Fraeys
cf84246115
refactor: co-locate config, container, envpool, errors, experiment, jupyter tests
Move unit tests from tests/unit/ to internal/ following Go conventions:
- tests/unit/config/* -> internal/config/* (constants, mode_paths, paths, validation)
- tests/unit/container/* -> internal/container/* (podman, security tests)
- tests/unit/envpool/* -> internal/envpool/* (envpool tests)
- tests/unit/errors/* -> internal/errtypes/* (errors_test.go moved to errtypes package)
- tests/unit/experiment/* -> internal/experiment/* (manager tests)
- tests/unit/jupyter/* -> internal/jupyter/* (config, package_blacklist, service_manager, trash_restore)

Update import paths in test files to reflect new locations.

Note: errors_test.go moved from tests/unit/errors/ to internal/errtypes/ to match the package structure.
2026-03-12 16:35:15 -04:00
Jeremie Fraeys
a4e2ecdbe6
refactor: co-locate api, audit, auth tests with source code
Move unit tests from tests/unit/ to internal/ following Go conventions:
- tests/unit/api/* -> internal/api/* (WebSocket handlers, helpers, duplicate detection)
- tests/unit/audit/* -> internal/audit/* (alert, sealed, verifier tests)
- tests/unit/auth/* -> internal/auth/* (API key, keychain, user manager)
- tests/unit/crypto/kms/* -> internal/auth/kms/* (cache, protocol tests)

Update import paths in test files to reflect new locations.

Benefits:
- Tests live alongside the code they test
- Easier navigation and maintenance
- Clearer package boundaries
- Follows standard Go project layout
2026-03-12 16:34:54 -04:00
Jeremie Fraeys
6af85ddaf6
feat(tests): enable stress and long-running test suites
Stress Tests:
- TestStress_WorkerConnectBurst: 30 workers, p99 latency validation
- TestStress_JobSubmissionBurst: 1K job submissions
- TestStress_WorkerChurn: 50 connect/disconnect cycles, memory leak detection
- TestStress_ConcurrentScheduling: 10 workers x 20 jobs contention

Long-Running Tests:
- TestLongRunning_MemoryLeak: heap growth monitoring
- TestLongRunning_OrphanRecovery: worker death/requeue stability
- TestLongRunning_WebSocketStability: 20 worker connection stability

Infrastructure:
- Add testreport package with JSON output, flaky test tracking
- Add TestTimer for timing/budget enforcement
- Add WaitForEvent, WaitForTaskStatus helpers
- Fix worker IDs to use valid bench-worker token patterns
2026-03-12 14:05:45 -04:00
Jeremie Fraeys
ca913e8878
feat(scheduler): add test mode config and TLS detection
- Add DisableTLSForTesting to HubConfig for test environments
- Add IsUsingTLS() method to detect scheduler TLS status
- Update MockWorker to auto-select ws:// vs wss:// protocol
- Set DisableTLSForTesting: true in DefaultHubConfig
2026-03-12 14:05:35 -04:00
Jeremie Fraeys
c5524562e9
test(scheduler): remove unused fields in service slot pool separation test
Remove ID and GPUCount fields from batchJob in TestServiceSlotPoolSeparation
that were assigned but never used. The test only validates SlotPool values.
2026-03-12 12:10:33 -04:00
Jeremie Fraeys
2bd7f97ae2
test(integration,unit): update test suites for new features and APIs
Integration test updates:
- jupyter_experiment_test.go: update for new workspace handling
- run_manifest_test.go: reproducibility manifest validation
- secrets_integration_test.go: KMS and secret provider tests
- storage_redis_integration_test.go: Redis-backed storage tests

Unit test updates:
- response_helpers_test.go: API response helper tests
- config_hash_test.go: configuration hashing for reproducibility
- filetype_test.go: security file type detection tests

Load testing:
- load_test.go: scheduler load and stress tests
2026-03-12 12:09:15 -04:00
Jeremie Fraeys
2b1ef10514
test(chaos): add worker disconnect chaos test and queue improvements
Chaos testing:
- Add worker_disconnect_chaos_test.go for network partition resilience
- Test scheduler hub recovery and job reassignment scenarios

Queue layer updates:
- event_store.go: add event sourcing for queue operations
- native_queue.go: extend native queue with batch operations and indexing
2026-03-12 12:08:21 -04:00
Jeremie Fraeys
17170667e2
feat(worker): improve lifecycle management and vLLM plugin
Lifecycle improvements:
- runloop.go: refined state machine with better error recovery
- service_manager.go: service dependency management and health checks
- states.go: add states for capability advertisement and draining

Container execution:
- container.go: improved OCI runtime integration with supply chain checks
- Add image verification and signature validation
- Better resource limits enforcement for GPU/memory

vLLM plugin updates:
- vllm.go: support for vLLM 0.3+ with new engine arguments
- Add quantization-aware scheduling (AWQ, GPTQ, FP8)
- Improve model download and caching logic

Configuration:
- config.go: add capability advertisement configuration
- snapshot_store.go: improve snapshot management for checkpointing
2026-03-12 12:05:02 -04:00
Jeremie Fraeys
37c4d4e9c7
feat(crypto,auth): harden KMS and improve permission handling
KMS improvements:
- cache.go: add LRU eviction with memory-bounded caches
- provider.go: refactor provider initialization and key rotation
- tenant_keys.go: per-tenant key isolation with envelope encryption

Auth layer updates:
- hybrid.go: refine hybrid auth flow for API key + JWT
- permissions_loader.go: faster permission caching with hot-reload
- validator.go: stricter validation with detailed error messages

Security middleware:
- security.go: add rate limiting headers and CORS refinement

Testing and benchmarks:
- Add KMS cache and protocol unit tests
- Add KMS benchmark tests for encryption throughput
- Update KMS integration tests for tenant isolation
2026-03-12 12:04:32 -04:00
Jeremie Fraeys
de83300962
feat(worker): refactor GPU detection with macOS Metal support
GPU detection refactor:
- Major rewrite of gpu_detector.go with unified detection interface
- Support for NVIDIA (NVML), AMD (ROCm), and Apple Metal
- Runtime GPU capability querying for scheduler matching

macOS improvements:
- gpu_macos.go: native Metal device enumeration and memory queries
- Support for Apple Silicon (M1/M2/M3) unified memory reporting
- Fallback to system profiler for Intel Macs

Testing infrastructure:
- Add gpu_detector_mock.go for testing without hardware
- Update gpu_golden_test.go with platform-specific expectations
- Cross-platform GPU info validation
2026-03-12 12:02:41 -04:00
Jeremie Fraeys
188cf55939
refactor(api): overhaul WebSocket handler and protocol layer
Major WebSocket handler refactor:
- Rewrite ws/handler.go with structured message routing and backpressure
- Add connection lifecycle management with heartbeats and timeouts
- Implement graceful connection draining for zero-downtime restarts

Protocol improvements:
- Define structured protocol types in protocol.go for hub communication
- Add versioned message envelopes for backward compatibility
- Standardize error codes and response formats across WebSocket API

Job streaming via WebSocket:
- Simplify ws/jobs.go with async job status streaming
- Add compression for high-volume job updates

Testing:
- Update websocket_e2e_test.go for new protocol semantics
- Add connection resilience tests
2026-03-12 12:01:21 -04:00
Jeremie Fraeys
57787e1e7b
feat(scheduler): implement capability-based routing and hub v2
Add comprehensive capability routing system to scheduler hub:
- Capability-aware worker matching with requirement/offer negotiation
- Hub v2 protocol with structured message types and heartbeat management
- Worker capability advertisement and dynamic routing decisions
- Orphan recovery for disconnected workers with state reconciliation
- Template-based job scheduling with capability constraints

Add extensive test coverage:
- Unit tests for capability routing logic and heartbeat mechanics
- Unit tests for orphan recovery scenarios
- E2E tests for capability routing across multiple workers
- Hub capabilities integration tests
- Scheduler fixture helpers for test setup

Protocol improvements:
- Define structured protocol messages for hub-worker communication
- Add capability matching algorithm with scoring
- Implement graceful worker disconnection handling
2026-03-12 12:00:05 -04:00
Jeremie Fraeys
13ffb81cab
fix: add CGO build tags to consistency tests, remove unused isHex function 2026-03-08 13:10:00 -04:00
Jeremie Fraeys
c74e91dd69
test: update test suite and remove deprecated privacy middleware
Test improvements:
- fixtures/: Updated mocks, fixtures with group context, SSH server, TUI driver
- integration/: WebSocket queue and handler tests with groups
- e2e/: WebSocket and TLS proxy end-to-end tests
- unit/api/ws_test.go: WebSocket API tests
- unit/scheduler/service_templates_test.go: Service template tests
- benchmarks/scheduler_bench_test.go: Performance benchmarks

Cleanup:
- Remove privacy middleware (replaced by audit system)
- Remove privacy_test.go
2026-03-08 13:03:55 -04:00
Jeremie Fraeys
a239f3a14f
test(consistency): add dataset hash consistency test suite
Add cross-implementation consistency tests for dataset hash functionality:

## Test Fixtures
- Single file, nested directories, and multiple file test cases
- Expected hashes in JSON format for validation

## Test Infrastructure
- harness.go: Common test utilities and reference implementation runner
- dataset_hash_test.go: Consistency test cases comparing implementations
- cmd/update.go: Tool to regenerate expected hashes from reference

## Purpose
Ensures hash implementations (Go, C++, Zig) produce identical results
across all supported platforms and implementations.
2026-03-05 14:41:14 -05:00
Jeremie Fraeys
ba9a358412
fix(scheduler): resolve TestEndToEndJobLifecycle race and getTask bug
## Problem
TestEndToEndJobLifecycle was failing with two issues:
1. Race condition: Workers signaled ready before job was processed, receiving
   MsgNoWork instead of MsgJobAssign
2. getTask() didn't check pendingAcceptance - assigned-but-not-yet-accepted
   tasks returned nil

## Changes

### Test Fix (restart_recovery_test.go)
- Replace single-shot select with retry loop that re-signals workers as ready
- Handle both assignment and non-assignment messages correctly
- Add 10ms delay between non-assignment messages to allow job processing
- Use 2-second deadline with 100ms timeout intervals

### Scheduler Fix (hub.go)
- Extend getTask() to check pendingAcceptance map after batch/service queues
- Allows GetTask() to find tasks in 'assigned' state before acceptance
- Maintains backward compatibility with existing queue/running lookups

## Testing
make test now passes: 475 passed, 0 failed, 34 skipped
2026-03-05 14:40:43 -05:00
Jeremie Fraeys
69951ce5a1
test: update test infrastructure and documentation
Update tests and documentation:
- native/README.md: document C++ native library plans
- restart_recovery_test.go: scheduler restart/recovery tests
- scheduler_fixture.go: test fixtures for scheduler
- hash_test.zig: SHA-NI hash tests (WIP)

Improves test coverage and documentation.
2026-03-04 20:25:48 -05:00
Jeremie Fraeys
7cd86fb88a
feat: add new API handlers, build scripts, and ADRs
Some checks failed
Build Pipeline / Sign HIPAA Config (push) Has been skipped
Build Pipeline / Generate SLSA Provenance (push) Has been skipped
Checkout test / test (push) Successful in 6s
CI Pipeline / Test (ubuntu-latest on self-hosted) (push) Failing after 1s
CI Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI Pipeline / Security Scan (push) Has been skipped
CI Pipeline / Test Scripts (push) Has been skipped
CI Pipeline / Test Native Libraries (push) Has been skipped
CI Pipeline / Native Library Build Matrix (push) Has been skipped
Contract Tests / Spec Drift Detection (push) Failing after 11s
Contract Tests / API Contract Tests (push) Has been skipped
Deploy API Docs / Build API Documentation (push) Failing after 5s
Deploy API Docs / Deploy to GitHub Pages (push) Has been skipped
Documentation / build-and-publish (push) Failing after 40s
Test Matrix / test-native-vs-pure (cgo) (push) Failing after 14s
Test Matrix / test-native-vs-pure (native) (push) Failing after 35s
Test Matrix / test-native-vs-pure (pure) (push) Failing after 18s
CI Pipeline / Trigger Build Workflow (push) Failing after 1s
Build CLI with Embedded SQLite / build (arm64, aarch64-linux) (push) Has been cancelled
Build CLI with Embedded SQLite / build (x86_64, x86_64-linux) (push) Has been cancelled
Build CLI with Embedded SQLite / build-macos (arm64) (push) Has been cancelled
Build CLI with Embedded SQLite / build-macos (x86_64) (push) Has been cancelled
Security Scan / Security Analysis (push) Has been cancelled
Security Scan / Native Library Security (push) Has been cancelled
Verification & Maintenance / V.1 - Schema Drift Detection (push) Has been cancelled
Verification & Maintenance / V.4 - Custom Go Vet Analyzers (push) Has been cancelled
Verification & Maintenance / V.7 - Audit Chain Integrity (push) Has been cancelled
Verification & Maintenance / V.6 - Extended Security Scanning (push) Has been cancelled
Verification & Maintenance / V.10 - OpenSSF Scorecard (push) Has been cancelled
Verification & Maintenance / Verification Summary (push) Has been cancelled
- Introduce audit, plugin, and scheduler API handlers
- Add spec_embed.go for OpenAPI spec embedding
- Create modular build scripts (cli, go, native, cross-platform)
- Add deployment cleanup and health-check utilities
- New ADRs: hot reload, audit store, SSE updates, RBAC, caching, offline mode, KMS regions, tenant offboarding
- Add KMS configuration schema and worker variants
- Include KMS benchmark tests
2026-03-04 13:24:27 -05:00
Jeremie Fraeys
5f53104fcd
test: modernize test suite for streamlined infrastructure
- Update E2E tests for consolidated docker-compose.test.yml
- Remove references to obsolete logs-debug.yml
- Enhance test fixtures and utilities
- Improve integration test coverage for KMS, queue, scheduler
- Update unit tests for config constants and worker execution
- Modernize cleanup-status.sh with new Makefile targets
2026-03-04 13:24:24 -05:00
Jeremie Fraeys
7948639b1e
docs: update documentation for streamlined Makefile
- Replace 'make test-full' with 'make test' throughout docs
- Replace 'make self-cleanup' with 'make clean'
- Replace 'make tech-excellence' with 'make complete-suite'
- Replace 'make deploy-up' with 'make dev-up'
- Update docker-compose commands to docker compose v2
- Update CI workflow to use new Makefile targets
2026-03-04 13:22:29 -05:00
Jeremie Fraeys
e08feae6ab
chore: migrate scripts from docker-compose v1 to v2
- Update all scripts to use 'docker compose' instead of 'docker-compose'
- Fix compose file paths after consolidation (test.yml, prod.yml)
- Update cleanup.sh to handle --profile debug and --profile smoke
- Update test fixtures to reference consolidated compose files
2026-03-04 13:22:26 -05:00
Jeremie Fraeys
16343e6c2a
test(kms): add comprehensive unit and integration tests
Unit tests for DEK cache:
- Put/Get operations, TTL expiry, LRU eviction
- Tenant isolation, flush/clear, stats, empty DEK rejection

Unit tests for KMS protocol:
- Encrypt/decrypt round-trip with MemoryProvider
- Multi-tenant isolation (wrong key fails MAC verification)
- Cache hit verification, key rotation flow
- Health check protocol

Integration tests with testcontainers:
- VaultProvider with hashicorp/vault:1.15 container
- AWSProvider with localstack/localstack container
- TenantKeyManager end-to-end with MemoryProvider
2026-03-03 19:14:31 -05:00
Jeremie Fraeys
da104367d6
feat: add Plugin GPU Quota implementation and tests
Some checks failed
Build Pipeline / Build Binaries (push) Failing after 1m59s
Build Pipeline / Build Docker Images (push) Has been skipped
Build Pipeline / Sign HIPAA Config (push) Has been skipped
Build Pipeline / Generate SLSA Provenance (push) Has been skipped
Checkout test / test (push) Successful in 5s
CI Pipeline / Test (ubuntu-latest on self-hosted) (push) Failing after 1s
CI Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI Pipeline / Security Scan (push) Has been skipped
CI Pipeline / Test Scripts (push) Has been skipped
CI Pipeline / Test Native Libraries (push) Has been skipped
CI Pipeline / Native Library Build Matrix (push) Has been skipped
Documentation / build-and-publish (push) Failing after 35s
CI Pipeline / Trigger Build Workflow (push) Failing after 0s
Security Scan / Security Analysis (push) Has been cancelled
Security Scan / Native Library Security (push) Has been cancelled
Verification & Maintenance / V.1 - Schema Drift Detection (push) Has been cancelled
Verification & Maintenance / V.4 - Custom Go Vet Analyzers (push) Has been cancelled
Verification & Maintenance / V.7 - Audit Chain Integrity (push) Has been cancelled
Verification & Maintenance / V.6 - Extended Security Scanning (push) Has been cancelled
Verification & Maintenance / V.10 - OpenSSF Scorecard (push) Has been cancelled
Verification & Maintenance / Verification Summary (push) Has been cancelled
- Add plugin_quota.go with GPU quota management for scheduler

- Update scheduler hub and protocol for plugin support

- Add comprehensive plugin quota unit tests

- Update gang service and WebSocket queue integration tests
2026-02-26 14:35:05 -05:00
Jeremie Fraeys
d87c556afa
test(all): update test suite for scheduler and security features
Update comprehensive test coverage:
- E2E tests with scheduler integration
- Integration tests with tenant isolation
- Unit tests with security assertions
- Security tests with audit validation
- Audit verification tests
- Auth tests with tenant scoping
- Config validation tests
- Container security tests
- Worker tests with scheduler mock
- Environment pool tests
- Load tests with distributed patterns
- Test fixtures with scheduler support
- Update go.mod/go.sum with new dependencies
2026-02-26 12:08:46 -05:00
Jeremie Fraeys
95adcba437
feat(worker): add Jupyter/vLLM plugins and process isolation
Extend worker capabilities with new execution plugins and security features:
- Jupyter plugin for notebook-based ML experiments
- vLLM plugin for LLM inference workloads
- Cross-platform process isolation (Unix/Windows)
- Network policy enforcement with platform-specific implementations
- Service manager integration for lifecycle management
- Scheduler backend integration for queue coordination

Update lifecycle management:
- Enhanced runloop with state transitions
- Service manager integration for plugin coordination
- Improved state persistence and recovery

Add test coverage:
- Unit tests for Jupyter and vLLM plugins
- Updated worker execution tests
2026-02-26 12:03:59 -05:00
Jeremie Fraeys
a981e89005
feat(security): add audit subsystem and tenant isolation
Implement comprehensive audit and security infrastructure:
- Immutable audit logs with platform-specific backends (Linux/Other)
- Sealed log entries with tamper-evident checksums
- Audit alert system for real-time security notifications
- Log rotation with retention policies
- Checkpoint-based audit verification

Add multi-tenant security features:
- Tenant manager with quota enforcement
- Middleware for tenant authentication/authorization
- Per-tenant cryptographic key isolation
- Supply chain security for container verification
- Cross-platform secure file utilities (Unix/Windows)

Add test coverage:
- Unit tests for audit alerts and sealed logs
- Platform-specific audit backend tests
2026-02-26 12:03:45 -05:00
Jeremie Fraeys
43e6446587
feat(scheduler): implement multi-tenant job scheduler with gang scheduling
Add new scheduler component for distributed ML workload orchestration:
- Hub-based coordination for multi-worker clusters
- Pacing controller for rate limiting job submissions
- Priority queue with preemption support
- Port allocator for dynamic service discovery
- Protocol handlers for worker-scheduler communication
- Service manager with OS-specific implementations
- Connection management and state persistence
- Template system for service deployment

Includes comprehensive test suite:
- Unit tests for all core components
- Integration tests for distributed scenarios
- Benchmark tests for performance validation
- Mock fixtures for isolated testing

Refs: scheduler-architecture.md
2026-02-26 12:03:23 -05:00
Jeremie Fraeys
6fc2e373c1
fix: resolve IDE warnings and test errors
Bug fixes and cleanup for test infrastructure:

- schema_test.go: Fix SchemaVersion reference with proper manifest import
- schema_test.go: Update all schema.json paths to internal/manifest location
- manifestenv.go: Remove unused helper functions (isArtifactsType, getPackagePath)
- nobaredetector.go: Fix exprToString syntax error, remove unused functions

All tests now pass without errors or warnings
2026-02-23 20:26:20 -05:00
Jeremie Fraeys
e0aae73cf4
test(phase-7-9): audit verification, fault injection, integration tests
Implement V.7, V.9, and integration test requirements:

Audit Verification (V.7):
- TestAuditVerificationJob: Chain verification and tamper detection

Fault Injection (V.9):
- TestNVMLUnavailableProvenanceFail, TestManifestWritePartialFailure
- TestRedisUnavailableQueueBehavior, TestAuditLogUnavailableHaltsJob
- TestConfigHashFailureProvenanceClosed, TestDiskFullDuringArtifactScan

Integration Tests:
- TestCrossTenantIsolation: Filesystem isolation verification
- TestRunManifestReproducibility: Cross-run reproducibility
- TestAuditLogPHIRedaction: PHI leak prevention
2026-02-23 20:26:01 -05:00
Jeremie Fraeys
80370e9f4a
test(phase-6): property-based tests with gopter
Implement property-based invariant verification:

- TestPropertyConfigHashAlwaysPresent: Valid configs produce non-empty hash
- TestPropertyConfigHashDeterministic: Same config produces same hash
- TestPropertyDetectionSourceAlwaysValid: CreateDetectorWithInfo returns valid source
- TestPropertyProvenanceFailClosed: Strict mode fails on incomplete env
- TestPropertyScanArtifactsNeverNilEnvironment: Artifacts can hold Environment
- TestPropertyManifestEnvironmentSurvivesRoundtrip: Environment survives write/load

Uses gopter for property-based testing with deterministic seeds
2026-02-23 20:25:49 -05:00
Jeremie Fraeys
9f9d75dd68
test(phase-4): reproducibility crossover tests
Implement reproducibility crossover requirements:

- TestManifestEnvironmentCapture: Environment population with ConfigHash and DetectionMethod
- TestConfigHashPostDefaults: Hash computation after env expansion and defaults

Verifies manifest.Environment is properly populated for reproducibility tracking
2026-02-23 20:25:37 -05:00
Jeremie Fraeys
8f9bcef754
test(phase-3): prerequisite security and reproducibility tests
Implement 4 prerequisite test requirements:

- TestConfigIntegrityVerification: Config signing, tamper detection, hash stability
- TestManifestFilenameNonce: Cryptographic nonce generation and filename patterns
- TestGPUDetectionAudit: Structured logging of GPU detection at startup
- TestResourceEnvVarParsing: Resource env var parsing and override behavior

Also update manifest run_manifest.go:
- Add nonce-based filename support to WriteToDir
- Add nonce-based file detection to LoadFromDir
2026-02-23 20:25:26 -05:00
Jeremie Fraeys
f71352202e
test(phase-1-2): naming alignment and partial test completion
Rename and enhance existing tests to align with coverage map:
- TestGPUDetectorAMDVendorAlias -> TestAMDAliasManifestRecord
- TestScanArtifacts_SkipsKnownPathsAndLogs -> TestScanExclusionsRecorded
- Add env var expansion verification to TestHIPAAValidation_InlineCredentials
- Record exclusions in manifest.Artifacts for audit trail
2026-02-23 20:25:07 -05:00
Jeremie Fraeys
b33c6c4878
test(security): Add PHI denylist tests to secrets validation
Add comprehensive PHI detection tests:
- patient_id rejection
- ssn rejection
- medical_record_number rejection
- diagnosis_code rejection
- Mixed secrets with PHI rejection
- Normal secrets acceptance (HF_TOKEN, WANDB_API_KEY, etc.)

Ensures AllowedSecrets PHI denylist validation works correctly
across all PHI pattern variations.

Part of: PHI denylist validation from security plan
2026-02-23 19:44:33 -05:00
Jeremie Fraeys
17d5c75e33
fix(security): Path validation improvements for symlink resolution
Fix ValidatePath to correctly resolve symlinks and handle edge cases:
- Resolve symlinks before boundary check to prevent traversal
- Handle macOS /private prefix correctly
- Add fallback for non-existent paths (parent directory resolution)
- Double boundary checks: before AND after symlink resolution
- Prevent race conditions between check and use

Update path traversal tests:
- Correct test expectations for "..." (three dots is valid filename, not traversal)
- Add tests for symlink escape attempts
- Add unicode attack tests
- Add deeply nested traversal tests

Security impact: Prevents path traversal via symlink following in artifact
scanning and other file operations.
2026-02-23 19:44:16 -05:00
Jeremie Fraeys
651318bc93
test(security): Integration tests for sandbox escape and secrets handling
Add sandbox escape integration tests:
- Container breakout attempts via privileged mode
- Host path mounting restrictions
- Network namespace isolation verification
- Capability dropping validation
- Seccomp profile enforcement

Add secrets integration tests:
- End-to-end credential expansion testing
- PHI denylist enforcement in real configs
- Environment variable reference resolution
- Plaintext secret detection across config boundaries
- Secret rotation workflow validation

Tests run with real container runtime (Podman/Docker) when available.
Provides defense-in-depth beyond unit tests.

Part of: security integration testing from security plan
2026-02-23 19:44:07 -05:00
Jeremie Fraeys
58c1a5fa58
feat(audit): Tamper-evident audit chain verification system
Add ChainVerifier for cryptographic audit log verification:
- VerifyLogFile(): Validates entire audit chain integrity
- Detects tampering at specific event index (FirstTampered)
- Returns chain root hash for external verification
- GetChainRootHash(): Standalone hash computation
- VerifyAndAlert(): Boolean tampering detection with logging

Add audit-verifier CLI tool:
- Standalone binary for audit chain verification
- Takes log path argument and reports tampering

Update audit logger for chain integrity:
- Each event includes sequence number and hash chain
- SHA-256 linking: hash_n = SHA-256(prev_hash || event_n)
- Tamper detection through hash chain validation

Add comprehensive test coverage:
- Empty log handling
- Valid chain verification
- Tampering detection with modification
- Root hash consistency
- Alert mechanism tests

Part of: V.7 audit verification from security plan
2026-02-23 19:43:50 -05:00
Jeremie Fraeys
9434f4c8e6
feat(security): Artifact ingestion caps enforcement
Add MaxArtifactFiles and MaxArtifactTotalBytes to SandboxConfig:
- Default MaxArtifactFiles: 10,000 (configurable via SecurityDefaults)
- Default MaxArtifactTotalBytes: 100GB (configurable via SecurityDefaults)
- ApplySecurityDefaults() sets defaults if not specified

Enforce caps in scanArtifacts() during directory walk:
- Returns error immediately when MaxArtifactFiles exceeded
- Returns error immediately when MaxArtifactTotalBytes exceeded
- Prevents resource exhaustion attacks from malicious artifact trees

Update all call sites to pass SandboxConfig for cap enforcement:
- Native bridge libs updated to pass caps argument
- Benchmark tests updated with nil caps (unlimited for benchmarks)
- Unit tests updated with nil caps

Closes: artifact ingestion caps items from security plan
2026-02-23 19:43:28 -05:00
Jeremie Fraeys
a8180f1f26
feat(security): HIPAA compliance mode and PHI denylist validation
Add compliance_mode field to Config with strict HIPAA validation:
- Requires SnapshotStore.Secure=true in HIPAA mode
- Requires NetworkMode="none" for tenant isolation
- Requires non-empty SeccompProfile
- Requires NoNewPrivileges=true
- Enforces credentials via environment variables only (no inline YAML)

Add PHI denylist validation for AllowedSecrets:
- Blocks secrets matching patterns: patient, ssn, mrn, medical_record,
  diagnosis, dob, birth, mrn_number, patient_id, patient_name
- Prevents accidental PHI exfiltration via secret channels

Add comprehensive test coverage in hipaa_validation_test.go:
- Network mode enforcement tests
- NoNewPrivileges requirement tests
- Seccomp profile validation tests
- Inline credential rejection tests
- PHI denylist validation tests

Closes: compliance_mode, PHI denylist items from security plan
2026-02-23 19:43:19 -05:00
Jeremie Fraeys
fc2459977c
refactor(worker): update worker tests and native bridge
**Worker Refactoring:**
- Update internal/worker/factory.go, worker.go, snapshot_store.go
- Update native_bridge.go and native_bridge_nocgo.go for native library integration

**Test Updates:**
- Update all worker unit tests for new interfaces
- Update chaos tests
- Update container/podman_test.go
- Add internal/workertest/worker.go for shared test utilities

**Documentation:**
- Update native/README.md
2026-02-23 18:04:22 -05:00
Jeremie Fraeys
4b8df60e83
deploy: clean up docker-compose configurations
Remove unnecessary service definitions and debug logging configuration
from docker-compose files across all deployment environments.
2026-02-23 18:04:09 -05:00
Jeremie Fraeys
be67cb77d3
test(benchmarks): update benchmark tests with job cleanup and improvements
**Payload Performance Test:**
- Add job cleanup after each iteration using DeleteJob()
- Ensure isolated memory measurements between test runs

**All Benchmark Tests:**
- General improvements and maintenance updates
2026-02-23 18:03:54 -05:00
Jeremie Fraeys
fccced6bb3
test(security): add comprehensive security unit tests
Adds 13 security tests across 4 files for hardening verification:

**Path Traversal Tests (path_traversal_test.go):**
- TestSecurePathValidator_ValidRelativePath
- TestSecurePathValidator_PathTraversalBlocked
- TestSecurePathValidator_SymlinkEscape
- Tests symlink resolution and path boundary enforcement

**File Type Validation Tests (filetype_test.go):**
- TestValidateFileType_AllowedTypes
- TestValidateFileType_DangerousTypesBlocked
- TestValidateModelFile
- Tests magic bytes validation and dangerous extension blocking

**Secrets Management Tests (secrets_test.go):**
- TestExpandSecrets_BasicExpansion
- TestExpandSecrets_NestedAndMissingVars
- TestValidateNoPlaintextSecrets_HeuristicDetection
- Tests env variable expansion and plaintext secret detection with entropy

**Audit Logging Tests (audit_test.go):**
- TestAuditLogger_ChainIntegrity
- TestAuditLogger_VerifyChain
- TestAuditLogger_LogFileAccess
- TestAuditLogger_Disabled
- Tests tamper-evident chain hashing and file access logging
2026-02-23 18:00:45 -05:00
Jeremie Fraeys
ec9e845bb6
fix(test): Fix WebSocketQueue test timeout and race conditions
Reduce worker polling interval from 5ms to 1ms for faster task pickup

Add 100ms buffer after job submission to allow queue to settle

Increase timeout from 30s to 60s to prevent flaky failures

Fixes intermittent timeout issues in integration tests
2026-02-23 14:38:18 -05:00
Jeremie Fraeys
ab20212d07
test: Update duplicate detection tests 2026-02-23 14:14:21 -05:00
Jeremie Fraeys
3b194ff2e8
feat: GPU detection transparency and artifact scanner improvements
Some checks failed
Build CLI with Embedded SQLite / build (arm64, aarch64-linux) (push) Waiting to run
Build CLI with Embedded SQLite / build (x86_64, x86_64-linux) (push) Waiting to run
Build CLI with Embedded SQLite / build-macos (arm64) (push) Waiting to run
Build CLI with Embedded SQLite / build-macos (x86_64) (push) Waiting to run
Security Scan / Security Analysis (push) Waiting to run
Security Scan / Native Library Security (push) Waiting to run
Checkout test / test (push) Successful in 6s
CI/CD Pipeline / Test (push) Failing after 1s
CI/CD Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI/CD Pipeline / Build (push) Has been skipped
CI/CD Pipeline / Test Scripts (push) Has been skipped
CI/CD Pipeline / Test Native Libraries (push) Has been skipped
CI/CD Pipeline / GPU Golden Test Matrix (push) Has been skipped
Documentation / build-and-publish (push) Failing after 39s
CI/CD Pipeline / Docker Build (push) Has been skipped
- Surface GPUDetectionInfo from parseGPUCountFromConfig for detection metadata
- Document FETCH_ML_TOTAL_CPU and FETCH_ML_GPU_SLOTS_PER_GPU env vars
- Add debug logging for all env var overrides to stderr
- Track config-layer auto-detection in GPUDetectionInfo.ConfigLayerAutoDetected
- Add --include-all flag to artifact scanner (includeAll parameter)
- Add AMD production mode enforcement (error in non-local mode)
- Add GPU detector unit tests for env overrides and AMD aliasing
2026-02-23 12:29:34 -05:00
Jeremie Fraeys
47ad62648d
test(e2e): skip gracefully when Redis unavailable, fix cross-device link
Some checks failed
Checkout test / test (push) Successful in 4s
CI/CD Pipeline / Test (push) Failing after 1s
CI/CD Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI/CD Pipeline / Build (push) Has been skipped
CI/CD Pipeline / Test Scripts (push) Has been skipped
CI/CD Pipeline / Test Native Libraries (push) Has been skipped
Documentation / build-and-publish (push) Failing after 33s
CI/CD Pipeline / Docker Build (push) Has been skipped
Security Scan / Security Analysis (push) Has been cancelled
Security Scan / Native Library Security (push) Has been cancelled
- StartTemporaryRedis now skips tests instead of failing when redis-server unavailable

- Fix homelab_e2e_test cross-device link issue using CopyDir instead of Rename
2026-02-21 21:20:47 -05:00
Jeremie Fraeys
bf4a8bcf78
test(auth): skip keychain tests when dbus unavailable
Some checks failed
CI/CD Pipeline / Docker Build (push) Blocked by required conditions
Security Scan / Security Analysis (push) Waiting to run
Security Scan / Native Library Security (push) Waiting to run
Checkout test / test (push) Successful in 4s
CI/CD Pipeline / Test (push) Failing after 1s
CI/CD Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI/CD Pipeline / Build (push) Has been skipped
CI/CD Pipeline / Test Scripts (push) Has been skipped
CI/CD Pipeline / Test Native Libraries (push) Has been skipped
Documentation / build-and-publish (push) Has been cancelled
2026-02-21 21:20:03 -05:00
Jeremie Fraeys
e557313e08
fix: context reuse benchmark uses temp directory
- Replace hardcoded testdata path with b.TempDir()
- Add createSmallDataset helper for self-contained benchmarks
- Fixes FAIL: BenchmarkContextReuse / BenchmarkSequentialHashes
2026-02-21 14:38:00 -05:00