Commit graph

118 commits

Author SHA1 Message Date
Jeremie Fraeys
7cd86fb88a
feat: add new API handlers, build scripts, and ADRs
Some checks failed
Build Pipeline / Sign HIPAA Config (push) Has been skipped
Build Pipeline / Generate SLSA Provenance (push) Has been skipped
Checkout test / test (push) Successful in 6s
CI Pipeline / Test (ubuntu-latest on self-hosted) (push) Failing after 1s
CI Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI Pipeline / Security Scan (push) Has been skipped
CI Pipeline / Test Scripts (push) Has been skipped
CI Pipeline / Test Native Libraries (push) Has been skipped
CI Pipeline / Native Library Build Matrix (push) Has been skipped
Contract Tests / Spec Drift Detection (push) Failing after 11s
Contract Tests / API Contract Tests (push) Has been skipped
Deploy API Docs / Build API Documentation (push) Failing after 5s
Deploy API Docs / Deploy to GitHub Pages (push) Has been skipped
Documentation / build-and-publish (push) Failing after 40s
Test Matrix / test-native-vs-pure (cgo) (push) Failing after 14s
Test Matrix / test-native-vs-pure (native) (push) Failing after 35s
Test Matrix / test-native-vs-pure (pure) (push) Failing after 18s
CI Pipeline / Trigger Build Workflow (push) Failing after 1s
Build CLI with Embedded SQLite / build (arm64, aarch64-linux) (push) Has been cancelled
Build CLI with Embedded SQLite / build (x86_64, x86_64-linux) (push) Has been cancelled
Build CLI with Embedded SQLite / build-macos (arm64) (push) Has been cancelled
Build CLI with Embedded SQLite / build-macos (x86_64) (push) Has been cancelled
Security Scan / Security Analysis (push) Has been cancelled
Security Scan / Native Library Security (push) Has been cancelled
Verification & Maintenance / V.1 - Schema Drift Detection (push) Has been cancelled
Verification & Maintenance / V.4 - Custom Go Vet Analyzers (push) Has been cancelled
Verification & Maintenance / V.7 - Audit Chain Integrity (push) Has been cancelled
Verification & Maintenance / V.6 - Extended Security Scanning (push) Has been cancelled
Verification & Maintenance / V.10 - OpenSSF Scorecard (push) Has been cancelled
Verification & Maintenance / Verification Summary (push) Has been cancelled
- Introduce audit, plugin, and scheduler API handlers
- Add spec_embed.go for OpenAPI spec embedding
- Create modular build scripts (cli, go, native, cross-platform)
- Add deployment cleanup and health-check utilities
- New ADRs: hot reload, audit store, SSE updates, RBAC, caching, offline mode, KMS regions, tenant offboarding
- Add KMS configuration schema and worker variants
- Include KMS benchmark tests
2026-03-04 13:24:27 -05:00
Jeremie Fraeys
61081655d2
feat: enhance worker execution and scheduler service templates
- Refactor worker configuration management
- Improve container executor lifecycle handling
- Update runloop and worker core logic
- Enhance scheduler service template generation
- Remove obsolete 'scheduler' symlink/directory
2026-03-04 13:24:20 -05:00
Jeremie Fraeys
66f262d788
security: improve audit, crypto, and config handling
- Enhance audit checkpoint system
- Update KMS provider and tenant key management
- Refine configuration constants
- Improve TUI config handling
2026-03-04 13:23:42 -05:00
Jeremie Fraeys
a4f2c36069
feat: enhance task domain and scheduler protocol
- Update task domain model
- Improve scheduler hub and priority queue
- Enhance protocol definitions
- Update manifest schema and run handling
2026-03-04 13:23:38 -05:00
Jeremie Fraeys
1f495dfbb7
api: regenerate OpenAPI types and server code
- Update openapi.yaml spec
- Regenerate server_gen.go with oapi-codegen
- Update adapter, routes, and server configuration
2026-03-04 13:23:34 -05:00
Jeremie Fraeys
e1ec255ad2
refactor(crypto): integrate KMS with TenantKeyManager
Replace in-memory root keys with KMS interface:
- GenerateDataEncryptionKey: generate DEK, wrap via KMS, cache
- UnwrapDataEncryptionKey: cache check, KMS decrypt, cache store
- EncryptArtifact/DecryptArtifact: use DEK from KMS
- RotateTenantKey: create new KMS key, flush cache
- RevokeTenant: disable KMS key, schedule deletion per ADR-015

Remove deprecated methods: wrapKey, unwrapKey (replaced by KMS)
2026-03-03 19:14:27 -05:00
Jeremie Fraeys
7c03c8b5bd
feat(kms): add HashiCorp Vault and AWS KMS providers
Implement VaultProvider with Transit engine:
- AppRole, Kubernetes, and Token authentication
- Encrypt/Decrypt via /transit/encrypt and /transit/decrypt
- Key lifecycle via /transit/keys API
- Health check via /sys/health

Implement AWSProvider with SDK v2:
- Per-region key naming with alias prefix
- Encrypt/Decrypt via KMS SDK
- Key lifecycle (CreateKey, Disable, ScheduleDeletion, Enable)
- AWS endpoint support for LocalStack testing
2026-03-03 19:14:21 -05:00
Jeremie Fraeys
cb25677695
feat(kms): implement core KMS infrastructure with DEK cache
Add KMSProvider interface for external key management systems:
- Encrypt/Decrypt operations for DEK wrapping
- Key lifecycle management (Create, Disable, ScheduleDeletion, Enable)
- HealthCheck and Close methods

Implement MemoryProvider for development/testing:
- XOR encryption with HMAC-SHA256 authentication
- Secure random key generation using crypto/rand
- MAC verification to detect wrong keys

Implement DEKCache per ADR-012:
- 15-minute TTL with configurable grace window (1 hour)
- LRU eviction with 1000 entry limit
- Cache key includes (tenantID, artifactID, kmsKeyID) for isolation
- Thread-safe operations with RWMutex
- Secure memory wiping on eviction/cleanup

Add config package with types:
- ProviderType enum (vault, aws, memory)
- VaultConfig with AppRole/Kubernetes/Token auth
- AWSConfig with region and alias prefix
- CacheConfig with TTL, MaxEntries, GraceWindow
- Validation methods for all config types
2026-03-03 19:13:55 -05:00
Jeremie Fraeys
da104367d6
feat: add Plugin GPU Quota implementation and tests
Some checks failed
Build Pipeline / Build Binaries (push) Failing after 1m59s
Build Pipeline / Build Docker Images (push) Has been skipped
Build Pipeline / Sign HIPAA Config (push) Has been skipped
Build Pipeline / Generate SLSA Provenance (push) Has been skipped
Checkout test / test (push) Successful in 5s
CI Pipeline / Test (ubuntu-latest on self-hosted) (push) Failing after 1s
CI Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI Pipeline / Security Scan (push) Has been skipped
CI Pipeline / Test Scripts (push) Has been skipped
CI Pipeline / Test Native Libraries (push) Has been skipped
CI Pipeline / Native Library Build Matrix (push) Has been skipped
Documentation / build-and-publish (push) Failing after 35s
CI Pipeline / Trigger Build Workflow (push) Failing after 0s
Security Scan / Security Analysis (push) Has been cancelled
Security Scan / Native Library Security (push) Has been cancelled
Verification & Maintenance / V.1 - Schema Drift Detection (push) Has been cancelled
Verification & Maintenance / V.4 - Custom Go Vet Analyzers (push) Has been cancelled
Verification & Maintenance / V.7 - Audit Chain Integrity (push) Has been cancelled
Verification & Maintenance / V.6 - Extended Security Scanning (push) Has been cancelled
Verification & Maintenance / V.10 - OpenSSF Scorecard (push) Has been cancelled
Verification & Maintenance / Verification Summary (push) Has been cancelled
- Add plugin_quota.go with GPU quota management for scheduler

- Update scheduler hub and protocol for plugin support

- Add comprehensive plugin quota unit tests

- Update gang service and WebSocket queue integration tests
2026-02-26 14:35:05 -05:00
Jeremie Fraeys
8f2495deb0
chore(cleanup): remove obsolete files and update .gitignore
Remove deprecated components replaced by new scheduler:
- Delete internal/controller/pacing_controller.go (replaced by scheduler/pacing.go)
- Delete internal/manifest/schema_test.go (consolidated into tests/unit/)
- Delete internal/workertest/worker.go (consolidated into tests/fixtures/)
- Update .gitignore with scheduler binary and new patterns
2026-02-26 12:09:18 -05:00
Jeremie Fraeys
4cdb68907e
refactor(utilities): update supporting modules for scheduler integration
Update utility modules:
- File utilities with secure file operations
- Environment pool with resource tracking
- Error types with scheduler error categories
- Logging with audit context support
- Network/SSH with connection pooling
- Privacy/PII handling with tenant boundaries
- Resource manager with scheduler allocation
- Security monitor with audit integration
- Tracking plugins (MLflow, TensorBoard) with auth
- Crypto signing with tenant keys
- Database init with multi-user support
2026-02-26 12:07:15 -05:00
Jeremie Fraeys
6866ba9366
refactor(queue): integrate scheduler backend and storage improvements
Update queue and storage systems for scheduler integration:
- Queue backend with scheduler coordination
- Filesystem queue with batch operations
- Deduplication with tenant-aware keys
- Storage layer with audit logging hooks
- Domain models (Task, Events, Errors) with scheduler fields
- Database layer with tenant isolation
- Dataset storage with integrity checks
2026-02-26 12:06:46 -05:00
Jeremie Fraeys
6b2c377680
refactor(jupyter): enhance security and scheduler integration
Update Jupyter integration for security and scheduler support:
- Enhanced security configuration with audit logging
- Health monitoring with scheduler event integration
- Package manager with network policy enforcement
- Service manager with lifecycle hooks
- Network manager with tenant isolation
- Workspace metadata with tenant tags
- Config with resource limits
- Podman container integration improvements
- Experiment manager with tracking integration
- Manifest runner with security checks
2026-02-26 12:06:35 -05:00
Jeremie Fraeys
3fb6902fa1
feat(worker): integrate scheduler endpoints and security hardening
Update worker system for scheduler integration:
- Worker server with scheduler registration
- Configuration with scheduler endpoint support
- Artifact handling with integrity verification
- Container executor with supply chain validation
- Local executor enhancements
- GPU detection improvements (cross-platform)
- Error handling with execution context
- Factory pattern for executor instantiation
- Hash integrity with native library support
2026-02-26 12:06:16 -05:00
Jeremie Fraeys
ef11d88a75
refactor(auth): add tenant scoping and permission enhancements
Update authentication system for multi-tenant support:
- API key management with tenant scoping
- Permission checks for multi-tenant operations
- Database layer with tenant isolation
- Keychain integration with audit logging
2026-02-26 12:06:08 -05:00
Jeremie Fraeys
420de879ff
feat(api): integrate scheduler protocol and WebSocket enhancements
Update API layer for scheduler integration:
- WebSocket handlers with scheduler protocol support
- Jobs WebSocket endpoint with priority queue integration
- Validation middleware for scheduler messages
- Server configuration with security hardening
- Protocol definitions for worker-scheduler communication
- Dataset handlers with tenant isolation checks
- Response helpers with audit context
- OpenAPI spec updates for new endpoints
2026-02-26 12:05:57 -05:00
Jeremie Fraeys
95adcba437
feat(worker): add Jupyter/vLLM plugins and process isolation
Extend worker capabilities with new execution plugins and security features:
- Jupyter plugin for notebook-based ML experiments
- vLLM plugin for LLM inference workloads
- Cross-platform process isolation (Unix/Windows)
- Network policy enforcement with platform-specific implementations
- Service manager integration for lifecycle management
- Scheduler backend integration for queue coordination

Update lifecycle management:
- Enhanced runloop with state transitions
- Service manager integration for plugin coordination
- Improved state persistence and recovery

Add test coverage:
- Unit tests for Jupyter and vLLM plugins
- Updated worker execution tests
2026-02-26 12:03:59 -05:00
Jeremie Fraeys
a981e89005
feat(security): add audit subsystem and tenant isolation
Implement comprehensive audit and security infrastructure:
- Immutable audit logs with platform-specific backends (Linux/Other)
- Sealed log entries with tamper-evident checksums
- Audit alert system for real-time security notifications
- Log rotation with retention policies
- Checkpoint-based audit verification

Add multi-tenant security features:
- Tenant manager with quota enforcement
- Middleware for tenant authentication/authorization
- Per-tenant cryptographic key isolation
- Supply chain security for container verification
- Cross-platform secure file utilities (Unix/Windows)

Add test coverage:
- Unit tests for audit alerts and sealed logs
- Platform-specific audit backend tests
2026-02-26 12:03:45 -05:00
Jeremie Fraeys
43e6446587
feat(scheduler): implement multi-tenant job scheduler with gang scheduling
Add new scheduler component for distributed ML workload orchestration:
- Hub-based coordination for multi-worker clusters
- Pacing controller for rate limiting job submissions
- Priority queue with preemption support
- Port allocator for dynamic service discovery
- Protocol handlers for worker-scheduler communication
- Service manager with OS-specific implementations
- Connection management and state persistence
- Template system for service deployment

Includes comprehensive test suite:
- Unit tests for all core components
- Integration tests for distributed scenarios
- Benchmark tests for performance validation
- Mock fixtures for isolated testing

Refs: scheduler-architecture.md
2026-02-26 12:03:23 -05:00
Jeremie Fraeys
8f9bcef754
test(phase-3): prerequisite security and reproducibility tests
Implement 4 prerequisite test requirements:

- TestConfigIntegrityVerification: Config signing, tamper detection, hash stability
- TestManifestFilenameNonce: Cryptographic nonce generation and filename patterns
- TestGPUDetectionAudit: Structured logging of GPU detection at startup
- TestResourceEnvVarParsing: Resource env var parsing and override behavior

Also update manifest run_manifest.go:
- Add nonce-based filename support to WriteToDir
- Add nonce-based file detection to LoadFromDir
2026-02-23 20:25:26 -05:00
Jeremie Fraeys
f71352202e
test(phase-1-2): naming alignment and partial test completion
Rename and enhance existing tests to align with coverage map:
- TestGPUDetectorAMDVendorAlias -> TestAMDAliasManifestRecord
- TestScanArtifacts_SkipsKnownPathsAndLogs -> TestScanExclusionsRecorded
- Add env var expansion verification to TestHIPAAValidation_InlineCredentials
- Record exclusions in manifest.Artifacts for audit trail
2026-02-23 20:25:07 -05:00
Jeremie Fraeys
17d5c75e33
fix(security): Path validation improvements for symlink resolution
Fix ValidatePath to correctly resolve symlinks and handle edge cases:
- Resolve symlinks before boundary check to prevent traversal
- Handle macOS /private prefix correctly
- Add fallback for non-existent paths (parent directory resolution)
- Double boundary checks: before AND after symlink resolution
- Prevent race conditions between check and use

Update path traversal tests:
- Correct test expectations for "..." (three dots is valid filename, not traversal)
- Add tests for symlink escape attempts
- Add unicode attack tests
- Add deeply nested traversal tests

Security impact: Prevents path traversal via symlink following in artifact
scanning and other file operations.
2026-02-23 19:44:16 -05:00
Jeremie Fraeys
58c1a5fa58
feat(audit): Tamper-evident audit chain verification system
Add ChainVerifier for cryptographic audit log verification:
- VerifyLogFile(): Validates entire audit chain integrity
- Detects tampering at specific event index (FirstTampered)
- Returns chain root hash for external verification
- GetChainRootHash(): Standalone hash computation
- VerifyAndAlert(): Boolean tampering detection with logging

Add audit-verifier CLI tool:
- Standalone binary for audit chain verification
- Takes log path argument and reports tampering

Update audit logger for chain integrity:
- Each event includes sequence number and hash chain
- SHA-256 linking: hash_n = SHA-256(prev_hash || event_n)
- Tamper detection through hash chain validation

Add comprehensive test coverage:
- Empty log handling
- Valid chain verification
- Tampering detection with modification
- Root hash consistency
- Alert mechanism tests

Part of: V.7 audit verification from security plan
2026-02-23 19:43:50 -05:00
Jeremie Fraeys
4a4d3de8e1
feat(security): Manifest security - nonce generation, environment tracking, schema validation
Add cryptographically secure manifest filename nonce generation:
- GenerateManifestNonce() creates 16-byte random nonce (32 hex chars)
- GenerateManifestFilename() creates unique filenames: run_manifest_<nonce>.json
- Prevents enumeration attacks on manifest files

Add ExecutionEnvironment struct to manifest:
- Captures ConfigHash for reproducibility verification
- Records GPU detection method (auto-detected, env override, config, etc.)
- Records sandbox settings (NoNewPrivileges, DropAllCaps, NetworkMode)
- Records compliance mode and manifest nonce
- Records artifact scan exclusions with reason

Add JSON Schema validation:
- schema.json: Canonical schema for manifest validation
- schema_version.go: Schema versioning and compatibility checking
- schema_test.go: Drift detection with SHA-256 hash verification
- Validates required fields (run_id, environment.config_hash, etc.)
- Validates compliance_mode enum values (hipaa, standard)
- Validates no negative sizes in artifacts

Closes: manifest nonce, environment tracking, scan exclusions from security plan
2026-02-23 19:43:39 -05:00
Jeremie Fraeys
9434f4c8e6
feat(security): Artifact ingestion caps enforcement
Add MaxArtifactFiles and MaxArtifactTotalBytes to SandboxConfig:
- Default MaxArtifactFiles: 10,000 (configurable via SecurityDefaults)
- Default MaxArtifactTotalBytes: 100GB (configurable via SecurityDefaults)
- ApplySecurityDefaults() sets defaults if not specified

Enforce caps in scanArtifacts() during directory walk:
- Returns error immediately when MaxArtifactFiles exceeded
- Returns error immediately when MaxArtifactTotalBytes exceeded
- Prevents resource exhaustion attacks from malicious artifact trees

Update all call sites to pass SandboxConfig for cap enforcement:
- Native bridge libs updated to pass caps argument
- Benchmark tests updated with nil caps (unlimited for benchmarks)
- Unit tests updated with nil caps

Closes: artifact ingestion caps items from security plan
2026-02-23 19:43:28 -05:00
Jeremie Fraeys
a8180f1f26
feat(security): HIPAA compliance mode and PHI denylist validation
Add compliance_mode field to Config with strict HIPAA validation:
- Requires SnapshotStore.Secure=true in HIPAA mode
- Requires NetworkMode="none" for tenant isolation
- Requires non-empty SeccompProfile
- Requires NoNewPrivileges=true
- Enforces credentials via environment variables only (no inline YAML)

Add PHI denylist validation for AllowedSecrets:
- Blocks secrets matching patterns: patient, ssn, mrn, medical_record,
  diagnosis, dob, birth, mrn_number, patient_id, patient_name
- Prevents accidental PHI exfiltration via secret channels

Add comprehensive test coverage in hipaa_validation_test.go:
- Network mode enforcement tests
- NoNewPrivileges requirement tests
- Seccomp profile validation tests
- Inline credential rejection tests
- PHI denylist validation tests

Closes: compliance_mode, PHI denylist items from security plan
2026-02-23 19:43:19 -05:00
Jeremie Fraeys
fc2459977c
refactor(worker): update worker tests and native bridge
**Worker Refactoring:**
- Update internal/worker/factory.go, worker.go, snapshot_store.go
- Update native_bridge.go and native_bridge_nocgo.go for native library integration

**Test Updates:**
- Update all worker unit tests for new interfaces
- Update chaos tests
- Update container/podman_test.go
- Add internal/workertest/worker.go for shared test utilities

**Documentation:**
- Update native/README.md
2026-02-23 18:04:22 -05:00
Jeremie Fraeys
a70d8aad8e
refactor: remove dead code and fix unused variables
**Cleanup:**
- Delete internal/worker/testutil.go (150 lines of unused test utilities)
- Remove unused stateDir() function from internal/jupyter/service_manager.go
- Silence unused variable warning in internal/worker/executor/container.go
2026-02-23 18:03:38 -05:00
Jeremie Fraeys
92aab06d76
feat(security): implement comprehensive security hardening phases 1-5,7
Implements defense-in-depth security for HIPAA and multi-tenant requirements:

**Phase 1 - File Ingestion Security:**
- SecurePathValidator with symlink resolution and path boundary enforcement
  in internal/fileutil/secure.go
- Magic bytes validation for ML artifacts (safetensors, GGUF, HDF5, numpy)
  in internal/fileutil/filetype.go
- Dangerous extension blocking (.pt, .pkl, .exe, .sh, .zip)
- Upload limits (10GB size, 100MB/s rate, 10 uploads/min)

**Phase 2 - Sandbox Hardening:**
- ApplySecurityDefaults() with secure-by-default principle
  - network_mode: none, read_only_root: true, no_new_privileges: true
  - drop_all_caps: true, user_ns: true, run_as_uid/gid: 1000
- PodmanSecurityConfig and BuildSecurityArgs() in internal/container/podman.go
- BuildPodmanCommand now accepts full security configuration
- Container executor passes SandboxConfig to Podman command builder
- configs/seccomp/default-hardened.json blocks dangerous syscalls
  (ptrace, mount, reboot, kexec_load, open_by_handle_at)

**Phase 3 - Secrets Management:**
- expandSecrets() for environment variable expansion using ${VAR} syntax
- validateNoPlaintextSecrets() with entropy-based detection
- Pattern matching for AWS, GitHub, GitLab, OpenAI, Stripe tokens
- Shannon entropy calculation (>4 bits/char triggers detection)
- Secrets expanded during LoadConfig() before validation

**Phase 5 - HIPAA Audit Logging:**
- Tamper-evident chain hashing with SHA-256 in internal/audit/audit.go
- Event struct extended with PrevHash, EventHash, SequenceNum
- File access event types: EventFileRead, EventFileWrite, EventFileDelete
- LogFileAccess() helper for HIPAA compliance
- VerifyChain() function for tamper detection

**Supporting Changes:**
- Add DeleteJob() and DeleteJobsByPrefix() to storage package
- Integrate SecurePathValidator in artifact scanning
2026-02-23 18:00:33 -05:00
Jeremie Fraeys
3b194ff2e8
feat: GPU detection transparency and artifact scanner improvements
Some checks failed
Build CLI with Embedded SQLite / build (arm64, aarch64-linux) (push) Waiting to run
Build CLI with Embedded SQLite / build (x86_64, x86_64-linux) (push) Waiting to run
Build CLI with Embedded SQLite / build-macos (arm64) (push) Waiting to run
Build CLI with Embedded SQLite / build-macos (x86_64) (push) Waiting to run
Security Scan / Security Analysis (push) Waiting to run
Security Scan / Native Library Security (push) Waiting to run
Checkout test / test (push) Successful in 6s
CI/CD Pipeline / Test (push) Failing after 1s
CI/CD Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI/CD Pipeline / Build (push) Has been skipped
CI/CD Pipeline / Test Scripts (push) Has been skipped
CI/CD Pipeline / Test Native Libraries (push) Has been skipped
CI/CD Pipeline / GPU Golden Test Matrix (push) Has been skipped
Documentation / build-and-publish (push) Failing after 39s
CI/CD Pipeline / Docker Build (push) Has been skipped
- Surface GPUDetectionInfo from parseGPUCountFromConfig for detection metadata
- Document FETCH_ML_TOTAL_CPU and FETCH_ML_GPU_SLOTS_PER_GPU env vars
- Add debug logging for all env var overrides to stderr
- Track config-layer auto-detection in GPUDetectionInfo.ConfigLayerAutoDetected
- Add --include-all flag to artifact scanner (includeAll parameter)
- Add AMD production mode enforcement (error in non-local mode)
- Add GPU detector unit tests for env overrides and AMD aliasing
2026-02-23 12:29:34 -05:00
Jeremie Fraeys
1b0781dc68
fix(auth): make DeleteAPIKey resilient to keyring errors
Some checks failed
Security Scan / Security Analysis (push) Waiting to run
Security Scan / Native Library Security (push) Waiting to run
Checkout test / test (push) Successful in 4s
CI/CD Pipeline / Test (push) Has been cancelled
CI/CD Pipeline / Dev Compose Smoke Test (push) Has been cancelled
CI/CD Pipeline / Build (push) Has been cancelled
CI/CD Pipeline / Test Scripts (push) Has been cancelled
CI/CD Pipeline / Test Native Libraries (push) Has been cancelled
CI/CD Pipeline / Docker Build (push) Has been cancelled
Documentation / build-and-publish (push) Has been cancelled
DeleteAPIKey now ignores primary keyring errors (e.g., dbus unavailable)

and always cleans up the fallback store
2026-02-21 21:19:46 -05:00
Jeremie Fraeys
be39b37aec
feat: native GPU detection and NVML bridge for macOS and Linux
- Add dynamic NVML loading for Linux GPU detection
- Add macOS GPU detection via IOKit framework
- Add Zig NVML wrapper for cross-platform GPU queries
- Update native bridge to support platform-specific GPU libs
- Add CMake support for NVML dynamic library
2026-02-21 17:59:59 -05:00
Jeremie Fraeys
05b7af6991
feat: implement NVML-based GPU monitoring
- Add native/nvml_gpu/ C++ library wrapping NVIDIA Management Library
- Add Go bindings in internal/worker/gpu_nvml_native.go and gpu_nvml_stub.go
- Update gpu_detector.go to use NVML for accurate GPU count detection
- Update native/CMakeLists.txt to build nvml_gpu library
- Provides real-time GPU utilization, memory, temperature, clocks, power
- Falls back to environment variable when NVML unavailable
2026-02-21 15:16:09 -05:00
Jeremie Fraeys
e557313e08
fix: context reuse benchmark uses temp directory
- Replace hardcoded testdata path with b.TempDir()
- Add createSmallDataset helper for self-contained benchmarks
- Fixes FAIL: BenchmarkContextReuse / BenchmarkSequentialHashes
2026-02-21 14:38:00 -05:00
Jeremie Fraeys
158c525bef
fix: resolve benchmark and build tag conflicts
- Remove duplicate hash_selector.go (build tags handle switching)
- Fix benchmark to use worker.DirOverallSHA256Hex
- Fix snapshot_store.go to use integrity.DirOverallSHA256Hex directly
- Native tests pass, benchmarks now correctly test native vs Go
2026-02-21 14:26:48 -05:00
Jeremie Fraeys
90d702823b
fix: correct C type cast and add context reuse benchmark
- Fix C.uint32_t cast for runtime.NumCPU() in native_bridge_libs.go
- Add context_reuse_bench_test.go to verify performance gains
- All native tests pass (8/8)
- Benchmarks functional
2026-02-21 14:20:40 -05:00
Jeremie Fraeys
d1ac558107
perf: implement context reuse
Go Worker (internal/worker/native_bridge_libs.go):
- Add global hashCtx with sync.Once for lazy initialization
- Eliminates 5-20ms fh_init/fh_cleanup per hash operation
- Uses runtime.NumCPU() for optimal thread count
- Log initialization time for observability

Zig CLI (cli/src/native/hash.zig):
- Add global_ctx with atomic flag and mutex
- Thread-safe initialization with double-check pattern
- Idempotent init() callable from multiple threads
- Log init time for debugging
2026-02-21 14:19:14 -05:00
Jeremie Fraeys
48d00b8322
feat: integrate native queue backend into worker and API
- Add QueueBackendNative constant to backend.go
- Add case for native queue in NewBackend() switch
- Native queue uses same FilesystemPath config
- Build tag -tags native_libs enables native implementation

Native library integration now complete:
- dataset_hash: Worker (hash_selector), CLI (verify auto-hash)
- queue_index: Worker/API (backend selection with 'native' type)
2026-02-21 14:11:10 -05:00
Jeremie Fraeys
c89d970210
refactor: migrate from env var to build tags for native libs
Replace FETCHML_NATIVE_LIBS=1 environment variable with -tags native_libs:

Changes:
- internal/queue/native_queue.go: UseNativeQueue is now const true
- internal/queue/native_queue_stub.go: UseNativeQueue is now const false
- build/docker/simple.Dockerfile: Add -tags native_libs to go build
- deployments/docker-compose.dev.yml: Remove FETCHML_NATIVE_LIBS env var
- native/README.md: Update documentation for build tags
- scripts/test-native-with-redis.sh: New test script with Redis via docker-compose

Benefits:
- Compile-time enforcement (no runtime checks needed)
- Cleaner deployment (no env var management)
- Type safety (const vs var)
- Simpler testing with docker-compose Redis integration
2026-02-21 13:43:58 -05:00
Jeremie Fraeys
23e5f3d1dc
refactor(api): internal refactoring for TUI and worker modules
- Refactor internal/worker and internal/queue packages
- Update cmd/tui for monitoring interface
- Update test configurations
2026-02-20 15:51:23 -05:00
Jeremie Fraeys
6028779239
feat: update CLI, TUI, and security documentation
- Add safety checks to Zig build
- Add TUI with job management and narrative views
- Add WebSocket support and export services
- Add smart configuration defaults
- Update API routes with security headers
- Update SECURITY.md with comprehensive policy
- Add Makefile security scanning targets
2026-02-19 15:35:05 -05:00
Jeremie Fraeys
02811c0ffe
fix: resolve TODOs and standardize tests
- Fix duplicate check in security_test.go lint warning
- Mark SHA256 tests as Legacy for backward compatibility
- Convert TODO comments to documentation (task, handlers, privacy)
- Update user_manager_test to use GenerateAPIKey pattern
2026-02-19 15:34:59 -05:00
Jeremie Fraeys
37aad7ae87
feat: add manifest signing and native hashing support
- Integrate RunManifest.Validate with existing Validator
- Add manifest Sign() and Verify() methods
- Add native C++ hashing libraries (dataset_hash, queue_index)
- Add native bridge for Go/C++ integration
- Add deduplication support in queue
2026-02-19 15:34:39 -05:00
Jeremie Fraeys
a3f9bf8731
feat: implement tamper-evident audit logging
- Add hash-chained audit log entries for tamper detection
- Add EventRecorder interface for structured event logging
- Add TaskEvent helper method for consistent event emission
2026-02-19 15:34:28 -05:00
Jeremie Fraeys
e4d286f2e5
feat: add security monitoring and validation framework
- Implement anomaly detection monitor (brute force, path traversal, etc.)
- Add input validation framework with safety rules
- Add environment-based secrets manager with redaction
- Add security test suite for path traversal and injection
- Add CI security scanning workflow
2026-02-19 15:34:25 -05:00
Jeremie Fraeys
34aaba8f17
feat: implement Argon2id hashing and Ed25519 manifest signing
- Add Argon2id-based API key hashing with salt support
- Implement Ed25519 manifest signing (key generation, sign, verify)
- Add gen-keys CLI tool for manifest signing keys
- Fix hash-key command to hash provided key (not generate new one)
- Complete isHex helper function
2026-02-19 15:34:20 -05:00
Jeremie Fraeys
27c8b08a16
test: Reorganize and add unit tests
Reorganize tests for better structure and coverage:
- Move container/security_test.go from internal/ to tests/unit/container/
- Move related tests to proper unit test locations
- Delete orphaned test files (startup_blacklist_test.go)
- Add privacy middleware unit tests
- Add worker config unit tests
- Update E2E tests for homelab and websocket scenarios
- Update test fixtures with utility functions
- Add CLI helper script for arraylist fixes
2026-02-18 21:28:13 -05:00
Jeremie Fraeys
4756348c48
feat: Worker sandboxing and security configuration
Add security hardening features for worker execution:
- Worker config with sandboxing options (network_mode, read_only, secrets)
- Execution setup with security context propagation
- Podman container runtime security enhancements
- Security configuration management in config package
- Add homelab-sandbox.yaml example configuration

Supports running jobs in isolated, restricted environments.
2026-02-18 21:27:59 -05:00
Jeremie Fraeys
cb826b74a3
feat: WebSocket API infrastructure improvements
Enhance WebSocket client and server components:
- Add new WebSocket opcodes (CompareRuns, FindRuns, ExportRun, SetRunOutcome)
- Improve WebSocket client with additional response handlers
- Add crypto utilities for secure WebSocket communications
- Add I/O utilities for WebSocket payload handling
- Enhance validation for WebSocket message payloads
- Update routes for new WebSocket endpoints
- Improve monitor and validate command WebSocket integrations
2026-02-18 21:27:48 -05:00
Jeremie Fraeys
aaeef69bab
feat: Privacy and PII detection
Add privacy protection features to prevent accidental PII leakage:
- PII detection engine supporting emails, phone numbers, SSNs, credit cards
- CLI privacy command for scanning files and text
- Privacy middleware for API request/response filtering
- Suggestion utility for privacy-preserving alternatives

Integrates PII scanning into manifest validation for narrative fields.
2026-02-18 21:27:23 -05:00