Commit graph

12 commits

Author SHA1 Message Date
Jeremie Fraeys
a8180f1f26
feat(security): HIPAA compliance mode and PHI denylist validation
Add compliance_mode field to Config with strict HIPAA validation:
- Requires SnapshotStore.Secure=true in HIPAA mode
- Requires NetworkMode="none" for tenant isolation
- Requires non-empty SeccompProfile
- Requires NoNewPrivileges=true
- Enforces credentials via environment variables only (no inline YAML)

Add PHI denylist validation for AllowedSecrets:
- Blocks secrets matching patterns: patient, ssn, mrn, medical_record,
  diagnosis, dob, birth, mrn_number, patient_id, patient_name
- Prevents accidental PHI exfiltration via secret channels

Add comprehensive test coverage in hipaa_validation_test.go:
- Network mode enforcement tests
- NoNewPrivileges requirement tests
- Seccomp profile validation tests
- Inline credential rejection tests
- PHI denylist validation tests

Closes: compliance_mode, PHI denylist items from security plan
2026-02-23 19:43:19 -05:00
Jeremie Fraeys
92aab06d76
feat(security): implement comprehensive security hardening phases 1-5,7
Implements defense-in-depth security for HIPAA and multi-tenant requirements:

**Phase 1 - File Ingestion Security:**
- SecurePathValidator with symlink resolution and path boundary enforcement
  in internal/fileutil/secure.go
- Magic bytes validation for ML artifacts (safetensors, GGUF, HDF5, numpy)
  in internal/fileutil/filetype.go
- Dangerous extension blocking (.pt, .pkl, .exe, .sh, .zip)
- Upload limits (10GB size, 100MB/s rate, 10 uploads/min)

**Phase 2 - Sandbox Hardening:**
- ApplySecurityDefaults() with secure-by-default principle
  - network_mode: none, read_only_root: true, no_new_privileges: true
  - drop_all_caps: true, user_ns: true, run_as_uid/gid: 1000
- PodmanSecurityConfig and BuildSecurityArgs() in internal/container/podman.go
- BuildPodmanCommand now accepts full security configuration
- Container executor passes SandboxConfig to Podman command builder
- configs/seccomp/default-hardened.json blocks dangerous syscalls
  (ptrace, mount, reboot, kexec_load, open_by_handle_at)

**Phase 3 - Secrets Management:**
- expandSecrets() for environment variable expansion using ${VAR} syntax
- validateNoPlaintextSecrets() with entropy-based detection
- Pattern matching for AWS, GitHub, GitLab, OpenAI, Stripe tokens
- Shannon entropy calculation (>4 bits/char triggers detection)
- Secrets expanded during LoadConfig() before validation

**Phase 5 - HIPAA Audit Logging:**
- Tamper-evident chain hashing with SHA-256 in internal/audit/audit.go
- Event struct extended with PrevHash, EventHash, SequenceNum
- File access event types: EventFileRead, EventFileWrite, EventFileDelete
- LogFileAccess() helper for HIPAA compliance
- VerifyChain() function for tamper detection

**Supporting Changes:**
- Add DeleteJob() and DeleteJobsByPrefix() to storage package
- Integrate SecurePathValidator in artifact scanning
2026-02-23 18:00:33 -05:00
Jeremie Fraeys
3b194ff2e8
feat: GPU detection transparency and artifact scanner improvements
Some checks failed
Build CLI with Embedded SQLite / build (arm64, aarch64-linux) (push) Waiting to run
Build CLI with Embedded SQLite / build (x86_64, x86_64-linux) (push) Waiting to run
Build CLI with Embedded SQLite / build-macos (arm64) (push) Waiting to run
Build CLI with Embedded SQLite / build-macos (x86_64) (push) Waiting to run
Security Scan / Security Analysis (push) Waiting to run
Security Scan / Native Library Security (push) Waiting to run
Checkout test / test (push) Successful in 6s
CI/CD Pipeline / Test (push) Failing after 1s
CI/CD Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI/CD Pipeline / Build (push) Has been skipped
CI/CD Pipeline / Test Scripts (push) Has been skipped
CI/CD Pipeline / Test Native Libraries (push) Has been skipped
CI/CD Pipeline / GPU Golden Test Matrix (push) Has been skipped
Documentation / build-and-publish (push) Failing after 39s
CI/CD Pipeline / Docker Build (push) Has been skipped
- Surface GPUDetectionInfo from parseGPUCountFromConfig for detection metadata
- Document FETCH_ML_TOTAL_CPU and FETCH_ML_GPU_SLOTS_PER_GPU env vars
- Add debug logging for all env var overrides to stderr
- Track config-layer auto-detection in GPUDetectionInfo.ConfigLayerAutoDetected
- Add --include-all flag to artifact scanner (includeAll parameter)
- Add AMD production mode enforcement (error in non-local mode)
- Add GPU detector unit tests for env overrides and AMD aliasing
2026-02-23 12:29:34 -05:00
Jeremie Fraeys
be39b37aec
feat: native GPU detection and NVML bridge for macOS and Linux
- Add dynamic NVML loading for Linux GPU detection
- Add macOS GPU detection via IOKit framework
- Add Zig NVML wrapper for cross-platform GPU queries
- Update native bridge to support platform-specific GPU libs
- Add CMake support for NVML dynamic library
2026-02-21 17:59:59 -05:00
Jeremie Fraeys
4756348c48
feat: Worker sandboxing and security configuration
Add security hardening features for worker execution:
- Worker config with sandboxing options (network_mode, read_only, secrets)
- Execution setup with security context propagation
- Podman container runtime security enhancements
- Security configuration management in config package
- Add homelab-sandbox.yaml example configuration

Supports running jobs in isolated, restricted environments.
2026-02-18 21:27:59 -05:00
Jeremie Fraeys
a5059c5231
refactor: adopt PathRegistry in worker config
Update internal/worker/config.go to use centralized PathRegistry:

Changes:
- Initialize PathRegistry with config.FromEnv() in LoadConfig
- Update BasePath default to use paths.ExperimentsDir()
- Update DataDir default to use paths.DataDir()
- Simplify DataDir logic by using PathRegistry directly

Benefits:
- Consistent directory locations via PathRegistry
- Centralized path management across worker and api-server
- Simpler configuration with fewer conditional branches
2026-02-18 16:55:18 -05:00
Jeremie Fraeys
3775bc3ee0
refactor: replace panic with error returns and update maintenance
- Replace 9 panic() calls in smart_defaults.go with error returns
- Add ErrUnknownProfile error type for better error handling
- Update all callers (worker/config.go, tui/config.go, tui/cli_config.go, tui/main.go)
- Update CHANGELOG.md with recent WebSocket handler improvements
- Add metrics persistence, dataset handlers, and test organization notes
- Config validation passes (make configlint)
- All tests pass (go test ./tests/unit/api/ws)
2026-02-18 14:44:21 -05:00
Jeremie Fraeys
320e6fd409
refactor(dependency-hygiene): Move path functions from config to storage
Move ExpandPath function and path-related utilities from internal/config to internal/storage where they belong.

Files updated:
- internal/worker/config.go: use storage.ExpandPath
- internal/network/ssh.go: use storage.ExpandPath
- cmd/data_manager/data_manager_config.go: use storage.ExpandPath
- internal/api/server_config.go: use storage.ExpandPath

internal/storage/paths.go already contained the canonical implementation.

Result: Path utilities now live in storage layer, config package focuses on configuration structs.
2026-02-17 21:15:23 -05:00
Jeremie Fraeys
a5c1a9fc0b
refactor: Phase 4 - split worker package into focused files
Split 551-line worker/core.go into single-concern files:

- worker/config.go (+44 lines)
  - Added config parsing: envInt(), parseCPUFromConfig(), parseGPUCountFromConfig()
  - parseGPUSlotsPerGPUFromConfig()
  - Now has all config logic in one place (440 lines total)

- worker/metrics.go (new file, 172 lines)
  - Extracted setupMetricsExporter() with ~30 Prometheus metric registrations
  - Isolated metrics logic for easy modification

- worker/factory.go (new file, 183 lines)
  - Extracted NewWorker() factory function
  - Moved prePullImages(), pullImage() from core.go
  - Centralized worker instantiation

- worker/worker.go (renamed from core.go, ~100 lines)
  - Now just defines Worker struct, MLServer, JupyterManager
  - Clean, focused file without mixed concerns

Lines redistributed: ~350 lines moved from monolithic core.go
Build status: Compiles successfully
2026-02-17 12:57:02 -05:00
Jeremie Fraeys
a93b6715fd
feat: add native library bridge and queue integration
- Add native_queue.go with CGO bindings for queue operations
- Add native_queue_stub.go for non-CGO builds
- Add hash_selector to choose between Go and native implementations
- Add native_bridge_libs.go for CGO builds with native_libs tag
- Add native_bridge_nocgo.go stub for non-CGO builds
- Update queue errors and task handling for native integration
- Update worker config and runloop for native library support
2026-02-16 20:38:30 -05:00
Jeremie Fraeys
2e701340e5
feat(core): API, worker, queue, and manifest improvements
- Add protocol buffer optimizations (internal/api/protocol.go)
- Add filesystem queue backend (internal/queue/filesystem_queue.go)
- Add run manifest support (internal/manifest/run_manifest.go)
- Worker and jupyter task refinements
- Exported test wrappers for benchmarking
2026-02-12 12:05:17 -05:00
Jeremie Fraeys
82034c68f3 feat(worker): add integrity checks, snapshot staging, and prewarm support 2026-01-05 12:31:13 -05:00