Commit graph

6 commits

Author SHA1 Message Date
Jeremie Fraeys
3fb6902fa1
feat(worker): integrate scheduler endpoints and security hardening
Update worker system for scheduler integration:
- Worker server with scheduler registration
- Configuration with scheduler endpoint support
- Artifact handling with integrity verification
- Container executor with supply chain validation
- Local executor enhancements
- GPU detection improvements (cross-platform)
- Error handling with execution context
- Factory pattern for executor instantiation
- Hash integrity with native library support
2026-02-26 12:06:16 -05:00
Jeremie Fraeys
92aab06d76
feat(security): implement comprehensive security hardening phases 1-5,7
Implements defense-in-depth security for HIPAA and multi-tenant requirements:

**Phase 1 - File Ingestion Security:**
- SecurePathValidator with symlink resolution and path boundary enforcement
  in internal/fileutil/secure.go
- Magic bytes validation for ML artifacts (safetensors, GGUF, HDF5, numpy)
  in internal/fileutil/filetype.go
- Dangerous extension blocking (.pt, .pkl, .exe, .sh, .zip)
- Upload limits (10GB size, 100MB/s rate, 10 uploads/min)

**Phase 2 - Sandbox Hardening:**
- ApplySecurityDefaults() with secure-by-default principle
  - network_mode: none, read_only_root: true, no_new_privileges: true
  - drop_all_caps: true, user_ns: true, run_as_uid/gid: 1000
- PodmanSecurityConfig and BuildSecurityArgs() in internal/container/podman.go
- BuildPodmanCommand now accepts full security configuration
- Container executor passes SandboxConfig to Podman command builder
- configs/seccomp/default-hardened.json blocks dangerous syscalls
  (ptrace, mount, reboot, kexec_load, open_by_handle_at)

**Phase 3 - Secrets Management:**
- expandSecrets() for environment variable expansion using ${VAR} syntax
- validateNoPlaintextSecrets() with entropy-based detection
- Pattern matching for AWS, GitHub, GitLab, OpenAI, Stripe tokens
- Shannon entropy calculation (>4 bits/char triggers detection)
- Secrets expanded during LoadConfig() before validation

**Phase 5 - HIPAA Audit Logging:**
- Tamper-evident chain hashing with SHA-256 in internal/audit/audit.go
- Event struct extended with PrevHash, EventHash, SequenceNum
- File access event types: EventFileRead, EventFileWrite, EventFileDelete
- LogFileAccess() helper for HIPAA compliance
- VerifyChain() function for tamper detection

**Supporting Changes:**
- Add DeleteJob() and DeleteJobsByPrefix() to storage package
- Integrate SecurePathValidator in artifact scanning
2026-02-23 18:00:33 -05:00
Jeremie Fraeys
7194826871
feat: implement research-grade maintainability phases 1,3,4,7
Phase 1: Event Sourcing
- Add TaskEvent types (queued, started, completed, failed, etc.)
- Create EventStore with Redis Streams (append-only)
- Support event querying by task ID and time range

Phase 3: Diagnosable Failures
- Enhance TaskExecutionError with Context map, Timestamp, Recoverable flag
- Update container.go to populate error context (image, GPU, duration)
- Add WithContext helper for building error context
- Create cmd/errors CLI for querying task errors

Phase 4: Testable Security
- Add security fields to PodmanConfig (Privileged, Network, ReadOnlyMounts)
- Create ValidateSecurityPolicy() with ErrSecurityViolation
- Add security contract tests (privileged rejection, host network rejection)
- Tests serve as executable security documentation

Phase 7: Reproducible Builds
- Add BuildHash and BuildTime ldflags to Makefile
- Create verify-build target for reproducibility testing
- Add -version and -verify flags to api-server

All tests pass:
- go test ./internal/errtypes/...
- go test ./internal/container/... -run Security
- go test ./internal/queue/...
- go build ./cmd/api-server/...
2026-02-18 15:27:50 -05:00
Jeremie Fraeys
a1ce267b86
feat: Implement all worker stub methods with real functionality
- VerifySnapshot: SHA256 verification using integrity package
- EnforceTaskProvenance: Strict and best-effort provenance validation
- RunJupyterTask: Full Jupyter service lifecycle (start/stop/remove/restore/list_packages)
- RunJob: Job execution using executor.JobRunner
- PrewarmNextOnce: Prewarming with queue integration

All methods now use new architecture components instead of placeholders
2026-02-17 17:37:56 -05:00
Jeremie Fraeys
4c8c9dfe4b
refactor: Export SelectDependencyManifest for API helpers
- Renamed selectDependencyManifest to SelectDependencyManifest (exported)
- Added re-export in worker package for backward compatibility
- Updated internal call in container.go to use exported function
- API helpers can now access via worker.SelectDependencyManifest

Build status: Compiles successfully
2026-02-17 16:45:59 -05:00
Jeremie Fraeys
22f3d66f1d
refactor: Phase 2 - Extract executor implementations
Created executor package with extracted job execution logic:

1. internal/worker/executor/local.go (104 lines)
   - LocalExecutor implements JobExecutor interface
   - Execute() method for local bash script execution
   - generateScript() helper for creating experiment scripts

2. internal/worker/executor/container.go (229 lines)
   - ContainerExecutor implements JobExecutor interface
   - Execute() method for podman container execution
   - EnvironmentPool interface for image caching
   - Tracking tool provisioning (MLflow, TensorBoard, Wandb)
   - Volume and cache setup
   - selectDependencyManifest() helper

3. internal/worker/executor/runner.go (131 lines)
   - JobRunner orchestrates execution
   - ExecutionMode enum (Auto, Local, Container)
   - Run() method with directory setup and executor selection
   - finalize() for success/failure handling

Key design decisions:
- Executors depend on interfaces (ManifestWriter, not Worker)
- JobRunner composes both executors
- No direct Worker dependencies in executor package
- SetupJobDirectories reused from execution package

Build status: Compiles successfully
2026-02-17 14:14:04 -05:00