Commit graph

329 commits

Author SHA1 Message Date
Jeremie Fraeys
98a0d42213
deploy: consolidate docker-compose files using profiles
- Merge logs-debug.yml into test.yml with 'debug' profile
- Merge local.yml into dev.yml with 'local' profile
- Merge prod.smoke.yml into prod.yml with 'smoke' profile
- Reduces compose files from 8 to 5, simplifies maintenance
- Update TEST_COMPOSE to use deployments/docker-compose.test.yml
2026-03-04 13:22:17 -05:00
Jeremie Fraeys
16343e6c2a
test(kms): add comprehensive unit and integration tests
Unit tests for DEK cache:
- Put/Get operations, TTL expiry, LRU eviction
- Tenant isolation, flush/clear, stats, empty DEK rejection

Unit tests for KMS protocol:
- Encrypt/decrypt round-trip with MemoryProvider
- Multi-tenant isolation (wrong key fails MAC verification)
- Cache hit verification, key rotation flow
- Health check protocol

Integration tests with testcontainers:
- VaultProvider with hashicorp/vault:1.15 container
- AWSProvider with localstack/localstack container
- TenantKeyManager end-to-end with MemoryProvider
2026-03-03 19:14:31 -05:00
Jeremie Fraeys
e1ec255ad2
refactor(crypto): integrate KMS with TenantKeyManager
Replace in-memory root keys with KMS interface:
- GenerateDataEncryptionKey: generate DEK, wrap via KMS, cache
- UnwrapDataEncryptionKey: cache check, KMS decrypt, cache store
- EncryptArtifact/DecryptArtifact: use DEK from KMS
- RotateTenantKey: create new KMS key, flush cache
- RevokeTenant: disable KMS key, schedule deletion per ADR-015

Remove deprecated methods: wrapKey, unwrapKey (replaced by KMS)
2026-03-03 19:14:27 -05:00
Jeremie Fraeys
7c03c8b5bd
feat(kms): add HashiCorp Vault and AWS KMS providers
Implement VaultProvider with Transit engine:
- AppRole, Kubernetes, and Token authentication
- Encrypt/Decrypt via /transit/encrypt and /transit/decrypt
- Key lifecycle via /transit/keys API
- Health check via /sys/health

Implement AWSProvider with SDK v2:
- Per-region key naming with alias prefix
- Encrypt/Decrypt via KMS SDK
- Key lifecycle (CreateKey, Disable, ScheduleDeletion, Enable)
- AWS endpoint support for LocalStack testing
2026-03-03 19:14:21 -05:00
Jeremie Fraeys
cb25677695
feat(kms): implement core KMS infrastructure with DEK cache
Add KMSProvider interface for external key management systems:
- Encrypt/Decrypt operations for DEK wrapping
- Key lifecycle management (Create, Disable, ScheduleDeletion, Enable)
- HealthCheck and Close methods

Implement MemoryProvider for development/testing:
- XOR encryption with HMAC-SHA256 authentication
- Secure random key generation using crypto/rand
- MAC verification to detect wrong keys

Implement DEKCache per ADR-012:
- 15-minute TTL with configurable grace window (1 hour)
- LRU eviction with 1000 entry limit
- Cache key includes (tenantID, artifactID, kmsKeyID) for isolation
- Thread-safe operations with RWMutex
- Secure memory wiping on eviction/cleanup

Add config package with types:
- ProviderType enum (vault, aws, memory)
- VaultConfig with AppRole/Kubernetes/Token auth
- AWSConfig with region and alias prefix
- CacheConfig with TTL, MaxEntries, GraceWindow
- Validation methods for all config types
2026-03-03 19:13:55 -05:00
Jeremie Fraeys
da104367d6
feat: add Plugin GPU Quota implementation and tests
Some checks failed
Build Pipeline / Build Binaries (push) Failing after 1m59s
Build Pipeline / Build Docker Images (push) Has been skipped
Build Pipeline / Sign HIPAA Config (push) Has been skipped
Build Pipeline / Generate SLSA Provenance (push) Has been skipped
Checkout test / test (push) Successful in 5s
CI Pipeline / Test (ubuntu-latest on self-hosted) (push) Failing after 1s
CI Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI Pipeline / Security Scan (push) Has been skipped
CI Pipeline / Test Scripts (push) Has been skipped
CI Pipeline / Test Native Libraries (push) Has been skipped
CI Pipeline / Native Library Build Matrix (push) Has been skipped
Documentation / build-and-publish (push) Failing after 35s
CI Pipeline / Trigger Build Workflow (push) Failing after 0s
Security Scan / Security Analysis (push) Has been cancelled
Security Scan / Native Library Security (push) Has been cancelled
Verification & Maintenance / V.1 - Schema Drift Detection (push) Has been cancelled
Verification & Maintenance / V.4 - Custom Go Vet Analyzers (push) Has been cancelled
Verification & Maintenance / V.7 - Audit Chain Integrity (push) Has been cancelled
Verification & Maintenance / V.6 - Extended Security Scanning (push) Has been cancelled
Verification & Maintenance / V.10 - OpenSSF Scorecard (push) Has been cancelled
Verification & Maintenance / Verification Summary (push) Has been cancelled
- Add plugin_quota.go with GPU quota management for scheduler

- Update scheduler hub and protocol for plugin support

- Add comprehensive plugin quota unit tests

- Update gang service and WebSocket queue integration tests
2026-02-26 14:35:05 -05:00
Jeremie Fraeys
ef05f200ba
build: add Scheduler & Services section to Makefile help
- Add new help section for scheduler and service targets

- Document dev-up, dev-down, prod-up, prod-down targets
2026-02-26 14:34:58 -05:00
Jeremie Fraeys
a653a2d0ed
ci: add plugin, quota, and scheduler tests to workflows
- Add plugin quota, service templates, scheduler tests to ci.yml

- Add vLLM plugin and audit logging test steps

- Add plugin configuration validation to security-modes-test.yml:

  - Verify HIPAA mode disables plugins

  - Verify standard mode enables plugins with security

  - Verify dev mode enables plugins with relaxed security
2026-02-26 14:34:49 -05:00
Jeremie Fraeys
b3a0c78903
config: add Plugin GPU Quota, plugins, and audit logging to configs
- Add Plugin GPU Quota config section to scheduler.yaml.example

- Add audit logging config to homelab-secure.yaml (HIPAA-compliant)

- Add Jupyter and vLLM plugin configs to all worker configs:

  - Security settings (passwords, trusted channels, blocked packages)

  - Resource limits (GPU, memory, CPU)

  - Model cache paths and quantization options for vLLM

- Disable plugins in HIPAA deployment mode for compliance

- Update deployments README with plugin services and GPU quotas
2026-02-26 14:34:42 -05:00
Jeremie Fraeys
90ea18555c
docs: add vLLM workflow and cross-link documentation
Some checks failed
Security Scan / Security Analysis (push) Waiting to run
Security Scan / Native Library Security (push) Waiting to run
Verification & Maintenance / V.1 - Schema Drift Detection (push) Waiting to run
Verification & Maintenance / V.4 - Custom Go Vet Analyzers (push) Waiting to run
Verification & Maintenance / V.7 - Audit Chain Integrity (push) Waiting to run
Verification & Maintenance / V.6 - Extended Security Scanning (push) Waiting to run
Verification & Maintenance / V.10 - OpenSSF Scorecard (push) Waiting to run
Verification & Maintenance / Verification Summary (push) Blocked by required conditions
Build Pipeline / Build Binaries (push) Failing after 2m4s
Build Pipeline / Build Docker Images (push) Has been skipped
Build Pipeline / Sign HIPAA Config (push) Has been skipped
Build Pipeline / Generate SLSA Provenance (push) Has been skipped
Checkout test / test (push) Successful in 5s
CI Pipeline / Test (push) Failing after 1s
CI Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI Pipeline / Security Scan (push) Has been skipped
CI Pipeline / Test Scripts (push) Has been skipped
CI Pipeline / Test Native Libraries (push) Has been skipped
CI Pipeline / Native Library Build Matrix (push) Has been skipped
Contract Tests / Spec Drift Detection (push) Failing after 16s
Contract Tests / API Contract Tests (push) Has been skipped
Deploy API Docs / Build API Documentation (push) Failing after 5s
Deploy API Docs / Deploy to GitHub Pages (push) Has been skipped
Documentation / build-and-publish (push) Failing after 44s
CI Pipeline / Trigger Build Workflow (push) Failing after 0s
- Add new vLLM workflow documentation (vllm-workflow.md)
- Update scheduler-architecture.md with Plugin GPU Quota and audit logging
- Add See Also sections to jupyter-workflow.md, quick-start.md,
  configuration-reference.md for better navigation
- Update landing page and index with vLLM and scheduler links
- Cross-link all documentation for improved discoverability
2026-02-26 13:04:39 -05:00
Jeremie Fraeys
8f2495deb0
chore(cleanup): remove obsolete files and update .gitignore
Remove deprecated components replaced by new scheduler:
- Delete internal/controller/pacing_controller.go (replaced by scheduler/pacing.go)
- Delete internal/manifest/schema_test.go (consolidated into tests/unit/)
- Delete internal/workertest/worker.go (consolidated into tests/fixtures/)
- Update .gitignore with scheduler binary and new patterns
2026-02-26 12:09:18 -05:00
Jeremie Fraeys
dddc2913e1
chore(tools): update scripts, native libs, and documentation
Update tooling and documentation:
- Smoke test script with scheduler health checks
- Release cleanup script
- Native test scripts with Redis integration
- TUI SSH test script
- Performance regression detector with scheduler metrics
- Profiler with distributed tracing
- Native CMake with test targets
- Dataset hash tests
- Storage symlink resistance tests
- Configuration reference documentation updates
2026-02-26 12:08:58 -05:00
Jeremie Fraeys
d87c556afa
test(all): update test suite for scheduler and security features
Update comprehensive test coverage:
- E2E tests with scheduler integration
- Integration tests with tenant isolation
- Unit tests with security assertions
- Security tests with audit validation
- Audit verification tests
- Auth tests with tenant scoping
- Config validation tests
- Container security tests
- Worker tests with scheduler mock
- Environment pool tests
- Load tests with distributed patterns
- Test fixtures with scheduler support
- Update go.mod/go.sum with new dependencies
2026-02-26 12:08:46 -05:00
Jeremie Fraeys
c459285cab
chore(deploy): update deployment configs and TUI for scheduler
Update deployment and CLI tooling:
- TUI models (jobs, state) with scheduler data
- TUI store with scheduler endpoints
- TUI config with scheduler settings
- Deployment Makefile with scheduler targets
- Deploy script with scheduler registration
- Docker Compose files with scheduler services
- Remove obsolete Dockerfiles (api-server, full-prod, test)
- Update remaining Dockerfiles with scheduler integration
2026-02-26 12:08:31 -05:00
Jeremie Fraeys
4cdb68907e
refactor(utilities): update supporting modules for scheduler integration
Update utility modules:
- File utilities with secure file operations
- Environment pool with resource tracking
- Error types with scheduler error categories
- Logging with audit context support
- Network/SSH with connection pooling
- Privacy/PII handling with tenant boundaries
- Resource manager with scheduler allocation
- Security monitor with audit integration
- Tracking plugins (MLflow, TensorBoard) with auth
- Crypto signing with tenant keys
- Database init with multi-user support
2026-02-26 12:07:15 -05:00
Jeremie Fraeys
6866ba9366
refactor(queue): integrate scheduler backend and storage improvements
Update queue and storage systems for scheduler integration:
- Queue backend with scheduler coordination
- Filesystem queue with batch operations
- Deduplication with tenant-aware keys
- Storage layer with audit logging hooks
- Domain models (Task, Events, Errors) with scheduler fields
- Database layer with tenant isolation
- Dataset storage with integrity checks
2026-02-26 12:06:46 -05:00
Jeremie Fraeys
6b2c377680
refactor(jupyter): enhance security and scheduler integration
Update Jupyter integration for security and scheduler support:
- Enhanced security configuration with audit logging
- Health monitoring with scheduler event integration
- Package manager with network policy enforcement
- Service manager with lifecycle hooks
- Network manager with tenant isolation
- Workspace metadata with tenant tags
- Config with resource limits
- Podman container integration improvements
- Experiment manager with tracking integration
- Manifest runner with security checks
2026-02-26 12:06:35 -05:00
Jeremie Fraeys
3fb6902fa1
feat(worker): integrate scheduler endpoints and security hardening
Update worker system for scheduler integration:
- Worker server with scheduler registration
- Configuration with scheduler endpoint support
- Artifact handling with integrity verification
- Container executor with supply chain validation
- Local executor enhancements
- GPU detection improvements (cross-platform)
- Error handling with execution context
- Factory pattern for executor instantiation
- Hash integrity with native library support
2026-02-26 12:06:16 -05:00
Jeremie Fraeys
ef11d88a75
refactor(auth): add tenant scoping and permission enhancements
Update authentication system for multi-tenant support:
- API key management with tenant scoping
- Permission checks for multi-tenant operations
- Database layer with tenant isolation
- Keychain integration with audit logging
2026-02-26 12:06:08 -05:00
Jeremie Fraeys
420de879ff
feat(api): integrate scheduler protocol and WebSocket enhancements
Update API layer for scheduler integration:
- WebSocket handlers with scheduler protocol support
- Jobs WebSocket endpoint with priority queue integration
- Validation middleware for scheduler messages
- Server configuration with security hardening
- Protocol definitions for worker-scheduler communication
- Dataset handlers with tenant isolation checks
- Response helpers with audit context
- OpenAPI spec updates for new endpoints
2026-02-26 12:05:57 -05:00
Jeremie Fraeys
9b2d5986a3
docs(architecture): add technical documentation for scheduler and security
Add comprehensive architecture documentation:
- scheduler-architecture.md - Design of distributed job scheduler
  - Hub coordination model
  - Gang scheduling algorithm
  - Service discovery mechanisms
  - Failure recovery strategies

- multi-tenant-security.md - Security isolation patterns
  - Tenant boundary enforcement
  - Resource quota management
  - Cross-tenant data protection

- runtime-security.md - Operational security guidelines
  - Container security configurations
  - Network policy enforcement
  - Audit logging requirements
2026-02-26 12:04:33 -05:00
Jeremie Fraeys
685f79c4a7
ci(deploy): add Forgejo workflows and deployment automation
Add CI/CD pipelines for Forgejo/GitHub Actions:
- build.yml - Main build pipeline with matrix builds
- deploy-staging.yml - Automated staging deployment
- deploy-prod.yml - Production deployment with rollback support
- security-modes-test.yml - Security mode validation tests

Add deployment artifacts:
- docker-compose.staging.yml for staging environment
- ROLLBACK.md with rollback procedures and playbooks

Supports multi-environment deployment workflow with proper
gates between staging and production.
2026-02-26 12:04:23 -05:00
Jeremie Fraeys
86f9ae5a7e
docs(config): reorganize configuration structure and add documentation
Restructure configuration files for better organization:
- Add scheduler configuration examples (scheduler.yaml.example)
- Reorganize worker configs into subdirectories:
  - distributed/ - Multi-node cluster configurations
  - standalone/ - Single-node deployment configs
- Add environment-specific configs:
  - dev-local.yaml, docker-dev.yaml, docker-prod.yaml
  - homelab-secure.yaml, worker-prod.toml
- Add deployment configs for different security modes:
  - docker-standard.yaml, docker-hipaa.yaml, docker-dev.yaml

Add documentation:
- configs/README.md with configuration guidelines
- configs/SECURITY.md with security configuration best practices
2026-02-26 12:04:11 -05:00
Jeremie Fraeys
95adcba437
feat(worker): add Jupyter/vLLM plugins and process isolation
Extend worker capabilities with new execution plugins and security features:
- Jupyter plugin for notebook-based ML experiments
- vLLM plugin for LLM inference workloads
- Cross-platform process isolation (Unix/Windows)
- Network policy enforcement with platform-specific implementations
- Service manager integration for lifecycle management
- Scheduler backend integration for queue coordination

Update lifecycle management:
- Enhanced runloop with state transitions
- Service manager integration for plugin coordination
- Improved state persistence and recovery

Add test coverage:
- Unit tests for Jupyter and vLLM plugins
- Updated worker execution tests
2026-02-26 12:03:59 -05:00
Jeremie Fraeys
a981e89005
feat(security): add audit subsystem and tenant isolation
Implement comprehensive audit and security infrastructure:
- Immutable audit logs with platform-specific backends (Linux/Other)
- Sealed log entries with tamper-evident checksums
- Audit alert system for real-time security notifications
- Log rotation with retention policies
- Checkpoint-based audit verification

Add multi-tenant security features:
- Tenant manager with quota enforcement
- Middleware for tenant authentication/authorization
- Per-tenant cryptographic key isolation
- Supply chain security for container verification
- Cross-platform secure file utilities (Unix/Windows)

Add test coverage:
- Unit tests for audit alerts and sealed logs
- Platform-specific audit backend tests
2026-02-26 12:03:45 -05:00
Jeremie Fraeys
43e6446587
feat(scheduler): implement multi-tenant job scheduler with gang scheduling
Add new scheduler component for distributed ML workload orchestration:
- Hub-based coordination for multi-worker clusters
- Pacing controller for rate limiting job submissions
- Priority queue with preemption support
- Port allocator for dynamic service discovery
- Protocol handlers for worker-scheduler communication
- Service manager with OS-specific implementations
- Connection management and state persistence
- Template system for service deployment

Includes comprehensive test suite:
- Unit tests for all core components
- Integration tests for distributed scenarios
- Benchmark tests for performance validation
- Mock fixtures for isolated testing

Refs: scheduler-architecture.md
2026-02-26 12:03:23 -05:00
Jeremie Fraeys
6e0e7d9d2e
fix(smoke-test): copy promtail config file instead of directory
Some checks failed
Checkout test / test (push) Successful in 5s
CI/CD Pipeline / Test (push) Failing after 1s
CI/CD Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI/CD Pipeline / Build (push) Has been skipped
CI/CD Pipeline / Test Scripts (push) Has been skipped
CI/CD Pipeline / Test Native Libraries (push) Has been skipped
CI/CD Pipeline / GPU Golden Test Matrix (push) Has been skipped
Documentation / build-and-publish (push) Failing after 38s
CI/CD Pipeline / Docker Build (push) Has been skipped
Build CLI with Embedded SQLite / build (arm64, aarch64-linux) (push) Has been cancelled
Build CLI with Embedded SQLite / build (x86_64, x86_64-linux) (push) Has been cancelled
Build CLI with Embedded SQLite / build-macos (arm64) (push) Has been cancelled
Build CLI with Embedded SQLite / build-macos (x86_64) (push) Has been cancelled
Security Scan / Security Analysis (push) Has been cancelled
Security Scan / Native Library Security (push) Has been cancelled
Verification & Maintenance / V.1 - Schema Drift Detection (push) Has been cancelled
Verification & Maintenance / V.4 - Custom Go Vet Analyzers (push) Has been cancelled
Verification & Maintenance / V.7 - Audit Chain Integrity (push) Has been cancelled
Verification & Maintenance / V.6 - Extended Security Scanning (push) Has been cancelled
Verification & Maintenance / V.10 - OpenSSF Scorecard (push) Has been cancelled
Verification & Maintenance / Verification Summary (push) Has been cancelled
Copy just promtail-config.yml to temp root instead of entire monitoring/
directory. This fixes the mount error where promtail couldn't find its
config at the expected path.
2026-02-24 11:57:35 -05:00
Jeremie Fraeys
bcc432a524
fix(deployments): use relative paths instead of FETCHML_REPO_ROOT with wrong fallback
Replace all .. with proper relative paths:

- Build context: Use '.' (current directory = project root when using --project-directory)
- Volume mounts: Use './data/...' instead of '../data/...'
- Config mounts: Use './configs/...' instead of '../configs/...'

The '..' fallback was incorrect - when --project-directory is set to repo root,
'..' would point to parent of repo instead of repo itself. Using '.' or
'./path' correctly resolves relative to project root.

Environment variables for data directories (SMOKE_TEST_DATA_DIR, PROD_DATA_DIR,
HOMELAB_DATA_DIR, LOCAL_DATA_DIR) are preserved for runtime customization.
2026-02-24 11:53:19 -05:00
Jeremie Fraeys
cebcb6115f
fix(smoke-test): add FETCHML_REPO_ROOT to env file
Ensure FETCHML_REPO_ROOT is set in the env file passed to docker-compose.
This fixes path resolution so fallback paths don't incorrectly use parent directory.
2026-02-24 11:48:10 -05:00
Jeremie Fraeys
3ff5ef320a
fix(deployments): add HOMELAB_DATA_DIR support to homelab-secure
Update docker-compose.homelab-secure.yml to use HOMELAB_DATA_DIR
environment variable with fallback to data/homelab for all volume mounts.
2026-02-24 11:43:38 -05:00
Jeremie Fraeys
5691b06876
fix(deployments): add env var support for data directories
Update all docker-compose files to use environment variables for data paths:

- docker-compose.local.yml: Use LOCAL_DATA_DIR with fallback to ../data/dev
- docker-compose.prod.yml: Use PROD_DATA_DIR with fallback to data/prod
- docker-compose.prod.smoke.yml: Use SMOKE_TEST_DATA_DIR with fallback

This allows smoke tests and local development to use temp directories
instead of repo-relative paths, avoiding file sharing permission issues
on macOS with Docker Desktop or Colima.
2026-02-24 11:43:11 -05:00
Jeremie Fraeys
ce4106a837
fix(smoke-test): copy monitoring configs to temp directory
Promtail mounts monitoring configs from repo root which fails in Colima:

- Copy monitoring/ directory to temp SMOKE_TEST_DATA_DIR
- Update promtail volume path to use SMOKE_TEST_DATA_DIR for configs
- This ensures all mounts are from accessible temp directories
2026-02-24 11:40:32 -05:00
Jeremie Fraeys
225ef5bfb5
fix(smoke-test): use actual env file instead of process substitution
Process substitution <(echo ...) doesn't work with docker-compose.
Write the env file to an actual temp file instead.
2026-02-24 11:38:18 -05:00
Jeremie Fraeys
bff2336db2
fix(smoke-test): use temp directory for smoke test data
Use /tmp for smoke test data to avoid file sharing issues on macOS/Colima:

- smoke-test.sh: Create temp dir with mktemp, export SMOKE_TEST_DATA_DIR
- docker-compose.dev.yml: Use SMOKE_TEST_DATA_DIR with fallback to data/dev
- Remove file sharing permission checks (no longer needed with tmp)

This avoids Docker Desktop/Colima file sharing permission issues entirely
by using a system temp directory that's always accessible.
2026-02-24 11:37:45 -05:00
Jeremie Fraeys
d3a861063f
fix(smoke-test): add Colima-specific file sharing instructions
Detect if user is running Colima and provide appropriate fix instructions:

- Check for colima command presence
- If Colima detected: suggest virtiofs/sshfs mount options
- Show colima.yaml mount configuration example
- Include verification command: colima ssh -- ls ...

Maintains Docker Desktop instructions for non-Colima users.
2026-02-24 11:35:58 -05:00
Jeremie Fraeys
00f938861c
fix(smoke-test): add Docker file sharing permission check for macOS
Add pre-flight check to detect Docker Desktop file sharing issues:

- After creating data directories, verify Docker can access them
- If access fails, print helpful error message with fix instructions
- Directs users to Docker Desktop Settings -> Resources -> File sharing

Prevents confusing 'operation not permitted' errors during smoke tests.
2026-02-24 11:35:23 -05:00
Jeremie Fraeys
8a054169ad
fix(docker): skip NVML GPU build for non-GPU systems
Dockerfile targets systems without GPUs:

- Add -DBUILD_NVML_GPU=OFF to cmake in simple.Dockerfile
- Add BUILD_NVML_GPU option to native/CMakeLists.txt (default ON)
- Conditionally include nvml_gpu subdirectory
- Update all_native_libs target to exclude nvml_gpu when disabled

This allows native libraries (dataset_hash, queue_index) to build
without requiring NVIDIA drivers/libraries.
2026-02-23 20:47:13 -05:00
Jeremie Fraeys
2a41032414
fix(deployments): fix docker-compose build context paths
Fix build context resolution in smoke test scripts:

- docker-compose.dev.yml: Use ${FETCHML_REPO_ROOT:-..} for api-server and worker
- docker-compose.prod.smoke.yml: Simplify dockerfile path (remove redundant FETCHML_REPO_ROOT)

Previously used 'context: ..' which resolved incorrectly when docker-compose
was run with --project-directory. Now consistently uses FETCHML_REPO_ROOT env var
for proper path resolution in both dev and prod smoke tests.
2026-02-23 20:30:07 -05:00
Jeremie Fraeys
6fc2e373c1
fix: resolve IDE warnings and test errors
Bug fixes and cleanup for test infrastructure:

- schema_test.go: Fix SchemaVersion reference with proper manifest import
- schema_test.go: Update all schema.json paths to internal/manifest location
- manifestenv.go: Remove unused helper functions (isArtifactsType, getPackagePath)
- nobaredetector.go: Fix exprToString syntax error, remove unused functions

All tests now pass without errors or warnings
2026-02-23 20:26:20 -05:00
Jeremie Fraeys
799afb9efa
docs: update coverage map and development documentation
Comprehensive documentation updates for 100% test coverage:

- TEST_COVERAGE_MAP.md: 49/49 requirements marked complete (100% coverage)
- CHANGELOG.md: Document Phase 8 test coverage implementation
- DEVELOPMENT.md: Add testing strategy and property-based test guidelines
- README.md: Add Testing & Security section with coverage highlights

All security and reproducibility requirements now tracked and tested
2026-02-23 20:26:13 -05:00
Jeremie Fraeys
e0aae73cf4
test(phase-7-9): audit verification, fault injection, integration tests
Implement V.7, V.9, and integration test requirements:

Audit Verification (V.7):
- TestAuditVerificationJob: Chain verification and tamper detection

Fault Injection (V.9):
- TestNVMLUnavailableProvenanceFail, TestManifestWritePartialFailure
- TestRedisUnavailableQueueBehavior, TestAuditLogUnavailableHaltsJob
- TestConfigHashFailureProvenanceClosed, TestDiskFullDuringArtifactScan

Integration Tests:
- TestCrossTenantIsolation: Filesystem isolation verification
- TestRunManifestReproducibility: Cross-run reproducibility
- TestAuditLogPHIRedaction: PHI leak prevention
2026-02-23 20:26:01 -05:00
Jeremie Fraeys
80370e9f4a
test(phase-6): property-based tests with gopter
Implement property-based invariant verification:

- TestPropertyConfigHashAlwaysPresent: Valid configs produce non-empty hash
- TestPropertyConfigHashDeterministic: Same config produces same hash
- TestPropertyDetectionSourceAlwaysValid: CreateDetectorWithInfo returns valid source
- TestPropertyProvenanceFailClosed: Strict mode fails on incomplete env
- TestPropertyScanArtifactsNeverNilEnvironment: Artifacts can hold Environment
- TestPropertyManifestEnvironmentSurvivesRoundtrip: Environment survives write/load

Uses gopter for property-based testing with deterministic seeds
2026-02-23 20:25:49 -05:00
Jeremie Fraeys
9f9d75dd68
test(phase-4): reproducibility crossover tests
Implement reproducibility crossover requirements:

- TestManifestEnvironmentCapture: Environment population with ConfigHash and DetectionMethod
- TestConfigHashPostDefaults: Hash computation after env expansion and defaults

Verifies manifest.Environment is properly populated for reproducibility tracking
2026-02-23 20:25:37 -05:00
Jeremie Fraeys
8f9bcef754
test(phase-3): prerequisite security and reproducibility tests
Implement 4 prerequisite test requirements:

- TestConfigIntegrityVerification: Config signing, tamper detection, hash stability
- TestManifestFilenameNonce: Cryptographic nonce generation and filename patterns
- TestGPUDetectionAudit: Structured logging of GPU detection at startup
- TestResourceEnvVarParsing: Resource env var parsing and override behavior

Also update manifest run_manifest.go:
- Add nonce-based filename support to WriteToDir
- Add nonce-based file detection to LoadFromDir
2026-02-23 20:25:26 -05:00
Jeremie Fraeys
f71352202e
test(phase-1-2): naming alignment and partial test completion
Rename and enhance existing tests to align with coverage map:
- TestGPUDetectorAMDVendorAlias -> TestAMDAliasManifestRecord
- TestScanArtifacts_SkipsKnownPathsAndLogs -> TestScanExclusionsRecorded
- Add env var expansion verification to TestHIPAAValidation_InlineCredentials
- Record exclusions in manifest.Artifacts for audit trail
2026-02-23 20:25:07 -05:00
Jeremie Fraeys
a769d9a430
chore(deps): Update dependencies for verification and security features
Add dependencies for verification framework:
- golang.org/x/tools/go/analysis (custom linting)
- Testing framework updates

Add dependencies for audit system:
- crypto/sha256 for chain hashing
- encoding/hex for hash representation

All dependencies verified compatible with Go 1.25+ toolchain.
2026-02-23 19:44:41 -05:00
Jeremie Fraeys
b33c6c4878
test(security): Add PHI denylist tests to secrets validation
Add comprehensive PHI detection tests:
- patient_id rejection
- ssn rejection
- medical_record_number rejection
- diagnosis_code rejection
- Mixed secrets with PHI rejection
- Normal secrets acceptance (HF_TOKEN, WANDB_API_KEY, etc.)

Ensures AllowedSecrets PHI denylist validation works correctly
across all PHI pattern variations.

Part of: PHI denylist validation from security plan
2026-02-23 19:44:33 -05:00
Jeremie Fraeys
fe75b6e27a
build(verification): Add Makefile targets and CI for verification suite
Add verification targets to Makefile:
- verify-schema: Check manifest schema hasn't drifted (V.1)
- test-schema-validation: Test schema validation with examples
- lint-custom: Build and run fetchml-vet analyzers (V.4)
- verify-audit: Run audit chain verification tests (V.7)
- verify-audit-chain: CLI tool for verifying specific log files
- verify-all: Run all verification checks (CI target)
- verify-quick: Fast checks for development
- verify-full: Comprehensive verification with unit/integration tests

Add install targets for verification tools:
- install-property-test-deps: gopter for property-based testing (V.2)
- install-mutation-test-deps: go-mutesting for mutation testing (V.3)
- install-security-scan-deps: gosec, nancy for supply chain (V.6)
- install-scorecard: OpenSSF Scorecard (V.10)

Add Forgejo CI workflow (.forgejo/workflows/verification.yml):
- Runs on every push and PR
- Schema drift detection
- Custom linting
- Audit chain verification
- Security scanning integration

Add verification documentation (docs/src/verification.md):
- V.1: Schema validation details
- V.4: Custom linting rules
- V.7: Audit chain verification
- CI integration guide
2026-02-23 19:44:25 -05:00
Jeremie Fraeys
17d5c75e33
fix(security): Path validation improvements for symlink resolution
Fix ValidatePath to correctly resolve symlinks and handle edge cases:
- Resolve symlinks before boundary check to prevent traversal
- Handle macOS /private prefix correctly
- Add fallback for non-existent paths (parent directory resolution)
- Double boundary checks: before AND after symlink resolution
- Prevent race conditions between check and use

Update path traversal tests:
- Correct test expectations for "..." (three dots is valid filename, not traversal)
- Add tests for symlink escape attempts
- Add unicode attack tests
- Add deeply nested traversal tests

Security impact: Prevents path traversal via symlink following in artifact
scanning and other file operations.
2026-02-23 19:44:16 -05:00
Jeremie Fraeys
651318bc93
test(security): Integration tests for sandbox escape and secrets handling
Add sandbox escape integration tests:
- Container breakout attempts via privileged mode
- Host path mounting restrictions
- Network namespace isolation verification
- Capability dropping validation
- Seccomp profile enforcement

Add secrets integration tests:
- End-to-end credential expansion testing
- PHI denylist enforcement in real configs
- Environment variable reference resolution
- Plaintext secret detection across config boundaries
- Secret rotation workflow validation

Tests run with real container runtime (Podman/Docker) when available.
Provides defense-in-depth beyond unit tests.

Part of: security integration testing from security plan
2026-02-23 19:44:07 -05:00