Commit graph

32 commits

Author SHA1 Message Date
Jeremie Fraeys
66f262d788
security: improve audit, crypto, and config handling
- Enhance audit checkpoint system
- Update KMS provider and tenant key management
- Refine configuration constants
- Improve TUI config handling
2026-03-04 13:23:42 -05:00
Jeremie Fraeys
c459285cab
chore(deploy): update deployment configs and TUI for scheduler
Update deployment and CLI tooling:
- TUI models (jobs, state) with scheduler data
- TUI store with scheduler endpoints
- TUI config with scheduler settings
- Deployment Makefile with scheduler targets
- Deploy script with scheduler registration
- Docker Compose files with scheduler services
- Remove obsolete Dockerfiles (api-server, full-prod, test)
- Update remaining Dockerfiles with scheduler integration
2026-02-26 12:08:31 -05:00
Jeremie Fraeys
4cdb68907e
refactor(utilities): update supporting modules for scheduler integration
Update utility modules:
- File utilities with secure file operations
- Environment pool with resource tracking
- Error types with scheduler error categories
- Logging with audit context support
- Network/SSH with connection pooling
- Privacy/PII handling with tenant boundaries
- Resource manager with scheduler allocation
- Security monitor with audit integration
- Tracking plugins (MLflow, TensorBoard) with auth
- Crypto signing with tenant keys
- Database init with multi-user support
2026-02-26 12:07:15 -05:00
Jeremie Fraeys
3fb6902fa1
feat(worker): integrate scheduler endpoints and security hardening
Update worker system for scheduler integration:
- Worker server with scheduler registration
- Configuration with scheduler endpoint support
- Artifact handling with integrity verification
- Container executor with supply chain validation
- Local executor enhancements
- GPU detection improvements (cross-platform)
- Error handling with execution context
- Factory pattern for executor instantiation
- Hash integrity with native library support
2026-02-26 12:06:16 -05:00
Jeremie Fraeys
43e6446587
feat(scheduler): implement multi-tenant job scheduler with gang scheduling
Add new scheduler component for distributed ML workload orchestration:
- Hub-based coordination for multi-worker clusters
- Pacing controller for rate limiting job submissions
- Priority queue with preemption support
- Port allocator for dynamic service discovery
- Protocol handlers for worker-scheduler communication
- Service manager with OS-specific implementations
- Connection management and state persistence
- Template system for service deployment

Includes comprehensive test suite:
- Unit tests for all core components
- Integration tests for distributed scenarios
- Benchmark tests for performance validation
- Mock fixtures for isolated testing

Refs: scheduler-architecture.md
2026-02-26 12:03:23 -05:00
Jeremie Fraeys
58c1a5fa58
feat(audit): Tamper-evident audit chain verification system
Add ChainVerifier for cryptographic audit log verification:
- VerifyLogFile(): Validates entire audit chain integrity
- Detects tampering at specific event index (FirstTampered)
- Returns chain root hash for external verification
- GetChainRootHash(): Standalone hash computation
- VerifyAndAlert(): Boolean tampering detection with logging

Add audit-verifier CLI tool:
- Standalone binary for audit chain verification
- Takes log path argument and reports tampering

Update audit logger for chain integrity:
- Each event includes sequence number and hash chain
- SHA-256 linking: hash_n = SHA-256(prev_hash || event_n)
- Tamper detection through hash chain validation

Add comprehensive test coverage:
- Empty log handling
- Valid chain verification
- Tampering detection with modification
- Root hash consistency
- Alert mechanism tests

Part of: V.7 audit verification from security plan
2026-02-23 19:43:50 -05:00
Jeremie Fraeys
abd27bf0a2
refactor(go): Update Go commands and TUI controller
Update api-server and gen-keys main files

Update TUI controller commands, helpers, and settings
2026-02-23 14:13:14 -05:00
Jeremie Fraeys
ed7b5032a9
build: update Makefile and TUI controller integration
Some checks failed
Build CLI with Embedded SQLite / build (arm64, aarch64-linux) (push) Waiting to run
Build CLI with Embedded SQLite / build (x86_64, x86_64-linux) (push) Waiting to run
Build CLI with Embedded SQLite / build-macos (arm64) (push) Waiting to run
Build CLI with Embedded SQLite / build-macos (x86_64) (push) Waiting to run
CI/CD Pipeline / Docker Build (push) Blocked by required conditions
Security Scan / Security Analysis (push) Waiting to run
Security Scan / Native Library Security (push) Waiting to run
Checkout test / test (push) Successful in 6s
CI with Native Libraries / Check Build Environment (push) Successful in 12s
CI/CD Pipeline / Test (push) Failing after 5m15s
CI/CD Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI/CD Pipeline / Build (push) Has been skipped
CI/CD Pipeline / Test Scripts (push) Has been skipped
CI/CD Pipeline / Security Scan (push) Failing after 4m49s
Contract Tests / Spec Drift Detection (push) Failing after 13s
Contract Tests / API Contract Tests (push) Has been skipped
Deploy API Docs / Build API Documentation (push) Failing after 36s
Deploy API Docs / Deploy to GitHub Pages (push) Has been skipped
Documentation / build-and-publish (push) Failing after 26s
CI with Native Libraries / Build and Test Native Libraries (push) Has been cancelled
CI with Native Libraries / Build Release Libraries (push) Has been cancelled
2026-02-21 18:00:09 -05:00
Jeremie Fraeys
20fde4f79d
feat: integrate NVML GPU monitoring into TUI
- Update TUI controller loadGPU() to use NVML when available
- Prioritize NVML over nvidia-smi command for better performance
- Show additional metrics: power draw, SM clock when available
- Maintain fallback to nvidia-smi and system_profiler
2026-02-21 15:17:22 -05:00
Jeremie Fraeys
7efe8bbfbf
native: security hardening, research trustworthiness, and CVE mitigations
Security Fixes:
- CVE-2024-45339: Add O_EXCL flag to temp file creation in storage_write_entries()
  Prevents symlink attacks on predictable .tmp file paths
- CVE-2025-47290: Use openat_nofollow() in storage_open()
  Closes TOCTOU race condition via path_sanitizer infrastructure
- CVE-2025-0838: Add MAX_BATCH_SIZE=10000 to add_tasks()
  Prevents integer overflow in batch operations

Research Trustworthiness (dataset_hash):
- Deterministic file ordering: std::sort after collect_files()
- Recursive directory traversal: depth-limited with cycle detection
- Documented exclusions: hidden files and special files noted in API

Bug Fixes:
- R1: storage_init path validation for non-existent directories
- R2: safe_strncpy return value check before strcat
- R3: parallel_hash 256-file cap replaced with std::vector
- R4: wire qi_compact_index/qi_rebuild_index stubs
- R5: CompletionLatch race condition fix (hold mutex during decrement)
- R6: ARMv8 SHA256 transform fix (save abcd_pre before vsha256hq_u32)
- R7: fuzz_index_storage header format fix
- R8: enforce null termination in add_tasks/update_tasks
- R9: use 64 bytes (not 65) in combined hash to exclude null terminator
- R10: status field persistence in save()

New Tests:
- test_recursive_dataset.cpp: Verify deterministic recursive hashing
- test_storage_symlink_resistance.cpp: Verify CVE-2024-45339 fix
- test_queue_index_batch_limit.cpp: Verify CVE-2025-0838 fix
- test_sha256_arm_kat.cpp: ARMv8 known-answer tests
- test_storage_init_new_dir.cpp: F1 verification
- test_parallel_hash_large_dir.cpp: F3 verification
- test_queue_index_compact.cpp: F4 verification

All 8 native tests passing. Library ready for research lab deployment.
2026-02-21 13:33:45 -05:00
Jeremie Fraeys
a3b957dcc0
refactor(cli): Update build system and core infrastructure
- Makefile: Update build targets for native library integration
- build.zig: Add SQLite linking and native hash library support
- scripts/build_rsync.sh: Update rsync embedded binary build process
- scripts/build_sqlite.sh: Add SQLite constants generation script
- src/assets/README.md: Document embedded asset structure
- src/utils/rsync_embedded_binary.zig: Update for new build layout
2026-02-20 21:39:51 -05:00
Jeremie Fraeys
7c4a59012b
feat(tui): Add SQLite support for local mode
- store/store.go: New SQLite storage for TUI local mode
  - Open() with WAL mode and NORMAL synchronous
  - Schema initialization for ml_experiments, ml_runs, ml_metrics, ml_params, ml_tags
  - GetUnsyncedRuns(), GetRunsByExperiment(), MarkRunSynced()
  - GetRunMetrics(), GetRunParams() for run details
- config/config.go: Add local mode configuration fields
  - DBPath, ForceLocal, ProjectRoot fields
  - Experiment struct with Name and Entrypoint
  - IsLocalMode() and GetDBPath() helper methods
- go.mod: Add modernc.org/sqlite v1.36.0 dependency
2026-02-20 21:28:49 -05:00
Jeremie Fraeys
23e5f3d1dc
refactor(api): internal refactoring for TUI and worker modules
- Refactor internal/worker and internal/queue packages
- Update cmd/tui for monitoring interface
- Update test configurations
2026-02-20 15:51:23 -05:00
Jeremie Fraeys
6028779239
feat: update CLI, TUI, and security documentation
- Add safety checks to Zig build
- Add TUI with job management and narrative views
- Add WebSocket support and export services
- Add smart configuration defaults
- Update API routes with security headers
- Update SECURITY.md with comprehensive policy
- Add Makefile security scanning targets
2026-02-19 15:35:05 -05:00
Jeremie Fraeys
34aaba8f17
feat: implement Argon2id hashing and Ed25519 manifest signing
- Add Argon2id-based API key hashing with salt support
- Implement Ed25519 manifest signing (key generation, sign, verify)
- Add gen-keys CLI tool for manifest signing keys
- Fix hash-key command to hash provided key (not generate new one)
- Complete isHex helper function
2026-02-19 15:34:20 -05:00
Jeremie Fraeys
412d7b82e9
security: implement comprehensive secrets protection
Critical fixes:
- Add SanitizeConnectionString() in storage/db_connect.go to remove passwords
- Add SecureEnvVar() in api/factory.go to clear env vars after reading (JWT_SECRET)
- Clear DB password from config after connection

Logging improvements:
- Enhance logging/sanitize.go with patterns for:
  - PostgreSQL connection strings
  - Generic connection string passwords
  - HTTP Authorization headers
  - Private keys

CLI security:
- Add --security-audit flag to api-server for security checks:
  - Config file permissions
  - Exposed environment variables
  - Running as root
  - API key file permissions
- Add warning when --api-key flag used (process list exposure)

Files changed:
- internal/storage/db_connect.go
- internal/api/factory.go
- internal/logging/sanitize.go
- internal/auth/flags.go
- cmd/api-server/main.go
2026-02-18 16:18:09 -05:00
Jeremie Fraeys
7194826871
feat: implement research-grade maintainability phases 1,3,4,7
Phase 1: Event Sourcing
- Add TaskEvent types (queued, started, completed, failed, etc.)
- Create EventStore with Redis Streams (append-only)
- Support event querying by task ID and time range

Phase 3: Diagnosable Failures
- Enhance TaskExecutionError with Context map, Timestamp, Recoverable flag
- Update container.go to populate error context (image, GPU, duration)
- Add WithContext helper for building error context
- Create cmd/errors CLI for querying task errors

Phase 4: Testable Security
- Add security fields to PodmanConfig (Privileged, Network, ReadOnlyMounts)
- Create ValidateSecurityPolicy() with ErrSecurityViolation
- Add security contract tests (privileged rejection, host network rejection)
- Tests serve as executable security documentation

Phase 7: Reproducible Builds
- Add BuildHash and BuildTime ldflags to Makefile
- Create verify-build target for reproducibility testing
- Add -version and -verify flags to api-server

All tests pass:
- go test ./internal/errtypes/...
- go test ./internal/container/... -run Security
- go test ./internal/queue/...
- go build ./cmd/api-server/...
2026-02-18 15:27:50 -05:00
Jeremie Fraeys
b889b5403d
docs: clean up verbose comments in TUI main.go 2026-02-18 14:45:44 -05:00
Jeremie Fraeys
3775bc3ee0
refactor: replace panic with error returns and update maintenance
- Replace 9 panic() calls in smart_defaults.go with error returns
- Add ErrUnknownProfile error type for better error handling
- Update all callers (worker/config.go, tui/config.go, tui/cli_config.go, tui/main.go)
- Update CHANGELOG.md with recent WebSocket handler improvements
- Add metrics persistence, dataset handlers, and test organization notes
- Config validation passes (make configlint)
- All tests pass (go test ./tests/unit/api/ws)
2026-02-18 14:44:21 -05:00
Jeremie Fraeys
320e6fd409
refactor(dependency-hygiene): Move path functions from config to storage
Move ExpandPath function and path-related utilities from internal/config to internal/storage where they belong.

Files updated:
- internal/worker/config.go: use storage.ExpandPath
- internal/network/ssh.go: use storage.ExpandPath
- cmd/data_manager/data_manager_config.go: use storage.ExpandPath
- internal/api/server_config.go: use storage.ExpandPath

internal/storage/paths.go already contained the canonical implementation.

Result: Path utilities now live in storage layer, config package focuses on configuration structs.
2026-02-17 21:15:23 -05:00
Jeremie Fraeys
dbf96020af
refactor(dependency-hygiene): Fix Redis leak, simplify TUI wrapper, clean go.mod
Phase 1: Fix Redis Schema Leak
- Create internal/storage/dataset.go with DatasetStore abstraction
- Remove all direct Redis calls from cmd/data_manager/data_sync.go
- data_manager now uses DatasetStore for transfer tracking and metadata

Phase 2: Simplify TUI Services
- Embed *queue.TaskQueue directly in services.TaskQueue
- Eliminate 60% of wrapper boilerplate (203 -> ~100 lines)
- Keep only TUI-specific methods (EnqueueTask, GetJobStatus, experiment methods)

Phase 5: Clean go.mod Dependencies
- Remove duplicate go-redis/redis/v8 dependency
- Migrate internal/storage/migrate.go to redis/go-redis/v9
- Separate test-only deps (miniredis, testify) into own block

Results:
- Zero direct Redis calls in cmd/
- 60% fewer lines in TUI services
- Cleaner dependency structure
2026-02-17 21:13:49 -05:00
Jeremie Fraeys
3187ff26ea
refactor: complete maintainability phases 1-9 and fix all tests
Test fixes (all 41 test packages now pass):
- Fix ComputeTaskProvenance - add dataset_specs JSON output
- Fix EnforceTaskProvenance - populate all metadata fields in best-effort mode
- Fix PrewarmNextOnce - preserve prewarm state when queue empty
- Fix RunManifest directory creation in SetupJobDirectories
- Add ManifestWriter to test worker (simpleManifestWriter)
- Fix worker ID mismatch (use cfg.WorkerID)
- Fix WebSocket binary protocol responses
- Implement all WebSocket handlers: QueueJob, QueueJobWithSnapshot, StatusRequest,
  CancelJob, Prune, ValidateRequest (with run manifest validation), LogMetric,
  GetExperiment, DatasetList/Register/Info/Search

Maintainability phases completed:
- Phases 1-6: Domain types, error system, config boundaries, worker/API/queue splits
- Phase 7: TUI cleanup - reorganize model package (jobs.go, messages.go, styles.go, keys.go)
- Phase 8: MLServer unification - consolidate worker + TUI into internal/network/mlserver.go
- Phase 9: CI enforcement - add scripts/ci-checks.sh with 5 checks:
  * No internal/ -> cmd/ imports
  * domain/ has zero internal imports
  * File size limit (500 lines, rigid)
  * No circular imports
  * Package naming conventions

Documentation:
- Add docs/src/file-naming-conventions.md
- Add make ci-checks target

Lines changed: +756/-36 (WebSocket fixes), +518/-320 (TUI), +263/-20 (Phase 8-9)
2026-02-17 20:32:14 -05:00
Jeremie Fraeys
fb2bbbaae5
refactor: Phase 7 - TUI cleanup - reorganize model package
Phase 7 of the monorepo maintainability plan:

New files created:
- model/jobs.go - Job type, JobStatus constants, list.Item interface
- model/messages.go - tea.Msg types (JobsLoadedMsg, StatusMsg, TickMsg, etc.)
- model/styles.go - NewJobListDelegate(), JobListTitleStyle(), SpinnerStyle()
- model/keys.go - KeyMap struct, DefaultKeys() function

Modified files:
- model/state.go - reduced from 226 to ~130 lines
  - Removed: Job, JobStatus, KeyMap, Keys, inline styles
  - Kept: State struct, domain re-exports, ViewMode, DatasetInfo, InitialState()
- controller/commands.go - use model. prefix for message types
- controller/controller.go - use model. prefix for message types
- controller/settings.go - use model.SettingsContentMsg

Deleted files:
- controller/keys.go (moved to model/keys.go since State references KeyMap)

Result:
- No file >150 lines in model/ package
- Single concern per file: state, jobs, messages, styles, keys
- All 41 test packages pass
2026-02-17 20:22:04 -05:00
Jeremie Fraeys
6580917ba8
refactor: extract domain types and consolidate error system (Phases 1-2)
Phase 1: Extract Domain Types
=============================
- Create internal/domain/ package with canonical types:
  - domain/task.go: Task, Attempt structs
  - domain/tracking.go: TrackingConfig and MLflow/TensorBoard/Wandb configs
  - domain/dataset.go: DatasetSpec
  - domain/status.go: JobStatus constants
  - domain/errors.go: FailureClass system with classification functions
  - domain/doc.go: package documentation

- Update queue/task.go to re-export domain types (backward compatibility)
- Update TUI model/state.go to use domain types via type aliases
- Simplify TUI services: remove ~60 lines of conversion functions

Phase 2: Delete ErrorCategory System
====================================
- Remove deprecated ErrorCategory type and constants
- Remove TaskError struct and related functions
- Remove mapping functions: ClassifyError, IsRetryable, GetUserMessage, RetryDelay
- Update all queue implementations to use domain.FailureClass directly:
  - queue/metrics.go: RecordTaskFailure/Retry now take FailureClass
  - queue/queue.go: RetryTask uses domain.ClassifyFailure
  - queue/filesystem_queue.go: RetryTask and MoveToDeadLetterQueue updated
  - queue/sqlite_queue.go: RetryTask and MoveToDeadLetterQueue updated

Lines eliminated: ~190 lines of conversion and mapping code
Result: Single source of truth for domain types and error classification
2026-02-17 12:34:28 -05:00
Jeremie Fraeys
1dcc1e11d5
chore(build): update build system, scripts, and additional tests
- Update Makefile with native build targets (preparing for C++)
- Add profiler and performance regression detector commands
- Update CI/testing scripts
- Add additional unit tests for API, jupyter, queue, manifest
2026-02-12 12:05:55 -05:00
Jeremie Fraeys
94112f0af5 ci: align workflows, build scripts, and docs with current architecture 2026-01-05 12:34:23 -05:00
Jeremie Fraeys
5ef24e4c6d feat(cli): add validate/info commands and improve protocol handling 2026-01-05 12:31:20 -05:00
Jeremie Fraeys
82034c68f3 feat(worker): add integrity checks, snapshot staging, and prewarm support 2026-01-05 12:31:13 -05:00
Jeremie Fraeys
add4a90e62 feat(api): refactor websocket handlers; add health and prometheus middleware 2026-01-05 12:31:07 -05:00
Jeremie Fraeys
cd5640ebd2 Slim and secure: move scripts, clean configs, remove secrets
- Move ci-test.sh and setup.sh to scripts/
- Trim docs/src/zig-cli.md to current structure
- Replace hardcoded secrets with placeholders in configs
- Update .gitignore to block .env*, secrets/, keys, build artifacts
- Slim README.md to reflect current CLI/TUI split
- Add cleanup trap to ci-test.sh
- Ensure no secrets are committed
2025-12-07 13:57:51 -05:00
Jeremie Fraeys
ea15af1833 Fix multi-user authentication and clean up debug code
- Fix YAML tags in auth config struct (json -> yaml)
- Update CLI configs to use pre-hashed API keys
- Remove double hashing in WebSocket client
- Fix port mapping (9102 -> 9103) in CLI commands
- Update permission keys to use jobs:read, jobs:create, etc.
- Clean up all debug logging from CLI and server
- All user roles now authenticate correctly:
  * Admin: Can queue jobs and see all jobs
  * Researcher: Can queue jobs and see own jobs
  * Analyst: Can see status (read-only access)

Multi-user authentication is now fully functional.
2025-12-06 12:35:32 -05:00
Jeremie Fraeys
803677be57 feat: implement Go backend with comprehensive API and internal packages
- Add API server with WebSocket support and REST endpoints
- Implement authentication system with API keys and permissions
- Add task queue system with Redis backend and error handling
- Include storage layer with database migrations and schemas
- Add comprehensive logging, metrics, and telemetry
- Implement security middleware and network utilities
- Add experiment management and container orchestration
- Include configuration management with smart defaults
2025-12-04 16:53:53 -05:00