Jeremie Fraeys
4da027868d
fix(storage): handle NULL values and state tracking in database operations
...
Fixes to support proper test coverage:
- db_jobs.go: UpdateJobStatus now checks RowsAffected and returns error
for nonexistent jobs instead of silently succeeding
- db_audit.go: GetOldestAuditLogDate uses sql.NullString to parse SQLite
datetime strings in YYYY-MM-DD HH:MM:SS format with RFC3339 fallback
- db_experiments.go: ListTasksForExperiment uses sql.NullString for
nullable worker_id and error fields to prevent scan errors
- db_connect.go: DB struct adds isClosed state tracking with mutex;
Close() now returns error on double close to match test expectations
2026-03-13 23:27:35 -04:00
Jeremie Fraeys
50b6506243
test(storage): add comprehensive storage layer tests
...
Add tests for:
- dataset: Redis dataset operations, transfer tracking
- db_audit: audit logging with hash chain, access tracking
- db_experiments: experiment metadata, dataset associations
- db_tasks: task listing with pagination for users and groups
- db_jobs: job CRUD, state transitions, worker assignment
Coverage: storage package ~40%+
2026-03-13 23:26:33 -04:00
Jeremie Fraeys
61660dc925
refactor: co-locate security, storage, telemetry, tracking, worker tests
...
Move unit tests from tests/unit/ to internal/ following Go conventions:
Security tests:
- tests/unit/security/* -> internal/security/* (audit, config_integrity, filetype, gpu_audit, hipaa_validation, manifest_filename, path_traversal, resource_quota, secrets)
Storage tests:
- tests/unit/storage/* -> internal/storage/* (db, experiment_metadata)
Telemetry tests:
- tests/unit/telemetry/* -> internal/telemetry/* (telemetry)
Tracking tests:
- tests/unit/reproducibility/* -> internal/tracking/* (config_hash, environment_capture)
Worker tests:
- tests/unit/worker/* -> internal/worker/* (artifacts, config, hash_bench, plugins/jupyter_task, plugins/vllm, prewarm_v1, run_manifest_execution, snapshot_stage, snapshot_store, worker)
Update import paths in test files to reflect new locations.
2026-03-12 16:37:03 -04:00
Jeremie Fraeys
2b1ef10514
test(chaos): add worker disconnect chaos test and queue improvements
...
Chaos testing:
- Add worker_disconnect_chaos_test.go for network partition resilience
- Test scheduler hub recovery and job reassignment scenarios
Queue layer updates:
- event_store.go: add event sourcing for queue operations
- native_queue.go: extend native queue with batch operations and indexing
2026-03-12 12:08:21 -04:00
Jeremie Fraeys
fbcf4d38e5
feat(storage): add groups, tasks, tokens, and audit database schemas
...
Add comprehensive database storage layer for new features:
- db_groups.go: Lab group management with members, roles (admin/member/viewer),
and group-based task visibility queries
- db_tasks.go: Task visibility system (private/lab/institution/open),
task sharing with expiry, public clone tokens, and optimized
ListTasksForUser() for access control
- db_tokens.go: Secure token management for public task access and cloning,
with SHA-256 hashed token storage and automatic cleanup
- db_audit.go: Audit log persistence with checkpoint chains, tamper
detection, and log rotation support
- schema_sqlite.sql: Updated schema with:
- groups, group_members tables
- tasks.visibility enum, task_shares with expiry
- access_tokens table with hashed tokens
- audit_logs, audit_checkpoints tables
- indexes for all foreign keys and query patterns
- db_experiments.go: Add CascadeVisibilityToTasks() for propagating
visibility changes from experiments to associated tasks
2026-03-08 12:48:42 -04:00
Jeremie Fraeys
6866ba9366
refactor(queue): integrate scheduler backend and storage improvements
...
Update queue and storage systems for scheduler integration:
- Queue backend with scheduler coordination
- Filesystem queue with batch operations
- Deduplication with tenant-aware keys
- Storage layer with audit logging hooks
- Domain models (Task, Events, Errors) with scheduler fields
- Database layer with tenant isolation
- Dataset storage with integrity checks
2026-02-26 12:06:46 -05:00
Jeremie Fraeys
92aab06d76
feat(security): implement comprehensive security hardening phases 1-5,7
...
Implements defense-in-depth security for HIPAA and multi-tenant requirements:
**Phase 1 - File Ingestion Security:**
- SecurePathValidator with symlink resolution and path boundary enforcement
in internal/fileutil/secure.go
- Magic bytes validation for ML artifacts (safetensors, GGUF, HDF5, numpy)
in internal/fileutil/filetype.go
- Dangerous extension blocking (.pt, .pkl, .exe, .sh, .zip)
- Upload limits (10GB size, 100MB/s rate, 10 uploads/min)
**Phase 2 - Sandbox Hardening:**
- ApplySecurityDefaults() with secure-by-default principle
- network_mode: none, read_only_root: true, no_new_privileges: true
- drop_all_caps: true, user_ns: true, run_as_uid/gid: 1000
- PodmanSecurityConfig and BuildSecurityArgs() in internal/container/podman.go
- BuildPodmanCommand now accepts full security configuration
- Container executor passes SandboxConfig to Podman command builder
- configs/seccomp/default-hardened.json blocks dangerous syscalls
(ptrace, mount, reboot, kexec_load, open_by_handle_at)
**Phase 3 - Secrets Management:**
- expandSecrets() for environment variable expansion using ${VAR} syntax
- validateNoPlaintextSecrets() with entropy-based detection
- Pattern matching for AWS, GitHub, GitLab, OpenAI, Stripe tokens
- Shannon entropy calculation (>4 bits/char triggers detection)
- Secrets expanded during LoadConfig() before validation
**Phase 5 - HIPAA Audit Logging:**
- Tamper-evident chain hashing with SHA-256 in internal/audit/audit.go
- Event struct extended with PrevHash, EventHash, SequenceNum
- File access event types: EventFileRead, EventFileWrite, EventFileDelete
- LogFileAccess() helper for HIPAA compliance
- VerifyChain() function for tamper detection
**Supporting Changes:**
- Add DeleteJob() and DeleteJobsByPrefix() to storage package
- Integrate SecurePathValidator in artifact scanning
2026-02-23 18:00:33 -05:00
Jeremie Fraeys
23e5f3d1dc
refactor(api): internal refactoring for TUI and worker modules
...
- Refactor internal/worker and internal/queue packages
- Update cmd/tui for monitoring interface
- Update test configurations
2026-02-20 15:51:23 -05:00
Jeremie Fraeys
412d7b82e9
security: implement comprehensive secrets protection
...
Critical fixes:
- Add SanitizeConnectionString() in storage/db_connect.go to remove passwords
- Add SecureEnvVar() in api/factory.go to clear env vars after reading (JWT_SECRET)
- Clear DB password from config after connection
Logging improvements:
- Enhance logging/sanitize.go with patterns for:
- PostgreSQL connection strings
- Generic connection string passwords
- HTTP Authorization headers
- Private keys
CLI security:
- Add --security-audit flag to api-server for security checks:
- Config file permissions
- Exposed environment variables
- Running as root
- API key file permissions
- Add warning when --api-key flag used (process list exposure)
Files changed:
- internal/storage/db_connect.go
- internal/api/factory.go
- internal/logging/sanitize.go
- internal/auth/flags.go
- cmd/api-server/main.go
2026-02-18 16:18:09 -05:00
Jeremie Fraeys
10e6416e11
refactor: update WebSocket handlers and database schemas
...
- Update datasets handlers with improved error handling
- Refactor WebSocket handler for better organization
- Clean up jobs.go handler implementation
- Add websocket_metrics table to Postgres and SQLite schemas
2026-02-18 14:36:30 -05:00
Jeremie Fraeys
de877a3030
feat: implement WebSocket handler improvements and metrics persistence
...
- Add websocket_metrics table to SQLite and Postgres schemas
- Create db_metrics.go with RecordMetric, GetMetrics, GetMetricSummary methods
- Integrate metrics persistence into handleLogMetric WebSocket handler
- Remove duplicate db_datasets.go to fix type mismatches
- Move tests to tests/unit/api/ws/ following project structure
- Add payload parsing tests for handleLogMetric, handleGetExperiment, handleStatusRequest
- Update handler.go line count to 541 (still under 500 limit target)
2026-02-18 14:36:05 -05:00
Jeremie Fraeys
dbf96020af
refactor(dependency-hygiene): Fix Redis leak, simplify TUI wrapper, clean go.mod
...
Phase 1: Fix Redis Schema Leak
- Create internal/storage/dataset.go with DatasetStore abstraction
- Remove all direct Redis calls from cmd/data_manager/data_sync.go
- data_manager now uses DatasetStore for transfer tracking and metadata
Phase 2: Simplify TUI Services
- Embed *queue.TaskQueue directly in services.TaskQueue
- Eliminate 60% of wrapper boilerplate (203 -> ~100 lines)
- Keep only TUI-specific methods (EnqueueTask, GetJobStatus, experiment methods)
Phase 5: Clean go.mod Dependencies
- Remove duplicate go-redis/redis/v8 dependency
- Migrate internal/storage/migrate.go to redis/go-redis/v9
- Separate test-only deps (miniredis, testify) into own block
Results:
- Zero direct Redis calls in cmd/
- 60% fewer lines in TUI services
- Cleaner dependency structure
2026-02-17 21:13:49 -05:00
Jeremie Fraeys
d1bef0a450
refactor: Phase 3 - fix config/storage boundaries
...
Move schema ownership to infrastructure layer:
- Redis keys: config/constants.go -> queue/keys.go (TaskQueueKey, TaskPrefix, etc.)
- Filesystem paths: config/paths.go -> storage/paths.go (JobPaths)
- Create config/shared.go with RedisConfig, SSHConfig
- Update all imports: worker/, api/helpers, api/ws_jobs, api/ws_validate
- Clean up: remove duplicates from queue/task.go, queue/queue.go, config/paths.go
Build status: Compiles successfully
2026-02-17 12:49:53 -05:00
Jeremie Fraeys
6ff5324e74
refactor(storage,queue): split storage layer and add sqlite queue backend
2026-01-05 12:31:02 -05:00
Jeremie Fraeys
ea15af1833
Fix multi-user authentication and clean up debug code
...
- Fix YAML tags in auth config struct (json -> yaml)
- Update CLI configs to use pre-hashed API keys
- Remove double hashing in WebSocket client
- Fix port mapping (9102 -> 9103) in CLI commands
- Update permission keys to use jobs:read, jobs:create, etc.
- Clean up all debug logging from CLI and server
- All user roles now authenticate correctly:
* Admin: Can queue jobs and see all jobs
* Researcher: Can queue jobs and see own jobs
* Analyst: Can see status (read-only access)
Multi-user authentication is now fully functional.
2025-12-06 12:35:32 -05:00
Jeremie Fraeys
803677be57
feat: implement Go backend with comprehensive API and internal packages
...
- Add API server with WebSocket support and REST endpoints
- Implement authentication system with API keys and permissions
- Add task queue system with Redis backend and error handling
- Include storage layer with database migrations and schemas
- Add comprehensive logging, metrics, and telemetry
- Implement security middleware and network utilities
- Add experiment management and container orchestration
- Include configuration management with smart defaults
2025-12-04 16:53:53 -05:00