Commit graph

118 commits

Author SHA1 Message Date
Jeremie Fraeys
d1bef0a450
refactor: Phase 3 - fix config/storage boundaries
Move schema ownership to infrastructure layer:

- Redis keys: config/constants.go -> queue/keys.go (TaskQueueKey, TaskPrefix, etc.)

- Filesystem paths: config/paths.go -> storage/paths.go (JobPaths)

- Create config/shared.go with RedisConfig, SSHConfig

- Update all imports: worker/, api/helpers, api/ws_jobs, api/ws_validate

- Clean up: remove duplicates from queue/task.go, queue/queue.go, config/paths.go

Build status: Compiles successfully
2026-02-17 12:49:53 -05:00
Jeremie Fraeys
6580917ba8
refactor: extract domain types and consolidate error system (Phases 1-2)
Phase 1: Extract Domain Types
=============================
- Create internal/domain/ package with canonical types:
  - domain/task.go: Task, Attempt structs
  - domain/tracking.go: TrackingConfig and MLflow/TensorBoard/Wandb configs
  - domain/dataset.go: DatasetSpec
  - domain/status.go: JobStatus constants
  - domain/errors.go: FailureClass system with classification functions
  - domain/doc.go: package documentation

- Update queue/task.go to re-export domain types (backward compatibility)
- Update TUI model/state.go to use domain types via type aliases
- Simplify TUI services: remove ~60 lines of conversion functions

Phase 2: Delete ErrorCategory System
====================================
- Remove deprecated ErrorCategory type and constants
- Remove TaskError struct and related functions
- Remove mapping functions: ClassifyError, IsRetryable, GetUserMessage, RetryDelay
- Update all queue implementations to use domain.FailureClass directly:
  - queue/metrics.go: RecordTaskFailure/Retry now take FailureClass
  - queue/queue.go: RetryTask uses domain.ClassifyFailure
  - queue/filesystem_queue.go: RetryTask and MoveToDeadLetterQueue updated
  - queue/sqlite_queue.go: RetryTask and MoveToDeadLetterQueue updated

Lines eliminated: ~190 lines of conversion and mapping code
Result: Single source of truth for domain types and error classification
2026-02-17 12:34:28 -05:00
Jeremie Fraeys
a93b6715fd
feat: add native library bridge and queue integration
- Add native_queue.go with CGO bindings for queue operations
- Add native_queue_stub.go for non-CGO builds
- Add hash_selector to choose between Go and native implementations
- Add native_bridge_libs.go for CGO builds with native_libs tag
- Add native_bridge_nocgo.go stub for non-CGO builds
- Update queue errors and task handling for native integration
- Update worker config and runloop for native library support
2026-02-16 20:38:30 -05:00
Jeremie Fraeys
b05470b30a
refactor: improve API structure and WebSocket protocol
- Extract WebSocket protocol handling to dedicated module
- Add helper functions for DB operations, validation, and responses
- Improve WebSocket frame handling and opcodes
- Refactor dataset, job, and Jupyter handlers
- Add duplicate detection processing
2026-02-16 20:38:12 -05:00
Jeremie Fraeys
43d241c28d
feat: implement C++ native libraries for performance-critical operations
- Add arena allocator for zero-allocation hot paths
- Add thread pool for parallel operations
- Add mmap utilities for memory-mapped I/O
- Implement queue_index with heap-based priority queue
- Implement dataset_hash with SIMD support (SHA-NI, ARMv8)
- Add runtime SIMD detection for cross-platform correctness
- Add comprehensive tests and benchmarks
2026-02-16 20:38:04 -05:00
Jeremie Fraeys
d408a60eb1
ci: push all workflow updates
Some checks failed
Documentation / build-and-publish (push) Waiting to run
Test / test (push) Waiting to run
Checkout test / test (push) Successful in 5s
CI with Native Libraries / test-native (push) Has been cancelled
CI with Native Libraries / build-release (push) Has been cancelled
2026-02-12 13:28:15 -05:00
Jeremie Fraeys
2e701340e5
feat(core): API, worker, queue, and manifest improvements
- Add protocol buffer optimizations (internal/api/protocol.go)
- Add filesystem queue backend (internal/queue/filesystem_queue.go)
- Add run manifest support (internal/manifest/run_manifest.go)
- Worker and jupyter task refinements
- Exported test wrappers for benchmarking
2026-02-12 12:05:17 -05:00
Jeremie Fraeys
72b4b29ecd
perf: add profiling benchmarks and parallel Go baseline for C++ optimization
Add comprehensive benchmarking suite for C++ optimization targets:
- tests/benchmarks/dataset_hash_bench_test.go - dirOverallSHA256Hex profiling
- tests/benchmarks/queue_bench_test.go - filesystem queue profiling
- tests/benchmarks/artifact_and_snapshot_bench_test.go - scanArtifacts/extractTarGz profiling
- tests/unit/worker/artifacts_test.go - moved from internal/ for clean separation

Add parallel Go implementation as baseline for C++ comparison:
- internal/worker/data_integrity.go: dirOverallSHA256HexParallel() with worker pool
- Benchmarks show 2.1x speedup (3.97ms -> 1.90ms) vs sequential

Exported wrappers for testing:
- ScanArtifacts() - artifact scanning
- ExtractTarGz() - tar.gz extraction
- DirOverallSHA256HexParallel() - parallel hashing

Profiling results (Apple M2 Ultra):
- dirOverallSHA256Hex: 78% syscall overhead (target for mmap C++)
- rebuildIndex: 96% syscall overhead (target for binary index C++)
- scanArtifacts: 87% syscall overhead (target for fast traversal C++)
- extractTarGz: 95% syscall overhead (target for parallel gzip C++)

Related: C++ optimization strategy in memory 5d5f0bb6
2026-02-12 12:04:02 -05:00
Jeremie Fraeys
c0eeeda940 feat(experiment): improve experiment lifecycle and update first-experiment guide 2026-01-05 12:37:34 -05:00
Jeremie Fraeys
6b771e4a50 feat(jupyter): improve runtime management and update security/workflow docs 2026-01-05 12:37:27 -05:00
Jeremie Fraeys
dab680a60d feat(tracking): add pluggable tracking backends and audit support 2026-01-05 12:33:57 -05:00
Jeremie Fraeys
82034c68f3 feat(worker): add integrity checks, snapshot staging, and prewarm support 2026-01-05 12:31:13 -05:00
Jeremie Fraeys
add4a90e62 feat(api): refactor websocket handlers; add health and prometheus middleware 2026-01-05 12:31:07 -05:00
Jeremie Fraeys
6ff5324e74 refactor(storage,queue): split storage layer and add sqlite queue backend 2026-01-05 12:31:02 -05:00
Jeremie Fraeys
cd5640ebd2 Slim and secure: move scripts, clean configs, remove secrets
- Move ci-test.sh and setup.sh to scripts/
- Trim docs/src/zig-cli.md to current structure
- Replace hardcoded secrets with placeholders in configs
- Update .gitignore to block .env*, secrets/, keys, build artifacts
- Slim README.md to reflect current CLI/TUI split
- Add cleanup trap to ci-test.sh
- Ensure no secrets are committed
2025-12-07 13:57:51 -05:00
Jeremie Fraeys
ea15af1833 Fix multi-user authentication and clean up debug code
- Fix YAML tags in auth config struct (json -> yaml)
- Update CLI configs to use pre-hashed API keys
- Remove double hashing in WebSocket client
- Fix port mapping (9102 -> 9103) in CLI commands
- Update permission keys to use jobs:read, jobs:create, etc.
- Clean up all debug logging from CLI and server
- All user roles now authenticate correctly:
  * Admin: Can queue jobs and see all jobs
  * Researcher: Can queue jobs and see own jobs
  * Analyst: Can see status (read-only access)

Multi-user authentication is now fully functional.
2025-12-06 12:35:32 -05:00
Jeremie Fraeys
10a3afaafb fix: update production environment variable check
- Change FETCH_ML_ENV check from 'production' to 'prod'
- Aligns with common environment naming conventions
- Fixes authentication validation for production deployment
2025-12-04 17:06:32 -05:00
Jeremie Fraeys
803677be57 feat: implement Go backend with comprehensive API and internal packages
- Add API server with WebSocket support and REST endpoints
- Implement authentication system with API keys and permissions
- Add task queue system with Redis backend and error handling
- Include storage layer with database migrations and schemas
- Add comprehensive logging, metrics, and telemetry
- Implement security middleware and network utilities
- Add experiment management and container orchestration
- Include configuration management with smart defaults
2025-12-04 16:53:53 -05:00