Jeremie Fraeys
4c8c9dfe4b
refactor: Export SelectDependencyManifest for API helpers
...
- Renamed selectDependencyManifest to SelectDependencyManifest (exported)
- Added re-export in worker package for backward compatibility
- Updated internal call in container.go to use exported function
- API helpers can now access via worker.SelectDependencyManifest
Build status: Compiles successfully
2026-02-17 16:45:59 -05:00
Jeremie Fraeys
085c23f66a
refactor(phase7): Initialize JobRunner in factory.go
...
- Create jobRunner using NewJobRunner with local and container executors
- Assign jobRunner to Worker.runner field
- JobRunner available for future task execution orchestration
Build status: Compiles successfully
2026-02-17 16:40:03 -05:00
Jeremie Fraeys
51698d60de
refactor(phase7): Restore resource metrics in metrics.go
...
- Re-enabled all resource metrics (CPU, GPU, acquisition stats)
- Metrics are conditionally registered only when w.resources != nil
- Added nil check to prevent panics if resource manager not initialized
Build status: Compiles successfully
2026-02-17 16:38:47 -05:00
Jeremie Fraeys
1ba67e419d
refactor(phase7): Integrate resource manager into Worker
...
- Added resources field to Worker struct
- Updated factory.go to pass resource manager to Worker
- Removed placeholder discard of resource manager
- Build compiles successfully
2026-02-17 16:37:33 -05:00
Jeremie Fraeys
a7360869f8
refactor: Implement TaskExecutorAdapter and Worker.runningCount()
...
- Created executor/TaskExecutorAdapter implementing lifecycle.TaskExecutor
- Properly wires LocalExecutor and ContainerExecutor through adapter
- Worker.runningCount() now delegates to runLoop.RunningCount()
- Added lifecycle.RunLoop.RunningCount() public method
- factory.go creates proper executor chain instead of placeholder
Build status: Compiles successfully
2026-02-17 16:15:41 -05:00
Jeremie Fraeys
38fa017b8e
refactor: Phase 6 - Complete migration, remove legacy files
...
BREAKING CHANGE: Legacy worker files removed, Worker struct simplified
Changes:
1. worker.go - Simplified to 8 fields using composed dependencies:
- runLoop, runner, metrics, health (from new packages)
- Removed: server, queue, running, datasetCache, ctx, cancel, etc.
2. factory.go - Updated NewWorker to use new structure
- Uses lifecycle.NewRunLoop
- Integrates jupyter.Manager properly
3. Removed legacy files:
- execution.go (1,016 lines)
- data_integrity.go (929 lines)
- runloop.go (555 lines)
- jupyter_task.go (144 lines)
- simplified.go (demonstration no longer needed)
4. Fixed references to use new packages:
- hash_selector.go -> integrity.DirOverallSHA256Hex
- snapshot_store.go -> integrity.NormalizeSHA256ChecksumHex
- metrics.go - Removed resource-dependent metrics temporarily
5. Added RecordQueueLatency to metrics.Metrics for lifecycle.MetricsRecorder
Worker struct: 27 fields -> 8 fields (70% reduction)
Build status: Compiles successfully
2026-02-17 14:39:48 -05:00
Jeremie Fraeys
94bb52d09c
refactor: Phase 5 - Simplified Worker demonstration
...
Created simplified.go demonstrating target architecture:
internal/worker/simplified.go (109 lines)
- SimplifiedWorker struct with 6 fields vs original 27 fields
- Uses composed dependencies from previous phases:
- lifecycle.RunLoop for task lifecycle management
- executor.JobRunner for job execution
- lifecycle.HealthMonitor for health tracking
- lifecycle.MetricsRecorder for metrics
Key improvements demonstrated:
- Dependency injection via SimplifiedWorkerConfig
- Clear separation of concerns
- No direct resource access (queue, metrics, etc.)
- Each component implements a defined interface
- Easy to test with mock implementations
Note: This is a demonstration of the target architecture.
The original Worker struct remains for backward compatibility.
Migration would happen incrementally in future PRs.
Build status: Compiles successfully
2026-02-17 14:24:40 -05:00
Jeremie Fraeys
062b78cbe0
refactor: Phase 4 - Extract lifecycle types and interfaces
...
Created lifecycle package with foundational types for future extraction:
1. internal/worker/lifecycle/runloop.go (117 lines)
- TaskExecutor interface for task execution contract
- RunLoopConfig for run loop configuration
- RunLoop type with core orchestration logic
- MetricsRecorder and Logger interfaces for dependencies
- Start(), Stop() methods for loop control
- executeTask() method for task lifecycle management
2. internal/worker/lifecycle/health.go (52 lines)
- HealthMonitor type for health tracking
- RecordHeartbeat(), IsHealthy(), MarkUnhealthy() methods
- Heartbeater interface for heartbeat operations
- HeartbeatLoop() function for background heartbeats
Note: These are interface/type foundations for Phase 5.
The actual Worker struct methods remain in runloop.go until
Phase 5 when they'll migrate to use these abstractions.
Build status: Compiles successfully
2026-02-17 14:22:58 -05:00
Jeremie Fraeys
3248279c01
refactor: Phase 3 - Extract data integrity layer
...
Created integrity package with extracted data utilities:
1. internal/worker/integrity/hash.go (113 lines)
- FileSHA256Hex() - SHA256 hash of single file
- NormalizeSHA256ChecksumHex() - Checksum normalization
- DirOverallSHA256Hex() - Directory hash (sequential)
- DirOverallSHA256HexParallel() - Directory hash (parallel workers)
2. internal/worker/integrity/validate.go (76 lines)
- DatasetVerifier type for dataset validation
- VerifyDatasetSpecs() method for checksum validation
- ProvenanceCalculator type for provenance computation
- ComputeProvenance() method for task provenance
Note: Used 'integrity' instead of 'data' due to .gitignore conflict
(data/ directory is ignored for experiment artifacts)
Functions extracted from data_integrity.go:
- fileSHA256Hex → FileSHA256Hex
- normalizeSHA256ChecksumHex → NormalizeSHA256ChecksumHex
- dirOverallSHA256HexGo → DirOverallSHA256Hex
- dirOverallSHA256HexParallel → DirOverallSHA256HexParallel
- verifyDatasetSpecs logic → DatasetVerifier
- computeTaskProvenance logic → ProvenanceCalculator
Build status: Compiles successfully
2026-02-17 14:20:41 -05:00
Jeremie Fraeys
22f3d66f1d
refactor: Phase 2 - Extract executor implementations
...
Created executor package with extracted job execution logic:
1. internal/worker/executor/local.go (104 lines)
- LocalExecutor implements JobExecutor interface
- Execute() method for local bash script execution
- generateScript() helper for creating experiment scripts
2. internal/worker/executor/container.go (229 lines)
- ContainerExecutor implements JobExecutor interface
- Execute() method for podman container execution
- EnvironmentPool interface for image caching
- Tracking tool provisioning (MLflow, TensorBoard, Wandb)
- Volume and cache setup
- selectDependencyManifest() helper
3. internal/worker/executor/runner.go (131 lines)
- JobRunner orchestrates execution
- ExecutionMode enum (Auto, Local, Container)
- Run() method with directory setup and executor selection
- finalize() for success/failure handling
Key design decisions:
- Executors depend on interfaces (ManifestWriter, not Worker)
- JobRunner composes both executors
- No direct Worker dependencies in executor package
- SetupJobDirectories reused from execution package
Build status: Compiles successfully
2026-02-17 14:14:04 -05:00
Jeremie Fraeys
ae0a370fb4
refactor: Phase 1 - Extract worker interfaces
...
Created interfaces package to break tight coupling:
1. internal/worker/interfaces/executor.go (30 lines)
- JobExecutor interface for job execution
- ExecutionEnv struct for execution context
- ExecutionResult struct for results
2. internal/worker/interfaces/tracker.go (20 lines)
- ProgressTracker interface for execution stages
- StageStart, StageComplete, StageFailed methods
- JobComplete for final status
3. internal/worker/interfaces/manifest.go (18 lines)
- ManifestWriter interface for manifest operations
- Upsert method for update/create
- BuildInitial method for creating new manifests
These interfaces will enable:
- Dependency injection in future phases
- Mocking for unit tests
- Clean separation between orchestration and execution
Build status: Compiles successfully
2026-02-17 14:10:03 -05:00
Jeremie Fraeys
c46be7f815
refactor: Phase 4 deferred - Extract GPU utilities and execution helpers
...
Extracted from execution.go to focused packages:
1. internal/worker/gpu.go (60 lines)
- gpuVisibleDevicesString() - GPU device string formatting
- filterExistingDevicePaths() - Device path filtering
- gpuVisibleEnvVarName() - GPU env var selection
- Reuses GPUType constants from gpu_detector.go
2. internal/worker/execution/setup.go (108 lines)
- SetupJobDirectories() - Job directory creation
- CopyDir() - Directory tree copying
- copyFile() - Single file copy helper
3. internal/worker/execution/snapshot.go (52 lines)
- StageSnapshot() - Snapshot staging for jobs
- StageSnapshotFromPath() - Snapshot staging from path
Updated execution.go:
- Removed 64 lines of GPU utilities (now in gpu.go)
- Reduced from 1,082 to ~1,018 lines
- Still contains main execution flow (runJob, executeJob, etc.)
Build status: Compiles successfully
2026-02-17 14:03:11 -05:00
Jeremie Fraeys
d8cc2a4efa
refactor: Migrate all test imports from api to api/ws package
...
Updated 6 test files to use proper api/ws package imports:
1. tests/e2e/websocket_e2e_test.go
- api.NewWSHandler → ws.NewHandler
2. tests/e2e/wss_reverse_proxy_e2e_test.go
- api.NewWSHandler → ws.NewHandler
3. tests/integration/ws_handler_integration_test.go
- api.NewWSHandler → wspkg.NewHandler
- api.Opcode* → wspkg.Opcode*
4. tests/integration/websocket_queue_integration_test.go
- api.NewWSHandler → wspkg.NewHandler
- api.Opcode* → wspkg.Opcode*
5. tests/unit/api/ws_test.go
- api.NewWSHandler → wspkg.NewHandler
- api.Opcode* → wspkg.Opcode*
6. tests/unit/api/ws_jobs_args_test.go
- api.Opcode* → wspkg.Opcode*
Removed api/ws_compat.go shim as all tests now use proper imports.
Build status: Compiles successfully
2026-02-17 13:52:20 -05:00
Jeremie Fraeys
83ca393ebc
fix: Add proper WebSocket compatibility shim for test imports
...
Updated api/ws_compat.go to properly delegate to api/ws package:
- NewWSHandler returns http.Handler interface (not interface{})
- All Opcode* constants re-exported from ws package
- Maintains backward compatibility for existing tests
This allows gradual migration of tests to use api/ws directly without
breaking the build. Tests can be updated incrementally.
Build status: Compiles successfully
2026-02-17 13:47:47 -05:00
Jeremie Fraeys
f191f7f68d
refactor: Phase 6 - Queue Restructure
...
Created subpackages for queue implementations:
- queue/redis/queue.go (165 lines) - Redis-based queue implementation
- queue/sqlite/queue.go (194 lines) - SQLite-based queue implementation
- queue/filesystem/queue.go (159 lines) - Filesystem-based queue implementation
Build status: Compiles successfully
2026-02-17 13:41:06 -05:00
Jeremie Fraeys
d9c5750ed8
refactor: Phase 5 cleanup - Remove original ws_*.go files
...
Removed original monolithic WebSocket handler files after extracting
to focused packages:
Deleted:
- ws_jobs.go (1,365 lines) → Extracted to api/jobs/handlers.go
- ws_jupyter.go (512 lines) → Extracted to api/jupyter/handlers.go
- ws_validate.go (523 lines) → Extracted to api/validate/handlers.go
- ws_handler.go (379 lines) → Extracted to api/ws/handler.go
- ws_datasets.go (174 lines) - Functionality not migrated
- ws_tls_auth.go (101 lines) - Functionality not migrated
Updated:
- routes.go - Changed NewWSHandler → ws.NewHandler
Lines deleted: ~3,000+ lines from monolithic files
Build status: Compiles successfully
2026-02-17 13:33:00 -05:00
Jeremie Fraeys
f0ffbb4a3d
refactor: Phase 5 complete - API packages extracted
...
Extracted all deferred API packages from monolithic ws_*.go files:
- api/routes.go (75 lines) - Extracted route registration from server.go
- api/errors.go (108 lines) - Standardized error responses and error codes
- api/jobs/handlers.go (271 lines) - Job WebSocket handlers
* HandleAnnotateRun, HandleSetRunNarrative
* HandleCancelJob, HandlePruneJobs, HandleListJobs
- api/jupyter/handlers.go (244 lines) - Jupyter WebSocket handlers
* HandleStartJupyter, HandleStopJupyter
* HandleListJupyter, HandleListJupyterPackages
* HandleRemoveJupyter, HandleRestoreJupyter
- api/validate/handlers.go (163 lines) - Validation WebSocket handlers
* HandleValidate, HandleGetValidateStatus, HandleListValidations
- api/ws/handler.go (298 lines) - WebSocket handler framework
* Core WebSocket handling logic
* Opcode constants and error codes
Lines redistributed: ~1,150 lines from ws_jobs.go (1,365), ws_jupyter.go (512),
ws_validate.go (523), ws_handler.go (379) into focused packages.
Note: Original ws_*.go files still present - cleanup in next commit.
Build status: Compiles successfully
2026-02-17 13:25:58 -05:00
Jeremie Fraeys
db7fbbd8d5
refactor: Phase 5 - split API package into focused files
...
Reorganized internal/api/ package to follow single-concern principle:
- api/factory.go (new file, 257 lines)
- Extracted component initialization from server.go
- initializeComponents(), setupLogger(), initExperimentManager()
- initTaskQueue(), initDatabase(), initDatabaseSchema()
- initSecurity(), initJupyterServiceManager(), initAuditLogger()
- api/middleware.go (new file, 31 lines)
- Extracted wrapWithMiddleware() - security middleware chain
- Centralized auth, rate limiting, CORS, security headers
- api/server.go (reduced from 446 to 212 lines)
- Now focused on Server lifecycle: NewServer, Start, WaitForShutdown, Close
- Removed initialization logic (moved to factory.go)
- Removed middleware wrapper (moved to middleware.go)
- api/metrics_middleware.go (existing, 64 lines)
- Already had wrapWithMetrics(), left in place
Lines redistributed: ~180 lines from monolithic server.go
Build status: Compiles successfully
2026-02-17 13:11:02 -05:00
Jeremie Fraeys
a5c1a9fc0b
refactor: Phase 4 - split worker package into focused files
...
Split 551-line worker/core.go into single-concern files:
- worker/config.go (+44 lines)
- Added config parsing: envInt(), parseCPUFromConfig(), parseGPUCountFromConfig()
- parseGPUSlotsPerGPUFromConfig()
- Now has all config logic in one place (440 lines total)
- worker/metrics.go (new file, 172 lines)
- Extracted setupMetricsExporter() with ~30 Prometheus metric registrations
- Isolated metrics logic for easy modification
- worker/factory.go (new file, 183 lines)
- Extracted NewWorker() factory function
- Moved prePullImages(), pullImage() from core.go
- Centralized worker instantiation
- worker/worker.go (renamed from core.go, ~100 lines)
- Now just defines Worker struct, MLServer, JupyterManager
- Clean, focused file without mixed concerns
Lines redistributed: ~350 lines moved from monolithic core.go
Build status: Compiles successfully
2026-02-17 12:57:02 -05:00
Jeremie Fraeys
d1bef0a450
refactor: Phase 3 - fix config/storage boundaries
...
Move schema ownership to infrastructure layer:
- Redis keys: config/constants.go -> queue/keys.go (TaskQueueKey, TaskPrefix, etc.)
- Filesystem paths: config/paths.go -> storage/paths.go (JobPaths)
- Create config/shared.go with RedisConfig, SSHConfig
- Update all imports: worker/, api/helpers, api/ws_jobs, api/ws_validate
- Clean up: remove duplicates from queue/task.go, queue/queue.go, config/paths.go
Build status: Compiles successfully
2026-02-17 12:49:53 -05:00
Jeremie Fraeys
6580917ba8
refactor: extract domain types and consolidate error system (Phases 1-2)
...
Phase 1: Extract Domain Types
=============================
- Create internal/domain/ package with canonical types:
- domain/task.go: Task, Attempt structs
- domain/tracking.go: TrackingConfig and MLflow/TensorBoard/Wandb configs
- domain/dataset.go: DatasetSpec
- domain/status.go: JobStatus constants
- domain/errors.go: FailureClass system with classification functions
- domain/doc.go: package documentation
- Update queue/task.go to re-export domain types (backward compatibility)
- Update TUI model/state.go to use domain types via type aliases
- Simplify TUI services: remove ~60 lines of conversion functions
Phase 2: Delete ErrorCategory System
====================================
- Remove deprecated ErrorCategory type and constants
- Remove TaskError struct and related functions
- Remove mapping functions: ClassifyError, IsRetryable, GetUserMessage, RetryDelay
- Update all queue implementations to use domain.FailureClass directly:
- queue/metrics.go: RecordTaskFailure/Retry now take FailureClass
- queue/queue.go: RetryTask uses domain.ClassifyFailure
- queue/filesystem_queue.go: RetryTask and MoveToDeadLetterQueue updated
- queue/sqlite_queue.go: RetryTask and MoveToDeadLetterQueue updated
Lines eliminated: ~190 lines of conversion and mapping code
Result: Single source of truth for domain types and error classification
2026-02-17 12:34:28 -05:00
Jeremie Fraeys
a93b6715fd
feat: add native library bridge and queue integration
...
- Add native_queue.go with CGO bindings for queue operations
- Add native_queue_stub.go for non-CGO builds
- Add hash_selector to choose between Go and native implementations
- Add native_bridge_libs.go for CGO builds with native_libs tag
- Add native_bridge_nocgo.go stub for non-CGO builds
- Update queue errors and task handling for native integration
- Update worker config and runloop for native library support
2026-02-16 20:38:30 -05:00
Jeremie Fraeys
b05470b30a
refactor: improve API structure and WebSocket protocol
...
- Extract WebSocket protocol handling to dedicated module
- Add helper functions for DB operations, validation, and responses
- Improve WebSocket frame handling and opcodes
- Refactor dataset, job, and Jupyter handlers
- Add duplicate detection processing
2026-02-16 20:38:12 -05:00
Jeremie Fraeys
43d241c28d
feat: implement C++ native libraries for performance-critical operations
...
- Add arena allocator for zero-allocation hot paths
- Add thread pool for parallel operations
- Add mmap utilities for memory-mapped I/O
- Implement queue_index with heap-based priority queue
- Implement dataset_hash with SIMD support (SHA-NI, ARMv8)
- Add runtime SIMD detection for cross-platform correctness
- Add comprehensive tests and benchmarks
2026-02-16 20:38:04 -05:00
Jeremie Fraeys
d408a60eb1
ci: push all workflow updates
Documentation / build-and-publish (push) Waiting to run
Test / test (push) Waiting to run
Checkout test / test (push) Successful in 5s
CI with Native Libraries / test-native (push) Has been cancelled
CI with Native Libraries / build-release (push) Has been cancelled
2026-02-12 13:28:15 -05:00
Jeremie Fraeys
2e701340e5
feat(core): API, worker, queue, and manifest improvements
...
- Add protocol buffer optimizations (internal/api/protocol.go)
- Add filesystem queue backend (internal/queue/filesystem_queue.go)
- Add run manifest support (internal/manifest/run_manifest.go)
- Worker and jupyter task refinements
- Exported test wrappers for benchmarking
2026-02-12 12:05:17 -05:00
Jeremie Fraeys
72b4b29ecd
perf: add profiling benchmarks and parallel Go baseline for C++ optimization
...
Add comprehensive benchmarking suite for C++ optimization targets:
- tests/benchmarks/dataset_hash_bench_test.go - dirOverallSHA256Hex profiling
- tests/benchmarks/queue_bench_test.go - filesystem queue profiling
- tests/benchmarks/artifact_and_snapshot_bench_test.go - scanArtifacts/extractTarGz profiling
- tests/unit/worker/artifacts_test.go - moved from internal/ for clean separation
Add parallel Go implementation as baseline for C++ comparison:
- internal/worker/data_integrity.go: dirOverallSHA256HexParallel() with worker pool
- Benchmarks show 2.1x speedup (3.97ms -> 1.90ms) vs sequential
Exported wrappers for testing:
- ScanArtifacts() - artifact scanning
- ExtractTarGz() - tar.gz extraction
- DirOverallSHA256HexParallel() - parallel hashing
Profiling results (Apple M2 Ultra):
- dirOverallSHA256Hex: 78% syscall overhead (target for mmap C++)
- rebuildIndex: 96% syscall overhead (target for binary index C++)
- scanArtifacts: 87% syscall overhead (target for fast traversal C++)
- extractTarGz: 95% syscall overhead (target for parallel gzip C++)
Related: C++ optimization strategy in memory 5d5f0bb6
2026-02-12 12:04:02 -05:00
Jeremie Fraeys
c0eeeda940
feat(experiment): improve experiment lifecycle and update first-experiment guide
2026-01-05 12:37:34 -05:00
Jeremie Fraeys
6b771e4a50
feat(jupyter): improve runtime management and update security/workflow docs
2026-01-05 12:37:27 -05:00
Jeremie Fraeys
dab680a60d
feat(tracking): add pluggable tracking backends and audit support
2026-01-05 12:33:57 -05:00
Jeremie Fraeys
82034c68f3
feat(worker): add integrity checks, snapshot staging, and prewarm support
2026-01-05 12:31:13 -05:00
Jeremie Fraeys
add4a90e62
feat(api): refactor websocket handlers; add health and prometheus middleware
2026-01-05 12:31:07 -05:00
Jeremie Fraeys
6ff5324e74
refactor(storage,queue): split storage layer and add sqlite queue backend
2026-01-05 12:31:02 -05:00
Jeremie Fraeys
cd5640ebd2
Slim and secure: move scripts, clean configs, remove secrets
...
- Move ci-test.sh and setup.sh to scripts/
- Trim docs/src/zig-cli.md to current structure
- Replace hardcoded secrets with placeholders in configs
- Update .gitignore to block .env*, secrets/, keys, build artifacts
- Slim README.md to reflect current CLI/TUI split
- Add cleanup trap to ci-test.sh
- Ensure no secrets are committed
2025-12-07 13:57:51 -05:00
Jeremie Fraeys
ea15af1833
Fix multi-user authentication and clean up debug code
...
- Fix YAML tags in auth config struct (json -> yaml)
- Update CLI configs to use pre-hashed API keys
- Remove double hashing in WebSocket client
- Fix port mapping (9102 -> 9103) in CLI commands
- Update permission keys to use jobs:read, jobs:create, etc.
- Clean up all debug logging from CLI and server
- All user roles now authenticate correctly:
* Admin: Can queue jobs and see all jobs
* Researcher: Can queue jobs and see own jobs
* Analyst: Can see status (read-only access)
Multi-user authentication is now fully functional.
2025-12-06 12:35:32 -05:00
Jeremie Fraeys
10a3afaafb
fix: update production environment variable check
...
- Change FETCH_ML_ENV check from 'production' to 'prod'
- Aligns with common environment naming conventions
- Fixes authentication validation for production deployment
2025-12-04 17:06:32 -05:00
Jeremie Fraeys
803677be57
feat: implement Go backend with comprehensive API and internal packages
...
- Add API server with WebSocket support and REST endpoints
- Implement authentication system with API keys and permissions
- Add task queue system with Redis backend and error handling
- Include storage layer with database migrations and schemas
- Add comprehensive logging, metrics, and telemetry
- Implement security middleware and network utilities
- Add experiment management and container orchestration
- Include configuration management with smart defaults
2025-12-04 16:53:53 -05:00