Commit graph

103 commits

Author SHA1 Message Date
Jeremie Fraeys
f191f7f68d
refactor: Phase 6 - Queue Restructure
Created subpackages for queue implementations:

- queue/redis/queue.go (165 lines) - Redis-based queue implementation
- queue/sqlite/queue.go (194 lines) - SQLite-based queue implementation
- queue/filesystem/queue.go (159 lines) - Filesystem-based queue implementation

Build status: Compiles successfully
2026-02-17 13:41:06 -05:00
Jeremie Fraeys
d9c5750ed8
refactor: Phase 5 cleanup - Remove original ws_*.go files
Removed original monolithic WebSocket handler files after extracting
to focused packages:

Deleted:
- ws_jobs.go (1,365 lines) → Extracted to api/jobs/handlers.go
- ws_jupyter.go (512 lines) → Extracted to api/jupyter/handlers.go
- ws_validate.go (523 lines) → Extracted to api/validate/handlers.go
- ws_handler.go (379 lines) → Extracted to api/ws/handler.go
- ws_datasets.go (174 lines) - Functionality not migrated
- ws_tls_auth.go (101 lines) - Functionality not migrated

Updated:
- routes.go - Changed NewWSHandler → ws.NewHandler

Lines deleted: ~3,000+ lines from monolithic files
Build status: Compiles successfully
2026-02-17 13:33:00 -05:00
Jeremie Fraeys
f0ffbb4a3d
refactor: Phase 5 complete - API packages extracted
Extracted all deferred API packages from monolithic ws_*.go files:

- api/routes.go (75 lines) - Extracted route registration from server.go
- api/errors.go (108 lines) - Standardized error responses and error codes
- api/jobs/handlers.go (271 lines) - Job WebSocket handlers
  * HandleAnnotateRun, HandleSetRunNarrative
  * HandleCancelJob, HandlePruneJobs, HandleListJobs
- api/jupyter/handlers.go (244 lines) - Jupyter WebSocket handlers
  * HandleStartJupyter, HandleStopJupyter
  * HandleListJupyter, HandleListJupyterPackages
  * HandleRemoveJupyter, HandleRestoreJupyter
- api/validate/handlers.go (163 lines) - Validation WebSocket handlers
  * HandleValidate, HandleGetValidateStatus, HandleListValidations
- api/ws/handler.go (298 lines) - WebSocket handler framework
  * Core WebSocket handling logic
  * Opcode constants and error codes

Lines redistributed: ~1,150 lines from ws_jobs.go (1,365), ws_jupyter.go (512),
ws_validate.go (523), ws_handler.go (379) into focused packages.

Note: Original ws_*.go files still present - cleanup in next commit.
Build status: Compiles successfully
2026-02-17 13:25:58 -05:00
Jeremie Fraeys
db7fbbd8d5
refactor: Phase 5 - split API package into focused files
Reorganized internal/api/ package to follow single-concern principle:

- api/factory.go (new file, 257 lines)
  - Extracted component initialization from server.go
  - initializeComponents(), setupLogger(), initExperimentManager()
  - initTaskQueue(), initDatabase(), initDatabaseSchema()
  - initSecurity(), initJupyterServiceManager(), initAuditLogger()

- api/middleware.go (new file, 31 lines)
  - Extracted wrapWithMiddleware() - security middleware chain
  - Centralized auth, rate limiting, CORS, security headers

- api/server.go (reduced from 446 to 212 lines)
  - Now focused on Server lifecycle: NewServer, Start, WaitForShutdown, Close
  - Removed initialization logic (moved to factory.go)
  - Removed middleware wrapper (moved to middleware.go)

- api/metrics_middleware.go (existing, 64 lines)
  - Already had wrapWithMetrics(), left in place

Lines redistributed: ~180 lines from monolithic server.go
Build status: Compiles successfully
2026-02-17 13:11:02 -05:00
Jeremie Fraeys
a5c1a9fc0b
refactor: Phase 4 - split worker package into focused files
Split 551-line worker/core.go into single-concern files:

- worker/config.go (+44 lines)
  - Added config parsing: envInt(), parseCPUFromConfig(), parseGPUCountFromConfig()
  - parseGPUSlotsPerGPUFromConfig()
  - Now has all config logic in one place (440 lines total)

- worker/metrics.go (new file, 172 lines)
  - Extracted setupMetricsExporter() with ~30 Prometheus metric registrations
  - Isolated metrics logic for easy modification

- worker/factory.go (new file, 183 lines)
  - Extracted NewWorker() factory function
  - Moved prePullImages(), pullImage() from core.go
  - Centralized worker instantiation

- worker/worker.go (renamed from core.go, ~100 lines)
  - Now just defines Worker struct, MLServer, JupyterManager
  - Clean, focused file without mixed concerns

Lines redistributed: ~350 lines moved from monolithic core.go
Build status: Compiles successfully
2026-02-17 12:57:02 -05:00
Jeremie Fraeys
d1bef0a450
refactor: Phase 3 - fix config/storage boundaries
Move schema ownership to infrastructure layer:

- Redis keys: config/constants.go -> queue/keys.go (TaskQueueKey, TaskPrefix, etc.)

- Filesystem paths: config/paths.go -> storage/paths.go (JobPaths)

- Create config/shared.go with RedisConfig, SSHConfig

- Update all imports: worker/, api/helpers, api/ws_jobs, api/ws_validate

- Clean up: remove duplicates from queue/task.go, queue/queue.go, config/paths.go

Build status: Compiles successfully
2026-02-17 12:49:53 -05:00
Jeremie Fraeys
6580917ba8
refactor: extract domain types and consolidate error system (Phases 1-2)
Phase 1: Extract Domain Types
=============================
- Create internal/domain/ package with canonical types:
  - domain/task.go: Task, Attempt structs
  - domain/tracking.go: TrackingConfig and MLflow/TensorBoard/Wandb configs
  - domain/dataset.go: DatasetSpec
  - domain/status.go: JobStatus constants
  - domain/errors.go: FailureClass system with classification functions
  - domain/doc.go: package documentation

- Update queue/task.go to re-export domain types (backward compatibility)
- Update TUI model/state.go to use domain types via type aliases
- Simplify TUI services: remove ~60 lines of conversion functions

Phase 2: Delete ErrorCategory System
====================================
- Remove deprecated ErrorCategory type and constants
- Remove TaskError struct and related functions
- Remove mapping functions: ClassifyError, IsRetryable, GetUserMessage, RetryDelay
- Update all queue implementations to use domain.FailureClass directly:
  - queue/metrics.go: RecordTaskFailure/Retry now take FailureClass
  - queue/queue.go: RetryTask uses domain.ClassifyFailure
  - queue/filesystem_queue.go: RetryTask and MoveToDeadLetterQueue updated
  - queue/sqlite_queue.go: RetryTask and MoveToDeadLetterQueue updated

Lines eliminated: ~190 lines of conversion and mapping code
Result: Single source of truth for domain types and error classification
2026-02-17 12:34:28 -05:00
Jeremie Fraeys
e286fd7769
chore: clean up temporary and build artifacts
- Remove old mermaid-init.js from assets directory (moved to static)
- Remove mem.prof profiling artifact
2026-02-16 20:39:34 -05:00
Jeremie Fraeys
fabb8fa1ee
docs: remove debug command from CLI reference 2026-02-16 20:39:20 -05:00
Jeremie Fraeys
fb4c91f4c5
chore: add review workflow and test updates
- Add Windsurf review workflow configuration
- Add logs debug test for CLI
- Update main test file
2026-02-16 20:39:09 -05:00
Jeremie Fraeys
355d2e311a
docs: update README and CHANGELOG
- Update project documentation with latest features
- Update manage-artifacts.sh script
2026-02-16 20:38:57 -05:00
Jeremie Fraeys
b177e603da
chore: update Docker build configuration 2026-02-16 20:38:50 -05:00
Jeremie Fraeys
6cc02b5efc
docs: add native libraries documentation and smoke tests
- Add comprehensive native-libraries.md documentation
- Add smoke-test-native.sh for testing native library builds
- Document build process, architecture, and testing strategy
2026-02-16 20:38:46 -05:00
Jeremie Fraeys
a93b6715fd
feat: add native library bridge and queue integration
- Add native_queue.go with CGO bindings for queue operations
- Add native_queue_stub.go for non-CGO builds
- Add hash_selector to choose between Go and native implementations
- Add native_bridge_libs.go for CGO builds with native_libs tag
- Add native_bridge_nocgo.go stub for non-CGO builds
- Update queue errors and task handling for native integration
- Update worker config and runloop for native library support
2026-02-16 20:38:30 -05:00
Jeremie Fraeys
8b4e1753d1
chore: update configurations and deployment files
- Add Redis secure configuration
- Update worker configurations for homelab and Docker
- Add Forgejo workflow configurations
- Update docker-compose files with improved networking
- Add Caddy configurations for different environments
2026-02-16 20:38:19 -05:00
Jeremie Fraeys
7305e2bc21
test: add comprehensive test coverage and command improvements
- Add logs and debug end-to-end tests
- Add test helper utilities
- Improve test fixtures and templates
- Update API server and config lint commands
- Add multi-user database initialization
2026-02-16 20:38:15 -05:00
Jeremie Fraeys
b05470b30a
refactor: improve API structure and WebSocket protocol
- Extract WebSocket protocol handling to dedicated module
- Add helper functions for DB operations, validation, and responses
- Improve WebSocket frame handling and opcodes
- Refactor dataset, job, and Jupyter handlers
- Add duplicate detection processing
2026-02-16 20:38:12 -05:00
Jeremie Fraeys
1147958e15
feat: enhance CLI with improved commands and WebSocket handling
- Refactor command structure for better organization
- Improve WebSocket client frame handling
- Add response handler improvements
- Update queue, requeue, and status commands
- Add security module for CLI authentication
2026-02-16 20:38:08 -05:00
Jeremie Fraeys
43d241c28d
feat: implement C++ native libraries for performance-critical operations
- Add arena allocator for zero-allocation hot paths
- Add thread pool for parallel operations
- Add mmap utilities for memory-mapped I/O
- Implement queue_index with heap-based priority queue
- Implement dataset_hash with SIMD support (SHA-NI, ARMv8)
- Add runtime SIMD detection for cross-platform correctness
- Add comprehensive tests and benchmarks
2026-02-16 20:38:04 -05:00
Jeremie Fraeys
d673bce216
docs: fix mermaid graphs and update outdated content
- Fix mermaid graph syntax errors (escape parentheses in node labels)
- Move mermaid-init.js to Hugo static directory for correct MIME type
- Update Future Extensions section in cli-tui-ux-contract-v1.md to match current roadmap
- Add ADR-004 through ADR-007 documenting C++ native optimization strategy
2026-02-16 20:37:38 -05:00
Jeremie Fraeys
3bd118f2d3
ci: fix Hugo installation - aggressively remove old versions and use absolute paths
Some checks failed
Checkout test / test (push) Successful in 6s
CI with Native Libraries / Check Build Environment (push) Successful in 10s
CI/CD Pipeline / Test (push) Failing after 5m7s
CI/CD Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI/CD Pipeline / Build (push) Has been skipped
CI/CD Pipeline / Test Scripts (push) Has been skipped
CI/CD Pipeline / Security Scan (push) Failing after 4m52s
Documentation / build-and-publish (push) Failing after 25s
CI with Native Libraries / Build and Test Native Libraries (push) Failing after 16m19s
CI with Native Libraries / Build Release Libraries (push) Has been skipped
CI/CD Pipeline / Docker Build (push) Has been skipped
2026-02-12 18:08:25 -05:00
Jeremie Fraeys
52b20ca7ae
ci: trigger workflow test
Some checks failed
Checkout test / test (push) Successful in 6s
Documentation / build-and-publish (push) Failing after 33s
2026-02-12 18:07:07 -05:00
Jeremie Fraeys
580288b166
fix(ci): remove codecov/codecov-action - not available on Forgejo instance
Some checks failed
Checkout test / test (push) Successful in 4s
CI with Native Libraries / Check Build Environment (push) Successful in 12s
CI/CD Pipeline / Test (push) Failing after 5m8s
CI/CD Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI/CD Pipeline / Build (push) Has been skipped
CI/CD Pipeline / Test Scripts (push) Has been skipped
CI/CD Pipeline / Security Scan (push) Failing after 4m52s
Documentation / build-and-publish (push) Failing after 37s
CI with Native Libraries / Build and Test Native Libraries (push) Failing after 16m16s
CI with Native Libraries / Build Release Libraries (push) Has been skipped
CI/CD Pipeline / Docker Build (push) Has been skipped
2026-02-12 14:16:32 -05:00
Jeremie Fraeys
bdcb134582
ci: replace setup-zig action with manual installation for Forgejo compatibility
Some checks failed
CI with Native Libraries / Build and Test Native Libraries (push) Blocked by required conditions
CI with Native Libraries / Build Release Libraries (push) Blocked by required conditions
Documentation / build-and-publish (push) Waiting to run
Checkout test / test (push) Successful in 5s
CI with Native Libraries / Check Build Environment (push) Successful in 12s
CI/CD Pipeline / Test (push) Failing after 20s
CI/CD Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI/CD Pipeline / Build (push) Has been skipped
CI/CD Pipeline / Test Scripts (push) Has been skipped
CI/CD Pipeline / Security Scan (push) Has been cancelled
CI/CD Pipeline / Docker Build (push) Has been cancelled
2026-02-12 14:14:37 -05:00
Jeremie Fraeys
4c82237608
ci: performance optimizations - native lib caching, docker layer caching, path filtering, apt caching
Some checks failed
CI with Native Libraries / Build and Test Native Libraries (push) Blocked by required conditions
CI with Native Libraries / Build Release Libraries (push) Blocked by required conditions
Documentation / build-and-publish (push) Waiting to run
Checkout test / test (push) Successful in 5s
CI with Native Libraries / Check Build Environment (push) Successful in 12s
CI/CD Pipeline / Test (push) Failing after 1m0s
CI/CD Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI/CD Pipeline / Build (push) Has been skipped
CI/CD Pipeline / Test Scripts (push) Has been skipped
CI/CD Pipeline / Security Scan (push) Has been cancelled
CI/CD Pipeline / Docker Build (push) Has been cancelled
2026-02-12 14:12:25 -05:00
Jeremie Fraeys
5695d847ca
ci: add Redis service to ci-native.yml for test isolation
Some checks failed
Checkout test / test (push) Successful in 5s
CI with Native Libraries / Check Build Environment (push) Successful in 12s
Documentation / build-and-publish (push) Failing after 37s
CI with Native Libraries / Build and Test Native Libraries (push) Has been cancelled
CI with Native Libraries / Build Release Libraries (push) Has been cancelled
2026-02-12 14:09:57 -05:00
Jeremie Fraeys
1de9cc2738
fix: add missing C++ headers for queue and condition_variable
Some checks failed
Checkout test / test (push) Successful in 4s
CI with Native Libraries / Check Build Environment (push) Successful in 12s
Documentation / build-and-publish (push) Failing after 33s
CI with Native Libraries / Build and Test Native Libraries (push) Failing after 6m36s
CI with Native Libraries / Build Release Libraries (push) Has been skipped
2026-02-12 13:56:09 -05:00
Jeremie Fraeys
1f0881e70f
ci: install cmake in test-native job (jobs don't share packages)
Some checks failed
Checkout test / test (push) Successful in 6s
CI with Native Libraries / Check Build Environment (push) Successful in 12s
Documentation / build-and-publish (push) Failing after 39s
CI with Native Libraries / Build and Test Native Libraries (push) Failing after 31s
CI with Native Libraries / Build Release Libraries (push) Has been skipped
2026-02-12 13:54:06 -05:00
Jeremie Fraeys
51eb6f9d0d
ci: update Hugo to 0.146.0 for hugo-book theme compatibility
Some checks failed
Checkout test / test (push) Successful in 4s
CI with Native Libraries / Check Build Environment (push) Successful in 11s
Documentation / build-and-publish (push) Failing after 37s
CI with Native Libraries / Build and Test Native Libraries (push) Failing after 17s
CI with Native Libraries / Build Release Libraries (push) Has been skipped
2026-02-12 13:49:56 -05:00
Jeremie Fraeys
98a156110e
ci: revert to Go 1.25.0 to match go.mod
Some checks failed
Checkout test / test (push) Successful in 4s
CI with Native Libraries / Check Build Environment (push) Successful in 11s
Documentation / build-and-publish (push) Failing after 40s
CI with Native Libraries / Build and Test Native Libraries (push) Failing after 16s
CI with Native Libraries / Build Release Libraries (push) Has been skipped
2026-02-12 13:48:31 -05:00
Jeremie Fraeys
d003b3de64
ci: replace slow setup-go with fast shell script in all workflows
Some checks failed
Checkout test / test (push) Successful in 5s
CI with Native Libraries / Check Build Environment (push) Successful in 12s
Documentation / build-and-publish (push) Failing after 32s
CI with Native Libraries / Build and Test Native Libraries (push) Failing after 5s
CI with Native Libraries / Build Release Libraries (push) Has been skipped
2026-02-12 13:41:59 -05:00
Jeremie Fraeys
8b95f2b5d2
ci: use fast Go setup that skips download if already installed
Some checks failed
CI with Native Libraries / Build and Test Native Libraries (push) Blocked by required conditions
CI with Native Libraries / Build Release Libraries (push) Blocked by required conditions
Checkout test / test (push) Successful in 5s
CI with Native Libraries / Check Build Environment (push) Successful in 11s
Documentation / build-and-publish (push) Has been cancelled
Test / test (push) Successful in 1s
2026-02-12 13:39:14 -05:00
Jeremie Fraeys
b4f2b3e785
ci: auto-install cmake and build dependencies
Some checks failed
CI with Native Libraries / Build and Test Native Libraries (push) Blocked by required conditions
CI with Native Libraries / Build Release Libraries (push) Blocked by required conditions
Test / test (push) Waiting to run
Checkout test / test (push) Successful in 5s
CI with Native Libraries / Check Build Environment (push) Successful in 12s
Documentation / build-and-publish (push) Has been cancelled
2026-02-12 13:35:48 -05:00
Jeremie Fraeys
06b388d692
ci: add timeouts and pre-flight checks to native CI
Some checks failed
Test / test (push) Waiting to run
Checkout test / test (push) Successful in 4s
CI with Native Libraries / Check Build Environment (push) Failing after 1s
CI with Native Libraries / Build and Test Native Libraries (push) Has been skipped
CI with Native Libraries / Build Release Libraries (push) Has been skipped
Documentation / build-and-publish (push) Has been cancelled
2026-02-12 13:32:59 -05:00
Jeremie Fraeys
d408a60eb1
ci: push all workflow updates
Some checks failed
Documentation / build-and-publish (push) Waiting to run
Test / test (push) Waiting to run
Checkout test / test (push) Successful in 5s
CI with Native Libraries / test-native (push) Has been cancelled
CI with Native Libraries / build-release (push) Has been cancelled
2026-02-12 13:28:15 -05:00
Jeremie Fraeys
06690230e2
ci: add native library CI workflow
Some checks failed
Checkout test / test (push) Waiting to run
Documentation / build-and-publish (push) Waiting to run
CI with Native Libraries / test-native (push) Failing after 5m35s
CI with Native Libraries / build-release (push) Has been skipped
Test / test (push) Successful in 1s
2026-02-12 13:21:37 -05:00
Jeremie Fraeys
d3f1a1841a
ci: fix invalid secrets syntax in job-level if
Some checks are pending
Checkout test / test (push) Waiting to run
Documentation / build-and-publish (push) Waiting to run
Test / test (push) Successful in 1s
2026-02-12 13:18:59 -05:00
Jeremie Fraeys
19853dbdf6
ci: add test workflow and ignore Instruments traces
Some checks failed
Test / test (push) Successful in 1s
Checkout test / test (push) Has been cancelled
Documentation / build-and-publish (push) Has been cancelled
2026-02-12 13:13:24 -05:00
Jeremie Fraeys
9a9c411bc9
ci: trigger workflows after runner updates
Some checks are pending
Checkout test / test (push) Waiting to run
Documentation / build-and-publish (push) Waiting to run
2026-02-12 13:11:27 -05:00
Jeremie Fraeys
75faba70fa
ci: trigger workflows after runner updates
Some checks failed
Documentation / build-and-publish (push) Has been cancelled
Checkout test / test (push) Has been cancelled
2026-02-12 13:08:53 -05:00
Jeremie Fraeys
2854d3df95
chore(cleanup): remove legacy artifacts and add tooling configs
Some checks failed
Documentation / build-and-publish (push) Has been cancelled
Checkout test / test (push) Has been cancelled
- Remove .github/ directory (migrated to .forgejo/)
- Remove .local-artifacts/ benchmark results
- Add AGENTS.md for coding assistants
- Add .windsurf/rules/ for development guidelines
- Update .gitignore
2026-02-12 12:06:09 -05:00
Jeremie Fraeys
1dcc1e11d5
chore(build): update build system, scripts, and additional tests
- Update Makefile with native build targets (preparing for C++)
- Add profiler and performance regression detector commands
- Update CI/testing scripts
- Add additional unit tests for API, jupyter, queue, manifest
2026-02-12 12:05:55 -05:00
Jeremie Fraeys
2209ae24c6
chore(config): update configurations and deployment scripts
- Update API server and worker config schemas
- Refine Docker Compose configurations (dev/prod)
- Update deployment scripts and documentation
2026-02-12 12:05:37 -05:00
Jeremie Fraeys
5144d291cb
docs: comprehensive documentation updates
- Add architecture, CI/CD, CLI reference documentation
- Update installation, operations, and quick-start guides
- Add Jupyter workflow and queue documentation
- New landing page and research runner plan
2026-02-12 12:05:27 -05:00
Jeremie Fraeys
2e701340e5
feat(core): API, worker, queue, and manifest improvements
- Add protocol buffer optimizations (internal/api/protocol.go)
- Add filesystem queue backend (internal/queue/filesystem_queue.go)
- Add run manifest support (internal/manifest/run_manifest.go)
- Worker and jupyter task refinements
- Exported test wrappers for benchmarking
2026-02-12 12:05:17 -05:00
Jeremie Fraeys
8e3fa94322
feat(cli): enhance Zig CLI with new commands and improved networking
- Add new commands: annotate, narrative, requeue
- Refactor WebSocket client into modular components (net/ws/)
- Add rsync embedded binary support
- Improve error handling and response packet processing
- Update build.zig and completions
2026-02-12 12:05:10 -05:00
Jeremie Fraeys
df5d872021
ci: migrate from GitHub to Forgejo/Gitea
- Add Forgejo workflow files (.forgejo/workflows/)
- Add Gitea templates (.gitea/ISSUE_TEMPLATE/, .gitea/PULL_REQUEST_TEMPLATE.md)
- Remove legacy .github/ workflows and templates
2026-02-12 12:05:00 -05:00
Jeremie Fraeys
72b4b29ecd
perf: add profiling benchmarks and parallel Go baseline for C++ optimization
Add comprehensive benchmarking suite for C++ optimization targets:
- tests/benchmarks/dataset_hash_bench_test.go - dirOverallSHA256Hex profiling
- tests/benchmarks/queue_bench_test.go - filesystem queue profiling
- tests/benchmarks/artifact_and_snapshot_bench_test.go - scanArtifacts/extractTarGz profiling
- tests/unit/worker/artifacts_test.go - moved from internal/ for clean separation

Add parallel Go implementation as baseline for C++ comparison:
- internal/worker/data_integrity.go: dirOverallSHA256HexParallel() with worker pool
- Benchmarks show 2.1x speedup (3.97ms -> 1.90ms) vs sequential

Exported wrappers for testing:
- ScanArtifacts() - artifact scanning
- ExtractTarGz() - tar.gz extraction
- DirOverallSHA256HexParallel() - parallel hashing

Profiling results (Apple M2 Ultra):
- dirOverallSHA256Hex: 78% syscall overhead (target for mmap C++)
- rebuildIndex: 96% syscall overhead (target for binary index C++)
- scanArtifacts: 87% syscall overhead (target for fast traversal C++)
- extractTarGz: 95% syscall overhead (target for parallel gzip C++)

Related: C++ optimization strategy in memory 5d5f0bb6
2026-02-12 12:04:02 -05:00
Jeremie Fraeys
eba4b4f766
testing checkout on docker
All checks were successful
Checkout test / test (push) Successful in 36s
2026-01-19 14:51:38 -05:00
Jeremie Fraeys
9832fc6f1d
testing checkout on docker
Some checks failed
Checkout test / test (push) Failing after 17s
2026-01-19 14:48:15 -05:00