Jeremie Fraeys
3187ff26ea
refactor: complete maintainability phases 1-9 and fix all tests
...
Test fixes (all 41 test packages now pass):
- Fix ComputeTaskProvenance - add dataset_specs JSON output
- Fix EnforceTaskProvenance - populate all metadata fields in best-effort mode
- Fix PrewarmNextOnce - preserve prewarm state when queue empty
- Fix RunManifest directory creation in SetupJobDirectories
- Add ManifestWriter to test worker (simpleManifestWriter)
- Fix worker ID mismatch (use cfg.WorkerID)
- Fix WebSocket binary protocol responses
- Implement all WebSocket handlers: QueueJob, QueueJobWithSnapshot, StatusRequest,
CancelJob, Prune, ValidateRequest (with run manifest validation), LogMetric,
GetExperiment, DatasetList/Register/Info/Search
Maintainability phases completed:
- Phases 1-6: Domain types, error system, config boundaries, worker/API/queue splits
- Phase 7: TUI cleanup - reorganize model package (jobs.go, messages.go, styles.go, keys.go)
- Phase 8: MLServer unification - consolidate worker + TUI into internal/network/mlserver.go
- Phase 9: CI enforcement - add scripts/ci-checks.sh with 5 checks:
* No internal/ -> cmd/ imports
* domain/ has zero internal imports
* File size limit (500 lines, rigid)
* No circular imports
* Package naming conventions
Documentation:
- Add docs/src/file-naming-conventions.md
- Add make ci-checks target
Lines changed: +756/-36 (WebSocket fixes), +518/-320 (TUI), +263/-20 (Phase 8-9)
2026-02-17 20:32:14 -05:00
Jeremie Fraeys
fb2bbbaae5
refactor: Phase 7 - TUI cleanup - reorganize model package
...
Phase 7 of the monorepo maintainability plan:
New files created:
- model/jobs.go - Job type, JobStatus constants, list.Item interface
- model/messages.go - tea.Msg types (JobsLoadedMsg, StatusMsg, TickMsg, etc.)
- model/styles.go - NewJobListDelegate(), JobListTitleStyle(), SpinnerStyle()
- model/keys.go - KeyMap struct, DefaultKeys() function
Modified files:
- model/state.go - reduced from 226 to ~130 lines
- Removed: Job, JobStatus, KeyMap, Keys, inline styles
- Kept: State struct, domain re-exports, ViewMode, DatasetInfo, InitialState()
- controller/commands.go - use model. prefix for message types
- controller/controller.go - use model. prefix for message types
- controller/settings.go - use model.SettingsContentMsg
Deleted files:
- controller/keys.go (moved to model/keys.go since State references KeyMap)
Result:
- No file >150 lines in model/ package
- Single concern per file: state, jobs, messages, styles, keys
- All 41 test packages pass
2026-02-17 20:22:04 -05:00
Jeremie Fraeys
6580917ba8
refactor: extract domain types and consolidate error system (Phases 1-2)
...
Phase 1: Extract Domain Types
=============================
- Create internal/domain/ package with canonical types:
- domain/task.go: Task, Attempt structs
- domain/tracking.go: TrackingConfig and MLflow/TensorBoard/Wandb configs
- domain/dataset.go: DatasetSpec
- domain/status.go: JobStatus constants
- domain/errors.go: FailureClass system with classification functions
- domain/doc.go: package documentation
- Update queue/task.go to re-export domain types (backward compatibility)
- Update TUI model/state.go to use domain types via type aliases
- Simplify TUI services: remove ~60 lines of conversion functions
Phase 2: Delete ErrorCategory System
====================================
- Remove deprecated ErrorCategory type and constants
- Remove TaskError struct and related functions
- Remove mapping functions: ClassifyError, IsRetryable, GetUserMessage, RetryDelay
- Update all queue implementations to use domain.FailureClass directly:
- queue/metrics.go: RecordTaskFailure/Retry now take FailureClass
- queue/queue.go: RetryTask uses domain.ClassifyFailure
- queue/filesystem_queue.go: RetryTask and MoveToDeadLetterQueue updated
- queue/sqlite_queue.go: RetryTask and MoveToDeadLetterQueue updated
Lines eliminated: ~190 lines of conversion and mapping code
Result: Single source of truth for domain types and error classification
2026-02-17 12:34:28 -05:00
Jeremie Fraeys
1dcc1e11d5
chore(build): update build system, scripts, and additional tests
...
- Update Makefile with native build targets (preparing for C++)
- Add profiler and performance regression detector commands
- Update CI/testing scripts
- Add additional unit tests for API, jupyter, queue, manifest
2026-02-12 12:05:55 -05:00
Jeremie Fraeys
94112f0af5
ci: align workflows, build scripts, and docs with current architecture
2026-01-05 12:34:23 -05:00
Jeremie Fraeys
5ef24e4c6d
feat(cli): add validate/info commands and improve protocol handling
2026-01-05 12:31:20 -05:00
Jeremie Fraeys
82034c68f3
feat(worker): add integrity checks, snapshot staging, and prewarm support
2026-01-05 12:31:13 -05:00
Jeremie Fraeys
add4a90e62
feat(api): refactor websocket handlers; add health and prometheus middleware
2026-01-05 12:31:07 -05:00
Jeremie Fraeys
cd5640ebd2
Slim and secure: move scripts, clean configs, remove secrets
...
- Move ci-test.sh and setup.sh to scripts/
- Trim docs/src/zig-cli.md to current structure
- Replace hardcoded secrets with placeholders in configs
- Update .gitignore to block .env*, secrets/, keys, build artifacts
- Slim README.md to reflect current CLI/TUI split
- Add cleanup trap to ci-test.sh
- Ensure no secrets are committed
2025-12-07 13:57:51 -05:00
Jeremie Fraeys
ea15af1833
Fix multi-user authentication and clean up debug code
...
- Fix YAML tags in auth config struct (json -> yaml)
- Update CLI configs to use pre-hashed API keys
- Remove double hashing in WebSocket client
- Fix port mapping (9102 -> 9103) in CLI commands
- Update permission keys to use jobs:read, jobs:create, etc.
- Clean up all debug logging from CLI and server
- All user roles now authenticate correctly:
* Admin: Can queue jobs and see all jobs
* Researcher: Can queue jobs and see own jobs
* Analyst: Can see status (read-only access)
Multi-user authentication is now fully functional.
2025-12-06 12:35:32 -05:00
Jeremie Fraeys
803677be57
feat: implement Go backend with comprehensive API and internal packages
...
- Add API server with WebSocket support and REST endpoints
- Implement authentication system with API keys and permissions
- Add task queue system with Redis backend and error handling
- Include storage layer with database migrations and schemas
- Add comprehensive logging, metrics, and telemetry
- Implement security middleware and network utilities
- Add experiment management and container orchestration
- Include configuration management with smart defaults
2025-12-04 16:53:53 -05:00