Jeremie Fraeys
a4f2c36069
feat: enhance task domain and scheduler protocol
...
- Update task domain model
- Improve scheduler hub and priority queue
- Enhance protocol definitions
- Update manifest schema and run handling
2026-03-04 13:23:38 -05:00
Jeremie Fraeys
6866ba9366
refactor(queue): integrate scheduler backend and storage improvements
...
Update queue and storage systems for scheduler integration:
- Queue backend with scheduler coordination
- Filesystem queue with batch operations
- Deduplication with tenant-aware keys
- Storage layer with audit logging hooks
- Domain models (Task, Events, Errors) with scheduler fields
- Database layer with tenant isolation
- Dataset storage with integrity checks
2026-02-26 12:06:46 -05:00
Jeremie Fraeys
23e5f3d1dc
refactor(api): internal refactoring for TUI and worker modules
...
- Refactor internal/worker and internal/queue packages
- Update cmd/tui for monitoring interface
- Update test configurations
2026-02-20 15:51:23 -05:00
Jeremie Fraeys
02811c0ffe
fix: resolve TODOs and standardize tests
...
- Fix duplicate check in security_test.go lint warning
- Mark SHA256 tests as Legacy for backward compatibility
- Convert TODO comments to documentation (task, handlers, privacy)
- Update user_manager_test to use GenerateAPIKey pattern
2026-02-19 15:34:59 -05:00
Jeremie Fraeys
7194826871
feat: implement research-grade maintainability phases 1,3,4,7
...
Phase 1: Event Sourcing
- Add TaskEvent types (queued, started, completed, failed, etc.)
- Create EventStore with Redis Streams (append-only)
- Support event querying by task ID and time range
Phase 3: Diagnosable Failures
- Enhance TaskExecutionError with Context map, Timestamp, Recoverable flag
- Update container.go to populate error context (image, GPU, duration)
- Add WithContext helper for building error context
- Create cmd/errors CLI for querying task errors
Phase 4: Testable Security
- Add security fields to PodmanConfig (Privileged, Network, ReadOnlyMounts)
- Create ValidateSecurityPolicy() with ErrSecurityViolation
- Add security contract tests (privileged rejection, host network rejection)
- Tests serve as executable security documentation
Phase 7: Reproducible Builds
- Add BuildHash and BuildTime ldflags to Makefile
- Create verify-build target for reproducibility testing
- Add -version and -verify flags to api-server
All tests pass:
- go test ./internal/errtypes/...
- go test ./internal/container/... -run Security
- go test ./internal/queue/...
- go build ./cmd/api-server/...
2026-02-18 15:27:50 -05:00
Jeremie Fraeys
6580917ba8
refactor: extract domain types and consolidate error system (Phases 1-2)
...
Phase 1: Extract Domain Types
=============================
- Create internal/domain/ package with canonical types:
- domain/task.go: Task, Attempt structs
- domain/tracking.go: TrackingConfig and MLflow/TensorBoard/Wandb configs
- domain/dataset.go: DatasetSpec
- domain/status.go: JobStatus constants
- domain/errors.go: FailureClass system with classification functions
- domain/doc.go: package documentation
- Update queue/task.go to re-export domain types (backward compatibility)
- Update TUI model/state.go to use domain types via type aliases
- Simplify TUI services: remove ~60 lines of conversion functions
Phase 2: Delete ErrorCategory System
====================================
- Remove deprecated ErrorCategory type and constants
- Remove TaskError struct and related functions
- Remove mapping functions: ClassifyError, IsRetryable, GetUserMessage, RetryDelay
- Update all queue implementations to use domain.FailureClass directly:
- queue/metrics.go: RecordTaskFailure/Retry now take FailureClass
- queue/queue.go: RetryTask uses domain.ClassifyFailure
- queue/filesystem_queue.go: RetryTask and MoveToDeadLetterQueue updated
- queue/sqlite_queue.go: RetryTask and MoveToDeadLetterQueue updated
Lines eliminated: ~190 lines of conversion and mapping code
Result: Single source of truth for domain types and error classification
2026-02-17 12:34:28 -05:00