Commit graph

3 commits

Author SHA1 Message Date
Jeremie Fraeys
0b5e99f720
refactor(scheduler,worker): improve service management and GPU detection
Scheduler enhancements:
- auth.go: Group membership validation in authentication
- hub.go: Task distribution with group affinity
- port_allocator.go: Dynamic port allocation with conflict resolution
- scheduler_conn.go: Connection pooling and retry logic
- service_manager.go: Lifecycle management for scheduler services
- service_templates.go: Template-based service configuration
- state.go: Persistent state management with recovery

Worker improvements:
- config.go: Extended configuration for task visibility rules
- execution/setup.go: Sandboxed execution environment setup
- executor/container.go: Container runtime integration
- executor/runner.go: Task runner with visibility enforcement
- gpu_detector.go: Robust GPU detection (NVIDIA, AMD, Apple Silicon, CPU fallback)
- integrity/validate.go: Data integrity validation
- lifecycle/runloop.go: Improved runloop with graceful shutdown
- lifecycle/service_manager.go: Service lifecycle coordination
- process/isolation.go + isolation_unix.go: Process isolation with namespaces/cgroups
- tenant/manager.go: Multi-tenant resource isolation
- tenant/middleware.go: Tenant context propagation
- worker.go: Core worker with group-scoped task execution
2026-03-08 13:03:15 -04:00
Jeremie Fraeys
8271277dc3
feat: implement research-grade maintainability phases 2, 5, 8, 10
Phase 2: Deterministic Manifests
- Add manifest.Validator with required field checking
- Support Validate() and ValidateStrict() modes
- Integrate validation into worker executor before execution
- Block execution if manifest missing commit_id or deps_manifest_sha256

Phase 5: Pinned Dependencies
- Add hermetic.dockerfile template with pinned system deps
- Frozen package versions: libblas3, libcudnn8, etc.
- Support for deps_manifest.json and requirements.txt with hashes
- Image tagging strategy: deps-<first-8-of-sha256>

Phase 8: Tests as Specifications
- Add queue_spec_test.go with executable scheduler specs
- Document priority ordering (higher first)
- Document FIFO tiebreaker for same priority
- Test cases for negative/zero priorities

Phase 10: Local Dev Parity
- Create root-level docker-compose.dev.yml
- Simplified from deployments/ for quick local dev
- Redis + API server + Worker with hot reload volumes
- Debug ports: 9101 (API), 6379 (Redis)
2026-02-18 15:34:28 -05:00
Jeremie Fraeys
22f3d66f1d
refactor: Phase 2 - Extract executor implementations
Created executor package with extracted job execution logic:

1. internal/worker/executor/local.go (104 lines)
   - LocalExecutor implements JobExecutor interface
   - Execute() method for local bash script execution
   - generateScript() helper for creating experiment scripts

2. internal/worker/executor/container.go (229 lines)
   - ContainerExecutor implements JobExecutor interface
   - Execute() method for podman container execution
   - EnvironmentPool interface for image caching
   - Tracking tool provisioning (MLflow, TensorBoard, Wandb)
   - Volume and cache setup
   - selectDependencyManifest() helper

3. internal/worker/executor/runner.go (131 lines)
   - JobRunner orchestrates execution
   - ExecutionMode enum (Auto, Local, Container)
   - Run() method with directory setup and executor selection
   - finalize() for success/failure handling

Key design decisions:
- Executors depend on interfaces (ManifestWriter, not Worker)
- JobRunner composes both executors
- No direct Worker dependencies in executor package
- SetupJobDirectories reused from execution package

Build status: Compiles successfully
2026-02-17 14:14:04 -05:00