Commit graph

7 commits

Author SHA1 Message Date
Jeremie Fraeys
17170667e2
feat(worker): improve lifecycle management and vLLM plugin
Lifecycle improvements:
- runloop.go: refined state machine with better error recovery
- service_manager.go: service dependency management and health checks
- states.go: add states for capability advertisement and draining

Container execution:
- container.go: improved OCI runtime integration with supply chain checks
- Add image verification and signature validation
- Better resource limits enforcement for GPU/memory

vLLM plugin updates:
- vllm.go: support for vLLM 0.3+ with new engine arguments
- Add quantization-aware scheduling (AWQ, GPTQ, FP8)
- Improve model download and caching logic

Configuration:
- config.go: add capability advertisement configuration
- snapshot_store.go: improve snapshot management for checkpointing
2026-03-12 12:05:02 -04:00
Jeremie Fraeys
0b5e99f720
refactor(scheduler,worker): improve service management and GPU detection
Scheduler enhancements:
- auth.go: Group membership validation in authentication
- hub.go: Task distribution with group affinity
- port_allocator.go: Dynamic port allocation with conflict resolution
- scheduler_conn.go: Connection pooling and retry logic
- service_manager.go: Lifecycle management for scheduler services
- service_templates.go: Template-based service configuration
- state.go: Persistent state management with recovery

Worker improvements:
- config.go: Extended configuration for task visibility rules
- execution/setup.go: Sandboxed execution environment setup
- executor/container.go: Container runtime integration
- executor/runner.go: Task runner with visibility enforcement
- gpu_detector.go: Robust GPU detection (NVIDIA, AMD, Apple Silicon, CPU fallback)
- integrity/validate.go: Data integrity validation
- lifecycle/runloop.go: Improved runloop with graceful shutdown
- lifecycle/service_manager.go: Service lifecycle coordination
- process/isolation.go + isolation_unix.go: Process isolation with namespaces/cgroups
- tenant/manager.go: Multi-tenant resource isolation
- tenant/middleware.go: Tenant context propagation
- worker.go: Core worker with group-scoped task execution
2026-03-08 13:03:15 -04:00
Jeremie Fraeys
61081655d2
feat: enhance worker execution and scheduler service templates
- Refactor worker configuration management
- Improve container executor lifecycle handling
- Update runloop and worker core logic
- Enhance scheduler service template generation
- Remove obsolete 'scheduler' symlink/directory
2026-03-04 13:24:20 -05:00
Jeremie Fraeys
95adcba437
feat(worker): add Jupyter/vLLM plugins and process isolation
Extend worker capabilities with new execution plugins and security features:
- Jupyter plugin for notebook-based ML experiments
- vLLM plugin for LLM inference workloads
- Cross-platform process isolation (Unix/Windows)
- Network policy enforcement with platform-specific implementations
- Service manager integration for lifecycle management
- Scheduler backend integration for queue coordination

Update lifecycle management:
- Enhanced runloop with state transitions
- Service manager integration for plugin coordination
- Improved state persistence and recovery

Add test coverage:
- Unit tests for Jupyter and vLLM plugins
- Updated worker execution tests
2026-02-26 12:03:59 -05:00
Jeremie Fraeys
23e5f3d1dc
refactor(api): internal refactoring for TUI and worker modules
- Refactor internal/worker and internal/queue packages
- Update cmd/tui for monitoring interface
- Update test configurations
2026-02-20 15:51:23 -05:00
Jeremie Fraeys
a7360869f8
refactor: Implement TaskExecutorAdapter and Worker.runningCount()
- Created executor/TaskExecutorAdapter implementing lifecycle.TaskExecutor
- Properly wires LocalExecutor and ContainerExecutor through adapter
- Worker.runningCount() now delegates to runLoop.RunningCount()
- Added lifecycle.RunLoop.RunningCount() public method
- factory.go creates proper executor chain instead of placeholder

Build status: Compiles successfully
2026-02-17 16:15:41 -05:00
Jeremie Fraeys
062b78cbe0
refactor: Phase 4 - Extract lifecycle types and interfaces
Created lifecycle package with foundational types for future extraction:

1. internal/worker/lifecycle/runloop.go (117 lines)
   - TaskExecutor interface for task execution contract
   - RunLoopConfig for run loop configuration
   - RunLoop type with core orchestration logic
   - MetricsRecorder and Logger interfaces for dependencies
   - Start(), Stop() methods for loop control
   - executeTask() method for task lifecycle management

2. internal/worker/lifecycle/health.go (52 lines)
   - HealthMonitor type for health tracking
   - RecordHeartbeat(), IsHealthy(), MarkUnhealthy() methods
   - Heartbeater interface for heartbeat operations
   - HeartbeatLoop() function for background heartbeats

Note: These are interface/type foundations for Phase 5.
The actual Worker struct methods remain in runloop.go until
Phase 5 when they'll migrate to use these abstractions.

Build status: Compiles successfully
2026-02-17 14:22:58 -05:00