fetch_ml/internal
Jeremie Fraeys ba9a358412
fix(scheduler): resolve TestEndToEndJobLifecycle race and getTask bug
## Problem
TestEndToEndJobLifecycle was failing with two issues:
1. Race condition: Workers signaled ready before job was processed, receiving
   MsgNoWork instead of MsgJobAssign
2. getTask() didn't check pendingAcceptance - assigned-but-not-yet-accepted
   tasks returned nil

## Changes

### Test Fix (restart_recovery_test.go)
- Replace single-shot select with retry loop that re-signals workers as ready
- Handle both assignment and non-assignment messages correctly
- Add 10ms delay between non-assignment messages to allow job processing
- Use 2-second deadline with 100ms timeout intervals

### Scheduler Fix (hub.go)
- Extend getTask() to check pendingAcceptance map after batch/service queues
- Allows GetTask() to find tasks in 'assigned' state before acceptance
- Maintains backward compatibility with existing queue/running lookups

## Testing
make test now passes: 475 passed, 0 failed, 34 skipped
2026-03-05 14:40:43 -05:00
..
api feat(cli,server): unify info command with remote/local support 2026-03-05 12:07:00 -05:00
audit security: improve audit, crypto, and config handling 2026-03-04 13:23:42 -05:00
auth refactor(auth): add tenant scoping and permission enhancements 2026-02-26 12:06:08 -05:00
config security: improve audit, crypto, and config handling 2026-03-04 13:23:42 -05:00
container refactor(jupyter): enhance security and scheduler integration 2026-02-26 12:06:35 -05:00
crypto security: improve audit, crypto, and config handling 2026-03-04 13:23:42 -05:00
domain refactor: misc improvements across codebase 2026-03-05 10:58:22 -05:00
envpool refactor(utilities): update supporting modules for scheduler integration 2026-02-26 12:07:15 -05:00
errtypes refactor(utilities): update supporting modules for scheduler integration 2026-02-26 12:07:15 -05:00
experiment refactor(jupyter): enhance security and scheduler integration 2026-02-26 12:06:35 -05:00
fileutil refactor(utilities): update supporting modules for scheduler integration 2026-02-26 12:07:15 -05:00
jupyter refactor(jupyter): enhance security and scheduler integration 2026-02-26 12:06:35 -05:00
logging refactor(utilities): update supporting modules for scheduler integration 2026-02-26 12:07:15 -05:00
manifest feat: enhance task domain and scheduler protocol 2026-03-04 13:23:38 -05:00
metrics refactor: Phase 6 - Complete migration, remove legacy files 2026-02-17 14:39:48 -05:00
middleware fix: resolve TODOs and standardize tests 2026-02-19 15:34:59 -05:00
network refactor(utilities): update supporting modules for scheduler integration 2026-02-26 12:07:15 -05:00
privacy refactor(utilities): update supporting modules for scheduler integration 2026-02-26 12:07:15 -05:00
prommetrics feat(api): refactor websocket handlers; add health and prometheus middleware 2026-01-05 12:31:07 -05:00
queue refactor(queue): integrate scheduler backend and storage improvements 2026-02-26 12:06:46 -05:00
resources refactor(utilities): update supporting modules for scheduler integration 2026-02-26 12:07:15 -05:00
scheduler fix(scheduler): resolve TestEndToEndJobLifecycle race and getTask bug 2026-03-05 14:40:43 -05:00
security refactor(utilities): update supporting modules for scheduler integration 2026-02-26 12:07:15 -05:00
storage refactor(queue): integrate scheduler backend and storage improvements 2026-02-26 12:06:46 -05:00
telemetry Fix multi-user authentication and clean up debug code 2025-12-06 12:35:32 -05:00
tracking refactor(utilities): update supporting modules for scheduler integration 2026-02-26 12:07:15 -05:00
validation feat: add security monitoring and validation framework 2026-02-19 15:34:25 -05:00
worker feat: enhance worker execution and scheduler service templates 2026-03-04 13:24:20 -05:00