Major WebSocket handler refactor:
- Rewrite ws/handler.go with structured message routing and backpressure
- Add connection lifecycle management with heartbeats and timeouts
- Implement graceful connection draining for zero-downtime restarts
Protocol improvements:
- Define structured protocol types in protocol.go for hub communication
- Add versioned message envelopes for backward compatibility
- Standardize error codes and response formats across WebSocket API
Job streaming via WebSocket:
- Simplify ws/jobs.go with async job status streaming
- Add compression for high-volume job updates
Testing:
- Update websocket_e2e_test.go for new protocol semantics
- Add connection resilience tests
Add comprehensive authentication and authorization enhancements:
- tokens.go: New token management system for public task access and cloning
* SHA-256 hashed token storage for security
* Token generation, validation, and automatic cleanup
* Support for public access and clone permissions
- api_key.go: Extend User struct with Groups field
* Lab group membership (ml-lab, nlp-group)
* Integration with permission system for group-based access
- flags.go: Security hardening - migrate to structured logging
* Replace log.Printf with log/slog to prevent log injection attacks
* Consistent structured output for all auth warnings
* Safe handling of file paths and errors in logs
- permissions.go: Add task sharing permission constants
* PermissionTasksReadOwn: Access own tasks
* PermissionTasksReadLab: Access lab group tasks
* PermissionTasksReadAll: Admin/institution-wide access
* PermissionTasksShare: Grant access to other users
* PermissionTasksClone: Create copies of shared tasks
* CanAccessTask() method with visibility checks
- database.go: Improve error handling
* Add structured error logging on row close failures
Enhance ml info to query server when connected, falling back to local
manifests when offline. Unifies behavior with other commands like run,
exec, and cancel.
CLI changes:
- Add --local and --remote flags for explicit control
- Auto-detect connection state via mode.detect()
- queryRemoteRun(): Query server via WebSocket for run details
- queryLocalRun(): Read local run_manifest.json
- displayRunInfo(): Shared display logic for both sources
- Add connection status indicators (Remote: connecting.../connected)
WebSocket protocol:
- Add query_run_info opcode (0x28) to cli and server
- Add sendQueryRunInfo() method to ws/client.zig
- Protocol: [opcode:1][api_key_hash:16][run_id_len:1][run_id:var]
Server changes:
- Add handleQueryRunInfo() handler to ws/handler.go
- Returns run_id, job_name, user, timestamp, overall_sha, files_count
- Checks PermJobsRead permission
- Looks up run in experiment manager
Usage:
ml info abc123 # Auto: tries remote, falls back to local
ml info abc123 --local # Force local manifest lookup
ml info abc123 --remote # Force remote query (fails if offline)
Update API layer for scheduler integration:
- WebSocket handlers with scheduler protocol support
- Jobs WebSocket endpoint with priority queue integration
- Validation middleware for scheduler messages
- Server configuration with security hardening
- Protocol definitions for worker-scheduler communication
- Dataset handlers with tenant isolation checks
- Response helpers with audit context
- OpenAPI spec updates for new endpoints
Add comprehensive research context tracking to jobs:
- Narrative fields: hypothesis, context, intent, expected_outcome
- Experiment groups and tags for organization
- Run comparison (compare command) for diff analysis
- Run search (find command) with criteria filtering
- Run export (export command) for data portability
- Outcome setting (outcome command) for experiment validation
Update queue and requeue commands to support narrative fields.
Add narrative validation to manifest validator.
Add WebSocket handlers for compare, find, export, and outcome operations.
Includes E2E tests for phase 2 features.
- VerifySnapshot: SHA256 verification using integrity package
- EnforceTaskProvenance: Strict and best-effort provenance validation
- RunJupyterTask: Full Jupyter service lifecycle (start/stop/remove/restore/list_packages)
- RunJob: Job execution using executor.JobRunner
- PrewarmNextOnce: Prewarming with queue integration
All methods now use new architecture components instead of placeholders