Add comprehensive authentication and authorization enhancements:
- tokens.go: New token management system for public task access and cloning
* SHA-256 hashed token storage for security
* Token generation, validation, and automatic cleanup
* Support for public access and clone permissions
- api_key.go: Extend User struct with Groups field
* Lab group membership (ml-lab, nlp-group)
* Integration with permission system for group-based access
- flags.go: Security hardening - migrate to structured logging
* Replace log.Printf with log/slog to prevent log injection attacks
* Consistent structured output for all auth warnings
* Safe handling of file paths and errors in logs
- permissions.go: Add task sharing permission constants
* PermissionTasksReadOwn: Access own tasks
* PermissionTasksReadLab: Access lab group tasks
* PermissionTasksReadAll: Admin/institution-wide access
* PermissionTasksShare: Grant access to other users
* PermissionTasksClone: Create copies of shared tasks
* CanAccessTask() method with visibility checks
- database.go: Improve error handling
* Add structured error logging on row close failures
Enhance ml info to query server when connected, falling back to local
manifests when offline. Unifies behavior with other commands like run,
exec, and cancel.
CLI changes:
- Add --local and --remote flags for explicit control
- Auto-detect connection state via mode.detect()
- queryRemoteRun(): Query server via WebSocket for run details
- queryLocalRun(): Read local run_manifest.json
- displayRunInfo(): Shared display logic for both sources
- Add connection status indicators (Remote: connecting.../connected)
WebSocket protocol:
- Add query_run_info opcode (0x28) to cli and server
- Add sendQueryRunInfo() method to ws/client.zig
- Protocol: [opcode:1][api_key_hash:16][run_id_len:1][run_id:var]
Server changes:
- Add handleQueryRunInfo() handler to ws/handler.go
- Returns run_id, job_name, user, timestamp, overall_sha, files_count
- Checks PermJobsRead permission
- Looks up run in experiment manager
Usage:
ml info abc123 # Auto: tries remote, falls back to local
ml info abc123 --local # Force local manifest lookup
ml info abc123 --remote # Force remote query (fails if offline)
Update API layer for scheduler integration:
- WebSocket handlers with scheduler protocol support
- Jobs WebSocket endpoint with priority queue integration
- Validation middleware for scheduler messages
- Server configuration with security hardening
- Protocol definitions for worker-scheduler communication
- Dataset handlers with tenant isolation checks
- Response helpers with audit context
- OpenAPI spec updates for new endpoints
Add comprehensive research context tracking to jobs:
- Narrative fields: hypothesis, context, intent, expected_outcome
- Experiment groups and tags for organization
- Run comparison (compare command) for diff analysis
- Run search (find command) with criteria filtering
- Run export (export command) for data portability
- Outcome setting (outcome command) for experiment validation
Update queue and requeue commands to support narrative fields.
Add narrative validation to manifest validator.
Add WebSocket handlers for compare, find, export, and outcome operations.
Includes E2E tests for phase 2 features.
- VerifySnapshot: SHA256 verification using integrity package
- EnforceTaskProvenance: Strict and best-effort provenance validation
- RunJupyterTask: Full Jupyter service lifecycle (start/stop/remove/restore/list_packages)
- RunJob: Job execution using executor.JobRunner
- PrewarmNextOnce: Prewarming with queue integration
All methods now use new architecture components instead of placeholders