Major WebSocket handler refactor:
- Rewrite ws/handler.go with structured message routing and backpressure
- Add connection lifecycle management with heartbeats and timeouts
- Implement graceful connection draining for zero-downtime restarts
Protocol improvements:
- Define structured protocol types in protocol.go for hub communication
- Add versioned message envelopes for backward compatibility
- Standardize error codes and response formats across WebSocket API
Job streaming via WebSocket:
- Simplify ws/jobs.go with async job status streaming
- Add compression for high-volume job updates
Testing:
- Update websocket_e2e_test.go for new protocol semantics
- Add connection resilience tests
Add comprehensive capability routing system to scheduler hub:
- Capability-aware worker matching with requirement/offer negotiation
- Hub v2 protocol with structured message types and heartbeat management
- Worker capability advertisement and dynamic routing decisions
- Orphan recovery for disconnected workers with state reconciliation
- Template-based job scheduling with capability constraints
Add extensive test coverage:
- Unit tests for capability routing logic and heartbeat mechanics
- Unit tests for orphan recovery scenarios
- E2E tests for capability routing across multiple workers
- Hub capabilities integration tests
- Scheduler fixture helpers for test setup
Protocol improvements:
- Define structured protocol messages for hub-worker communication
- Add capability matching algorithm with scoring
- Implement graceful worker disconnection handling
Comprehensive audit system for security and compliance:
- middleware/audit.go: HTTP request/response auditing middleware
* Captures request details, user identity, response status
* Chains audit events with cryptographic hashes for tamper detection
* Configurable filtering for sensitive data redaction
- audit/chain.go: Blockchain-style audit log chaining
* Each entry includes hash of previous entry
* Tamper detection through hash verification
* Supports incremental verification without full scan
- checkpoint.go: Periodic integrity checkpoints
* Creates signed checkpoints for fast verification
* Configurable checkpoint intervals
* Recovery from last known good checkpoint
- rotation.go: Automatic log rotation and archival
* Size-based and time-based rotation policies
* Compressed archival with integrity seals
* Retention policy enforcement
- sealed.go: Cryptographic sealing of audit logs
* Digital signatures for log integrity
* HSM support preparation
* Exportable sealed bundles for external auditors
- verifier.go: Log verification and forensic analysis
* Complete chain verification from genesis to latest
* Detects gaps, tampering, unauthorized modifications
* Forensic export for incident response
Add new API endpoints and clean up handler interfaces:
- groups/handlers.go: New lab group management API
* CRUD operations for lab groups
* Member management with role assignment (admin/member/viewer)
* Group listing and membership queries
- tokens/handlers.go: Token generation and validation endpoints
* Create access tokens for public task sharing
* Validate tokens for secure access
* Token revocation and cleanup
- routes.go: Refactor handler registration
* Integrate groups handler into WebSocket routes
* Remove nil parameters from all handler constructors
* Cleaner dependency injection pattern
- Handler interface cleanup across all modules:
* jobs/handlers.go: Remove unused nil privacyEnforcer parameter
* jupyter/handlers.go: Streamline initialization
* scheduler/handlers.go: Consistent constructor signature
* ws/handler.go: Add groups handler to dependencies
Add comprehensive authentication and authorization enhancements:
- tokens.go: New token management system for public task access and cloning
* SHA-256 hashed token storage for security
* Token generation, validation, and automatic cleanup
* Support for public access and clone permissions
- api_key.go: Extend User struct with Groups field
* Lab group membership (ml-lab, nlp-group)
* Integration with permission system for group-based access
- flags.go: Security hardening - migrate to structured logging
* Replace log.Printf with log/slog to prevent log injection attacks
* Consistent structured output for all auth warnings
* Safe handling of file paths and errors in logs
- permissions.go: Add task sharing permission constants
* PermissionTasksReadOwn: Access own tasks
* PermissionTasksReadLab: Access lab group tasks
* PermissionTasksReadAll: Admin/institution-wide access
* PermissionTasksShare: Grant access to other users
* PermissionTasksClone: Create copies of shared tasks
* CanAccessTask() method with visibility checks
- database.go: Improve error handling
* Add structured error logging on row close failures
Enhance ml info to query server when connected, falling back to local
manifests when offline. Unifies behavior with other commands like run,
exec, and cancel.
CLI changes:
- Add --local and --remote flags for explicit control
- Auto-detect connection state via mode.detect()
- queryRemoteRun(): Query server via WebSocket for run details
- queryLocalRun(): Read local run_manifest.json
- displayRunInfo(): Shared display logic for both sources
- Add connection status indicators (Remote: connecting.../connected)
WebSocket protocol:
- Add query_run_info opcode (0x28) to cli and server
- Add sendQueryRunInfo() method to ws/client.zig
- Protocol: [opcode:1][api_key_hash:16][run_id_len:1][run_id:var]
Server changes:
- Add handleQueryRunInfo() handler to ws/handler.go
- Returns run_id, job_name, user, timestamp, overall_sha, files_count
- Checks PermJobsRead permission
- Looks up run in experiment manager
Usage:
ml info abc123 # Auto: tries remote, falls back to local
ml info abc123 --local # Force local manifest lookup
ml info abc123 --remote # Force remote query (fails if offline)
Update API layer for scheduler integration:
- WebSocket handlers with scheduler protocol support
- Jobs WebSocket endpoint with priority queue integration
- Validation middleware for scheduler messages
- Server configuration with security hardening
- Protocol definitions for worker-scheduler communication
- Dataset handlers with tenant isolation checks
- Response helpers with audit context
- OpenAPI spec updates for new endpoints
- Fix duplicate check in security_test.go lint warning
- Mark SHA256 tests as Legacy for backward compatibility
- Convert TODO comments to documentation (task, handlers, privacy)
- Update user_manager_test to use GenerateAPIKey pattern
Add comprehensive research context tracking to jobs:
- Narrative fields: hypothesis, context, intent, expected_outcome
- Experiment groups and tags for organization
- Run comparison (compare command) for diff analysis
- Run search (find command) with criteria filtering
- Run export (export command) for data portability
- Outcome setting (outcome command) for experiment validation
Update queue and requeue commands to support narrative fields.
Add narrative validation to manifest validator.
Add WebSocket handlers for compare, find, export, and outcome operations.
Includes E2E tests for phase 2 features.
Update internal/api/server_config.go to use centralized PathRegistry:
Changes:
- Update EnsureLogDirectory() to use config.FromEnv().LogDir() with EnsureDir()
- Update Validate() to use PathRegistry for default BasePath and DataDir
- Remove hardcoded /tmp/ml-experiments default
- Use paths.ExperimentsDir() and paths.DataDir() for consistent paths
Benefits:
- Consistent directory locations via PathRegistry
- Centralized directory creation with EnsureDir()
- Better error handling for directory creation
Move ExpandPath function and path-related utilities from internal/config to internal/storage where they belong.
Files updated:
- internal/worker/config.go: use storage.ExpandPath
- internal/network/ssh.go: use storage.ExpandPath
- cmd/data_manager/data_manager_config.go: use storage.ExpandPath
- internal/api/server_config.go: use storage.ExpandPath
internal/storage/paths.go already contained the canonical implementation.
Result: Path utilities now live in storage layer, config package focuses on configuration structs.
- VerifySnapshot: SHA256 verification using integrity package
- EnforceTaskProvenance: Strict and best-effort provenance validation
- RunJupyterTask: Full Jupyter service lifecycle (start/stop/remove/restore/list_packages)
- RunJob: Job execution using executor.JobRunner
- PrewarmNextOnce: Prewarming with queue integration
All methods now use new architecture components instead of placeholders
- Renamed selectDependencyManifest to SelectDependencyManifest (exported)
- Added re-export in worker package for backward compatibility
- Updated internal call in container.go to use exported function
- API helpers can now access via worker.SelectDependencyManifest
Build status: Compiles successfully
Updated api/ws_compat.go to properly delegate to api/ws package:
- NewWSHandler returns http.Handler interface (not interface{})
- All Opcode* constants re-exported from ws package
- Maintains backward compatibility for existing tests
This allows gradual migration of tests to use api/ws directly without
breaking the build. Tests can be updated incrementally.
Build status: Compiles successfully
- Move ci-test.sh and setup.sh to scripts/
- Trim docs/src/zig-cli.md to current structure
- Replace hardcoded secrets with placeholders in configs
- Update .gitignore to block .env*, secrets/, keys, build artifacts
- Slim README.md to reflect current CLI/TUI split
- Add cleanup trap to ci-test.sh
- Ensure no secrets are committed
- Fix YAML tags in auth config struct (json -> yaml)
- Update CLI configs to use pre-hashed API keys
- Remove double hashing in WebSocket client
- Fix port mapping (9102 -> 9103) in CLI commands
- Update permission keys to use jobs:read, jobs:create, etc.
- Clean up all debug logging from CLI and server
- All user roles now authenticate correctly:
* Admin: Can queue jobs and see all jobs
* Researcher: Can queue jobs and see own jobs
* Analyst: Can see status (read-only access)
Multi-user authentication is now fully functional.
- Add API server with WebSocket support and REST endpoints
- Implement authentication system with API keys and permissions
- Add task queue system with Redis backend and error handling
- Include storage layer with database migrations and schemas
- Add comprehensive logging, metrics, and telemetry
- Implement security middleware and network utilities
- Add experiment management and container orchestration
- Include configuration management with smart defaults