Commit graph

424 commits

Author SHA1 Message Date
Jeremie Fraeys
a4e2ecdbe6
refactor: co-locate api, audit, auth tests with source code
Move unit tests from tests/unit/ to internal/ following Go conventions:
- tests/unit/api/* -> internal/api/* (WebSocket handlers, helpers, duplicate detection)
- tests/unit/audit/* -> internal/audit/* (alert, sealed, verifier tests)
- tests/unit/auth/* -> internal/auth/* (API key, keychain, user manager)
- tests/unit/crypto/kms/* -> internal/auth/kms/* (cache, protocol tests)

Update import paths in test files to reflect new locations.

Benefits:
- Tests live alongside the code they test
- Easier navigation and maintenance
- Clearer package boundaries
- Follows standard Go project layout
2026-03-12 16:34:54 -04:00
Jeremie Fraeys
b00fa236db
docs: add Known Limitations section and testing structure updates
Add Known Limitations section to AGENTS.md documenting:
- AMD GPU not implemented (use NVIDIA, Apple Silicon, or CPU)
- 100+ node gang allocation stress testing not yet implemented
- Podman-in-Docker CI requires privileged mode, not yet automated
- Error handling patterns for unimplemented features
- Container usage rules (Docker for testing/deployments, Podman for experiments)
- Error codes table (NOT_IMPLEMENTED, NOT_FOUND, INVALID_CONFIGURATION)

Update testing documentation to reflect new test locations:
- Unit tests moved from tests/unit/ to internal/ (Go convention)
- Update all test file path references in security testing docs
2026-03-12 16:33:19 -04:00
Jeremie Fraeys
6646f3a382
ci(docker): add test workflow and container architecture docs
- Create docker-tests.yml for merge-to-main CI pipeline
- Add mock GPU test matrix (NVIDIA, Metal, CPU-only)
- Add AGENTS.md with container architecture rules:
  * Docker for CI/CD testing and deployments
  * Podman for ML experiment isolation only
- Update .gitignore to track AGENTS.md
2026-03-12 14:05:53 -04:00
Jeremie Fraeys
6af85ddaf6
feat(tests): enable stress and long-running test suites
Stress Tests:
- TestStress_WorkerConnectBurst: 30 workers, p99 latency validation
- TestStress_JobSubmissionBurst: 1K job submissions
- TestStress_WorkerChurn: 50 connect/disconnect cycles, memory leak detection
- TestStress_ConcurrentScheduling: 10 workers x 20 jobs contention

Long-Running Tests:
- TestLongRunning_MemoryLeak: heap growth monitoring
- TestLongRunning_OrphanRecovery: worker death/requeue stability
- TestLongRunning_WebSocketStability: 20 worker connection stability

Infrastructure:
- Add testreport package with JSON output, flaky test tracking
- Add TestTimer for timing/budget enforcement
- Add WaitForEvent, WaitForTaskStatus helpers
- Fix worker IDs to use valid bench-worker token patterns
2026-03-12 14:05:45 -04:00
Jeremie Fraeys
ca913e8878
feat(scheduler): add test mode config and TLS detection
- Add DisableTLSForTesting to HubConfig for test environments
- Add IsUsingTLS() method to detect scheduler TLS status
- Update MockWorker to auto-select ws:// vs wss:// protocol
- Set DisableTLSForTesting: true in DefaultHubConfig
2026-03-12 14:05:35 -04:00
Jeremie Fraeys
c5524562e9
test(scheduler): remove unused fields in service slot pool separation test
Remove ID and GPUCount fields from batchJob in TestServiceSlotPoolSeparation
that were assigned but never used. The test only validates SlotPool values.
2026-03-12 12:10:33 -04:00
Jeremie Fraeys
a49e8f593c
chore(tools): update fetchml-vet analyzers
Analyzer improvements:
- hipaacomplete.go: refined HIPAA compliance checks
- manifestenv.go: environment variable validation in manifests
- nobaredetector.go: detection of bare credential exposures
- noinlinecredentials.go: inline credential scanning improvements
2026-03-12 12:09:34 -04:00
Jeremie Fraeys
2bd7f97ae2
test(integration,unit): update test suites for new features and APIs
Integration test updates:
- jupyter_experiment_test.go: update for new workspace handling
- run_manifest_test.go: reproducibility manifest validation
- secrets_integration_test.go: KMS and secret provider tests
- storage_redis_integration_test.go: Redis-backed storage tests

Unit test updates:
- response_helpers_test.go: API response helper tests
- config_hash_test.go: configuration hashing for reproducibility
- filetype_test.go: security file type detection tests

Load testing:
- load_test.go: scheduler load and stress tests
2026-03-12 12:09:15 -04:00
Jeremie Fraeys
2b1ef10514
test(chaos): add worker disconnect chaos test and queue improvements
Chaos testing:
- Add worker_disconnect_chaos_test.go for network partition resilience
- Test scheduler hub recovery and job reassignment scenarios

Queue layer updates:
- event_store.go: add event sourcing for queue operations
- native_queue.go: extend native queue with batch operations and indexing
2026-03-12 12:08:21 -04:00
Jeremie Fraeys
93d6d63d8d
chore(deploy): update Docker compose files and add MinIO lifecycle policies
Docker Compose updates:
- docker-compose.dev.yml: add GPU support, local scheduler and worker
- docker-compose.staging.yml: production-like staging with SSL termination
- docker-compose.test.yml: ephemeral test environment with seeded data

MinIO lifecycle management:
- Add lifecycle-dev.json: 7-day retention for dev artifacts
- Add lifecycle-staging.json: 30-day retention with transition to cold

Build improvements:
- Makefile: add native library build targets and cross-platform support
- scripts/release/cleanup.sh: improved artifact cleanup with dry-run mode
2026-03-12 12:06:16 -04:00
Jeremie Fraeys
17170667e2
feat(worker): improve lifecycle management and vLLM plugin
Lifecycle improvements:
- runloop.go: refined state machine with better error recovery
- service_manager.go: service dependency management and health checks
- states.go: add states for capability advertisement and draining

Container execution:
- container.go: improved OCI runtime integration with supply chain checks
- Add image verification and signature validation
- Better resource limits enforcement for GPU/memory

vLLM plugin updates:
- vllm.go: support for vLLM 0.3+ with new engine arguments
- Add quantization-aware scheduling (AWQ, GPTQ, FP8)
- Improve model download and caching logic

Configuration:
- config.go: add capability advertisement configuration
- snapshot_store.go: improve snapshot management for checkpointing
2026-03-12 12:05:02 -04:00
Jeremie Fraeys
c18a8619fe
feat(api): add structured error package and refactor handlers
New error handling:
- Add internal/api/errors/errors.go with structured API error types
- Standardize error codes across all API endpoints
- Add user-facing error messages vs internal error details separation

Handler improvements:
- jupyter/handlers.go: better workspace lifecycle and error handling
- plugins/handlers.go: plugin management with validation
- groups/handlers.go: group CRUD with capability metadata
- jobs/handlers.go: job submission and monitoring improvements
- datasets/handlers.go: dataset upload/download with progress
- validate/handlers.go: manifest validation with detailed errors
- audit/handlers.go: audit log querying with filters

Server configuration:
- server_config.go: refined config loading with validation
- server_gen.go: improved code generation for OpenAPI specs
2026-03-12 12:04:46 -04:00
Jeremie Fraeys
37c4d4e9c7
feat(crypto,auth): harden KMS and improve permission handling
KMS improvements:
- cache.go: add LRU eviction with memory-bounded caches
- provider.go: refactor provider initialization and key rotation
- tenant_keys.go: per-tenant key isolation with envelope encryption

Auth layer updates:
- hybrid.go: refine hybrid auth flow for API key + JWT
- permissions_loader.go: faster permission caching with hot-reload
- validator.go: stricter validation with detailed error messages

Security middleware:
- security.go: add rate limiting headers and CORS refinement

Testing and benchmarks:
- Add KMS cache and protocol unit tests
- Add KMS benchmark tests for encryption throughput
- Update KMS integration tests for tenant isolation
2026-03-12 12:04:32 -04:00
Jeremie Fraeys
de83300962
feat(worker): refactor GPU detection with macOS Metal support
GPU detection refactor:
- Major rewrite of gpu_detector.go with unified detection interface
- Support for NVIDIA (NVML), AMD (ROCm), and Apple Metal
- Runtime GPU capability querying for scheduler matching

macOS improvements:
- gpu_macos.go: native Metal device enumeration and memory queries
- Support for Apple Silicon (M1/M2/M3) unified memory reporting
- Fallback to system profiler for Intel Macs

Testing infrastructure:
- Add gpu_detector_mock.go for testing without hardware
- Update gpu_golden_test.go with platform-specific expectations
- Cross-platform GPU info validation
2026-03-12 12:02:41 -04:00
Jeremie Fraeys
188cf55939
refactor(api): overhaul WebSocket handler and protocol layer
Major WebSocket handler refactor:
- Rewrite ws/handler.go with structured message routing and backpressure
- Add connection lifecycle management with heartbeats and timeouts
- Implement graceful connection draining for zero-downtime restarts

Protocol improvements:
- Define structured protocol types in protocol.go for hub communication
- Add versioned message envelopes for backward compatibility
- Standardize error codes and response formats across WebSocket API

Job streaming via WebSocket:
- Simplify ws/jobs.go with async job status streaming
- Add compression for high-volume job updates

Testing:
- Update websocket_e2e_test.go for new protocol semantics
- Add connection resilience tests
2026-03-12 12:01:21 -04:00
Jeremie Fraeys
ad3be36a6d
feat(cli): add workers command, scheduler client, and PII utilities
New commands and modules:
- Add workers.zig command for worker management and status
- Add scheduler_client.zig for scheduler hub communication
- Add pii.zig utility for PII detection and redaction in logs/outputs

Improvements to existing commands:
- groups.zig: enhanced group management with capability metadata
- jupyter/mod.zig: improved Jupyter workspace lifecycle handling
- tasks.zig: better task status reporting and cancellation support

Networking and sync improvements:
- ws/client.zig: WebSocket client enhancements for hub protocol
- sync_manager.zig: improved sync with scheduler state and conflict resolution
- uuid.zig: optimized UUID generation for macOS and Linux

Database utilities:
- sqlite_embedded.zig: embedded SQLite for CLI-local state caching
2026-03-12 12:00:49 -04:00
Jeremie Fraeys
57787e1e7b
feat(scheduler): implement capability-based routing and hub v2
Add comprehensive capability routing system to scheduler hub:
- Capability-aware worker matching with requirement/offer negotiation
- Hub v2 protocol with structured message types and heartbeat management
- Worker capability advertisement and dynamic routing decisions
- Orphan recovery for disconnected workers with state reconciliation
- Template-based job scheduling with capability constraints

Add extensive test coverage:
- Unit tests for capability routing logic and heartbeat mechanics
- Unit tests for orphan recovery scenarios
- E2E tests for capability routing across multiple workers
- Hub capabilities integration tests
- Scheduler fixture helpers for test setup

Protocol improvements:
- Define structured protocol messages for hub-worker communication
- Add capability matching algorithm with scoring
- Implement graceful worker disconnection handling
2026-03-12 12:00:05 -04:00
Jeremie Fraeys
13ffb81cab
fix: add CGO build tags to consistency tests, remove unused isHex function 2026-03-08 13:10:00 -04:00
Jeremie Fraeys
7eee31d721
chore: cleanup and miscellaneous updates
- .gitignore: Add reports/ and .api-keys
- examples/jupyter_experiment_integration.py: Update for new API
- podman/scripts/: CLI integration, secure runner, ML tool testing
- tools/: Performance regression detector, profiler utilities
2026-03-08 13:04:01 -04:00
Jeremie Fraeys
c74e91dd69
test: update test suite and remove deprecated privacy middleware
Test improvements:
- fixtures/: Updated mocks, fixtures with group context, SSH server, TUI driver
- integration/: WebSocket queue and handler tests with groups
- e2e/: WebSocket and TLS proxy end-to-end tests
- unit/api/ws_test.go: WebSocket API tests
- unit/scheduler/service_templates_test.go: Service template tests
- benchmarks/scheduler_bench_test.go: Performance benchmarks

Cleanup:
- Remove privacy middleware (replaced by audit system)
- Remove privacy_test.go
2026-03-08 13:03:55 -04:00
Jeremie Fraeys
cb142213fa
chore(build): update build system, Dockerfiles, and dependencies
Build and deployment improvements:

Makefile:
- Native library build targets with ASan support
- Cross-platform compilation helpers
- Performance benchmark targets
- Security scan integration

Docker:
- secure-prod.Dockerfile: Hardened production image (non-root, minimal surface)
- simple.Dockerfile: Lightweight development image

Scripts:
- build/: Go and native library build scripts, cross-platform builds
- ci/: checks.sh, test.sh, verify-paths.sh for validation
- benchmarks/: Local performance testing and regression tracking
- dev/: Monitoring setup

Dependencies: Update to latest stable with security patches

Commands:
- api-server/main.go: Server initialization updates
- data_manager/data_sync.go: Data sync with visibility
- errors/main.go: Error handling improvements
- tui/: TUI improvements for group management
2026-03-08 13:03:48 -04:00
Jeremie Fraeys
4b2782f674
feat(domain): add task visibility and supporting infrastructure
Core domain and utility updates:

- domain/task.go: Task model with visibility system
  * Visibility enum: private, lab, institution, open
  * Group associations for lab-scoped access
  * CreatedBy tracking for ownership
  * Sharing metadata with expiry

- config/paths.go: Group-scoped data directories and audit log paths
- crypto/signing.go: Key management for audit sealing, token signature verification
- container/supply_chain.go: Image provenance tracking, vulnerability scanning
- fileutil/filetype.go: MIME type detection and security validation
- fileutil/secure.go: Protected file permissions, secure deletion
- jupyter/: Package and service manager updates
- experiment/manager.go: Visibility cascade from experiments to tasks
- network/ssh.go: SSH tunneling improvements
- queue/: Filesystem queue enhancements
2026-03-08 13:03:27 -04:00
Jeremie Fraeys
0b5e99f720
refactor(scheduler,worker): improve service management and GPU detection
Scheduler enhancements:
- auth.go: Group membership validation in authentication
- hub.go: Task distribution with group affinity
- port_allocator.go: Dynamic port allocation with conflict resolution
- scheduler_conn.go: Connection pooling and retry logic
- service_manager.go: Lifecycle management for scheduler services
- service_templates.go: Template-based service configuration
- state.go: Persistent state management with recovery

Worker improvements:
- config.go: Extended configuration for task visibility rules
- execution/setup.go: Sandboxed execution environment setup
- executor/container.go: Container runtime integration
- executor/runner.go: Task runner with visibility enforcement
- gpu_detector.go: Robust GPU detection (NVIDIA, AMD, Apple Silicon, CPU fallback)
- integrity/validate.go: Data integrity validation
- lifecycle/runloop.go: Improved runloop with graceful shutdown
- lifecycle/service_manager.go: Service lifecycle coordination
- process/isolation.go + isolation_unix.go: Process isolation with namespaces/cgroups
- tenant/manager.go: Multi-tenant resource isolation
- tenant/middleware.go: Tenant context propagation
- worker.go: Core worker with group-scoped task execution
2026-03-08 13:03:15 -04:00
Jeremie Fraeys
5ae997ceb3
feat(cli): add groups and tasks commands with visibility controls
New Zig CLI commands for lab management:

- groups.zig: Lab group management commands
  * create-group: Create new lab groups with metadata
  * list-groups: Show all groups with member counts
  * add-member: Add users with role assignment (admin/member/viewer)
  * remove-member: Remove users from groups
  * group-info: Display group details and membership

- tasks.zig: Task operations with visibility integration
  * create-task: New tasks with visibility flag (private/lab/institution/open)
  * list-tasks: Filter by visibility level and group membership
  * share-task: Generate access tokens for external sharing
  * clone-task: Copy tasks with public clone tokens
  * task-visibility: Change visibility and cascade to experiments

- run.zig: Updated experiment runner
  * Integrate with new task visibility system
  * Group-scoped experiment execution
  * Token-based access for shared experiments

- main.zig: Command registration updates
  * Wire up new groups and tasks commands
  * Updated help text and command discovery
2026-03-08 13:03:10 -04:00
Jeremie Fraeys
1c7205c0a0
feat(audit): add HTTP audit middleware and tamper-evident logging
Comprehensive audit system for security and compliance:

- middleware/audit.go: HTTP request/response auditing middleware
  * Captures request details, user identity, response status
  * Chains audit events with cryptographic hashes for tamper detection
  * Configurable filtering for sensitive data redaction

- audit/chain.go: Blockchain-style audit log chaining
  * Each entry includes hash of previous entry
  * Tamper detection through hash verification
  * Supports incremental verification without full scan

- checkpoint.go: Periodic integrity checkpoints
  * Creates signed checkpoints for fast verification
  * Configurable checkpoint intervals
  * Recovery from last known good checkpoint

- rotation.go: Automatic log rotation and archival
  * Size-based and time-based rotation policies
  * Compressed archival with integrity seals
  * Retention policy enforcement

- sealed.go: Cryptographic sealing of audit logs
  * Digital signatures for log integrity
  * HSM support preparation
  * Exportable sealed bundles for external auditors

- verifier.go: Log verification and forensic analysis
  * Complete chain verification from genesis to latest
  * Detects gaps, tampering, unauthorized modifications
  * Forensic export for incident response
2026-03-08 13:03:02 -04:00
Jeremie Fraeys
7e5ceec069
feat(api): add groups and tokens handlers, refactor routes
Add new API endpoints and clean up handler interfaces:

- groups/handlers.go: New lab group management API
  * CRUD operations for lab groups
  * Member management with role assignment (admin/member/viewer)
  * Group listing and membership queries

- tokens/handlers.go: Token generation and validation endpoints
  * Create access tokens for public task sharing
  * Validate tokens for secure access
  * Token revocation and cleanup

- routes.go: Refactor handler registration
  * Integrate groups handler into WebSocket routes
  * Remove nil parameters from all handler constructors
  * Cleaner dependency injection pattern

- Handler interface cleanup across all modules:
  * jobs/handlers.go: Remove unused nil privacyEnforcer parameter
  * jupyter/handlers.go: Streamline initialization
  * scheduler/handlers.go: Consistent constructor signature
  * ws/handler.go: Add groups handler to dependencies
2026-03-08 12:51:25 -04:00
Jeremie Fraeys
c52179dcbe
feat(auth): add token-based access and structured logging
Add comprehensive authentication and authorization enhancements:

- tokens.go: New token management system for public task access and cloning
  * SHA-256 hashed token storage for security
  * Token generation, validation, and automatic cleanup
  * Support for public access and clone permissions

- api_key.go: Extend User struct with Groups field
  * Lab group membership (ml-lab, nlp-group)
  * Integration with permission system for group-based access

- flags.go: Security hardening - migrate to structured logging
  * Replace log.Printf with log/slog to prevent log injection attacks
  * Consistent structured output for all auth warnings
  * Safe handling of file paths and errors in logs

- permissions.go: Add task sharing permission constants
  * PermissionTasksReadOwn: Access own tasks
  * PermissionTasksReadLab: Access lab group tasks
  * PermissionTasksReadAll: Admin/institution-wide access
  * PermissionTasksShare: Grant access to other users
  * PermissionTasksClone: Create copies of shared tasks
  * CanAccessTask() method with visibility checks

- database.go: Improve error handling
  * Add structured error logging on row close failures
2026-03-08 12:51:07 -04:00
Jeremie Fraeys
fbcf4d38e5
feat(storage): add groups, tasks, tokens, and audit database schemas
Add comprehensive database storage layer for new features:

- db_groups.go: Lab group management with members, roles (admin/member/viewer),
  and group-based task visibility queries

- db_tasks.go: Task visibility system (private/lab/institution/open),
  task sharing with expiry, public clone tokens, and optimized
  ListTasksForUser() for access control

- db_tokens.go: Secure token management for public task access and cloning,
  with SHA-256 hashed token storage and automatic cleanup

- db_audit.go: Audit log persistence with checkpoint chains, tamper
  detection, and log rotation support

- schema_sqlite.sql: Updated schema with:
  - groups, group_members tables
  - tasks.visibility enum, task_shares with expiry
  - access_tokens table with hashed tokens
  - audit_logs, audit_checkpoints tables
  - indexes for all foreign keys and query patterns

- db_experiments.go: Add CascadeVisibilityToTasks() for propagating
  visibility changes from experiments to associated tasks
2026-03-08 12:48:42 -04:00
Jeremie Fraeys
a239f3a14f
test(consistency): add dataset hash consistency test suite
Add cross-implementation consistency tests for dataset hash functionality:

## Test Fixtures
- Single file, nested directories, and multiple file test cases
- Expected hashes in JSON format for validation

## Test Infrastructure
- harness.go: Common test utilities and reference implementation runner
- dataset_hash_test.go: Consistency test cases comparing implementations
- cmd/update.go: Tool to regenerate expected hashes from reference

## Purpose
Ensures hash implementations (Go, C++, Zig) produce identical results
across all supported platforms and implementations.
2026-03-05 14:41:14 -05:00
Jeremie Fraeys
8e5af0da2d
fix(build): resolve shell error in test_summary macro
## Problem
test_summary macro was failing with 'integer expression expected' because
grep -c output contained newlines, breaking the [  -gt 0 ] comparison.

## Fix
- Add | tr -d '\n' to strip newlines from grep -c output
- Add 2>/dev/null to comparison to suppress any edge case errors

## Result
Clean test summary output without shell errors
2026-03-05 14:40:48 -05:00
Jeremie Fraeys
ba9a358412
fix(scheduler): resolve TestEndToEndJobLifecycle race and getTask bug
## Problem
TestEndToEndJobLifecycle was failing with two issues:
1. Race condition: Workers signaled ready before job was processed, receiving
   MsgNoWork instead of MsgJobAssign
2. getTask() didn't check pendingAcceptance - assigned-but-not-yet-accepted
   tasks returned nil

## Changes

### Test Fix (restart_recovery_test.go)
- Replace single-shot select with retry loop that re-signals workers as ready
- Handle both assignment and non-assignment messages correctly
- Add 10ms delay between non-assignment messages to allow job processing
- Use 2-second deadline with 100ms timeout intervals

### Scheduler Fix (hub.go)
- Extend getTask() to check pendingAcceptance map after batch/service queues
- Allows GetTask() to find tasks in 'assigned' state before acceptance
- Maintains backward compatibility with existing queue/running lookups

## Testing
make test now passes: 475 passed, 0 failed, 34 skipped
2026-03-05 14:40:43 -05:00
Jeremie Fraeys
8ee98eaf7f
refactor(cli): improve hash utilities and command structure
## Changes
- Refactor hash.zig utilities for better performance and maintainability
- Clean up command structure in run.zig for clarity
- Simplify main.zig entry point organization
2026-03-05 14:39:54 -05:00
Jeremie Fraeys
eb88d403a1
refactor(cli): update experiment and run commands with ConnectionContext
- Refactor experiment.zig to use common.ConnectionContext for WebSocket connections
  - Eliminates duplicate connection setup code in createExperiment, listExperiments, showExperiment
  - Reduces boilerplate: api_key_hash generation, ws_url construction, client lifecycle
- Major updates to run.zig for improved job execution flow
- Update sync.zig with minor improvements

This refactoring reduces code duplication and centralizes connection
management across CLI commands that communicate with the server.
2026-03-05 13:13:09 -05:00
Jeremie Fraeys
ccb4e15877
refactor(cli): rename exec/ and queue/ directories for clarity
- Rename cli/src/commands/exec/ → executor/ (4 files)
- Rename cli/src/commands/queue/ → submission/ (4 files)
- Create new submission/index.zig, delete queue/index.zig

The new names better reflect the purpose of these modules:
- 'executor' for local/remote execution logic
- 'submission' for job submission and queue management

This is a pure rename with no functional changes.
2026-03-05 13:12:59 -05:00
Jeremie Fraeys
8b83d60452
docs: update changelog, readme and CLI reference for 0.1.0
- Add 0.1.0 release entry to CHANGELOG.md with CLI and C++ native libs highlights
- Update README.md with current project status
- Sync CLI reference documentation with recent command changes
2026-03-05 13:12:52 -05:00
Jeremie Fraeys
e3dad32029
ci: consolidate CLI build workflow and add static rsync
- Add musl-tools to build-cli.yml for static linking support
- Move rsync build from ci.yml into build-cli.yml workflow
- Fix SQLite year parameter in both workflows
- Remove redundant RSYNC_VERSION env var from ci.yml

This consolidates the CLI artifact build process into a single
workflow file, making the CI pipeline easier to maintain.
2026-03-05 13:12:45 -05:00
Jeremie Fraeys
c4b6ae5d0c
refactor(cli): remove obsolete printUsage from exec/mod.zig
Exec is now an internal module used by 'ml run', not a standalone
command. Remove the misleading 'ml exec' usage documentation and
replace with simple internal module message.
2026-03-05 12:23:42 -05:00
Jeremie Fraeys
a36a5e4522
feat(cli): add execution_mode config setting for local/remote/auto preference
Add execution_mode enum (local/remote/auto) to config for persistent
control over command execution behavior. Removes --local/--remote flags
from commands to simplify user workflow - no need to check server
connection status manually.

Changes:
- config.zig: Add ExecutionMode enum, execution_mode field, parsing/serialization
- mode.zig: Update detect() to check execution_mode == .local
- init.zig: Add --mode flag (local/remote/auto) for setting during init
- info.zig: Use config execution_mode, removed --local/--remote flags
- run.zig: Use config execution_mode, removed --local/--remote flags
- exec/mod.zig: Use config execution_mode, removed --local/--remote flags

Priority order for determining execution mode:
1. Config setting (execution_mode: local/remote/auto)
2. Auto-detect only if config is 'auto'

Users set mode once during init:
  ml init --mode=local     # Always use local
  ml init --mode=remote    # Always use remote
  ml init --mode=auto      # Auto-detect (default)
2026-03-05 12:18:30 -05:00
Jeremie Fraeys
cf8115c670
feat(cli): standardize connection handling across commands
Add isConnected() method to common.ConnectionContext to check WebSocket
client connection state. Migrate all server-connected commands to use
the standardized ConnectionContext pattern:

- jupyter/lifecycle.zig: Replace local ConnectionCtx with common.ConnectionContext
- status.zig: Use ConnectionContext, remove manual connection boilerplate,
  add connection status indicators (connecting/connected)
- cancel.zig: Use ConnectionContext for server cancel operations
- dataset.zig: Use ConnectionContext for list/register/info/search operations
- exec/remote.zig: Use ConnectionContext for remote job execution

Benefits:
- Eliminates ~160 lines of duplicated connection boilerplate
- Consistent error handling and cleanup across commands
- Single point of change for connection logic
- Adds runtime connection state visibility to status command
2026-03-05 12:07:41 -05:00
Jeremie Fraeys
c6a224d5fc
feat(cli,server): unify info command with remote/local support
Enhance ml info to query server when connected, falling back to local
manifests when offline. Unifies behavior with other commands like run,
exec, and cancel.

CLI changes:
- Add --local and --remote flags for explicit control
- Auto-detect connection state via mode.detect()
- queryRemoteRun(): Query server via WebSocket for run details
- queryLocalRun(): Read local run_manifest.json
- displayRunInfo(): Shared display logic for both sources
- Add connection status indicators (Remote: connecting.../connected)

WebSocket protocol:
- Add query_run_info opcode (0x28) to cli and server
- Add sendQueryRunInfo() method to ws/client.zig
- Protocol: [opcode:1][api_key_hash:16][run_id_len:1][run_id:var]

Server changes:
- Add handleQueryRunInfo() handler to ws/handler.go
- Returns run_id, job_name, user, timestamp, overall_sha, files_count
- Checks PermJobsRead permission
- Looks up run in experiment manager

Usage:
  ml info abc123              # Auto: tries remote, falls back to local
  ml info abc123 --local      # Force local manifest lookup
  ml info abc123 --remote     # Force remote query (fails if offline)
2026-03-05 12:07:00 -05:00
Jeremie Fraeys
68062831b0
refactor(cli): remove redundant doc comments from command files
Removed duplicate help text from doc comments:
- log.zig: Removed usage examples (in printUsage)
- annotate.zig: Removed usage examples (in printUsage)
- experiment.zig: Removed usage examples (in printUsage)

Rationale: printUsage() already contains detailed help text.
Doc comments should not duplicate this information.

All tests pass.
2026-03-05 11:06:28 -05:00
Jeremie Fraeys
747579eae4
refactor: misc improvements across codebase
Various improvements:
- Makefile: build optimizations and native lib integration
- prune.zig: cleanup logic refinements
- status.zig: improved status reporting
- experiment_core.zig: core functionality updates
- progress.zig: progress bar improvements
- task.go: domain model updates for task handling

All tests pass.
2026-03-05 10:58:22 -05:00
Jeremie Fraeys
3e557e4565
fix(cli): correct const qualifier in jupyter lifecycle
Fixed compilation error in jupyter/lifecycle.zig:
- Changed 'const client' to 'var client' in ConnectionCtx.init()
- Allows errdefer client.close() to work correctly
- close() requires mutable reference to ws.Client

All tests pass.
2026-03-05 10:57:57 -05:00
Jeremie Fraeys
cb018934e1
feat(cli): add shared dataset_hash utility and automatic hashing
Created utils/dataset_hash.zig:
- computeDatasetHash(allocator, path) -> [64]u8
- Returns fixed 64-char hex string (stack allocated)
- Provides verifyDatasetIntegrity() for hash comparison
- Enables testing against native C++ implementations

Updated dataset.zig:
- verifyDataset() now automatically computes hash during verification
- Uses utils/dataset_hash.zig for hash computation
- Hash displayed in JSON output for reference
- No separate 'dataset hash' command needed

Benefits:
- Single source of truth for dataset hashing
- Testable independently for correctness verification
- Automatic during dataset verify operation
2026-03-05 10:57:39 -05:00
Jeremie Fraeys
e2673be8b5
feat(cli): unify exec, queue, run into single 'run' command
Since app is not released, removed old commands entirely:
- Deleted exec.zig (533 lines) - modularized version
- Deleted queue.zig (1248 lines) - complete removal
- Unified all functionality into run.zig

New unified 'ml run' command features:
- Auto-detects local vs remote execution via mode.detect()
- Supports --local and --remote flags to force execution mode
- Includes all resource options: --cpu, --memory, --gpu
- Research context: --hypothesis, --context, --intent, --tags
- Validation modes: --dry-run, --validate, --explain
- Uses modular exec/remote.zig and exec/local.zig for execution

Dispatcher updates (main.zig):
- Removed 'e' (exec) handler
- Removed 'q' (queue) handler
- Updated help text to show unified command

Import cleanup (commands.zig):
- Removed queue.zig import

Total code reduction: ~1,700 lines
All tests pass.
2026-03-05 10:57:00 -05:00
Jeremie Fraeys
9b4bd1b103
refactor(cli): standardize dataset command format and remove redundant hash command
Standardized dataset.zig with proper doc comment format:
- Added /// doc comment with usage and subcommand descriptions
- Follows same format as other commands

Removed dataset_hash.zig:
- Hash computation is already automatic in 'dataset verify'
- Standalone 'ml dataset hash' command was redundant
- Users can use 'ml dataset verify <path>' to get hash

All tests pass.
2026-03-05 10:30:55 -05:00
Jeremie Fraeys
b99cd6b0e3
feat(cli): unify exec, queue, run into single 'run' command
Since app is not released, removed old commands entirely:
- Deleted exec.zig (533 lines)
- Deleted queue.zig (1248 lines)
- Unified functionality into run.zig

New unified 'ml run' command:
- Auto-detects local vs remote execution
- Supports --local and --remote flags to force mode
- Includes all features: priority, resources, research context
- Single command for all execution needs

Updated main.zig dispatcher:
- Removed 'e' (exec) handler
- Removed 'q' (queue) handler
- Updated help text

Total reduction: ~1,700 lines of code
All tests pass.
2026-03-05 10:22:37 -05:00
Jeremie Fraeys
0d05ec0317
refactor(cli): consolidate duplicate functions into common.zig
Move shared utility functions from queue.zig to common.zig:
- buildNarrativeJson() - was duplicated in queue.zig, exec/dryrun.zig, exec/remote.zig
- formatNextSteps() - was duplicated in queue.zig
- dryRun() - was duplicated in exec/dryrun.zig
- JobOptions struct - shared configuration options

Added common.zig import to queue.zig and updated all references.

Reduction: ~80 lines of duplicate code removed
All tests pass.
2026-03-05 10:12:44 -05:00
Jeremie Fraeys
ab7da26d77
refactor(cli): remove unused has_tracking variable from queue.zig
The has_tracking variable was set but never read. Removed:
- Variable declaration (line 140)
- 6 assignments across tracking flag handlers

Cleanup only, no functional changes.

All tests pass.
2026-03-05 10:03:05 -05:00
Jeremie Fraeys
6316e4d702
refactor(cli): modularize exec.zig (533 lines)
Break down exec.zig into focused modules:
- exec/mod.zig - Main entry point and command dispatch (211 lines)
- exec/remote.zig - Remote execution via WebSocket (87 lines)
- exec/local.zig - Local execution with fork/exec (137 lines)
- exec/dryrun.zig - Dry-run preview functionality (53 lines)

Original exec.zig now acts as backward-compatible wrapper.

Benefits:
- Each module <150 lines (highly maintainable)
- Clear separation: remote vs local vs dry-run logic
- Easier to test individual execution paths
- Original 533-line file split into 4 focused modules

All tests pass.
2026-03-05 09:59:00 -05:00