From 1aed78839baa45b75ccbc36a2f94adb1911c3882 Mon Sep 17 00:00:00 2001 From: Jeremie Fraeys Date: Mon, 5 Jan 2026 12:37:46 -0500 Subject: [PATCH] docs(dev): document validate workflow, CLI/TUI UX contract, and consolidate dev/testing docs --- docs/src/cicd.md | 2 +- docs/src/cli-reference.md | 116 +++++-- docs/src/cli-tui-ux-contract-v1.md | 337 +++++++++++++++++++ docs/src/configuration-reference.md | 135 ++++---- docs/src/dev-quick-start.md | 30 ++ docs/src/quick-start-testing.md | 185 ----------- docs/src/testing-guide.md | 50 --- docs/src/testing-protocol.md | 258 --------------- docs/src/testing.md | 492 ++++++++++++++++++++++++++-- docs/src/troubleshooting.md | 57 ++-- docs/src/validate.md | 101 ++++++ docs/src/zig-cli.md | 2 + 12 files changed, 1119 insertions(+), 646 deletions(-) create mode 100644 docs/src/cli-tui-ux-contract-v1.md create mode 100644 docs/src/dev-quick-start.md delete mode 100644 docs/src/quick-start-testing.md delete mode 100644 docs/src/testing-guide.md delete mode 100644 docs/src/testing-protocol.md create mode 100644 docs/src/validate.md diff --git a/docs/src/cicd.md b/docs/src/cicd.md index 44baad0..4ad86a9 100644 --- a/docs/src/cicd.md +++ b/docs/src/cicd.md @@ -95,7 +95,7 @@ make test go test ./internal/queue/... # Build CLI -cd cli && zig build dev +cd cli && zig build --release=fast # Run formatters and linters make lint diff --git a/docs/src/cli-reference.md b/docs/src/cli-reference.md index 12942e7..41c1a0c 100644 --- a/docs/src/cli-reference.md +++ b/docs/src/cli-reference.md @@ -34,6 +34,9 @@ High-performance command-line interface for experiment management, written in Zi | `cancel` | Cancel running job | `ml cancel job123` | | `prune` | Clean up old experiments | `ml prune --keep 10` | | `watch` | Auto-sync directory on changes | `ml watch ./project --queue` | +| `jupyter` | Manage Jupyter notebook services | `ml jupyter start --name my-nb` | +| `validate` | Validate provenance/integrity for a commit or task | `ml validate --verbose` | +| `info` | Show run info from `run_manifest.json` | `ml info ` | ### Command Details @@ -78,6 +81,9 @@ ml queue my-job --commit abc123 --priority 8 - Priority queuing system - API key authentication +**Notes:** +- Tasks support optional `snapshot_id` and `dataset_specs` fields server-side (for provenance and dataset resolution). + #### `watch` - Auto-Sync Monitoring ```bash # Watch directory for changes @@ -108,12 +114,78 @@ ml monitor ``` Launches TUI interface via SSH for real-time monitoring. +#### `status` - System Status + +`ml status --json` returns a JSON object including an optional `prewarm` field when worker prewarming is active: + +```json +{ + "prewarm": [ + { + "worker_id": "worker-1", + "task_id": "", + "started_at": "2025-01-01T00:00:00Z", + "updated_at": "2025-01-01T00:00:05Z", + "phase": "datasets", + "dataset_count": 2 + } + ] +} +``` + #### `cancel` - Job Cancellation ```bash ml cancel running-job-id ``` Cancels currently running jobs by ID. +#### `jupyter` - Jupyter Notebook Management + +Manage Jupyter notebook services via WebSocket protocol. + +```bash +# Start a Jupyter service +ml jupyter start --name my-notebook --workspace /path/to/workspace + +# Start with password protection +ml jupyter start --name my-notebook --workspace /path/to/workspace --password mypass + +# List running services +ml jupyter list + +# Stop a service +ml jupyter stop service-id-12345 + +# Check service status +ml jupyter status +``` + +**Features:** +- WebSocket-based binary protocol for low latency +- Secure API key authentication (SHA256 hashed) +- Real-time service management +- Workspace isolation + +**Common Use Cases:** +```bash +# Development workflow +ml jupyter start --name dev-notebook --workspace ./notebooks +# ... do development work ... +ml jupyter stop dev-service-123 + +# Team collaboration +ml jupyter start --name team-analysis --workspace /shared/analysis --password teampass + +# Multiple services +ml jupyter list # View all running services +``` + +**Security:** +- API keys are hashed before transmission +- Password protection for notebooks +- Workspace path validation +- Service ID-based authorization + ### Configuration The Zig CLI reads configuration from `~/.ml/config.toml`: @@ -144,7 +216,7 @@ Main HTTPS API server for experiment management. go run ./cmd/api-server/main.go # With configuration -./bin/api-server --config configs/config-local.yaml +./bin/api-server --config configs/api/dev.yaml ``` **Features:** @@ -160,9 +232,6 @@ Terminal User Interface for monitoring experiments. ```bash # Launch TUI go run ./cmd/tui/main.go - -# With custom config -./tui --config configs/config-local.yaml ``` **Features:** @@ -187,10 +256,10 @@ Configuration validation and linting tool. ```bash # Validate configuration -./configlint configs/config-local.yaml +./configlint configs/api/dev.yaml # Check schema compliance -./configlint --schema configs/schema/config_schema.yaml +./configlint --schema configs/schema/api_server_config.yaml ``` ## Management Script (`./tools/manage.sh`) @@ -208,39 +277,22 @@ Simple service management for your homelab. ./tools/manage.sh cleanup # Clean project artifacts ``` -## Setup Script (`./setup.sh`) - -One-command homelab setup. - -### Usage -```bash -# Full setup -./setup.sh - -# Setup includes: -# - SSL certificate generation -# - Configuration creation -# - Build all components -# - Start Redis -# - Setup Fail2Ban (if available) -``` - ## API Testing Test the API with curl: ```bash # Health check -curl -k -H 'X-API-Key: password' https://localhost:9101/health +curl -f http://localhost:8080/health # List experiments -curl -k -H 'X-API-Key: password' https://localhost:9101/experiments +curl -H 'X-API-Key: password' http://localhost:8080/experiments # Submit experiment -curl -k -X POST -H 'X-API-Key: password' \ +curl -X POST -H 'X-API-Key: password' \ -H 'Content-Type: application/json' \ -d '{"name":"test","config":{"type":"basic"}}' \ - https://localhost:9101/experiments + http://localhost:8080/experiments ``` ## Zig CLI Architecture @@ -269,7 +321,7 @@ The Zig CLI is designed for performance and reliability: ## Configuration -Main configuration file: `configs/config-local.yaml` +Main configuration file: `configs/api/dev.yaml` ### Key Settings ```yaml @@ -277,14 +329,14 @@ auth: enabled: true api_keys: dev_user: - hash: "2baf1f40105d9501fe319a8ec463fdf4325a2a5df445adf3f572f626253678c9" + hash: "CHANGE_ME_SHA256_DEV_USER_KEY" admin: true roles: - admin permissions: '*': true researcher_user: - hash: "ef92b778ba7a6c8f2150019a5678047b6a9a2b95cef81895吧18f9b35c54d2e3ae" + hash: "CHANGE_ME_SHA256_RESEARCHER_USER_KEY" admin: false roles: - researcher @@ -385,8 +437,8 @@ telnet worker.local 9100 **Authentication failed:** ```bash -# Check API key in config-local.yaml -grep -A 5 "api_keys:" configs/config-local.yaml +# Check API key in config +grep -A 5 "api_keys:" configs/api/dev.yaml ``` **Redis connection failed:** diff --git a/docs/src/cli-tui-ux-contract-v1.md b/docs/src/cli-tui-ux-contract-v1.md new file mode 100644 index 0000000..014413d --- /dev/null +++ b/docs/src/cli-tui-ux-contract-v1.md @@ -0,0 +1,337 @@ +# FetchML CLI/TUI UX Contract v1 + +This document defines the user experience contract for FetchML v1, focusing on clean, predictable CLI/TUI interactions without mode flags. + +## Core Principles + +1. **Thin CLI**: Local CLI does minimal validation; authoritative checks happen server-side +2. **No Mode Flags**: Commands do what they say; no `--mode` or similar flags +3. **Predictable Defaults**: Sensible defaults that work for most use cases +4. **Graceful Degradation**: JSON output for automation, human-friendly output for interactive use +5. **Explicit Operations**: `--dry-run`, `--validate`, `--explain` are explicit, not implied + +## Commands v1 + +### Core Workflow Commands + +#### `ml queue [options]` +Submit a job for execution. + +**Basic Usage:** +```bash +ml queue my-experiment +``` + +**Options:** +- `--commit `: Specify commit ID (default: current git HEAD) +- `--priority <1-10>`: Job priority (default: 5) +- `--cpu `: CPU cores requested (default: 2) +- `--memory `: Memory in GB (default: 8) +- `--gpu `: GPU count (default: 0) +- `--gpu-memory `: GPU memory budget (default: auto) + +**Dry Run:** +```bash +ml queue my-experiment --dry-run +# Output: JSON with what would be submitted, validation results +``` + +**Validate Only:** +```bash +ml queue my-experiment --validate +# Output: Validation results without submitting +``` + +**Explain:** +```bash +ml queue my-experiment --explain +# Output: Human-readable explanation of what will happen +``` + +**JSON Output:** +When using `--json`, the response may include a `prewarm` field describing best-effort worker prewarming activity (e.g. dataset prefetch for the next queued task). + +```bash +ml queue my-experiment --json +# Output: Structured JSON response +``` + +#### `ml status [job-name]` +Show job status. + +**Basic Usage:** +```bash +ml status # All jobs summary +ml status my-experiment # Specific job details +``` + +**Options:** +- `--json`: JSON output +- `--watch`: Watch mode (refresh every 2s) +- `--limit `: Limit number of jobs shown (default: 20) + +#### `ml cancel ` +Cancel a running or queued job. + +**Basic Usage:** +```bash +ml cancel my-experiment +``` + +**Options:** +- `--force`: Force cancel even if running +- `--json`: JSON output + +### Experiment Management + +#### `ml experiment init ` +Initialize a new experiment directory. + +**Basic Usage:** +```bash +ml experiment init my-project +``` + +**Options:** +- `--template `: Use experiment template +- `--dry-run`: Show what would be created + +#### `ml experiment list` +List available experiments. + +**Options:** +- `--json`: JSON output +- `--limit `: Limit results + +#### `ml experiment show ` +Show experiment details. + +**Options:** +- `--json`: JSON output +- `--manifest`: Show content integrity manifest + +### Dataset Management + +#### `ml dataset list` +List available datasets. + +**Options:** +- `--json`: JSON output +- `--synced-only`: Show only synced datasets + +#### `ml dataset sync ` +Sync a dataset from NAS to ML server. + +**Options:** +- `--dry-run`: Show what would be synced +- `--validate`: Validate dataset integrity after sync + +### Monitoring & TUI + +#### `ml monitor` +Launch TUI for real-time monitoring (runs over SSH). + +**Basic Usage:** +```bash +ml monitor +``` + +**TUI Controls:** +- `Ctrl+C`: Exit TUI +- `q`: Quit +- `r`: Refresh +- `j/k`: Navigate jobs +- `Enter`: Job details +- `c`: Cancel selected job + +#### `ml watch ` +Watch a specific job's output. + +**Options:** +- `--follow`: Follow log output (default) +- `--tail `: Show last n lines + +## Global Options + +These options work with any command: + +- `--json`: Output structured JSON instead of human-readable format +- `--config `: Use custom config file (default: ~/.ml/config.toml) +- `--verbose`: Verbose output +- `--quiet`: Minimal output +- `--help`: Show help for command + +## Defaults Configuration + +### Default Job Resources +```toml +[defaults] +cpu = 2 # CPU cores +memory = 8 # GB +gpu = 0 # GPU count +gpu_memory = "auto" # Auto-detect or specify GB +priority = 5 # Job priority (1-10) +``` + +### Default Behavior +- **Commit ID**: Current git HEAD (must be clean working directory) +- **Working Directory**: Current directory for experiment files +- **Output**: Human-readable format unless `--json` specified +- **Validation**: Server-side authoritative validation + +## Error Handling + +### Exit Codes +- `0`: Success +- `1`: General error +- `2`: Invalid arguments +- `3`: Validation failed +- `4`: Network/connection error +- `5`: Server error + +### Error Output Format +**Human-readable:** +``` +Error: Experiment validation failed + - Missing dependency manifest (environment.yml, poetry.lock, pyproject.toml, or requirements.txt) + - Train script not found: train.py +``` + +**JSON:** +```json +{ + "error": "validation_failed", + "message": "Experiment validation failed", + "details": [ + {"field": "dependency_manifest", "error": "missing", "supported": ["environment.yml", "poetry.lock", "pyproject.toml", "requirements.txt"]}, + {"field": "train_script", "error": "not_found", "expected": "train.py"} + ] +} +``` + +## Ctrl+C Semantics + +### Command Cancellation +- **Ctrl+C during `ml queue --dry-run`**: Immediate exit, no side effects +- **Ctrl+C during `ml queue`**: Attempt to cancel submission, show status +- **Ctrl+C during `ml status --watch`**: Exit watch mode +- **Ctrl+C during `ml monitor`**: Gracefully exit TUI +- **Ctrl+C during `ml watch`**: Stop following logs, show final status + +### Graceful Shutdown +1. Signal interrupt to server (if applicable) +2. Clean up local resources +3. Display current status +4. Exit with appropriate code + +## JSON Output Schema + +### Job Submission Response +```json +{ + "job_id": "uuid-string", + "job_name": "my-experiment", + "status": "queued", + "commit_id": "abc123...", + "submitted_at": "2025-01-01T12:00:00Z", + "estimated_start": "2025-01-01T12:05:00Z", + "resources": { + "cpu": 2, + "memory_gb": 8, + "gpu": 1, + "gpu_memory_gb": 16 + } +} +``` + +### Status Response +```json +{ + "jobs": [ + { + "job_id": "uuid-string", + "job_name": "my-experiment", + "status": "running", + "progress": 0.75, + "started_at": "2025-01-01T12:05:00Z", + "estimated_completion": "2025-01-01T12:30:00Z", + "node": "worker-01" + } + ], + "total": 1, + "showing": 1 +} +``` + +## Examples + +### Typical Workflow +```bash +# 1. Initialize experiment +ml experiment init my-project +cd my-project + +# 2. Validate experiment locally +ml queue . --validate --dry-run + +# 3. Submit job +ml queue . --priority 8 --gpu 1 + +# 4. Monitor progress +ml status . +ml watch . + +# 5. Check results +ml status . --json +``` + +### Automation Script +```bash +#!/bin/bash +# Submit job and wait for completion +JOB_ID=$(ml queue my-experiment --json | jq -r '.job_id') + +echo "Submitted job: $JOB_ID" + +# Wait for completion +while true; do + STATUS=$(ml status $JOB_ID --json | jq -r '.jobs[0].status') + echo "Status: $STATUS" + + if [[ "$STATUS" == "completed" || "$STATUS" == "failed" ]]; then + break + fi + + sleep 10 +done + +ml status $JOB_ID +``` + +## Implementation Notes + +### Server-side Validation +- CLI performs minimal local checks (git status, file existence) +- All authoritative validation happens on worker +- Validation failures are propagated back to CLI with clear error messages + +### Trust Contract Integration +- Every job submission includes commit ID and content integrity manifest +- Worker validates both before execution +- Any mismatch causes hard-fail with detailed error reporting + +### Resource Management +- Resource requests are validated against available capacity +- Jobs are queued based on priority and resource availability +- Resource usage is tracked and reported in status + +## Future Extensions + +The v1 contract is intentionally minimal but designed for extension: + +- **v1.1**: Add job dependencies and workflows +- **v1.2**: Add experiment templates and scaffolding +- **v1.3**: Add distributed execution across multiple workers +- **v2.0**: Add advanced scheduling and resource optimization + +All extensions will maintain backward compatibility with the v1 contract. diff --git a/docs/src/configuration-reference.md b/docs/src/configuration-reference.md index 111b4e3..069d2b5 100644 --- a/docs/src/configuration-reference.md +++ b/docs/src/configuration-reference.md @@ -7,14 +7,14 @@ This document provides a comprehensive reference for all configuration options i ## Environment Configurations ### Local Development -**File:** `configs/environments/config-local.yaml` +**File:** `configs/api/dev.yaml` ```yaml auth: enabled: true - apikeys: + api_keys: dev_user: - hash: "2baf1f40105d9501fe319a8ec463fdf4325a2a5df445adf3f572f626253678c9" + hash: "CHANGE_ME_SHA256_DEV_USER_KEY" admin: true roles: ["admin"] permissions: @@ -35,14 +35,14 @@ security: ``` ### Multi-User Setup -**File:** `configs/environments/config-multi-user.yaml` +**File:** `configs/api/multi-user.yaml` ```yaml auth: enabled: true - apikeys: + api_keys: admin_user: - hash: "5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8" + hash: "CHANGE_ME_SHA256_ADMIN_USER_KEY" admin: true roles: ["user", "admin"] permissions: @@ -51,7 +51,7 @@ auth: delete: true researcher1: - hash: "ef92b778ba7a6c8f2150019a5678047b6a9a2b95cef8189518f9b35c54d2e3ae" + hash: "CHANGE_ME_SHA256_RESEARCHER1_KEY" admin: false roles: ["user", "researcher"] permissions: @@ -61,7 +61,7 @@ auth: jobs:delete: false analyst1: - hash: "a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3" + hash: "CHANGE_ME_SHA256_ANALYST1_KEY" admin: false roles: ["user", "analyst"] permissions: @@ -72,12 +72,12 @@ auth: ``` ### Production -**File:** `configs/environments/config-prod.yaml` +**File:** `configs/api/prod.yaml` ```yaml auth: enabled: true - apikeys: + api_keys: # Production users configured here server: @@ -98,13 +98,14 @@ security: - "10.0.0.0/8" redis: - url: "redis://redis:6379" - max_connections: 10 + addr: "redis:6379" + password: "" + db: 0 logging: level: "info" file: "/app/logs/app.log" - audit_file: "/app/logs/audit.log" + audit_log: "/app/logs/audit.log" ``` ## Worker Configurations @@ -113,59 +114,80 @@ logging: **File:** `configs/workers/worker-prod.toml` ```toml -[worker] -name = "production-worker" -id = "worker-prod-1" +worker_id = "worker-prod-01" +base_path = "/data/ml-experiments" +max_workers = 4 -[server] -host = "api-server" -port = 9101 -api_key = "your-api-key-here" +redis_addr = "localhost:6379" +redis_password = "CHANGE_ME_REDIS_PASSWORD" +redis_db = 0 -[execution] -max_concurrent_jobs = 2 -timeout_minutes = 60 -retry_attempts = 3 +host = "localhost" +user = "ml-user" +port = 22 +ssh_key = "~/.ssh/id_rsa" + +podman_image = "ml-training:latest" +gpu_vendor = "none" +gpu_visible_devices = [] +gpu_devices = [] +container_workspace = "/workspace" +container_results = "/results" +train_script = "train.py" [resources] -memory_limit = "4Gi" -cpu_limit = "2" -gpu_enabled = false +max_workers = 4 +desired_rps_per_worker = 2 +podman_cpus = "4" +podman_memory = "16g" -[storage] -work_dir = "/tmp/fetchml-jobs" -cleanup_interval_minutes = 30 +[metrics] +enabled = true +listen_addr = ":9100" +``` + +```toml +# Production Worker (NVIDIA, UUID-based GPU selection) +worker_id = "worker-prod-01" +base_path = "/data/ml-experiments" + +podman_image = "ml-training:latest" +gpu_vendor = "nvidia" +gpu_visible_device_ids = ["GPU-REPLACE_WITH_REAL_UUID"] +gpu_devices = ["/dev/dri"] +container_workspace = "/workspace" +container_results = "/results" +train_script = "train.py" ``` ### Docker Worker -**File:** `configs/workers/worker-docker.yaml` +**File:** `configs/workers/docker.yaml` ```yaml -worker: - name: "docker-worker" - id: "worker-docker-1" +worker_id: "docker-worker" +base_path: "/tmp/fetchml-jobs" +train_script: "train.py" -server: - host: "api-server" - port: 9101 - api_key: "your-api-key-here" +redis_addr: "redis:6379" +redis_password: "" +redis_db: 0 -execution: - max_concurrent_jobs: 1 - timeout_minutes: 30 - retry_attempts: 3 +local_mode: true -resources: - memory_limit: "2Gi" - cpu_limit: "1" - gpu_enabled: false +max_workers: 1 +poll_interval_seconds: 5 -docker: +podman_image: "python:3.9-slim" +container_workspace: "/workspace" +container_results: "/results" +gpu_devices: [] +gpu_vendor: "none" +gpu_visible_devices: [] + +metrics: enabled: true - image: "fetchml/worker:latest" - volume_mounts: - - "/tmp:/tmp" - - "/var/run/docker.sock:/var/run/docker.sock" + listen_addr: ":9100" +metrics_flush_interval: "500ms" ``` ## CLI Configuration @@ -181,7 +203,7 @@ worker_base = "/app" worker_port = 22 [auth] -api_key = "your-hashed-api-key" +api_key = "" [cli] default_timeout = 30 @@ -199,7 +221,7 @@ worker_base = "/app" worker_port = 22 [auth] -api_key = "5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8" +api_key = "" ``` **Researcher Config:** `~/.ml/config-researcher.toml` @@ -211,7 +233,7 @@ worker_base = "/app" worker_port = 22 [auth] -api_key = "ef92b778ba7a6c8f2150019a5678047b6a9a2b95cef8189518f9b35c54d2e3ae" +api_key = "" ``` **Analyst Config:** `~/.ml/config-analyst.toml` @@ -223,7 +245,7 @@ worker_base = "/app" worker_port = 22 [auth] -api_key = "a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3" +api_key = "" ``` ## Configuration Options @@ -298,7 +320,6 @@ api_key = "a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3" |----------|---------|-------------| | `FETCHML_CONFIG` | - | Path to config file | | `FETCHML_LOG_LEVEL` | "info" | Override log level | -| `FETCHML_REDIS_URL` | - | Override Redis URL | | `CLI_CONFIG` | - | Path to CLI config file | ## Troubleshooting @@ -324,7 +345,7 @@ api_key = "a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3" ```bash # Validate server configuration -go run cmd/api-server/main.go --config configs/environments/config-local.yaml --validate +go run cmd/api-server/main.go --config configs/api/dev.yaml --validate # Test CLI configuration ./cli/zig-out/bin/ml status --debug diff --git a/docs/src/dev-quick-start.md b/docs/src/dev-quick-start.md new file mode 100644 index 0000000..4898e45 --- /dev/null +++ b/docs/src/dev-quick-start.md @@ -0,0 +1,30 @@ +# Development Quick Start + +This page is the developer-focused entrypoint for working on FetchML. + +## Prerequisites + +- Go +- Zig +- Docker / Docker Compose + +## Quick setup + +```bash +# Clone +git clone https://github.com/jfraeys/fetch_ml.git +cd fetch_ml + +# Start dev environment +make dev-up + +# Run tests +make test +``` + +## Next + +- See `testing.md` for test workflows. +- See `architecture.md` for system structure. +- See `zig-cli.md` for CLI build details. +- See the repository root `DEVELOPMENT.md` for the full development guide. diff --git a/docs/src/quick-start-testing.md b/docs/src/quick-start-testing.md deleted file mode 100644 index 5349d83..0000000 --- a/docs/src/quick-start-testing.md +++ /dev/null @@ -1,185 +0,0 @@ -# Quick Start Testing Guide - -## Overview - -This guide provides the fastest way to test the FetchML multi-user authentication system. - -## Prerequisites - -- Docker and Docker Compose installed -- CLI built: `cd cli && zig build` -- Test configs available in `~/.ml/` - -## 5-Minute Test - -### 1. Clean Environment -```bash -make self-cleanup -``` - -### 2. Start Services -```bash -docker-compose -f deployments/docker-compose.prod.yml up -d -``` - -### 3. Test Authentication -```bash -make test-auth -``` - -### 4. Check Results -You should see: -- Admin user: Full access, shows all jobs -- Researcher user: Own jobs only -- Analyst user: Read-only access - -### 5. Clean Up -```bash -make self-cleanup -``` - -## Detailed Testing - -### Multi-User Authentication Test - -```bash -# Test each user role -cp ~/.ml/config-admin.toml ~/.ml/config.toml && ./cli/zig-out/bin/ml status -cp ~/.ml/config-researcher.toml ~/.ml/config.toml && ./cli/zig-out/bin/ml status -cp ~/.ml/config-analyst.toml ~/.ml/config.toml && ./cli/zig-out/bin/ml status -``` - -### Job Queueing Test - -```bash -# Admin can queue jobs -cp ~/.ml/config-admin.toml ~/.ml/config.toml -echo "admin job" | ./cli/zig-out/bin/ml queue admin-test - -# Researcher can queue jobs -cp ~/.ml/config-researcher.toml ~/.ml/config.toml -echo "research job" | ./cli/zig-out/bin/ml queue research-test - -# Analyst cannot queue jobs (should fail) -cp ~/.ml/config-analyst.toml ~/.ml/config.toml -echo "analysis job" | ./cli/zig-out/bin/ml queue analysis-test -``` - -### Status Verification - -```bash -# Check what each user can see -make test-auth -``` - -## Expected Results - -### Admin User Output -``` -Status retrieved for user: admin_user (admin: true) -Tasks: X total, X queued, X running, X failed, X completed -``` - -### Researcher User Output -``` -Status retrieved for user: researcher1 (admin: false) -Tasks: X total, X queued, X running, X failed, X completed -``` - -### Analyst User Output -``` -Status retrieved for user: analyst1 (admin: false) -Tasks: X total, X queued, X running, X failed, X completed -``` - -## Troubleshooting - -### Server Not Running -```bash -# Check containers -docker ps --filter "name=ml-" - -# Start services -docker-compose -f deployments/docker-compose.prod.yml up -d - -# Check logs -docker logs ml-prod-api -``` - -### Authentication Failures -```bash -# Check config files -ls ~/.ml/config-*.toml - -# Verify API keys -cat ~/.ml/config-admin.toml -``` - -### Connection Issues -```bash -# Test API directly -curl -I http://localhost:9103/health - -# Check ports -netstat -an | grep 9103 -``` - -## Advanced Testing - -### Full Test Suite -```bash -make test-full -``` - -### Performance Testing -```bash -./scripts/benchmarks/run-benchmarks-local.sh -``` - -### Cleanup Status -```bash -make test-status -``` - -## Configuration Files - -### Test Configs Location -- `~/.ml/config-admin.toml` - Admin user -- `~/.ml/config-researcher.toml` - Researcher user -- `~/.ml/config-analyst.toml` - Analyst user - -### Server Configs -- `configs/environments/config-multi-user.yaml` - Multi-user setup -- `configs/environments/config-local.yaml` - Local development - -## Next Steps - -1. **Review Documentation** - - [Testing Protocol](testing-protocol.md) - - [Configuration Reference](configuration-reference.md) - - [Testing Guide](testing-guide.md) - -2. **Explore Features** - - Job queueing and management - - WebSocket communication - - Role-based permissions - -3. **Production Setup** - - TLS configuration - - Security hardening - - Monitoring setup - -## Help - -### Common Commands -```bash -make help # Show all commands -make test-auth # Quick auth test -make self-cleanup # Clean environment -make test-status # Check system status -``` - -### Get Help -- Check logs: `docker logs ml-prod-api` -- Review documentation in `docs/src/` -- Use `--debug` flag with CLI commands \ No newline at end of file diff --git a/docs/src/testing-guide.md b/docs/src/testing-guide.md deleted file mode 100644 index 3be9198..0000000 --- a/docs/src/testing-guide.md +++ /dev/null @@ -1,50 +0,0 @@ -# Testing Guide - -## Quick Start - -The FetchML project includes comprehensive testing tools. - -## Testing Commands - -### Quick Tests -```bash -make test-auth # Test multi-user authentication -make test-status # Check cleanup status -make self-cleanup # Clean environment -``` - -### Full Test Suite -```bash -make test-full # Run complete test suite -``` - -## Expected Results - -### Admin User -Status retrieved for user: admin_user (admin: true) -Tasks: X total, X queued, X running, X failed, X completed - -### Researcher User -Status retrieved for user: researcher1 (admin: false) -Tasks: X total, X queued, X running, X failed, X completed - -### Analyst User -Status retrieved for user: analyst1 (admin: false) -Tasks: X total, X queued, X running, X failed, X completed - -## Troubleshooting - -### Authentication Failures -- Check API key in ~/.ml/config.toml -- Verify server is running with auth enabled - -### Container Issues -- Check Docker daemon is running -- Verify ports 9100, 9103 are available -- Review logs: docker logs ml-prod-api - -## Cleanup -```bash -make self-cleanup # Interactive cleanup -make auto-cleanup # Setup daily auto-cleanup -``` \ No newline at end of file diff --git a/docs/src/testing-protocol.md b/docs/src/testing-protocol.md deleted file mode 100644 index ec0075c..0000000 --- a/docs/src/testing-protocol.md +++ /dev/null @@ -1,258 +0,0 @@ -# Testing Protocol - -This document outlines the comprehensive testing protocol for the FetchML project. - -## Overview - -The testing protocol is designed to ensure: -- Multi-user authentication works correctly -- API functionality is reliable -- CLI commands function properly -- Docker containers run as expected -- Performance meets requirements - -## Test Categories - -### 1. Authentication Tests - -#### 1.1 Multi-User Authentication -```bash -# Test admin user -cp ~/.ml/config-admin.toml ~/.ml/config.toml -./cli/zig-out/bin/ml status -# Expected: Shows admin status and all jobs - -# Test researcher user -cp ~/.ml/config-researcher.toml ~/.ml/config.toml -./cli/zig-out/bin/ml status -# Expected: Shows researcher status and own jobs only - -# Test analyst user -cp ~/.ml/config-analyst.toml ~/.ml/config.toml -./cli/zig-out/bin/ml status -# Expected: Shows analyst status, read-only access -``` - -#### 1.2 API Key Validation -```bash -# Test invalid API key -echo "invalid_key" > ~/.ml/config.toml -./cli/zig-out/bin/ml status -# Expected: Authentication failed error - -# Test missing API key -rm ~/.ml/config.toml -./cli/zig-out/bin/ml status -# Expected: API key not configured error -``` - -### 2. CLI Functionality Tests - -#### 2.1 Job Queueing -```bash -# Test job queueing with different users -cp ~/.ml/config-admin.toml ~/.ml/config.toml -echo "test job" | ./cli/zig-out/bin/ml queue test-job - -cp ~/.ml/config-researcher.toml ~/.ml/config.toml -echo "research job" | ./cli/zig-out/bin/ml queue research-job - -cp ~/.ml/config-analyst.toml ~/.ml/config.toml -echo "analysis job" | ./cli/zig-out/bin/ml queue analysis-job -# Expected: Admin and researcher can queue, analyst cannot -``` - -#### 2.2 Status Checking -```bash -# Check status after job queueing -./cli/zig-out/bin/ml status -# Expected: Shows jobs based on user permissions -``` - -### 3. Docker Container Tests - -#### 3.1 Container Startup -```bash -# Start production environment -docker-compose -f deployments/docker-compose.prod.yml up -d - -# Check container status -docker ps --filter "name=ml-" -# Expected: All containers running and healthy -``` - -#### 3.2 Port Accessibility -```bash -# Test API server port -curl -I http://localhost:9103/health -# Expected: 200 OK response - -# Test metrics port -curl -I http://localhost:9100/metrics -# Expected: 200 OK response -``` - -#### 3.3 Container Cleanup -```bash -# Test cleanup script -./scripts/maintenance/cleanup.sh --dry-run -./scripts/maintenance/cleanup.sh --force -# Expected: Containers stopped and removed -``` - -### 4. Performance Tests - -#### 4.1 API Performance -```bash -# Run API benchmarks -./scripts/benchmarks/run-benchmarks-local.sh -# Expected: Response times under 100ms for basic operations -``` - -#### 4.2 Load Testing -```bash -# Run load tests -go test -v ./tests/load/... -# Expected: System handles concurrent requests without degradation -``` - -### 5. Integration Tests - -#### 5.1 End-to-End Workflow -```bash -# Complete workflow test -cp ~/.ml/config-admin.toml ~/.ml/config.toml - -# Queue job -echo "integration test" | ./cli/zig-out/bin/ml queue integration-test - -# Check status -./cli/zig-out/bin/ml status - -# Verify job appears in queue -# Expected: Job queued and visible in status -``` - -#### 5.2 WebSocket Communication -```bash -# Test WebSocket handshake -./cli/zig-out/bin/ml status -# Expected: Successful WebSocket upgrade and response -``` - -## Test Execution Order - -### Phase 1: Environment Setup -1. Clean up any existing containers -2. Start fresh Docker environment -3. Verify all services are running - -### Phase 2: Authentication Testing -1. Test all user roles (admin, researcher, analyst) -2. Test invalid authentication scenarios -3. Verify role-based permissions - -### Phase 3: Functional Testing -1. Test CLI commands (queue, status) -2. Test API endpoints -3. Test WebSocket communication - -### Phase 4: Integration Testing -1. Test complete workflows -2. Test error scenarios -3. Test cleanup procedures - -### Phase 5: Performance Testing -1. Run benchmarks -2. Perform load testing -3. Validate performance metrics - -## Automated Testing - -### Continuous Integration Tests -```bash -# Run all tests -make test - -# Run specific test categories -make test-unit -make test-integration -make test-e2e -``` - -### Pre-deployment Checklist -```bash -# Complete test suite -./scripts/testing/run-full-test-suite.sh - -# Performance validation -./scripts/benchmarks/run-benchmarks-local.sh - -# Security validation -./scripts/security/security-scan.sh -``` - -## Test Data Management - -### Test Users -- **admin_user**: Full access, can see all jobs -- **researcher1**: Can create and view own jobs -- **analyst1**: Read-only access, cannot create jobs - -### Test Jobs -- **test-job**: Basic job for testing -- **research-job**: Research-specific job -- **analysis-job**: Analysis-specific job - -## Troubleshooting - -### Common Issues - -#### Authentication Failures -- Check API key configuration -- Verify server is running with auth enabled -- Check YAML config syntax - -#### Container Issues -- Verify Docker daemon is running -- Check port conflicts -- Review container logs - -#### Performance Issues -- Monitor resource usage -- Check for memory leaks -- Verify database connections - -### Debug Commands -```bash -# Check container logs -docker logs ml-prod-api - -# Check system resources -docker stats - -# Verify network connectivity -docker network ls -``` - -## Test Results Documentation - -All test results should be documented in: -- `test-results/` directory -- Performance benchmarks -- Integration test reports -- Security scan results - -## Maintenance - -### Regular Tasks -- Update test data periodically -- Review and update test cases -- Maintain test infrastructure -- Monitor test performance - -### Test Environment -- Keep test environment isolated -- Use consistent test data -- Regular cleanup of test artifacts -- Monitor test resource usage diff --git a/docs/src/testing.md b/docs/src/testing.md index 76a8bb5..002c5e9 100644 --- a/docs/src/testing.md +++ b/docs/src/testing.md @@ -1,10 +1,62 @@ # Testing Guide -Comprehensive testing documentation for FetchML platform. +Comprehensive testing documentation for FetchML platform with integrated monitoring. ## Quick Start Testing -For a fast 5-minute testing experience, see the **[Quick Start Testing Guide](quick-start-testing.md)**. +### 5-Minute Fast Test + +```bash +# Clean environment +make self-cleanup + +# Start development stack with monitoring +make dev-up + +# Quick authentication test +make test-auth + +# Clean up +make dev-down +``` + +**Expected Results**: +- Admin user: Full access, shows all jobs +- Researcher user: Own jobs only +- Analyst user: Read-only access + +## Test Environment Setup + +### Development Environment with Monitoring + +```bash +# Start development stack with monitoring +make dev-up + +# Verify all services are running +make dev-status + +# Run tests against running services +make test + +# Check monitoring during tests +# Grafana: http://localhost:3000 (admin/admin123) +``` + +### Test Environment Verification + +```bash +# Verify API server +curl -f http://localhost:8080/health + +# Verify monitoring services +curl -f http://localhost:3000/api/health +curl -f http://localhost:9090/api/v1/query?query=up +curl -f http://localhost:3100/ready + +# Verify Redis +docker exec ml-experiments-redis redis-cli ping +``` ## Test Types @@ -12,6 +64,9 @@ For a fast 5-minute testing experience, see the **[Quick Start Testing Guide](qu ```bash make test-unit # Go unit tests only cd cli && zig build test # Zig CLI tests + +# Unit tests live under tests/unit/ (including tests that cover internal/ packages) +go test ./tests/unit/... ``` ### Integration Tests @@ -22,6 +77,9 @@ make test-integration # API and database integration ### End-to-End Tests ```bash make test-e2e # Full workflow testing + +# Podman E2E is opt-in because it builds/runs containers +FETCH_ML_E2E_PODMAN=1 go test ./tests/e2e/... # Enables TestPodmanIntegration ``` ### All Tests @@ -30,66 +88,308 @@ make test # Run complete test suite make test-coverage # With coverage report ``` -## Docker Testing +## Deployment-Specific Testing + +### Development Environment Testing -### Development Environment ```bash -docker-compose up -d +# Start dev stack +make dev-up + +# Run tests with monitoring make test -docker-compose down + +# View test results in Grafana +# Load Test Performance dashboard +# System Health dashboard ``` ### Production Environment Testing + ```bash -docker-compose -f docker-compose.prod.yml up -d +cd deployments +make prod-up + +# Test production deployment make test-auth # Multi-user auth test make self-cleanup # Clean up after testing + +# Verify production monitoring +curl -f https://your-domain.com/health +``` + +### Homelab Secure Testing + +```bash +cd deployments +make homelab-up + +# Test secure deployment +make test-auth +make test-ssl # SSL/TLS testing +``` + +## Authentication Testing Protocol + +### Multi-User Authentication + +```bash +# Test admin user +cp ~/.ml/config-admin.toml ~/.ml/config.toml +./cli/zig-out/bin/ml status +# Expected: Shows admin status and all jobs + +# Test researcher user +cp ~/.ml/config-researcher.toml ~/.ml/config.toml +./cli/zig-out/bin/ml status +# Expected: Shows researcher status and own jobs only + +# Test analyst user +cp ~/.ml/config-analyst.toml ~/.ml/config.toml +./cli/zig-out/bin/ml status +# Expected: Shows analyst status, read-only access +``` + +### API Key Validation + +```bash +# Test invalid API key +echo "invalid_key" > ~/.ml/config.toml +./cli/zig-out/bin/ml status +# Expected: Authentication failed error + +# Test missing API key +rm ~/.ml/config.toml +./cli/zig-out/bin/ml status +# Expected: API key not configured error +``` + +### Job Queueing by Role + +```bash +# Admin can queue jobs +cp ~/.ml/config-admin.toml ~/.ml/config.toml +echo "admin job" | ./cli/zig-out/bin/ml queue admin-test + +# Researcher can queue jobs +cp ~/.ml/config-researcher.toml ~/.ml/config.toml +echo "research job" | ./cli/zig-out/bin/ml queue research-test + +# Analyst cannot queue jobs (should fail) +cp ~/.ml/config-analyst.toml ~/.ml/config.toml +echo "analysis job" | ./cli/zig-out/bin/ml queue analysis-test +# Expected: Permission denied error ``` ## Performance Testing +### Load Testing with Monitoring + +```bash +# Start monitoring +make dev-up + +# Run load tests +make load-test + +# Monitor performance in real-time +# Grafana: http://localhost:3000 +# Check: Request rates, response times, error rates +``` + ### Benchmark Suite -```bash -./scripts/benchmarks/run-benchmarks-local.sh -``` - -### Load Testing -```bash -make test-load # API load testing -``` - -## Authentication Testing - -Multi-user authentication testing is fully covered in the **[Quick Start Testing Guide](quick-start-testing.md)**. ```bash -make test-auth # Quick auth role testing +# Run benchmarks with performance tracking +./scripts/track_performance.sh + +# Or run directly +make benchmark-local + +# View results in Grafana dashboards ``` +### Performance Monitoring During Tests + +**Key Metrics to Watch**: +- API response times (95th percentile) +- Error rates (should be < 1%) +- Memory usage trends +- CPU utilization +- Request throughput + ## CLI Testing ### Build and Test CLI + ```bash -cd cli && zig build dev -./cli/zig-out/dev/ml --help +cd cli && zig build --release=fast +./zig-out/bin/ml --help zig build test ``` ### CLI Integration Tests + ```bash -make test-cli # CLI-specific integration tests + ``` +## Docker Container Testing + +### Container Startup + +```bash +# Start environment +make dev-up + +# Check container status +docker ps --filter "name=ml-" +# Expected: All containers running and healthy +``` + +### Port Accessibility + +```bash +# Test API server port +curl -I http://localhost:8080/health +# Expected: 200 OK response + +# Test metrics port +curl -I http://localhost:9100/metrics +# Expected: 200 OK response + +# Test monitoring ports +curl -I http://localhost:3000/api/health +curl -I http://localhost:9090/api/v1/query?query=up +``` + +### Container Cleanup + +```bash +# Test cleanup script +make self-cleanup +make dev-down +# Expected: Containers stopped and removed +``` + +## Monitoring During Testing + +### Real-Time Monitoring + +```bash +# Access Grafana during tests +http://localhost:3000 (admin/admin123) + +# Key Dashboards: +# - Load Test Performance: Request metrics, response times +# - System Health: Service status, resource usage +# - Log Analysis: Error logs, service logs +``` + +### Test Metrics Collection + +**Automatically Collected**: +- HTTP request metrics +- Response time histograms +- Error counters +- Resource utilization +- Log aggregation + +**Manual Test Markers**: +```bash +# Mark test start in logs +echo "TEST_START: $(date)" | tee -a /logs/test.log + +# Mark test completion +echo "TEST_END: $(date)" | tee -a /logs/test.log +``` + +## Test Execution Protocol + +### Phase 1: Environment Setup +1. Clean up any existing containers +2. Start fresh Docker environment with monitoring +3. Verify all services are running + +### Phase 2: Authentication Testing +1. Test all user roles (admin, researcher, analyst) +2. Test invalid authentication scenarios +3. Verify role-based permissions + +### Phase 3: Functional Testing +1. Test CLI commands (queue, status) +2. Test API endpoints +3. Test WebSocket communication + +### Phase 4: Integration Testing +1. Test complete workflows +2. Test error scenarios +3. Test cleanup procedures + +### Phase 5: Performance Testing +1. Run benchmarks +2. Perform load testing +3. Validate performance metrics + ## Troubleshooting Tests ### Common Issues -- **Server not running**: Check with `docker ps --filter "name=ml-"` -- **Authentication failures**: Verify configs in `~/.ml/config-*.toml` -- **Connection issues**: Test API with `curl -I http://localhost:9103/health` + +**Server Not Running**: +```bash +# Check service status +make dev-status + +# Check container logs +docker logs ml-experiments-api +docker logs ml-experiments-grafana +``` + +**Authentication Failures**: +```bash +# Verify configs +ls ~/.ml/config-*.toml + +# Check API health +curl -I http://localhost:8080/health + +# Monitor auth logs in Grafana +# Log Analysis dashboard -> filter "auth" +``` + +**Performance Issues**: +```bash +# Check resource usage in Grafana +# System Health dashboard + +# Check API response times +# Load Test Performance dashboard + +# Identify bottlenecks +# Prometheus: http://localhost:9090 +``` + +**Monitoring Issues**: +```bash +# Re-setup monitoring provisioning (Grafana datasources/providers) +python3 scripts/setup_monitoring.py + +# Restart Grafana +docker restart ml-experiments-grafana + +# Check datasource connectivity +# Grafana -> Configuration -> Data Sources +``` ### Debug Mode + ```bash -make test-debug # Run tests with verbose output +# Enable debug logging +export LOG_LEVEL=debug +make test + +# Monitor debug logs +# Grafana Log Analysis dashboard ``` ## Test Configuration @@ -103,12 +403,27 @@ make test-debug # Run tests with verbose output - `tests/fixtures/` - Test data and examples - `tests/benchmarks/` - Performance test data +### Monitoring Configuration +- `monitoring/grafana/provisioning/` - Auto-provisioned datasources +- `monitoring/grafana/dashboards/` - Auto-provisioned dashboards +- `monitoring/prometheus/prometheus.yml` - Metrics collection + ## Continuous Integration +### CI Pipeline with Monitoring + Tests run automatically on: -- Pull requests (full suite) -- Main branch commits (unit + integration) -- Releases (full suite + benchmarks) +- **Pull requests**: Full suite + performance benchmarks +- **Main branch**: Unit + integration tests +- **Releases**: Full suite + benchmarks + security scans + +### CI Monitoring + +During CI runs: +- Performance metrics collected +- Test results tracked in Grafana +- Regression detection +- Automated alerts on failures ## Writing Tests @@ -117,13 +432,120 @@ Tests run automatically on: - Integration tests: `tests/e2e/` directory - Benchmark tests: `tests/benchmarks/` directory +### Test with Monitoring + +```go +// Add custom metrics for tests +var testRequests = prometheus.NewCounterVec( + prometheus.CounterOpts{ + Name: "test_requests_total", + Help: "Total number of test requests", + }, + []string{"method", "status"}, +) + +// Log test events for monitoring +log.WithFields(log.Fields{ + "test": "integration_test", + "operation": "api_call", + "status": "success", +}).Info("Test operation completed") +``` + ### Zig Tests - CLI tests: `cli/tests/` directory - Follow Zig testing conventions -## See Also +## Test Result Analysis -- **[Quick Start Testing Guide](quick-start-testing.md)** - Fast 5-minute testing -- **[Testing Protocol](testing-protocol.md)** - Detailed testing procedures -- **[Configuration Reference](configuration-reference.md)** - Test setup -- **[Troubleshooting](troubleshooting.md)** - Common issues +### Grafana Dashboard Analysis + +**Load Test Performance**: +- Request rates over time +- Response time percentiles +- Error rate trends +- Throughput metrics + +**System Health**: +- Service availability +- Resource utilization +- Memory usage patterns +- CPU consumption + +**Log Analysis**: +- Error patterns +- Warning frequency +- Service log aggregation +- Debug information + +### Performance Regression Detection + +```bash +# Track performance over time +./scripts/track_performance.sh + +# Compare with baseline +# Grafana: Compare current run with historical data + +# Alert on regressions +# Set up Grafana alerts for performance degradation +``` + +## Test Cleanup + +### Automated Cleanup + +```bash +# Clean up test data +make self-cleanup + +# Clean up Docker resources +make clean-all + +# Reset monitoring data +docker volume rm monitoring_prometheus_data +docker volume rm monitoring_grafana_data +``` + +### Manual Cleanup + +```bash +# Stop test environment +make dev-down + +# Remove test artifacts +rm -rf ~/.ml/config-*.toml +rm -rf test-results/ +``` + +## Expected Test Results + +### Admin User +``` +Status retrieved for user: admin_user (admin: true) +Tasks: X total, X queued, X running, X failed, X completed +``` + +### Researcher User +``` +Status retrieved for user: researcher1 (admin: false) +Tasks: X total, X queued, X running, X failed, X completed +``` + +### Analyst User +``` +Status retrieved for user: analyst1 (admin: false) +Tasks: X total, X queued, X running, X failed, X completed +``` + +## Common Commands Reference + +```bash +make help # Show all commands +make test-auth # Quick auth test +make self-cleanup # Clean environment +make test-status # Check system status +make dev-up # Start dev environment +make dev-down # Stop dev environment +make dev-status # Check dev status +``` \ No newline at end of file diff --git a/docs/src/troubleshooting.md b/docs/src/troubleshooting.md index 8ff6cfb..1a1e5a8 100644 --- a/docs/src/troubleshooting.md +++ b/docs/src/troubleshooting.md @@ -7,41 +7,36 @@ Common issues and solutions for Fetch ML. ### Services Not Starting ```bash -# Check Docker status -docker-compose ps +# Check container status +docker ps --filter "name=ml-" -# Restart services -docker-compose down && docker-compose up -d (testing only) - -# Check logs -docker-compose logs -f +# Restart development stack +make dev-down +make dev-up ``` ### API Not Responding ```bash # Check health endpoint -curl http://localhost:9101/health +curl http://localhost:8080/health # Check if port is in use -lsof -i :9101 +lsof -i :8080 +lsof -i :8443 # Kill process on port -kill -9 $(lsof -ti :9101) +kill -9 $(lsof -ti :8080) ``` -### Database Issues +### Database / Redis Issues ```bash -# Check database connection -docker-compose exec postgres psql -U postgres -d fetch_ml +# Check Redis from container +docker exec ml-experiments-redis redis-cli ping -# Reset database -docker-compose down postgres -docker-compose up -d (testing only) postgres - -# Check Redis -docker-compose exec redis redis-cli ping +# Check API can reach database (via health endpoint) +curl -f http://localhost:8080/health || echo "API not healthy" ``` ## Common Errors @@ -53,7 +48,7 @@ docker-compose exec redis redis-cli ping ### Database Errors - **Connection failed**: Verify database type and connection params -- **No such table**: Run migrations with `--migrate` (see [Development Setup](development-setup.md)) +- **No such table**: Run migrations with `--migrate` (see [Quick Start](quick-start.md)) ### Container Errors - **Runtime not found**: Set `runtime: docker (testing only)` in config @@ -65,15 +60,15 @@ docker-compose exec redis redis-cli ping ## Development Issues - **Build fails**: `go mod tidy` and `cd cli && rm -rf zig-out zig-cache` -- **Tests fail**: Start test dependencies with `docker-compose up -d` or `make test-auth` +- **Tests fail**: Ensure dev stack is running with `make dev-up` or use `make test-auth` ## CLI Issues -- **Not found**: `cd cli && zig build dev` +- **Not found**: `cd cli && zig build --release=fast` - **Connection errors**: Check `--server` and `--api-key` ## Network Issues -- **Port conflicts**: `lsof -i :9101` and kill processes -- **Firewall**: Allow ports 9101, 6379, 5432 +- **Port conflicts**: `lsof -i :8080` / `lsof -i :8443` and kill processes +- **Firewall**: Allow ports 8080, 8443, 6379, 5432 ## Configuration Issues - **Invalid YAML**: `python3 -c "import yaml; yaml.safe_load(open('config.yaml'))"` @@ -82,13 +77,19 @@ docker-compose exec redis redis-cli ping ## Debug Information ```bash ./bin/api-server --version -docker-compose ps -docker-compose logs api-server | grep ERROR +docker ps --filter "name=ml-" +docker logs ml-experiments-api | grep ERROR ``` ## Emergency Reset ```bash -docker-compose down -v +# Stop and remove all dev containers and volumes +make dev-down +docker volume prune + +# Remove local data if needed rm -rf data/ results/ *.db -docker-compose up -d (testing only) + +# Start fresh dev stack +make dev-up ``` diff --git a/docs/src/validate.md b/docs/src/validate.md new file mode 100644 index 0000000..7e91dd9 --- /dev/null +++ b/docs/src/validate.md @@ -0,0 +1,101 @@ +--- +layout: page +title: "Validation (ml validate)" +permalink: /validate/ +--- + +# Validation (`ml validate`) + +The `ml validate` command verifies experiment integrity and provenance. + +It can be run against: + +- A **commit id** (validates the experiment tree + dependency manifest) +- A **task id** (additionally validates the run’s `run_manifest.json` provenance and lifecycle) + +## CLI usage + +```bash +# Validate by commit +ml validate [--json] [--verbose] + +# Validate by task +ml validate --task [--json] [--verbose] +``` + +### Output modes + +- Default (human): prints a summary with `errors`, `warnings`, and `failed_checks`. +- `--verbose`: prints all checks under `checks` and includes `expected/actual/details` when present. +- `--json`: prints the raw JSON payload. + +## Report shape + +The API returns a JSON report of the form: + +- `ok`: overall boolean +- `commit_id`: commit being validated (if known) +- `task_id`: task being validated (when validating by task) +- `checks`: map of check name → `{ ok, expected?, actual?, details? }` +- `errors`: list of high-level failures +- `warnings`: list of non-fatal issues +- `ts`: UTC timestamp + +## Check semantics + +- For **task statuses** `running`, `completed`, or `failed`, run-manifest issues are treated as **errors**. +- For **queued/pending** tasks, run-manifest issues are usually **warnings** (the job may not have started yet). + +## Notable checks + +### Experiment integrity + +- `experiment_manifest`: validates the experiment manifest (content-addressed integrity) +- `deps_manifest`: validates that a dependency manifest exists and can be hashed +- `expected_manifest_overall_sha`: compares the task’s recorded manifest SHA to the current manifest SHA +- `expected_deps_manifest`: compares the task’s recorded deps manifest name/SHA to what exists on disk + +### Run manifest provenance (task validation) + +- `run_manifest`: whether `run_manifest.json` could be found and loaded +- `run_manifest_location`: verifies the manifest was found in the expected bucket: + - `pending` for queued/pending + - `running` for running + - `finished` for completed + - `failed` for failed +- `run_manifest_task_id`: task id match +- `run_manifest_commit_id`: commit id match +- `run_manifest_deps`: deps manifest name/SHA match +- `run_manifest_snapshot_id`: snapshot id match (when snapshot is part of the task) +- `run_manifest_snapshot_sha256`: snapshot sha256 match (when snapshot sha is recorded) + +### Run manifest lifecycle (task validation) + +- `run_manifest_lifecycle`: + - `running`: must have `started_at`, must not have `ended_at`/`exit_code` + - `completed`/`failed`: must have `started_at`, `ended_at`, `exit_code`, and `ended_at >= started_at` + - `queued`/`pending`: must not have `ended_at`/`exit_code` + +## Example report (task validation) + +```json +{ + "ok": false, + "commit_id": "6161616161616161616161616161616161616161", + "task_id": "task-run-manifest-location-mismatch", + "checks": { + "experiment_manifest": {"ok": true}, + "deps_manifest": {"ok": true, "actual": "requirements.txt:..."}, + "run_manifest": {"ok": true}, + "run_manifest_location": { + "ok": false, + "expected": "running", + "actual": "finished" + } + }, + "errors": [ + "run manifest location mismatch" + ], + "ts": "2025-12-17T18:43:00Z" +} +``` diff --git a/docs/src/zig-cli.md b/docs/src/zig-cli.md index 4bfbe51..c2ab7af 100644 --- a/docs/src/zig-cli.md +++ b/docs/src/zig-cli.md @@ -47,6 +47,8 @@ export FETCH_ML_CLI_API_KEY="prod-key" - `ml dataset list` – list datasets - `ml monitor` – launch TUI over SSH (remote UI) +`ml status --json` may include an optional `prewarm` field when the worker is prefetching datasets for the next queued task. + ## Build flavors - `make all` – release‑small (default)