docs(dev): document validate workflow, CLI/TUI UX contract, and consolidate dev/testing docs

2026-01-05 12:37:46 -05:00 · 2026-01-05 12:37:46 -05:00 · 1aed78839b
commit 1aed78839b
parent 8157f73a70
12 changed files with 1119 additions and 646 deletions
--- a/docs/src/cicd.md
+++ b/docs/src/cicd.md
@ -95,7 +95,7 @@ make test
 go test ./internal/queue/...

 # Build CLI
-cd cli && zig build dev
+cd cli && zig build --release=fast

 # Run formatters and linters
 make lint
--- a/docs/src/cli-reference.md
+++ b/docs/src/cli-reference.md
@ -34,6 +34,9 @@ High-performance command-line interface for experiment management, written in Zi
 | `cancel` | Cancel running job | `ml cancel job123` |
 | `prune` | Clean up old experiments | `ml prune --keep 10` |
 | `watch` | Auto-sync directory on changes | `ml watch ./project --queue` |
+| `jupyter` | Manage Jupyter notebook services | `ml jupyter start --name my-nb` |
+| `validate` | Validate provenance/integrity for a commit or task | `ml validate <commit_id> --verbose` |
+| `info` | Show run info from `run_manifest.json` | `ml info <run_dir>` |

 ### Command Details

@ -78,6 +81,9 @@ ml queue my-job --commit abc123 --priority 8
 - Priority queuing system
 - API key authentication

+**Notes:**
+- Tasks support optional `snapshot_id` and `dataset_specs` fields server-side (for provenance and dataset resolution).
+
 #### `watch` - Auto-Sync Monitoring
 ```bash
 # Watch directory for changes
@ -108,12 +114,78 @@ ml monitor
 ```
 Launches TUI interface via SSH for real-time monitoring.

+#### `status` - System Status
+
+`ml status --json` returns a JSON object including an optional `prewarm` field when worker prewarming is active:
+
+```json
+{
+  "prewarm": [
+    {
+      "worker_id": "worker-1",
+      "task_id": "<task-id>",
+      "started_at": "2025-01-01T00:00:00Z",
+      "updated_at": "2025-01-01T00:00:05Z",
+      "phase": "datasets",
+      "dataset_count": 2
+    }
+  ]
+}
+```
+
 #### `cancel` - Job Cancellation
 ```bash
 ml cancel running-job-id
 ```
 Cancels currently running jobs by ID.

+#### `jupyter` - Jupyter Notebook Management
+
+Manage Jupyter notebook services via WebSocket protocol.
+
+```bash
+# Start a Jupyter service
+ml jupyter start --name my-notebook --workspace /path/to/workspace
+
+# Start with password protection
+ml jupyter start --name my-notebook --workspace /path/to/workspace --password mypass
+
+# List running services
+ml jupyter list
+
+# Stop a service
+ml jupyter stop service-id-12345
+
+# Check service status
+ml jupyter status
+```
+
+**Features:**
+- WebSocket-based binary protocol for low latency
+- Secure API key authentication (SHA256 hashed)
+- Real-time service management
+- Workspace isolation
+
+**Common Use Cases:**
+```bash
+# Development workflow
+ml jupyter start --name dev-notebook --workspace ./notebooks
+# ... do development work ...
+ml jupyter stop dev-service-123
+
+# Team collaboration
+ml jupyter start --name team-analysis --workspace /shared/analysis --password teampass
+
+# Multiple services
+ml jupyter list  # View all running services
+```
+
+**Security:**
+- API keys are hashed before transmission
+- Password protection for notebooks
+- Workspace path validation
+- Service ID-based authorization
+
 ### Configuration

 The Zig CLI reads configuration from `~/.ml/config.toml`:
@ -144,7 +216,7 @@ Main HTTPS API server for experiment management.
 go run ./cmd/api-server/main.go

 # With configuration
-./bin/api-server --config configs/config-local.yaml
+./bin/api-server --config configs/api/dev.yaml
 ```

 **Features:**
@ -160,9 +232,6 @@ Terminal User Interface for monitoring experiments.
 ```bash
 # Launch TUI
 go run ./cmd/tui/main.go
-
-# With custom config
-./tui --config configs/config-local.yaml
 ```

 **Features:**
@ -187,10 +256,10 @@ Configuration validation and linting tool.

 ```bash
 # Validate configuration
-./configlint configs/config-local.yaml
+./configlint configs/api/dev.yaml

 # Check schema compliance
-./configlint --schema configs/schema/config_schema.yaml
+./configlint --schema configs/schema/api_server_config.yaml
 ```

 ## Management Script (`./tools/manage.sh`)
@ -208,39 +277,22 @@ Simple service management for your homelab.
 ./tools/manage.sh cleanup        # Clean project artifacts
 ```

-## Setup Script (`./setup.sh`)
-
-One-command homelab setup.
-
-### Usage
-```bash
-# Full setup
-./setup.sh
-
-# Setup includes:
-# - SSL certificate generation
-# - Configuration creation
-# - Build all components
-# - Start Redis
-# - Setup Fail2Ban (if available)
-```
-
 ## API Testing

 Test the API with curl:

 ```bash
 # Health check
-curl -k -H 'X-API-Key: password' https://localhost:9101/health
+curl -f http://localhost:8080/health

 # List experiments
-curl -k -H 'X-API-Key: password' https://localhost:9101/experiments
+curl -H 'X-API-Key: password' http://localhost:8080/experiments

 # Submit experiment
-curl -k -X POST -H 'X-API-Key: password' \
+curl -X POST -H 'X-API-Key: password' \
     -H 'Content-Type: application/json' \
     -d '{"name":"test","config":{"type":"basic"}}' \
-     https://localhost:9101/experiments
+     http://localhost:8080/experiments
 ```

 ## Zig CLI Architecture
@ -269,7 +321,7 @@ The Zig CLI is designed for performance and reliability:

 ## Configuration

-Main configuration file: `configs/config-local.yaml`
+Main configuration file: `configs/api/dev.yaml`

 ### Key Settings
 ```yaml
@ -277,14 +329,14 @@ auth:
  enabled: true
  api_keys:
    dev_user:
-      hash: "2baf1f40105d9501fe319a8ec463fdf4325a2a5df445adf3f572f626253678c9"
+      hash: "CHANGE_ME_SHA256_DEV_USER_KEY"
      admin: true
      roles:
        - admin
      permissions:
        '*': true
    researcher_user:
-      hash: "ef92b778ba7a6c8f2150019a5678047b6a9a2b95cef81895吧18f9b35c54d2e3ae"
+      hash: "CHANGE_ME_SHA256_RESEARCHER_USER_KEY"
      admin: false
      roles:
        - researcher
@ -385,8 +437,8 @@ telnet worker.local 9100

 **Authentication failed:**
 ```bash
-# Check API key in config-local.yaml
-grep -A 5 "api_keys:" configs/config-local.yaml
+# Check API key in config
+grep -A 5 "api_keys:" configs/api/dev.yaml
 ```

 **Redis connection failed:**
--- a/docs/src/cli-tui-ux-contract-v1.md
+++ b/docs/src/cli-tui-ux-contract-v1.md
@ -0,0 +1,337 @@
+# FetchML CLI/TUI UX Contract v1
+
+This document defines the user experience contract for FetchML v1, focusing on clean, predictable CLI/TUI interactions without mode flags.
+
+## Core Principles
+
+1. **Thin CLI**: Local CLI does minimal validation; authoritative checks happen server-side
+2. **No Mode Flags**: Commands do what they say; no `--mode` or similar flags
+3. **Predictable Defaults**: Sensible defaults that work for most use cases
+4. **Graceful Degradation**: JSON output for automation, human-friendly output for interactive use
+5. **Explicit Operations**: `--dry-run`, `--validate`, `--explain` are explicit, not implied
+
+## Commands v1
+
+### Core Workflow Commands
+
+#### `ml queue <job-name> [options]`
+Submit a job for execution.
+
+**Basic Usage:**
+```bash
+ml queue my-experiment
+```
+
+**Options:**
+- `--commit <sha>`: Specify commit ID (default: current git HEAD)
+- `--priority <1-10>`: Job priority (default: 5)
+- `--cpu <cores>`: CPU cores requested (default: 2)
+- `--memory <gb>`: Memory in GB (default: 8)
+- `--gpu <count>`: GPU count (default: 0)
+- `--gpu-memory <gb>`: GPU memory budget (default: auto)
+
+**Dry Run:**
+```bash
+ml queue my-experiment --dry-run
+# Output: JSON with what would be submitted, validation results
+```
+
+**Validate Only:**
+```bash
+ml queue my-experiment --validate
+# Output: Validation results without submitting
+```
+
+**Explain:**
+```bash
+ml queue my-experiment --explain
+# Output: Human-readable explanation of what will happen
+```
+
+**JSON Output:**
+When using `--json`, the response may include a `prewarm` field describing best-effort worker prewarming activity (e.g. dataset prefetch for the next queued task).
+
+```bash
+ml queue my-experiment --json
+# Output: Structured JSON response
+```
+
+#### `ml status [job-name]`
+Show job status.
+
+**Basic Usage:**
+```bash
+ml status              # All jobs summary
+ml status my-experiment # Specific job details
+```
+
+**Options:**
+- `--json`: JSON output
+- `--watch`: Watch mode (refresh every 2s)
+- `--limit <n>`: Limit number of jobs shown (default: 20)
+
+#### `ml cancel <job-name>`
+Cancel a running or queued job.
+
+**Basic Usage:**
+```bash
+ml cancel my-experiment
+```
+
+**Options:**
+- `--force`: Force cancel even if running
+- `--json`: JSON output
+
+### Experiment Management
+
+#### `ml experiment init <name>`
+Initialize a new experiment directory.
+
+**Basic Usage:**
+```bash
+ml experiment init my-project
+```
+
+**Options:**
+- `--template <name>`: Use experiment template
+- `--dry-run`: Show what would be created
+
+#### `ml experiment list`
+List available experiments.
+
+**Options:**
+- `--json`: JSON output
+- `--limit <n>`: Limit results
+
+#### `ml experiment show <commit-id>`
+Show experiment details.
+
+**Options:**
+- `--json`: JSON output
+- `--manifest`: Show content integrity manifest
+
+### Dataset Management
+
+#### `ml dataset list`
+List available datasets.
+
+**Options:**
+- `--json`: JSON output
+- `--synced-only`: Show only synced datasets
+
+#### `ml dataset sync <dataset-name>`
+Sync a dataset from NAS to ML server.
+
+**Options:**
+- `--dry-run`: Show what would be synced
+- `--validate`: Validate dataset integrity after sync
+
+### Monitoring & TUI
+
+#### `ml monitor`
+Launch TUI for real-time monitoring (runs over SSH).
+
+**Basic Usage:**
+```bash
+ml monitor
+```
+
+**TUI Controls:**
+- `Ctrl+C`: Exit TUI
+- `q`: Quit
+- `r`: Refresh
+- `j/k`: Navigate jobs
+- `Enter`: Job details
+- `c`: Cancel selected job
+
+#### `ml watch <job-name>`
+Watch a specific job's output.
+
+**Options:**
+- `--follow`: Follow log output (default)
+- `--tail <n>`: Show last n lines
+
+## Global Options
+
+These options work with any command:
+
+- `--json`: Output structured JSON instead of human-readable format
+- `--config <path>`: Use custom config file (default: ~/.ml/config.toml)
+- `--verbose`: Verbose output
+- `--quiet`: Minimal output
+- `--help`: Show help for command
+
+## Defaults Configuration
+
+### Default Job Resources
+```toml
+[defaults]
+cpu = 2              # CPU cores
+memory = 8           # GB
+gpu = 0              # GPU count
+gpu_memory = "auto"  # Auto-detect or specify GB
+priority = 5         # Job priority (1-10)
+```
+
+### Default Behavior
+- **Commit ID**: Current git HEAD (must be clean working directory)
+- **Working Directory**: Current directory for experiment files
+- **Output**: Human-readable format unless `--json` specified
+- **Validation**: Server-side authoritative validation
+
+## Error Handling
+
+### Exit Codes
+- `0`: Success
+- `1`: General error
+- `2`: Invalid arguments
+- `3`: Validation failed
+- `4`: Network/connection error
+- `5`: Server error
+
+### Error Output Format
+**Human-readable:**
+```
+Error: Experiment validation failed
+  - Missing dependency manifest (environment.yml, poetry.lock, pyproject.toml, or requirements.txt)
+  - Train script not found: train.py
+```
+
+**JSON:**
+```json
+{
+  "error": "validation_failed",
+  "message": "Experiment validation failed",
+  "details": [
+    {"field": "dependency_manifest", "error": "missing", "supported": ["environment.yml", "poetry.lock", "pyproject.toml", "requirements.txt"]},
+    {"field": "train_script", "error": "not_found", "expected": "train.py"}
+  ]
+}
+```
+
+## Ctrl+C Semantics
+
+### Command Cancellation
+- **Ctrl+C during `ml queue --dry-run`**: Immediate exit, no side effects
+- **Ctrl+C during `ml queue`**: Attempt to cancel submission, show status
+- **Ctrl+C during `ml status --watch`**: Exit watch mode
+- **Ctrl+C during `ml monitor`**: Gracefully exit TUI
+- **Ctrl+C during `ml watch`**: Stop following logs, show final status
+
+### Graceful Shutdown
+1. Signal interrupt to server (if applicable)
+2. Clean up local resources
+3. Display current status
+4. Exit with appropriate code
+
+## JSON Output Schema
+
+### Job Submission Response
+```json
+{
+  "job_id": "uuid-string",
+  "job_name": "my-experiment",
+  "status": "queued",
+  "commit_id": "abc123...",
+  "submitted_at": "2025-01-01T12:00:00Z",
+  "estimated_start": "2025-01-01T12:05:00Z",
+  "resources": {
+    "cpu": 2,
+    "memory_gb": 8,
+    "gpu": 1,
+    "gpu_memory_gb": 16
+  }
+}
+```
+
+### Status Response
+```json
+{
+  "jobs": [
+    {
+      "job_id": "uuid-string",
+      "job_name": "my-experiment",
+      "status": "running",
+      "progress": 0.75,
+      "started_at": "2025-01-01T12:05:00Z",
+      "estimated_completion": "2025-01-01T12:30:00Z",
+      "node": "worker-01"
+    }
+  ],
+  "total": 1,
+  "showing": 1
+}
+```
+
+## Examples
+
+### Typical Workflow
+```bash
+# 1. Initialize experiment
+ml experiment init my-project
+cd my-project
+
+# 2. Validate experiment locally
+ml queue . --validate --dry-run
+
+# 3. Submit job
+ml queue . --priority 8 --gpu 1
+
+# 4. Monitor progress
+ml status .
+ml watch .
+
+# 5. Check results
+ml status . --json
+```
+
+### Automation Script
+```bash
+#!/bin/bash
+# Submit job and wait for completion
+JOB_ID=$(ml queue my-experiment --json | jq -r '.job_id')
+
+echo "Submitted job: $JOB_ID"
+
+# Wait for completion
+while true; do
+  STATUS=$(ml status $JOB_ID --json | jq -r '.jobs[0].status')
+  echo "Status: $STATUS"
+  
+  if [[ "$STATUS" == "completed" || "$STATUS" == "failed" ]]; then
+    break
+  fi
+  
+  sleep 10
+done
+
+ml status $JOB_ID
+```
+
+## Implementation Notes
+
+### Server-side Validation
+- CLI performs minimal local checks (git status, file existence)
+- All authoritative validation happens on worker
+- Validation failures are propagated back to CLI with clear error messages
+
+### Trust Contract Integration
+- Every job submission includes commit ID and content integrity manifest
+- Worker validates both before execution
+- Any mismatch causes hard-fail with detailed error reporting
+
+### Resource Management
+- Resource requests are validated against available capacity
+- Jobs are queued based on priority and resource availability
+- Resource usage is tracked and reported in status
+
+## Future Extensions
+
+The v1 contract is intentionally minimal but designed for extension:
+
+- **v1.1**: Add job dependencies and workflows
+- **v1.2**: Add experiment templates and scaffolding
+- **v1.3**: Add distributed execution across multiple workers
+- **v2.0**: Add advanced scheduling and resource optimization
+
+All extensions will maintain backward compatibility with the v1 contract.
--- a/docs/src/configuration-reference.md
+++ b/docs/src/configuration-reference.md
@ -7,14 +7,14 @@ This document provides a comprehensive reference for all configuration options i
 ## Environment Configurations

 ### Local Development
-**File:** `configs/environments/config-local.yaml`
+**File:** `configs/api/dev.yaml`

 ```yaml
 auth:
  enabled: true
-  apikeys:
+  api_keys:
    dev_user:
-      hash: "2baf1f40105d9501fe319a8ec463fdf4325a2a5df445adf3f572f626253678c9"
+      hash: "CHANGE_ME_SHA256_DEV_USER_KEY"
      admin: true
      roles: ["admin"]
      permissions:
@ -35,14 +35,14 @@ security:
 ```

 ### Multi-User Setup
-**File:** `configs/environments/config-multi-user.yaml`
+**File:** `configs/api/multi-user.yaml`

 ```yaml
 auth:
  enabled: true
-  apikeys:
+  api_keys:
    admin_user:
-      hash: "5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8"
+      hash: "CHANGE_ME_SHA256_ADMIN_USER_KEY"
      admin: true
      roles: ["user", "admin"]
      permissions:
@ -51,7 +51,7 @@ auth:
        delete: true
    
    researcher1:
-      hash: "ef92b778ba7a6c8f2150019a5678047b6a9a2b95cef8189518f9b35c54d2e3ae"
+      hash: "CHANGE_ME_SHA256_RESEARCHER1_KEY"
      admin: false
      roles: ["user", "researcher"]
      permissions:
@ -61,7 +61,7 @@ auth:
        jobs:delete: false
    
    analyst1:
-      hash: "a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3"
+      hash: "CHANGE_ME_SHA256_ANALYST1_KEY"
      admin: false
      roles: ["user", "analyst"]
      permissions:
@ -72,12 +72,12 @@ auth:
 ```

 ### Production
-**File:** `configs/environments/config-prod.yaml`
+**File:** `configs/api/prod.yaml`

 ```yaml
 auth:
  enabled: true
-  apikeys:
+  api_keys:
    # Production users configured here

 server:
@ -98,13 +98,14 @@ security:
    - "10.0.0.0/8"

 redis:
-  url: "redis://redis:6379"
-  max_connections: 10
+  addr: "redis:6379"
+  password: ""
+  db: 0

 logging:
  level: "info"
  file: "/app/logs/app.log"
-  audit_file: "/app/logs/audit.log"
+  audit_log: "/app/logs/audit.log"
 ```

 ## Worker Configurations
@ -113,59 +114,80 @@ logging:
 **File:** `configs/workers/worker-prod.toml`

 ```toml
-[worker]
-name = "production-worker"
-id = "worker-prod-1"
+worker_id = "worker-prod-01"
+base_path = "/data/ml-experiments"
+max_workers = 4

-[server]
-host = "api-server"
-port = 9101
-api_key = "your-api-key-here"
+redis_addr = "localhost:6379"
+redis_password = "CHANGE_ME_REDIS_PASSWORD"
+redis_db = 0

-[execution]
-max_concurrent_jobs = 2
-timeout_minutes = 60
-retry_attempts = 3
+host = "localhost"
+user = "ml-user"
+port = 22
+ssh_key = "~/.ssh/id_rsa"
+
+podman_image = "ml-training:latest"
+gpu_vendor = "none"
+gpu_visible_devices = []
+gpu_devices = []
+container_workspace = "/workspace"
+container_results = "/results"
+train_script = "train.py"

 [resources]
-memory_limit = "4Gi"
-cpu_limit = "2"
-gpu_enabled = false
+max_workers = 4
+desired_rps_per_worker = 2
+podman_cpus = "4"
+podman_memory = "16g"

-[storage]
-work_dir = "/tmp/fetchml-jobs"
-cleanup_interval_minutes = 30
+[metrics]
+enabled = true
+listen_addr = ":9100"
+```
+
+```toml
+# Production Worker (NVIDIA, UUID-based GPU selection)
+worker_id = "worker-prod-01"
+base_path = "/data/ml-experiments"
+
+podman_image = "ml-training:latest"
+gpu_vendor = "nvidia"
+gpu_visible_device_ids = ["GPU-REPLACE_WITH_REAL_UUID"]
+gpu_devices = ["/dev/dri"]
+container_workspace = "/workspace"
+container_results = "/results"
+train_script = "train.py"
 ```

 ### Docker Worker
-**File:** `configs/workers/worker-docker.yaml`
+**File:** `configs/workers/docker.yaml`

 ```yaml
-worker:
-  name: "docker-worker"
-  id: "worker-docker-1"
+worker_id: "docker-worker"
+base_path: "/tmp/fetchml-jobs"
+train_script: "train.py"

-server:
-  host: "api-server"
-  port: 9101
-  api_key: "your-api-key-here"
+redis_addr: "redis:6379"
+redis_password: ""
+redis_db: 0

-execution:
-  max_concurrent_jobs: 1
-  timeout_minutes: 30
-  retry_attempts: 3
+local_mode: true

-resources:
-  memory_limit: "2Gi"
-  cpu_limit: "1"
-  gpu_enabled: false
+max_workers: 1
+poll_interval_seconds: 5

-docker:
+podman_image: "python:3.9-slim"
+container_workspace: "/workspace"
+container_results: "/results"
+gpu_devices: []
+gpu_vendor: "none"
+gpu_visible_devices: []
+
+metrics:
  enabled: true
-  image: "fetchml/worker:latest"
-  volume_mounts:
-    - "/tmp:/tmp"
-    - "/var/run/docker.sock:/var/run/docker.sock"
+  listen_addr: ":9100"
+metrics_flush_interval: "500ms"
 ```

 ## CLI Configuration
@ -181,7 +203,7 @@ worker_base = "/app"
 worker_port = 22

 [auth]
-api_key = "your-hashed-api-key"
+api_key = "<your-api-key>"

 [cli]
 default_timeout = 30
@ -199,7 +221,7 @@ worker_base = "/app"
 worker_port = 22

 [auth]
-api_key = "5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8"
+api_key = "<admin-api-key>"
 ```

 **Researcher Config:** `~/.ml/config-researcher.toml`
@ -211,7 +233,7 @@ worker_base = "/app"
 worker_port = 22

 [auth]
-api_key = "ef92b778ba7a6c8f2150019a5678047b6a9a2b95cef8189518f9b35c54d2e3ae"
+api_key = "<researcher-api-key>"
 ```

 **Analyst Config:** `~/.ml/config-analyst.toml`
@ -223,7 +245,7 @@ worker_base = "/app"
 worker_port = 22

 [auth]
-api_key = "a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3"
+api_key = "<analyst-api-key>"
 ```

 ## Configuration Options
@ -298,7 +320,6 @@ api_key = "a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3"
 |----------|---------|-------------|
 | `FETCHML_CONFIG` | - | Path to config file |
 | `FETCHML_LOG_LEVEL` | "info" | Override log level |
-| `FETCHML_REDIS_URL` | - | Override Redis URL |
 | `CLI_CONFIG` | - | Path to CLI config file |

 ## Troubleshooting
@ -324,7 +345,7 @@ api_key = "a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3"

 ```bash
 # Validate server configuration
-go run cmd/api-server/main.go --config configs/environments/config-local.yaml --validate
+go run cmd/api-server/main.go --config configs/api/dev.yaml --validate

 # Test CLI configuration
 ./cli/zig-out/bin/ml status --debug
--- a/docs/src/dev-quick-start.md
+++ b/docs/src/dev-quick-start.md
@ -0,0 +1,30 @@
+# Development Quick Start
+
+This page is the developer-focused entrypoint for working on FetchML.
+
+## Prerequisites
+
+- Go
+- Zig
+- Docker / Docker Compose
+
+## Quick setup
+
+```bash
+# Clone
+git clone https://github.com/jfraeys/fetch_ml.git
+cd fetch_ml
+
+# Start dev environment
+make dev-up
+
+# Run tests
+make test
+```
+
+## Next
+
+- See `testing.md` for test workflows.
+- See `architecture.md` for system structure.
+- See `zig-cli.md` for CLI build details.
+- See the repository root `DEVELOPMENT.md` for the full development guide.
--- a/docs/src/quick-start-testing.md
+++ b/docs/src/quick-start-testing.md
@ -1,185 +0,0 @@
-# Quick Start Testing Guide
-
-## Overview
-
-This guide provides the fastest way to test the FetchML multi-user authentication system.
-
-## Prerequisites
-
- Docker and Docker Compose installed
- CLI built: `cd cli && zig build`
- Test configs available in `~/.ml/`
-
-## 5-Minute Test
-
-### 1. Clean Environment
-```bash
-make self-cleanup
-```
-
-### 2. Start Services
-```bash
-docker-compose -f deployments/docker-compose.prod.yml up -d
-```
-
-### 3. Test Authentication
-```bash
-make test-auth
-```
-
-### 4. Check Results
-You should see:
- Admin user: Full access, shows all jobs
- Researcher user: Own jobs only
- Analyst user: Read-only access
-
-### 5. Clean Up
-```bash
-make self-cleanup
-```
-
-## Detailed Testing
-
-### Multi-User Authentication Test
-
-```bash
-# Test each user role
-cp ~/.ml/config-admin.toml ~/.ml/config.toml && ./cli/zig-out/bin/ml status
-cp ~/.ml/config-researcher.toml ~/.ml/config.toml && ./cli/zig-out/bin/ml status  
-cp ~/.ml/config-analyst.toml ~/.ml/config.toml && ./cli/zig-out/bin/ml status
-```
-
-### Job Queueing Test
-
-```bash
-# Admin can queue jobs
-cp ~/.ml/config-admin.toml ~/.ml/config.toml
-echo "admin job" | ./cli/zig-out/bin/ml queue admin-test
-
-# Researcher can queue jobs
-cp ~/.ml/config-researcher.toml ~/.ml/config.toml
-echo "research job" | ./cli/zig-out/bin/ml queue research-test
-
-# Analyst cannot queue jobs (should fail)
-cp ~/.ml/config-analyst.toml ~/.ml/config.toml
-echo "analysis job" | ./cli/zig-out/bin/ml queue analysis-test
-```
-
-### Status Verification
-
-```bash
-# Check what each user can see
-make test-auth
-```
-
-## Expected Results
-
-### Admin User Output
-```
-Status retrieved for user: admin_user (admin: true)
-Tasks: X total, X queued, X running, X failed, X completed
-```
-
-### Researcher User Output
-```
-Status retrieved for user: researcher1 (admin: false)
-Tasks: X total, X queued, X running, X failed, X completed
-```
-
-### Analyst User Output
-```
-Status retrieved for user: analyst1 (admin: false)
-Tasks: X total, X queued, X running, X failed, X completed
-```
-
-## Troubleshooting
-
-### Server Not Running
-```bash
-# Check containers
-docker ps --filter "name=ml-"
-
-# Start services
-docker-compose -f deployments/docker-compose.prod.yml up -d
-
-# Check logs
-docker logs ml-prod-api
-```
-
-### Authentication Failures
-```bash
-# Check config files
-ls ~/.ml/config-*.toml
-
-# Verify API keys
-cat ~/.ml/config-admin.toml
-```
-
-### Connection Issues
-```bash
-# Test API directly
-curl -I http://localhost:9103/health
-
-# Check ports
-netstat -an | grep 9103
-```
-
-## Advanced Testing
-
-### Full Test Suite
-```bash
-make test-full
-```
-
-### Performance Testing
-```bash
-./scripts/benchmarks/run-benchmarks-local.sh
-```
-
-### Cleanup Status
-```bash
-make test-status
-```
-
-## Configuration Files
-
-### Test Configs Location
- `~/.ml/config-admin.toml` - Admin user
- `~/.ml/config-researcher.toml` - Researcher user  
- `~/.ml/config-analyst.toml` - Analyst user
-
-### Server Configs
- `configs/environments/config-multi-user.yaml` - Multi-user setup
- `configs/environments/config-local.yaml` - Local development
-
-## Next Steps
-
-1. **Review Documentation**
-   - [Testing Protocol](testing-protocol.md)
-   - [Configuration Reference](configuration-reference.md)
-   - [Testing Guide](testing-guide.md)
-
-2. **Explore Features**
-   - Job queueing and management
-   - WebSocket communication
-   - Role-based permissions
-
-3. **Production Setup**
-   - TLS configuration
-   - Security hardening
-   - Monitoring setup
-
-## Help
-
-### Common Commands
-```bash
-make help              # Show all commands
-make test-auth         # Quick auth test
-make self-cleanup      # Clean environment
-make test-status       # Check system status
-```
-
-### Get Help
- Check logs: `docker logs ml-prod-api`
- Review documentation in `docs/src/`
- Use `--debug` flag with CLI commands
--- a/docs/src/testing-guide.md
+++ b/docs/src/testing-guide.md
@ -1,50 +0,0 @@
-# Testing Guide
-
-## Quick Start
-
-The FetchML project includes comprehensive testing tools.
-
-## Testing Commands
-
-### Quick Tests
-```bash
-make test-auth        # Test multi-user authentication
-make test-status      # Check cleanup status
-make self-cleanup     # Clean environment
-```
-
-### Full Test Suite
-```bash
-make test-full         # Run complete test suite
-```
-
-## Expected Results
-
-### Admin User
-Status retrieved for user: admin_user (admin: true)
-Tasks: X total, X queued, X running, X failed, X completed
-
-### Researcher User  
-Status retrieved for user: researcher1 (admin: false)
-Tasks: X total, X queued, X running, X failed, X completed
-
-### Analyst User
-Status retrieved for user: analyst1 (admin: false)
-Tasks: X total, X queued, X running, X failed, X completed
-
-## Troubleshooting
-
-### Authentication Failures
- Check API key in ~/.ml/config.toml
- Verify server is running with auth enabled
-
-### Container Issues
- Check Docker daemon is running
- Verify ports 9100, 9103 are available
- Review logs: docker logs ml-prod-api
-
-## Cleanup
-```bash
-make self-cleanup     # Interactive cleanup
-make auto-cleanup     # Setup daily auto-cleanup
-```
--- a/docs/src/testing-protocol.md
+++ b/docs/src/testing-protocol.md
@ -1,258 +0,0 @@
-# Testing Protocol
-
-This document outlines the comprehensive testing protocol for the FetchML project.
-
-## Overview
-
-The testing protocol is designed to ensure:
- Multi-user authentication works correctly
- API functionality is reliable
- CLI commands function properly
- Docker containers run as expected
- Performance meets requirements
-
-## Test Categories
-
-### 1. Authentication Tests
-
-#### 1.1 Multi-User Authentication
-```bash
-# Test admin user
-cp ~/.ml/config-admin.toml ~/.ml/config.toml
-./cli/zig-out/bin/ml status
-# Expected: Shows admin status and all jobs
-
-# Test researcher user  
-cp ~/.ml/config-researcher.toml ~/.ml/config.toml
-./cli/zig-out/bin/ml status
-# Expected: Shows researcher status and own jobs only
-
-# Test analyst user
-cp ~/.ml/config-analyst.toml ~/.ml/config.toml
-./cli/zig-out/bin/ml status
-# Expected: Shows analyst status, read-only access
-```
-
-#### 1.2 API Key Validation
-```bash
-# Test invalid API key
-echo "invalid_key" > ~/.ml/config.toml
-./cli/zig-out/bin/ml status
-# Expected: Authentication failed error
-
-# Test missing API key
-rm ~/.ml/config.toml
-./cli/zig-out/bin/ml status
-# Expected: API key not configured error
-```
-
-### 2. CLI Functionality Tests
-
-#### 2.1 Job Queueing
-```bash
-# Test job queueing with different users
-cp ~/.ml/config-admin.toml ~/.ml/config.toml
-echo "test job" | ./cli/zig-out/bin/ml queue test-job
-
-cp ~/.ml/config-researcher.toml ~/.ml/config.toml
-echo "research job" | ./cli/zig-out/bin/ml queue research-job
-
-cp ~/.ml/config-analyst.toml ~/.ml/config.toml
-echo "analysis job" | ./cli/zig-out/bin/ml queue analysis-job
-# Expected: Admin and researcher can queue, analyst cannot
-```
-
-#### 2.2 Status Checking
-```bash
-# Check status after job queueing
-./cli/zig-out/bin/ml status
-# Expected: Shows jobs based on user permissions
-```
-
-### 3. Docker Container Tests
-
-#### 3.1 Container Startup
-```bash
-# Start production environment
-docker-compose -f deployments/docker-compose.prod.yml up -d
-
-# Check container status
-docker ps --filter "name=ml-"
-# Expected: All containers running and healthy
-```
-
-#### 3.2 Port Accessibility
-```bash
-# Test API server port
-curl -I http://localhost:9103/health
-# Expected: 200 OK response
-
-# Test metrics port
-curl -I http://localhost:9100/metrics
-# Expected: 200 OK response
-```
-
-#### 3.3 Container Cleanup
-```bash
-# Test cleanup script
-./scripts/maintenance/cleanup.sh --dry-run
-./scripts/maintenance/cleanup.sh --force
-# Expected: Containers stopped and removed
-```
-
-### 4. Performance Tests
-
-#### 4.1 API Performance
-```bash
-# Run API benchmarks
-./scripts/benchmarks/run-benchmarks-local.sh
-# Expected: Response times under 100ms for basic operations
-```
-
-#### 4.2 Load Testing
-```bash
-# Run load tests
-go test -v ./tests/load/...
-# Expected: System handles concurrent requests without degradation
-```
-
-### 5. Integration Tests
-
-#### 5.1 End-to-End Workflow
-```bash
-# Complete workflow test
-cp ~/.ml/config-admin.toml ~/.ml/config.toml
-
-# Queue job
-echo "integration test" | ./cli/zig-out/bin/ml queue integration-test
-
-# Check status
-./cli/zig-out/bin/ml status
-
-# Verify job appears in queue
-# Expected: Job queued and visible in status
-```
-
-#### 5.2 WebSocket Communication
-```bash
-# Test WebSocket handshake
-./cli/zig-out/bin/ml status
-# Expected: Successful WebSocket upgrade and response
-```
-
-## Test Execution Order
-
-### Phase 1: Environment Setup
-1. Clean up any existing containers
-2. Start fresh Docker environment
-3. Verify all services are running
-
-### Phase 2: Authentication Testing
-1. Test all user roles (admin, researcher, analyst)
-2. Test invalid authentication scenarios
-3. Verify role-based permissions
-
-### Phase 3: Functional Testing
-1. Test CLI commands (queue, status)
-2. Test API endpoints
-3. Test WebSocket communication
-
-### Phase 4: Integration Testing
-1. Test complete workflows
-2. Test error scenarios
-3. Test cleanup procedures
-
-### Phase 5: Performance Testing
-1. Run benchmarks
-2. Perform load testing
-3. Validate performance metrics
-
-## Automated Testing
-
-### Continuous Integration Tests
-```bash
-# Run all tests
-make test
-
-# Run specific test categories
-make test-unit
-make test-integration
-make test-e2e
-```
-
-### Pre-deployment Checklist
-```bash
-# Complete test suite
-./scripts/testing/run-full-test-suite.sh
-
-# Performance validation
-./scripts/benchmarks/run-benchmarks-local.sh
-
-# Security validation
-./scripts/security/security-scan.sh
-```
-
-## Test Data Management
-
-### Test Users
- **admin_user**: Full access, can see all jobs
- **researcher1**: Can create and view own jobs
- **analyst1**: Read-only access, cannot create jobs
-
-### Test Jobs
- **test-job**: Basic job for testing
- **research-job**: Research-specific job
- **analysis-job**: Analysis-specific job
-
-## Troubleshooting
-
-### Common Issues
-
-#### Authentication Failures
- Check API key configuration
- Verify server is running with auth enabled
- Check YAML config syntax
-
-#### Container Issues
- Verify Docker daemon is running
- Check port conflicts
- Review container logs
-
-#### Performance Issues
- Monitor resource usage
- Check for memory leaks
- Verify database connections
-
-### Debug Commands
-```bash
-# Check container logs
-docker logs ml-prod-api
-
-# Check system resources
-docker stats
-
-# Verify network connectivity
-docker network ls
-```
-
-## Test Results Documentation
-
-All test results should be documented in:
- `test-results/` directory
- Performance benchmarks
- Integration test reports
- Security scan results
-
-## Maintenance
-
-### Regular Tasks
- Update test data periodically
- Review and update test cases
- Maintain test infrastructure
- Monitor test performance
-
-### Test Environment
- Keep test environment isolated
- Use consistent test data
- Regular cleanup of test artifacts
- Monitor test resource usage
--- a/docs/src/testing.md
+++ b/docs/src/testing.md
@ -1,10 +1,62 @@
 # Testing Guide

-Comprehensive testing documentation for FetchML platform.
+Comprehensive testing documentation for FetchML platform with integrated monitoring.

 ## Quick Start Testing

-For a fast 5-minute testing experience, see the **[Quick Start Testing Guide](quick-start-testing.md)**.
+### 5-Minute Fast Test
+
+```bash
+# Clean environment
+make self-cleanup
+
+# Start development stack with monitoring
+make dev-up
+
+# Quick authentication test
+make test-auth
+
+# Clean up
+make dev-down
+```
+
+**Expected Results**:
+- Admin user: Full access, shows all jobs
+- Researcher user: Own jobs only
+- Analyst user: Read-only access
+
+## Test Environment Setup
+
+### Development Environment with Monitoring
+
+```bash
+# Start development stack with monitoring
+make dev-up
+
+# Verify all services are running
+make dev-status
+
+# Run tests against running services
+make test
+
+# Check monitoring during tests
+# Grafana: http://localhost:3000 (admin/admin123)
+```
+
+### Test Environment Verification
+
+```bash
+# Verify API server
+curl -f http://localhost:8080/health
+
+# Verify monitoring services
+curl -f http://localhost:3000/api/health
+curl -f http://localhost:9090/api/v1/query?query=up
+curl -f http://localhost:3100/ready
+
+# Verify Redis
+docker exec ml-experiments-redis redis-cli ping
+```

 ## Test Types

@ -12,6 +64,9 @@ For a fast 5-minute testing experience, see the **[Quick Start Testing Guide](qu
 ```bash
 make test-unit          # Go unit tests only
 cd cli && zig build test  # Zig CLI tests
+
+# Unit tests live under tests/unit/ (including tests that cover internal/ packages)
+go test ./tests/unit/...
 ```

 ### Integration Tests
@ -22,6 +77,9 @@ make test-integration   # API and database integration
 ### End-to-End Tests
 ```bash
 make test-e2e          # Full workflow testing
+
+# Podman E2E is opt-in because it builds/runs containers
+FETCH_ML_E2E_PODMAN=1 go test ./tests/e2e/...  # Enables TestPodmanIntegration
 ```

 ### All Tests
@ -30,66 +88,308 @@ make test              # Run complete test suite
 make test-coverage     # With coverage report
 ```

-## Docker Testing
+## Deployment-Specific Testing
+
+### Development Environment Testing

-### Development Environment
 ```bash
-docker-compose up -d
+# Start dev stack
+make dev-up
+
+# Run tests with monitoring
 make test
-docker-compose down
+
+# View test results in Grafana
+# Load Test Performance dashboard
+# System Health dashboard
 ```

 ### Production Environment Testing
+
 ```bash
-docker-compose -f docker-compose.prod.yml up -d
+cd deployments
+make prod-up
+
+# Test production deployment
 make test-auth         # Multi-user auth test
 make self-cleanup      # Clean up after testing
+
+# Verify production monitoring
+curl -f https://your-domain.com/health
+```
+
+### Homelab Secure Testing
+
+```bash
+cd deployments
+make homelab-up
+
+# Test secure deployment
+make test-auth
+make test-ssl          # SSL/TLS testing
+```
+
+## Authentication Testing Protocol
+
+### Multi-User Authentication
+
+```bash
+# Test admin user
+cp ~/.ml/config-admin.toml ~/.ml/config.toml
+./cli/zig-out/bin/ml status
+# Expected: Shows admin status and all jobs
+
+# Test researcher user  
+cp ~/.ml/config-researcher.toml ~/.ml/config.toml
+./cli/zig-out/bin/ml status
+# Expected: Shows researcher status and own jobs only
+
+# Test analyst user
+cp ~/.ml/config-analyst.toml ~/.ml/config.toml
+./cli/zig-out/bin/ml status
+# Expected: Shows analyst status, read-only access
+```
+
+### API Key Validation
+
+```bash
+# Test invalid API key
+echo "invalid_key" > ~/.ml/config.toml
+./cli/zig-out/bin/ml status
+# Expected: Authentication failed error
+
+# Test missing API key
+rm ~/.ml/config.toml
+./cli/zig-out/bin/ml status
+# Expected: API key not configured error
+```
+
+### Job Queueing by Role
+
+```bash
+# Admin can queue jobs
+cp ~/.ml/config-admin.toml ~/.ml/config.toml
+echo "admin job" | ./cli/zig-out/bin/ml queue admin-test
+
+# Researcher can queue jobs
+cp ~/.ml/config-researcher.toml ~/.ml/config.toml
+echo "research job" | ./cli/zig-out/bin/ml queue research-test
+
+# Analyst cannot queue jobs (should fail)
+cp ~/.ml/config-analyst.toml ~/.ml/config.toml
+echo "analysis job" | ./cli/zig-out/bin/ml queue analysis-test
+# Expected: Permission denied error
 ```

 ## Performance Testing

+### Load Testing with Monitoring
+
+```bash
+# Start monitoring
+make dev-up
+
+# Run load tests
+make load-test
+
+# Monitor performance in real-time
+# Grafana: http://localhost:3000
+# Check: Request rates, response times, error rates
+```
+
 ### Benchmark Suite
-```bash
-./scripts/benchmarks/run-benchmarks-local.sh
-```
-
-### Load Testing
-```bash
-make test-load         # API load testing
-```
-
-## Authentication Testing
-
-Multi-user authentication testing is fully covered in the **[Quick Start Testing Guide](quick-start-testing.md)**.

 ```bash
-make test-auth         # Quick auth role testing
+# Run benchmarks with performance tracking
+./scripts/track_performance.sh
+
+# Or run directly
+make benchmark-local
+
+# View results in Grafana dashboards
 ```

+### Performance Monitoring During Tests
+
+**Key Metrics to Watch**:
+- API response times (95th percentile)
+- Error rates (should be < 1%)
+- Memory usage trends
+- CPU utilization
+- Request throughput
+
 ## CLI Testing

 ### Build and Test CLI
+
 ```bash
-cd cli && zig build dev
-./cli/zig-out/dev/ml --help
+cd cli && zig build --release=fast
+./zig-out/bin/ml --help
 zig build test
 ```

 ### CLI Integration Tests
+
 ```bash
-make test-cli          # CLI-specific integration tests
+
 ```

+## Docker Container Testing
+
+### Container Startup
+
+```bash
+# Start environment
+make dev-up
+
+# Check container status
+docker ps --filter "name=ml-"
+# Expected: All containers running and healthy
+```
+
+### Port Accessibility
+
+```bash
+# Test API server port
+curl -I http://localhost:8080/health
+# Expected: 200 OK response
+
+# Test metrics port
+curl -I http://localhost:9100/metrics
+# Expected: 200 OK response
+
+# Test monitoring ports
+curl -I http://localhost:3000/api/health
+curl -I http://localhost:9090/api/v1/query?query=up
+```
+
+### Container Cleanup
+
+```bash
+# Test cleanup script
+make self-cleanup
+make dev-down
+# Expected: Containers stopped and removed
+```
+
+## Monitoring During Testing
+
+### Real-Time Monitoring
+
+```bash
+# Access Grafana during tests
+http://localhost:3000 (admin/admin123)
+
+# Key Dashboards:
+# - Load Test Performance: Request metrics, response times
+# - System Health: Service status, resource usage
+# - Log Analysis: Error logs, service logs
+```
+
+### Test Metrics Collection
+
+**Automatically Collected**:
+- HTTP request metrics
+- Response time histograms
+- Error counters
+- Resource utilization
+- Log aggregation
+
+**Manual Test Markers**:
+```bash
+# Mark test start in logs
+echo "TEST_START: $(date)" | tee -a /logs/test.log
+
+# Mark test completion
+echo "TEST_END: $(date)" | tee -a /logs/test.log
+```
+
+## Test Execution Protocol
+
+### Phase 1: Environment Setup
+1. Clean up any existing containers
+2. Start fresh Docker environment with monitoring
+3. Verify all services are running
+
+### Phase 2: Authentication Testing
+1. Test all user roles (admin, researcher, analyst)
+2. Test invalid authentication scenarios
+3. Verify role-based permissions
+
+### Phase 3: Functional Testing
+1. Test CLI commands (queue, status)
+2. Test API endpoints
+3. Test WebSocket communication
+
+### Phase 4: Integration Testing
+1. Test complete workflows
+2. Test error scenarios
+3. Test cleanup procedures
+
+### Phase 5: Performance Testing
+1. Run benchmarks
+2. Perform load testing
+3. Validate performance metrics
+
 ## Troubleshooting Tests

 ### Common Issues
- **Server not running**: Check with `docker ps --filter "name=ml-"`
- **Authentication failures**: Verify configs in `~/.ml/config-*.toml`
- **Connection issues**: Test API with `curl -I http://localhost:9103/health`
+
+**Server Not Running**:
+```bash
+# Check service status
+make dev-status
+
+# Check container logs
+docker logs ml-experiments-api
+docker logs ml-experiments-grafana
+```
+
+**Authentication Failures**:
+```bash
+# Verify configs
+ls ~/.ml/config-*.toml
+
+# Check API health
+curl -I http://localhost:8080/health
+
+# Monitor auth logs in Grafana
+# Log Analysis dashboard -> filter "auth"
+```
+
+**Performance Issues**:
+```bash
+# Check resource usage in Grafana
+# System Health dashboard
+
+# Check API response times
+# Load Test Performance dashboard
+
+# Identify bottlenecks
+# Prometheus: http://localhost:9090
+```
+
+**Monitoring Issues**:
+```bash
+# Re-setup monitoring provisioning (Grafana datasources/providers)
+python3 scripts/setup_monitoring.py
+
+# Restart Grafana
+docker restart ml-experiments-grafana
+
+# Check datasource connectivity
+# Grafana -> Configuration -> Data Sources
+```

 ### Debug Mode
+
 ```bash
-make test-debug        # Run tests with verbose output
+# Enable debug logging
+export LOG_LEVEL=debug
+make test
+
+# Monitor debug logs
+# Grafana Log Analysis dashboard
 ```

 ## Test Configuration
@ -103,12 +403,27 @@ make test-debug        # Run tests with verbose output
 - `tests/fixtures/` - Test data and examples
 - `tests/benchmarks/` - Performance test data

+### Monitoring Configuration
+- `monitoring/grafana/provisioning/` - Auto-provisioned datasources
+- `monitoring/grafana/dashboards/` - Auto-provisioned dashboards
+- `monitoring/prometheus/prometheus.yml` - Metrics collection
+
 ## Continuous Integration

+### CI Pipeline with Monitoring
+
 Tests run automatically on:
- Pull requests (full suite)
- Main branch commits (unit + integration)
- Releases (full suite + benchmarks)
+- **Pull requests**: Full suite + performance benchmarks
+- **Main branch**: Unit + integration tests
+- **Releases**: Full suite + benchmarks + security scans
+
+### CI Monitoring
+
+During CI runs:
+- Performance metrics collected
+- Test results tracked in Grafana
+- Regression detection
+- Automated alerts on failures

 ## Writing Tests

@ -117,13 +432,120 @@ Tests run automatically on:
 - Integration tests: `tests/e2e/` directory
 - Benchmark tests: `tests/benchmarks/` directory

+### Test with Monitoring
+
+```go
+// Add custom metrics for tests
+var testRequests = prometheus.NewCounterVec(
+    prometheus.CounterOpts{
+        Name: "test_requests_total",
+        Help: "Total number of test requests",
+    },
+    []string{"method", "status"},
+)
+
+// Log test events for monitoring
+log.WithFields(log.Fields{
+    "test": "integration_test",
+    "operation": "api_call",
+    "status": "success",
+}).Info("Test operation completed")
+```
+
 ### Zig Tests
 - CLI tests: `cli/tests/` directory
 - Follow Zig testing conventions

-## See Also
+## Test Result Analysis

- **[Quick Start Testing Guide](quick-start-testing.md)** - Fast 5-minute testing
- **[Testing Protocol](testing-protocol.md)** - Detailed testing procedures
- **[Configuration Reference](configuration-reference.md)** - Test setup
- **[Troubleshooting](troubleshooting.md)** - Common issues 
+### Grafana Dashboard Analysis
+
+**Load Test Performance**:
+- Request rates over time
+- Response time percentiles
+- Error rate trends
+- Throughput metrics
+
+**System Health**:
+- Service availability
+- Resource utilization
+- Memory usage patterns
+- CPU consumption
+
+**Log Analysis**:
+- Error patterns
+- Warning frequency
+- Service log aggregation
+- Debug information
+
+### Performance Regression Detection
+
+```bash
+# Track performance over time
+./scripts/track_performance.sh
+
+# Compare with baseline
+# Grafana: Compare current run with historical data
+
+# Alert on regressions
+# Set up Grafana alerts for performance degradation
+```
+
+## Test Cleanup
+
+### Automated Cleanup
+
+```bash
+# Clean up test data
+make self-cleanup
+
+# Clean up Docker resources
+make clean-all
+
+# Reset monitoring data
+docker volume rm monitoring_prometheus_data
+docker volume rm monitoring_grafana_data
+```
+
+### Manual Cleanup
+
+```bash
+# Stop test environment
+make dev-down
+
+# Remove test artifacts
+rm -rf ~/.ml/config-*.toml
+rm -rf test-results/
+```
+
+## Expected Test Results
+
+### Admin User
+```
+Status retrieved for user: admin_user (admin: true)
+Tasks: X total, X queued, X running, X failed, X completed
+```
+
+### Researcher User  
+```
+Status retrieved for user: researcher1 (admin: false)
+Tasks: X total, X queued, X running, X failed, X completed
+```
+
+### Analyst User
+```
+Status retrieved for user: analyst1 (admin: false)
+Tasks: X total, X queued, X running, X failed, X completed
+```
+
+## Common Commands Reference
+
+```bash
+make help              # Show all commands
+make test-auth         # Quick auth test
+make self-cleanup      # Clean environment
+make test-status       # Check system status
+make dev-up            # Start dev environment
+make dev-down          # Stop dev environment
+make dev-status        # Check dev status
+```
--- a/docs/src/troubleshooting.md
+++ b/docs/src/troubleshooting.md
@ -7,41 +7,36 @@ Common issues and solutions for Fetch ML.
 ### Services Not Starting

 ```bash
-# Check Docker status
-docker-compose ps
+# Check container status
+docker ps --filter "name=ml-"

-# Restart services
-docker-compose down && docker-compose up -d (testing only)
-
-# Check logs
-docker-compose logs -f
+# Restart development stack
+make dev-down
+make dev-up
 ```

 ### API Not Responding

 ```bash
 # Check health endpoint
-curl http://localhost:9101/health
+curl http://localhost:8080/health

 # Check if port is in use
-lsof -i :9101
+lsof -i :8080
+lsof -i :8443

 # Kill process on port
-kill -9 $(lsof -ti :9101)
+kill -9 $(lsof -ti :8080)
 ```

-### Database Issues
+### Database / Redis Issues

 ```bash
-# Check database connection
-docker-compose exec postgres psql -U postgres -d fetch_ml
+# Check Redis from container
+docker exec ml-experiments-redis redis-cli ping

-# Reset database
-docker-compose down postgres
-docker-compose up -d (testing only) postgres
-
-# Check Redis
-docker-compose exec redis redis-cli ping
+# Check API can reach database (via health endpoint)
+curl -f http://localhost:8080/health || echo "API not healthy"
 ```

 ## Common Errors
@ -53,7 +48,7 @@ docker-compose exec redis redis-cli ping

 ### Database Errors
 - **Connection failed**: Verify database type and connection params
- **No such table**: Run migrations with `--migrate` (see [Development Setup](development-setup.md))
+- **No such table**: Run migrations with `--migrate` (see [Quick Start](quick-start.md))

 ### Container Errors
 - **Runtime not found**: Set `runtime: docker (testing only)` in config
@ -65,15 +60,15 @@ docker-compose exec redis redis-cli ping

 ## Development Issues
 - **Build fails**: `go mod tidy` and `cd cli && rm -rf zig-out zig-cache`
- **Tests fail**: Start test dependencies with `docker-compose up -d` or `make test-auth`
+- **Tests fail**: Ensure dev stack is running with `make dev-up` or use `make test-auth`

 ## CLI Issues
- **Not found**: `cd cli && zig build dev`
+- **Not found**: `cd cli && zig build --release=fast`
 - **Connection errors**: Check `--server` and `--api-key`

 ## Network Issues
- **Port conflicts**: `lsof -i :9101` and kill processes
- **Firewall**: Allow ports 9101, 6379, 5432
+- **Port conflicts**: `lsof -i :8080` / `lsof -i :8443` and kill processes
+- **Firewall**: Allow ports 8080, 8443, 6379, 5432

 ## Configuration Issues
 - **Invalid YAML**: `python3 -c "import yaml; yaml.safe_load(open('config.yaml'))"`
@ -82,13 +77,19 @@ docker-compose exec redis redis-cli ping
 ## Debug Information
 ```bash
 ./bin/api-server --version
-docker-compose ps
-docker-compose logs api-server | grep ERROR
+docker ps --filter "name=ml-"
+docker logs ml-experiments-api | grep ERROR
 ```

 ## Emergency Reset
 ```bash
-docker-compose down -v
+# Stop and remove all dev containers and volumes
+make dev-down
+docker volume prune
+
+# Remove local data if needed
 rm -rf data/ results/ *.db
-docker-compose up -d (testing only)
+
+# Start fresh dev stack
+make dev-up
 ```
--- a/docs/src/validate.md
+++ b/docs/src/validate.md
@ -0,0 +1,101 @@
+---
+layout: page
+title: "Validation (ml validate)"
+permalink: /validate/
+---
+
+# Validation (`ml validate`)
+
+The `ml validate` command verifies experiment integrity and provenance.
+
+It can be run against:
+
+- A **commit id** (validates the experiment tree + dependency manifest)
+- A **task id** (additionally validates the run’s `run_manifest.json` provenance and lifecycle)
+
+## CLI usage
+
+```bash
+# Validate by commit
+ml validate <commit_id> [--json] [--verbose]
+
+# Validate by task
+ml validate --task <task_id> [--json] [--verbose]
+```
+
+### Output modes
+
+- Default (human): prints a summary with `errors`, `warnings`, and `failed_checks`.
+- `--verbose`: prints all checks under `checks` and includes `expected/actual/details` when present.
+- `--json`: prints the raw JSON payload.
+
+## Report shape
+
+The API returns a JSON report of the form:
+
+- `ok`: overall boolean
+- `commit_id`: commit being validated (if known)
+- `task_id`: task being validated (when validating by task)
+- `checks`: map of check name → `{ ok, expected?, actual?, details? }`
+- `errors`: list of high-level failures
+- `warnings`: list of non-fatal issues
+- `ts`: UTC timestamp
+
+## Check semantics
+
+- For **task statuses** `running`, `completed`, or `failed`, run-manifest issues are treated as **errors**.
+- For **queued/pending** tasks, run-manifest issues are usually **warnings** (the job may not have started yet).
+
+## Notable checks
+
+### Experiment integrity
+
+- `experiment_manifest`: validates the experiment manifest (content-addressed integrity)
+- `deps_manifest`: validates that a dependency manifest exists and can be hashed
+- `expected_manifest_overall_sha`: compares the task’s recorded manifest SHA to the current manifest SHA
+- `expected_deps_manifest`: compares the task’s recorded deps manifest name/SHA to what exists on disk
+
+### Run manifest provenance (task validation)
+
+- `run_manifest`: whether `run_manifest.json` could be found and loaded
+- `run_manifest_location`: verifies the manifest was found in the expected bucket:
+  - `pending` for queued/pending
+  - `running` for running
+  - `finished` for completed
+  - `failed` for failed
+- `run_manifest_task_id`: task id match
+- `run_manifest_commit_id`: commit id match
+- `run_manifest_deps`: deps manifest name/SHA match
+- `run_manifest_snapshot_id`: snapshot id match (when snapshot is part of the task)
+- `run_manifest_snapshot_sha256`: snapshot sha256 match (when snapshot sha is recorded)
+
+### Run manifest lifecycle (task validation)
+
+- `run_manifest_lifecycle`:
+  - `running`: must have `started_at`, must not have `ended_at`/`exit_code`
+  - `completed`/`failed`: must have `started_at`, `ended_at`, `exit_code`, and `ended_at >= started_at`
+  - `queued`/`pending`: must not have `ended_at`/`exit_code`
+
+## Example report (task validation)
+
+```json
+{
+  "ok": false,
+  "commit_id": "6161616161616161616161616161616161616161",
+  "task_id": "task-run-manifest-location-mismatch",
+  "checks": {
+    "experiment_manifest": {"ok": true},
+    "deps_manifest": {"ok": true, "actual": "requirements.txt:..."},
+    "run_manifest": {"ok": true},
+    "run_manifest_location": {
+      "ok": false,
+      "expected": "running",
+      "actual": "finished"
+    }
+  },
+  "errors": [
+    "run manifest location mismatch"
+  ],
+  "ts": "2025-12-17T18:43:00Z"
+}
+```
--- a/docs/src/zig-cli.md
+++ b/docs/src/zig-cli.md
@ -47,6 +47,8 @@ export FETCH_ML_CLI_API_KEY="prod-key"
 - `ml dataset list` – list datasets
 - `ml monitor` – launch TUI over SSH (remote UI)

+`ml status --json` may include an optional `prewarm` field when the worker is prefetching datasets for the next queued task.
+
 ## Build flavors

 - `make all` – release‑small (default)