docs(dev): document validate workflow, CLI/TUI UX contract, and consolidate dev/testing docs

This commit is contained in:
Jeremie Fraeys 2026-01-05 12:37:46 -05:00
parent 8157f73a70
commit 1aed78839b
12 changed files with 1119 additions and 646 deletions

View file

@ -95,7 +95,7 @@ make test
go test ./internal/queue/...
# Build CLI
cd cli && zig build dev
cd cli && zig build --release=fast
# Run formatters and linters
make lint

View file

@ -34,6 +34,9 @@ High-performance command-line interface for experiment management, written in Zi
| `cancel` | Cancel running job | `ml cancel job123` |
| `prune` | Clean up old experiments | `ml prune --keep 10` |
| `watch` | Auto-sync directory on changes | `ml watch ./project --queue` |
| `jupyter` | Manage Jupyter notebook services | `ml jupyter start --name my-nb` |
| `validate` | Validate provenance/integrity for a commit or task | `ml validate <commit_id> --verbose` |
| `info` | Show run info from `run_manifest.json` | `ml info <run_dir>` |
### Command Details
@ -78,6 +81,9 @@ ml queue my-job --commit abc123 --priority 8
- Priority queuing system
- API key authentication
**Notes:**
- Tasks support optional `snapshot_id` and `dataset_specs` fields server-side (for provenance and dataset resolution).
#### `watch` - Auto-Sync Monitoring
```bash
# Watch directory for changes
@ -108,12 +114,78 @@ ml monitor
```
Launches TUI interface via SSH for real-time monitoring.
#### `status` - System Status
`ml status --json` returns a JSON object including an optional `prewarm` field when worker prewarming is active:
```json
{
"prewarm": [
{
"worker_id": "worker-1",
"task_id": "<task-id>",
"started_at": "2025-01-01T00:00:00Z",
"updated_at": "2025-01-01T00:00:05Z",
"phase": "datasets",
"dataset_count": 2
}
]
}
```
#### `cancel` - Job Cancellation
```bash
ml cancel running-job-id
```
Cancels currently running jobs by ID.
#### `jupyter` - Jupyter Notebook Management
Manage Jupyter notebook services via WebSocket protocol.
```bash
# Start a Jupyter service
ml jupyter start --name my-notebook --workspace /path/to/workspace
# Start with password protection
ml jupyter start --name my-notebook --workspace /path/to/workspace --password mypass
# List running services
ml jupyter list
# Stop a service
ml jupyter stop service-id-12345
# Check service status
ml jupyter status
```
**Features:**
- WebSocket-based binary protocol for low latency
- Secure API key authentication (SHA256 hashed)
- Real-time service management
- Workspace isolation
**Common Use Cases:**
```bash
# Development workflow
ml jupyter start --name dev-notebook --workspace ./notebooks
# ... do development work ...
ml jupyter stop dev-service-123
# Team collaboration
ml jupyter start --name team-analysis --workspace /shared/analysis --password teampass
# Multiple services
ml jupyter list # View all running services
```
**Security:**
- API keys are hashed before transmission
- Password protection for notebooks
- Workspace path validation
- Service ID-based authorization
### Configuration
The Zig CLI reads configuration from `~/.ml/config.toml`:
@ -144,7 +216,7 @@ Main HTTPS API server for experiment management.
go run ./cmd/api-server/main.go
# With configuration
./bin/api-server --config configs/config-local.yaml
./bin/api-server --config configs/api/dev.yaml
```
**Features:**
@ -160,9 +232,6 @@ Terminal User Interface for monitoring experiments.
```bash
# Launch TUI
go run ./cmd/tui/main.go
# With custom config
./tui --config configs/config-local.yaml
```
**Features:**
@ -187,10 +256,10 @@ Configuration validation and linting tool.
```bash
# Validate configuration
./configlint configs/config-local.yaml
./configlint configs/api/dev.yaml
# Check schema compliance
./configlint --schema configs/schema/config_schema.yaml
./configlint --schema configs/schema/api_server_config.yaml
```
## Management Script (`./tools/manage.sh`)
@ -208,39 +277,22 @@ Simple service management for your homelab.
./tools/manage.sh cleanup # Clean project artifacts
```
## Setup Script (`./setup.sh`)
One-command homelab setup.
### Usage
```bash
# Full setup
./setup.sh
# Setup includes:
# - SSL certificate generation
# - Configuration creation
# - Build all components
# - Start Redis
# - Setup Fail2Ban (if available)
```
## API Testing
Test the API with curl:
```bash
# Health check
curl -k -H 'X-API-Key: password' https://localhost:9101/health
curl -f http://localhost:8080/health
# List experiments
curl -k -H 'X-API-Key: password' https://localhost:9101/experiments
curl -H 'X-API-Key: password' http://localhost:8080/experiments
# Submit experiment
curl -k -X POST -H 'X-API-Key: password' \
curl -X POST -H 'X-API-Key: password' \
-H 'Content-Type: application/json' \
-d '{"name":"test","config":{"type":"basic"}}' \
https://localhost:9101/experiments
http://localhost:8080/experiments
```
## Zig CLI Architecture
@ -269,7 +321,7 @@ The Zig CLI is designed for performance and reliability:
## Configuration
Main configuration file: `configs/config-local.yaml`
Main configuration file: `configs/api/dev.yaml`
### Key Settings
```yaml
@ -277,14 +329,14 @@ auth:
enabled: true
api_keys:
dev_user:
hash: "2baf1f40105d9501fe319a8ec463fdf4325a2a5df445adf3f572f626253678c9"
hash: "CHANGE_ME_SHA256_DEV_USER_KEY"
admin: true
roles:
- admin
permissions:
'*': true
researcher_user:
hash: "ef92b778ba7a6c8f2150019a5678047b6a9a2b95cef81895吧18f9b35c54d2e3ae"
hash: "CHANGE_ME_SHA256_RESEARCHER_USER_KEY"
admin: false
roles:
- researcher
@ -385,8 +437,8 @@ telnet worker.local 9100
**Authentication failed:**
```bash
# Check API key in config-local.yaml
grep -A 5 "api_keys:" configs/config-local.yaml
# Check API key in config
grep -A 5 "api_keys:" configs/api/dev.yaml
```
**Redis connection failed:**

View file

@ -0,0 +1,337 @@
# FetchML CLI/TUI UX Contract v1
This document defines the user experience contract for FetchML v1, focusing on clean, predictable CLI/TUI interactions without mode flags.
## Core Principles
1. **Thin CLI**: Local CLI does minimal validation; authoritative checks happen server-side
2. **No Mode Flags**: Commands do what they say; no `--mode` or similar flags
3. **Predictable Defaults**: Sensible defaults that work for most use cases
4. **Graceful Degradation**: JSON output for automation, human-friendly output for interactive use
5. **Explicit Operations**: `--dry-run`, `--validate`, `--explain` are explicit, not implied
## Commands v1
### Core Workflow Commands
#### `ml queue <job-name> [options]`
Submit a job for execution.
**Basic Usage:**
```bash
ml queue my-experiment
```
**Options:**
- `--commit <sha>`: Specify commit ID (default: current git HEAD)
- `--priority <1-10>`: Job priority (default: 5)
- `--cpu <cores>`: CPU cores requested (default: 2)
- `--memory <gb>`: Memory in GB (default: 8)
- `--gpu <count>`: GPU count (default: 0)
- `--gpu-memory <gb>`: GPU memory budget (default: auto)
**Dry Run:**
```bash
ml queue my-experiment --dry-run
# Output: JSON with what would be submitted, validation results
```
**Validate Only:**
```bash
ml queue my-experiment --validate
# Output: Validation results without submitting
```
**Explain:**
```bash
ml queue my-experiment --explain
# Output: Human-readable explanation of what will happen
```
**JSON Output:**
When using `--json`, the response may include a `prewarm` field describing best-effort worker prewarming activity (e.g. dataset prefetch for the next queued task).
```bash
ml queue my-experiment --json
# Output: Structured JSON response
```
#### `ml status [job-name]`
Show job status.
**Basic Usage:**
```bash
ml status # All jobs summary
ml status my-experiment # Specific job details
```
**Options:**
- `--json`: JSON output
- `--watch`: Watch mode (refresh every 2s)
- `--limit <n>`: Limit number of jobs shown (default: 20)
#### `ml cancel <job-name>`
Cancel a running or queued job.
**Basic Usage:**
```bash
ml cancel my-experiment
```
**Options:**
- `--force`: Force cancel even if running
- `--json`: JSON output
### Experiment Management
#### `ml experiment init <name>`
Initialize a new experiment directory.
**Basic Usage:**
```bash
ml experiment init my-project
```
**Options:**
- `--template <name>`: Use experiment template
- `--dry-run`: Show what would be created
#### `ml experiment list`
List available experiments.
**Options:**
- `--json`: JSON output
- `--limit <n>`: Limit results
#### `ml experiment show <commit-id>`
Show experiment details.
**Options:**
- `--json`: JSON output
- `--manifest`: Show content integrity manifest
### Dataset Management
#### `ml dataset list`
List available datasets.
**Options:**
- `--json`: JSON output
- `--synced-only`: Show only synced datasets
#### `ml dataset sync <dataset-name>`
Sync a dataset from NAS to ML server.
**Options:**
- `--dry-run`: Show what would be synced
- `--validate`: Validate dataset integrity after sync
### Monitoring & TUI
#### `ml monitor`
Launch TUI for real-time monitoring (runs over SSH).
**Basic Usage:**
```bash
ml monitor
```
**TUI Controls:**
- `Ctrl+C`: Exit TUI
- `q`: Quit
- `r`: Refresh
- `j/k`: Navigate jobs
- `Enter`: Job details
- `c`: Cancel selected job
#### `ml watch <job-name>`
Watch a specific job's output.
**Options:**
- `--follow`: Follow log output (default)
- `--tail <n>`: Show last n lines
## Global Options
These options work with any command:
- `--json`: Output structured JSON instead of human-readable format
- `--config <path>`: Use custom config file (default: ~/.ml/config.toml)
- `--verbose`: Verbose output
- `--quiet`: Minimal output
- `--help`: Show help for command
## Defaults Configuration
### Default Job Resources
```toml
[defaults]
cpu = 2 # CPU cores
memory = 8 # GB
gpu = 0 # GPU count
gpu_memory = "auto" # Auto-detect or specify GB
priority = 5 # Job priority (1-10)
```
### Default Behavior
- **Commit ID**: Current git HEAD (must be clean working directory)
- **Working Directory**: Current directory for experiment files
- **Output**: Human-readable format unless `--json` specified
- **Validation**: Server-side authoritative validation
## Error Handling
### Exit Codes
- `0`: Success
- `1`: General error
- `2`: Invalid arguments
- `3`: Validation failed
- `4`: Network/connection error
- `5`: Server error
### Error Output Format
**Human-readable:**
```
Error: Experiment validation failed
- Missing dependency manifest (environment.yml, poetry.lock, pyproject.toml, or requirements.txt)
- Train script not found: train.py
```
**JSON:**
```json
{
"error": "validation_failed",
"message": "Experiment validation failed",
"details": [
{"field": "dependency_manifest", "error": "missing", "supported": ["environment.yml", "poetry.lock", "pyproject.toml", "requirements.txt"]},
{"field": "train_script", "error": "not_found", "expected": "train.py"}
]
}
```
## Ctrl+C Semantics
### Command Cancellation
- **Ctrl+C during `ml queue --dry-run`**: Immediate exit, no side effects
- **Ctrl+C during `ml queue`**: Attempt to cancel submission, show status
- **Ctrl+C during `ml status --watch`**: Exit watch mode
- **Ctrl+C during `ml monitor`**: Gracefully exit TUI
- **Ctrl+C during `ml watch`**: Stop following logs, show final status
### Graceful Shutdown
1. Signal interrupt to server (if applicable)
2. Clean up local resources
3. Display current status
4. Exit with appropriate code
## JSON Output Schema
### Job Submission Response
```json
{
"job_id": "uuid-string",
"job_name": "my-experiment",
"status": "queued",
"commit_id": "abc123...",
"submitted_at": "2025-01-01T12:00:00Z",
"estimated_start": "2025-01-01T12:05:00Z",
"resources": {
"cpu": 2,
"memory_gb": 8,
"gpu": 1,
"gpu_memory_gb": 16
}
}
```
### Status Response
```json
{
"jobs": [
{
"job_id": "uuid-string",
"job_name": "my-experiment",
"status": "running",
"progress": 0.75,
"started_at": "2025-01-01T12:05:00Z",
"estimated_completion": "2025-01-01T12:30:00Z",
"node": "worker-01"
}
],
"total": 1,
"showing": 1
}
```
## Examples
### Typical Workflow
```bash
# 1. Initialize experiment
ml experiment init my-project
cd my-project
# 2. Validate experiment locally
ml queue . --validate --dry-run
# 3. Submit job
ml queue . --priority 8 --gpu 1
# 4. Monitor progress
ml status .
ml watch .
# 5. Check results
ml status . --json
```
### Automation Script
```bash
#!/bin/bash
# Submit job and wait for completion
JOB_ID=$(ml queue my-experiment --json | jq -r '.job_id')
echo "Submitted job: $JOB_ID"
# Wait for completion
while true; do
STATUS=$(ml status $JOB_ID --json | jq -r '.jobs[0].status')
echo "Status: $STATUS"
if [[ "$STATUS" == "completed" || "$STATUS" == "failed" ]]; then
break
fi
sleep 10
done
ml status $JOB_ID
```
## Implementation Notes
### Server-side Validation
- CLI performs minimal local checks (git status, file existence)
- All authoritative validation happens on worker
- Validation failures are propagated back to CLI with clear error messages
### Trust Contract Integration
- Every job submission includes commit ID and content integrity manifest
- Worker validates both before execution
- Any mismatch causes hard-fail with detailed error reporting
### Resource Management
- Resource requests are validated against available capacity
- Jobs are queued based on priority and resource availability
- Resource usage is tracked and reported in status
## Future Extensions
The v1 contract is intentionally minimal but designed for extension:
- **v1.1**: Add job dependencies and workflows
- **v1.2**: Add experiment templates and scaffolding
- **v1.3**: Add distributed execution across multiple workers
- **v2.0**: Add advanced scheduling and resource optimization
All extensions will maintain backward compatibility with the v1 contract.

View file

@ -7,14 +7,14 @@ This document provides a comprehensive reference for all configuration options i
## Environment Configurations
### Local Development
**File:** `configs/environments/config-local.yaml`
**File:** `configs/api/dev.yaml`
```yaml
auth:
enabled: true
apikeys:
api_keys:
dev_user:
hash: "2baf1f40105d9501fe319a8ec463fdf4325a2a5df445adf3f572f626253678c9"
hash: "CHANGE_ME_SHA256_DEV_USER_KEY"
admin: true
roles: ["admin"]
permissions:
@ -35,14 +35,14 @@ security:
```
### Multi-User Setup
**File:** `configs/environments/config-multi-user.yaml`
**File:** `configs/api/multi-user.yaml`
```yaml
auth:
enabled: true
apikeys:
api_keys:
admin_user:
hash: "5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8"
hash: "CHANGE_ME_SHA256_ADMIN_USER_KEY"
admin: true
roles: ["user", "admin"]
permissions:
@ -51,7 +51,7 @@ auth:
delete: true
researcher1:
hash: "ef92b778ba7a6c8f2150019a5678047b6a9a2b95cef8189518f9b35c54d2e3ae"
hash: "CHANGE_ME_SHA256_RESEARCHER1_KEY"
admin: false
roles: ["user", "researcher"]
permissions:
@ -61,7 +61,7 @@ auth:
jobs:delete: false
analyst1:
hash: "a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3"
hash: "CHANGE_ME_SHA256_ANALYST1_KEY"
admin: false
roles: ["user", "analyst"]
permissions:
@ -72,12 +72,12 @@ auth:
```
### Production
**File:** `configs/environments/config-prod.yaml`
**File:** `configs/api/prod.yaml`
```yaml
auth:
enabled: true
apikeys:
api_keys:
# Production users configured here
server:
@ -98,13 +98,14 @@ security:
- "10.0.0.0/8"
redis:
url: "redis://redis:6379"
max_connections: 10
addr: "redis:6379"
password: ""
db: 0
logging:
level: "info"
file: "/app/logs/app.log"
audit_file: "/app/logs/audit.log"
audit_log: "/app/logs/audit.log"
```
## Worker Configurations
@ -113,59 +114,80 @@ logging:
**File:** `configs/workers/worker-prod.toml`
```toml
[worker]
name = "production-worker"
id = "worker-prod-1"
worker_id = "worker-prod-01"
base_path = "/data/ml-experiments"
max_workers = 4
[server]
host = "api-server"
port = 9101
api_key = "your-api-key-here"
redis_addr = "localhost:6379"
redis_password = "CHANGE_ME_REDIS_PASSWORD"
redis_db = 0
[execution]
max_concurrent_jobs = 2
timeout_minutes = 60
retry_attempts = 3
host = "localhost"
user = "ml-user"
port = 22
ssh_key = "~/.ssh/id_rsa"
podman_image = "ml-training:latest"
gpu_vendor = "none"
gpu_visible_devices = []
gpu_devices = []
container_workspace = "/workspace"
container_results = "/results"
train_script = "train.py"
[resources]
memory_limit = "4Gi"
cpu_limit = "2"
gpu_enabled = false
max_workers = 4
desired_rps_per_worker = 2
podman_cpus = "4"
podman_memory = "16g"
[storage]
work_dir = "/tmp/fetchml-jobs"
cleanup_interval_minutes = 30
[metrics]
enabled = true
listen_addr = ":9100"
```
```toml
# Production Worker (NVIDIA, UUID-based GPU selection)
worker_id = "worker-prod-01"
base_path = "/data/ml-experiments"
podman_image = "ml-training:latest"
gpu_vendor = "nvidia"
gpu_visible_device_ids = ["GPU-REPLACE_WITH_REAL_UUID"]
gpu_devices = ["/dev/dri"]
container_workspace = "/workspace"
container_results = "/results"
train_script = "train.py"
```
### Docker Worker
**File:** `configs/workers/worker-docker.yaml`
**File:** `configs/workers/docker.yaml`
```yaml
worker:
name: "docker-worker"
id: "worker-docker-1"
worker_id: "docker-worker"
base_path: "/tmp/fetchml-jobs"
train_script: "train.py"
server:
host: "api-server"
port: 9101
api_key: "your-api-key-here"
redis_addr: "redis:6379"
redis_password: ""
redis_db: 0
execution:
max_concurrent_jobs: 1
timeout_minutes: 30
retry_attempts: 3
local_mode: true
resources:
memory_limit: "2Gi"
cpu_limit: "1"
gpu_enabled: false
max_workers: 1
poll_interval_seconds: 5
docker:
podman_image: "python:3.9-slim"
container_workspace: "/workspace"
container_results: "/results"
gpu_devices: []
gpu_vendor: "none"
gpu_visible_devices: []
metrics:
enabled: true
image: "fetchml/worker:latest"
volume_mounts:
- "/tmp:/tmp"
- "/var/run/docker.sock:/var/run/docker.sock"
listen_addr: ":9100"
metrics_flush_interval: "500ms"
```
## CLI Configuration
@ -181,7 +203,7 @@ worker_base = "/app"
worker_port = 22
[auth]
api_key = "your-hashed-api-key"
api_key = "<your-api-key>"
[cli]
default_timeout = 30
@ -199,7 +221,7 @@ worker_base = "/app"
worker_port = 22
[auth]
api_key = "5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8"
api_key = "<admin-api-key>"
```
**Researcher Config:** `~/.ml/config-researcher.toml`
@ -211,7 +233,7 @@ worker_base = "/app"
worker_port = 22
[auth]
api_key = "ef92b778ba7a6c8f2150019a5678047b6a9a2b95cef8189518f9b35c54d2e3ae"
api_key = "<researcher-api-key>"
```
**Analyst Config:** `~/.ml/config-analyst.toml`
@ -223,7 +245,7 @@ worker_base = "/app"
worker_port = 22
[auth]
api_key = "a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3"
api_key = "<analyst-api-key>"
```
## Configuration Options
@ -298,7 +320,6 @@ api_key = "a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3"
|----------|---------|-------------|
| `FETCHML_CONFIG` | - | Path to config file |
| `FETCHML_LOG_LEVEL` | "info" | Override log level |
| `FETCHML_REDIS_URL` | - | Override Redis URL |
| `CLI_CONFIG` | - | Path to CLI config file |
## Troubleshooting
@ -324,7 +345,7 @@ api_key = "a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3"
```bash
# Validate server configuration
go run cmd/api-server/main.go --config configs/environments/config-local.yaml --validate
go run cmd/api-server/main.go --config configs/api/dev.yaml --validate
# Test CLI configuration
./cli/zig-out/bin/ml status --debug

View file

@ -0,0 +1,30 @@
# Development Quick Start
This page is the developer-focused entrypoint for working on FetchML.
## Prerequisites
- Go
- Zig
- Docker / Docker Compose
## Quick setup
```bash
# Clone
git clone https://github.com/jfraeys/fetch_ml.git
cd fetch_ml
# Start dev environment
make dev-up
# Run tests
make test
```
## Next
- See `testing.md` for test workflows.
- See `architecture.md` for system structure.
- See `zig-cli.md` for CLI build details.
- See the repository root `DEVELOPMENT.md` for the full development guide.

View file

@ -1,185 +0,0 @@
# Quick Start Testing Guide
## Overview
This guide provides the fastest way to test the FetchML multi-user authentication system.
## Prerequisites
- Docker and Docker Compose installed
- CLI built: `cd cli && zig build`
- Test configs available in `~/.ml/`
## 5-Minute Test
### 1. Clean Environment
```bash
make self-cleanup
```
### 2. Start Services
```bash
docker-compose -f deployments/docker-compose.prod.yml up -d
```
### 3. Test Authentication
```bash
make test-auth
```
### 4. Check Results
You should see:
- Admin user: Full access, shows all jobs
- Researcher user: Own jobs only
- Analyst user: Read-only access
### 5. Clean Up
```bash
make self-cleanup
```
## Detailed Testing
### Multi-User Authentication Test
```bash
# Test each user role
cp ~/.ml/config-admin.toml ~/.ml/config.toml && ./cli/zig-out/bin/ml status
cp ~/.ml/config-researcher.toml ~/.ml/config.toml && ./cli/zig-out/bin/ml status
cp ~/.ml/config-analyst.toml ~/.ml/config.toml && ./cli/zig-out/bin/ml status
```
### Job Queueing Test
```bash
# Admin can queue jobs
cp ~/.ml/config-admin.toml ~/.ml/config.toml
echo "admin job" | ./cli/zig-out/bin/ml queue admin-test
# Researcher can queue jobs
cp ~/.ml/config-researcher.toml ~/.ml/config.toml
echo "research job" | ./cli/zig-out/bin/ml queue research-test
# Analyst cannot queue jobs (should fail)
cp ~/.ml/config-analyst.toml ~/.ml/config.toml
echo "analysis job" | ./cli/zig-out/bin/ml queue analysis-test
```
### Status Verification
```bash
# Check what each user can see
make test-auth
```
## Expected Results
### Admin User Output
```
Status retrieved for user: admin_user (admin: true)
Tasks: X total, X queued, X running, X failed, X completed
```
### Researcher User Output
```
Status retrieved for user: researcher1 (admin: false)
Tasks: X total, X queued, X running, X failed, X completed
```
### Analyst User Output
```
Status retrieved for user: analyst1 (admin: false)
Tasks: X total, X queued, X running, X failed, X completed
```
## Troubleshooting
### Server Not Running
```bash
# Check containers
docker ps --filter "name=ml-"
# Start services
docker-compose -f deployments/docker-compose.prod.yml up -d
# Check logs
docker logs ml-prod-api
```
### Authentication Failures
```bash
# Check config files
ls ~/.ml/config-*.toml
# Verify API keys
cat ~/.ml/config-admin.toml
```
### Connection Issues
```bash
# Test API directly
curl -I http://localhost:9103/health
# Check ports
netstat -an | grep 9103
```
## Advanced Testing
### Full Test Suite
```bash
make test-full
```
### Performance Testing
```bash
./scripts/benchmarks/run-benchmarks-local.sh
```
### Cleanup Status
```bash
make test-status
```
## Configuration Files
### Test Configs Location
- `~/.ml/config-admin.toml` - Admin user
- `~/.ml/config-researcher.toml` - Researcher user
- `~/.ml/config-analyst.toml` - Analyst user
### Server Configs
- `configs/environments/config-multi-user.yaml` - Multi-user setup
- `configs/environments/config-local.yaml` - Local development
## Next Steps
1. **Review Documentation**
- [Testing Protocol](testing-protocol.md)
- [Configuration Reference](configuration-reference.md)
- [Testing Guide](testing-guide.md)
2. **Explore Features**
- Job queueing and management
- WebSocket communication
- Role-based permissions
3. **Production Setup**
- TLS configuration
- Security hardening
- Monitoring setup
## Help
### Common Commands
```bash
make help # Show all commands
make test-auth # Quick auth test
make self-cleanup # Clean environment
make test-status # Check system status
```
### Get Help
- Check logs: `docker logs ml-prod-api`
- Review documentation in `docs/src/`
- Use `--debug` flag with CLI commands

View file

@ -1,50 +0,0 @@
# Testing Guide
## Quick Start
The FetchML project includes comprehensive testing tools.
## Testing Commands
### Quick Tests
```bash
make test-auth # Test multi-user authentication
make test-status # Check cleanup status
make self-cleanup # Clean environment
```
### Full Test Suite
```bash
make test-full # Run complete test suite
```
## Expected Results
### Admin User
Status retrieved for user: admin_user (admin: true)
Tasks: X total, X queued, X running, X failed, X completed
### Researcher User
Status retrieved for user: researcher1 (admin: false)
Tasks: X total, X queued, X running, X failed, X completed
### Analyst User
Status retrieved for user: analyst1 (admin: false)
Tasks: X total, X queued, X running, X failed, X completed
## Troubleshooting
### Authentication Failures
- Check API key in ~/.ml/config.toml
- Verify server is running with auth enabled
### Container Issues
- Check Docker daemon is running
- Verify ports 9100, 9103 are available
- Review logs: docker logs ml-prod-api
## Cleanup
```bash
make self-cleanup # Interactive cleanup
make auto-cleanup # Setup daily auto-cleanup
```

View file

@ -1,258 +0,0 @@
# Testing Protocol
This document outlines the comprehensive testing protocol for the FetchML project.
## Overview
The testing protocol is designed to ensure:
- Multi-user authentication works correctly
- API functionality is reliable
- CLI commands function properly
- Docker containers run as expected
- Performance meets requirements
## Test Categories
### 1. Authentication Tests
#### 1.1 Multi-User Authentication
```bash
# Test admin user
cp ~/.ml/config-admin.toml ~/.ml/config.toml
./cli/zig-out/bin/ml status
# Expected: Shows admin status and all jobs
# Test researcher user
cp ~/.ml/config-researcher.toml ~/.ml/config.toml
./cli/zig-out/bin/ml status
# Expected: Shows researcher status and own jobs only
# Test analyst user
cp ~/.ml/config-analyst.toml ~/.ml/config.toml
./cli/zig-out/bin/ml status
# Expected: Shows analyst status, read-only access
```
#### 1.2 API Key Validation
```bash
# Test invalid API key
echo "invalid_key" > ~/.ml/config.toml
./cli/zig-out/bin/ml status
# Expected: Authentication failed error
# Test missing API key
rm ~/.ml/config.toml
./cli/zig-out/bin/ml status
# Expected: API key not configured error
```
### 2. CLI Functionality Tests
#### 2.1 Job Queueing
```bash
# Test job queueing with different users
cp ~/.ml/config-admin.toml ~/.ml/config.toml
echo "test job" | ./cli/zig-out/bin/ml queue test-job
cp ~/.ml/config-researcher.toml ~/.ml/config.toml
echo "research job" | ./cli/zig-out/bin/ml queue research-job
cp ~/.ml/config-analyst.toml ~/.ml/config.toml
echo "analysis job" | ./cli/zig-out/bin/ml queue analysis-job
# Expected: Admin and researcher can queue, analyst cannot
```
#### 2.2 Status Checking
```bash
# Check status after job queueing
./cli/zig-out/bin/ml status
# Expected: Shows jobs based on user permissions
```
### 3. Docker Container Tests
#### 3.1 Container Startup
```bash
# Start production environment
docker-compose -f deployments/docker-compose.prod.yml up -d
# Check container status
docker ps --filter "name=ml-"
# Expected: All containers running and healthy
```
#### 3.2 Port Accessibility
```bash
# Test API server port
curl -I http://localhost:9103/health
# Expected: 200 OK response
# Test metrics port
curl -I http://localhost:9100/metrics
# Expected: 200 OK response
```
#### 3.3 Container Cleanup
```bash
# Test cleanup script
./scripts/maintenance/cleanup.sh --dry-run
./scripts/maintenance/cleanup.sh --force
# Expected: Containers stopped and removed
```
### 4. Performance Tests
#### 4.1 API Performance
```bash
# Run API benchmarks
./scripts/benchmarks/run-benchmarks-local.sh
# Expected: Response times under 100ms for basic operations
```
#### 4.2 Load Testing
```bash
# Run load tests
go test -v ./tests/load/...
# Expected: System handles concurrent requests without degradation
```
### 5. Integration Tests
#### 5.1 End-to-End Workflow
```bash
# Complete workflow test
cp ~/.ml/config-admin.toml ~/.ml/config.toml
# Queue job
echo "integration test" | ./cli/zig-out/bin/ml queue integration-test
# Check status
./cli/zig-out/bin/ml status
# Verify job appears in queue
# Expected: Job queued and visible in status
```
#### 5.2 WebSocket Communication
```bash
# Test WebSocket handshake
./cli/zig-out/bin/ml status
# Expected: Successful WebSocket upgrade and response
```
## Test Execution Order
### Phase 1: Environment Setup
1. Clean up any existing containers
2. Start fresh Docker environment
3. Verify all services are running
### Phase 2: Authentication Testing
1. Test all user roles (admin, researcher, analyst)
2. Test invalid authentication scenarios
3. Verify role-based permissions
### Phase 3: Functional Testing
1. Test CLI commands (queue, status)
2. Test API endpoints
3. Test WebSocket communication
### Phase 4: Integration Testing
1. Test complete workflows
2. Test error scenarios
3. Test cleanup procedures
### Phase 5: Performance Testing
1. Run benchmarks
2. Perform load testing
3. Validate performance metrics
## Automated Testing
### Continuous Integration Tests
```bash
# Run all tests
make test
# Run specific test categories
make test-unit
make test-integration
make test-e2e
```
### Pre-deployment Checklist
```bash
# Complete test suite
./scripts/testing/run-full-test-suite.sh
# Performance validation
./scripts/benchmarks/run-benchmarks-local.sh
# Security validation
./scripts/security/security-scan.sh
```
## Test Data Management
### Test Users
- **admin_user**: Full access, can see all jobs
- **researcher1**: Can create and view own jobs
- **analyst1**: Read-only access, cannot create jobs
### Test Jobs
- **test-job**: Basic job for testing
- **research-job**: Research-specific job
- **analysis-job**: Analysis-specific job
## Troubleshooting
### Common Issues
#### Authentication Failures
- Check API key configuration
- Verify server is running with auth enabled
- Check YAML config syntax
#### Container Issues
- Verify Docker daemon is running
- Check port conflicts
- Review container logs
#### Performance Issues
- Monitor resource usage
- Check for memory leaks
- Verify database connections
### Debug Commands
```bash
# Check container logs
docker logs ml-prod-api
# Check system resources
docker stats
# Verify network connectivity
docker network ls
```
## Test Results Documentation
All test results should be documented in:
- `test-results/` directory
- Performance benchmarks
- Integration test reports
- Security scan results
## Maintenance
### Regular Tasks
- Update test data periodically
- Review and update test cases
- Maintain test infrastructure
- Monitor test performance
### Test Environment
- Keep test environment isolated
- Use consistent test data
- Regular cleanup of test artifacts
- Monitor test resource usage

View file

@ -1,10 +1,62 @@
# Testing Guide
Comprehensive testing documentation for FetchML platform.
Comprehensive testing documentation for FetchML platform with integrated monitoring.
## Quick Start Testing
For a fast 5-minute testing experience, see the **[Quick Start Testing Guide](quick-start-testing.md)**.
### 5-Minute Fast Test
```bash
# Clean environment
make self-cleanup
# Start development stack with monitoring
make dev-up
# Quick authentication test
make test-auth
# Clean up
make dev-down
```
**Expected Results**:
- Admin user: Full access, shows all jobs
- Researcher user: Own jobs only
- Analyst user: Read-only access
## Test Environment Setup
### Development Environment with Monitoring
```bash
# Start development stack with monitoring
make dev-up
# Verify all services are running
make dev-status
# Run tests against running services
make test
# Check monitoring during tests
# Grafana: http://localhost:3000 (admin/admin123)
```
### Test Environment Verification
```bash
# Verify API server
curl -f http://localhost:8080/health
# Verify monitoring services
curl -f http://localhost:3000/api/health
curl -f http://localhost:9090/api/v1/query?query=up
curl -f http://localhost:3100/ready
# Verify Redis
docker exec ml-experiments-redis redis-cli ping
```
## Test Types
@ -12,6 +64,9 @@ For a fast 5-minute testing experience, see the **[Quick Start Testing Guide](qu
```bash
make test-unit # Go unit tests only
cd cli && zig build test # Zig CLI tests
# Unit tests live under tests/unit/ (including tests that cover internal/ packages)
go test ./tests/unit/...
```
### Integration Tests
@ -22,6 +77,9 @@ make test-integration # API and database integration
### End-to-End Tests
```bash
make test-e2e # Full workflow testing
# Podman E2E is opt-in because it builds/runs containers
FETCH_ML_E2E_PODMAN=1 go test ./tests/e2e/... # Enables TestPodmanIntegration
```
### All Tests
@ -30,66 +88,308 @@ make test # Run complete test suite
make test-coverage # With coverage report
```
## Docker Testing
## Deployment-Specific Testing
### Development Environment Testing
### Development Environment
```bash
docker-compose up -d
# Start dev stack
make dev-up
# Run tests with monitoring
make test
docker-compose down
# View test results in Grafana
# Load Test Performance dashboard
# System Health dashboard
```
### Production Environment Testing
```bash
docker-compose -f docker-compose.prod.yml up -d
cd deployments
make prod-up
# Test production deployment
make test-auth # Multi-user auth test
make self-cleanup # Clean up after testing
# Verify production monitoring
curl -f https://your-domain.com/health
```
### Homelab Secure Testing
```bash
cd deployments
make homelab-up
# Test secure deployment
make test-auth
make test-ssl # SSL/TLS testing
```
## Authentication Testing Protocol
### Multi-User Authentication
```bash
# Test admin user
cp ~/.ml/config-admin.toml ~/.ml/config.toml
./cli/zig-out/bin/ml status
# Expected: Shows admin status and all jobs
# Test researcher user
cp ~/.ml/config-researcher.toml ~/.ml/config.toml
./cli/zig-out/bin/ml status
# Expected: Shows researcher status and own jobs only
# Test analyst user
cp ~/.ml/config-analyst.toml ~/.ml/config.toml
./cli/zig-out/bin/ml status
# Expected: Shows analyst status, read-only access
```
### API Key Validation
```bash
# Test invalid API key
echo "invalid_key" > ~/.ml/config.toml
./cli/zig-out/bin/ml status
# Expected: Authentication failed error
# Test missing API key
rm ~/.ml/config.toml
./cli/zig-out/bin/ml status
# Expected: API key not configured error
```
### Job Queueing by Role
```bash
# Admin can queue jobs
cp ~/.ml/config-admin.toml ~/.ml/config.toml
echo "admin job" | ./cli/zig-out/bin/ml queue admin-test
# Researcher can queue jobs
cp ~/.ml/config-researcher.toml ~/.ml/config.toml
echo "research job" | ./cli/zig-out/bin/ml queue research-test
# Analyst cannot queue jobs (should fail)
cp ~/.ml/config-analyst.toml ~/.ml/config.toml
echo "analysis job" | ./cli/zig-out/bin/ml queue analysis-test
# Expected: Permission denied error
```
## Performance Testing
### Load Testing with Monitoring
```bash
# Start monitoring
make dev-up
# Run load tests
make load-test
# Monitor performance in real-time
# Grafana: http://localhost:3000
# Check: Request rates, response times, error rates
```
### Benchmark Suite
```bash
./scripts/benchmarks/run-benchmarks-local.sh
```
### Load Testing
```bash
make test-load # API load testing
```
## Authentication Testing
Multi-user authentication testing is fully covered in the **[Quick Start Testing Guide](quick-start-testing.md)**.
```bash
make test-auth # Quick auth role testing
# Run benchmarks with performance tracking
./scripts/track_performance.sh
# Or run directly
make benchmark-local
# View results in Grafana dashboards
```
### Performance Monitoring During Tests
**Key Metrics to Watch**:
- API response times (95th percentile)
- Error rates (should be < 1%)
- Memory usage trends
- CPU utilization
- Request throughput
## CLI Testing
### Build and Test CLI
```bash
cd cli && zig build dev
./cli/zig-out/dev/ml --help
cd cli && zig build --release=fast
./zig-out/bin/ml --help
zig build test
```
### CLI Integration Tests
```bash
make test-cli # CLI-specific integration tests
```
## Docker Container Testing
### Container Startup
```bash
# Start environment
make dev-up
# Check container status
docker ps --filter "name=ml-"
# Expected: All containers running and healthy
```
### Port Accessibility
```bash
# Test API server port
curl -I http://localhost:8080/health
# Expected: 200 OK response
# Test metrics port
curl -I http://localhost:9100/metrics
# Expected: 200 OK response
# Test monitoring ports
curl -I http://localhost:3000/api/health
curl -I http://localhost:9090/api/v1/query?query=up
```
### Container Cleanup
```bash
# Test cleanup script
make self-cleanup
make dev-down
# Expected: Containers stopped and removed
```
## Monitoring During Testing
### Real-Time Monitoring
```bash
# Access Grafana during tests
http://localhost:3000 (admin/admin123)
# Key Dashboards:
# - Load Test Performance: Request metrics, response times
# - System Health: Service status, resource usage
# - Log Analysis: Error logs, service logs
```
### Test Metrics Collection
**Automatically Collected**:
- HTTP request metrics
- Response time histograms
- Error counters
- Resource utilization
- Log aggregation
**Manual Test Markers**:
```bash
# Mark test start in logs
echo "TEST_START: $(date)" | tee -a /logs/test.log
# Mark test completion
echo "TEST_END: $(date)" | tee -a /logs/test.log
```
## Test Execution Protocol
### Phase 1: Environment Setup
1. Clean up any existing containers
2. Start fresh Docker environment with monitoring
3. Verify all services are running
### Phase 2: Authentication Testing
1. Test all user roles (admin, researcher, analyst)
2. Test invalid authentication scenarios
3. Verify role-based permissions
### Phase 3: Functional Testing
1. Test CLI commands (queue, status)
2. Test API endpoints
3. Test WebSocket communication
### Phase 4: Integration Testing
1. Test complete workflows
2. Test error scenarios
3. Test cleanup procedures
### Phase 5: Performance Testing
1. Run benchmarks
2. Perform load testing
3. Validate performance metrics
## Troubleshooting Tests
### Common Issues
- **Server not running**: Check with `docker ps --filter "name=ml-"`
- **Authentication failures**: Verify configs in `~/.ml/config-*.toml`
- **Connection issues**: Test API with `curl -I http://localhost:9103/health`
**Server Not Running**:
```bash
# Check service status
make dev-status
# Check container logs
docker logs ml-experiments-api
docker logs ml-experiments-grafana
```
**Authentication Failures**:
```bash
# Verify configs
ls ~/.ml/config-*.toml
# Check API health
curl -I http://localhost:8080/health
# Monitor auth logs in Grafana
# Log Analysis dashboard -> filter "auth"
```
**Performance Issues**:
```bash
# Check resource usage in Grafana
# System Health dashboard
# Check API response times
# Load Test Performance dashboard
# Identify bottlenecks
# Prometheus: http://localhost:9090
```
**Monitoring Issues**:
```bash
# Re-setup monitoring provisioning (Grafana datasources/providers)
python3 scripts/setup_monitoring.py
# Restart Grafana
docker restart ml-experiments-grafana
# Check datasource connectivity
# Grafana -> Configuration -> Data Sources
```
### Debug Mode
```bash
make test-debug # Run tests with verbose output
# Enable debug logging
export LOG_LEVEL=debug
make test
# Monitor debug logs
# Grafana Log Analysis dashboard
```
## Test Configuration
@ -103,12 +403,27 @@ make test-debug # Run tests with verbose output
- `tests/fixtures/` - Test data and examples
- `tests/benchmarks/` - Performance test data
### Monitoring Configuration
- `monitoring/grafana/provisioning/` - Auto-provisioned datasources
- `monitoring/grafana/dashboards/` - Auto-provisioned dashboards
- `monitoring/prometheus/prometheus.yml` - Metrics collection
## Continuous Integration
### CI Pipeline with Monitoring
Tests run automatically on:
- Pull requests (full suite)
- Main branch commits (unit + integration)
- Releases (full suite + benchmarks)
- **Pull requests**: Full suite + performance benchmarks
- **Main branch**: Unit + integration tests
- **Releases**: Full suite + benchmarks + security scans
### CI Monitoring
During CI runs:
- Performance metrics collected
- Test results tracked in Grafana
- Regression detection
- Automated alerts on failures
## Writing Tests
@ -117,13 +432,120 @@ Tests run automatically on:
- Integration tests: `tests/e2e/` directory
- Benchmark tests: `tests/benchmarks/` directory
### Test with Monitoring
```go
// Add custom metrics for tests
var testRequests = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "test_requests_total",
Help: "Total number of test requests",
},
[]string{"method", "status"},
)
// Log test events for monitoring
log.WithFields(log.Fields{
"test": "integration_test",
"operation": "api_call",
"status": "success",
}).Info("Test operation completed")
```
### Zig Tests
- CLI tests: `cli/tests/` directory
- Follow Zig testing conventions
## See Also
## Test Result Analysis
- **[Quick Start Testing Guide](quick-start-testing.md)** - Fast 5-minute testing
- **[Testing Protocol](testing-protocol.md)** - Detailed testing procedures
- **[Configuration Reference](configuration-reference.md)** - Test setup
- **[Troubleshooting](troubleshooting.md)** - Common issues
### Grafana Dashboard Analysis
**Load Test Performance**:
- Request rates over time
- Response time percentiles
- Error rate trends
- Throughput metrics
**System Health**:
- Service availability
- Resource utilization
- Memory usage patterns
- CPU consumption
**Log Analysis**:
- Error patterns
- Warning frequency
- Service log aggregation
- Debug information
### Performance Regression Detection
```bash
# Track performance over time
./scripts/track_performance.sh
# Compare with baseline
# Grafana: Compare current run with historical data
# Alert on regressions
# Set up Grafana alerts for performance degradation
```
## Test Cleanup
### Automated Cleanup
```bash
# Clean up test data
make self-cleanup
# Clean up Docker resources
make clean-all
# Reset monitoring data
docker volume rm monitoring_prometheus_data
docker volume rm monitoring_grafana_data
```
### Manual Cleanup
```bash
# Stop test environment
make dev-down
# Remove test artifacts
rm -rf ~/.ml/config-*.toml
rm -rf test-results/
```
## Expected Test Results
### Admin User
```
Status retrieved for user: admin_user (admin: true)
Tasks: X total, X queued, X running, X failed, X completed
```
### Researcher User
```
Status retrieved for user: researcher1 (admin: false)
Tasks: X total, X queued, X running, X failed, X completed
```
### Analyst User
```
Status retrieved for user: analyst1 (admin: false)
Tasks: X total, X queued, X running, X failed, X completed
```
## Common Commands Reference
```bash
make help # Show all commands
make test-auth # Quick auth test
make self-cleanup # Clean environment
make test-status # Check system status
make dev-up # Start dev environment
make dev-down # Stop dev environment
make dev-status # Check dev status
```

View file

@ -7,41 +7,36 @@ Common issues and solutions for Fetch ML.
### Services Not Starting
```bash
# Check Docker status
docker-compose ps
# Check container status
docker ps --filter "name=ml-"
# Restart services
docker-compose down && docker-compose up -d (testing only)
# Check logs
docker-compose logs -f
# Restart development stack
make dev-down
make dev-up
```
### API Not Responding
```bash
# Check health endpoint
curl http://localhost:9101/health
curl http://localhost:8080/health
# Check if port is in use
lsof -i :9101
lsof -i :8080
lsof -i :8443
# Kill process on port
kill -9 $(lsof -ti :9101)
kill -9 $(lsof -ti :8080)
```
### Database Issues
### Database / Redis Issues
```bash
# Check database connection
docker-compose exec postgres psql -U postgres -d fetch_ml
# Check Redis from container
docker exec ml-experiments-redis redis-cli ping
# Reset database
docker-compose down postgres
docker-compose up -d (testing only) postgres
# Check Redis
docker-compose exec redis redis-cli ping
# Check API can reach database (via health endpoint)
curl -f http://localhost:8080/health || echo "API not healthy"
```
## Common Errors
@ -53,7 +48,7 @@ docker-compose exec redis redis-cli ping
### Database Errors
- **Connection failed**: Verify database type and connection params
- **No such table**: Run migrations with `--migrate` (see [Development Setup](development-setup.md))
- **No such table**: Run migrations with `--migrate` (see [Quick Start](quick-start.md))
### Container Errors
- **Runtime not found**: Set `runtime: docker (testing only)` in config
@ -65,15 +60,15 @@ docker-compose exec redis redis-cli ping
## Development Issues
- **Build fails**: `go mod tidy` and `cd cli && rm -rf zig-out zig-cache`
- **Tests fail**: Start test dependencies with `docker-compose up -d` or `make test-auth`
- **Tests fail**: Ensure dev stack is running with `make dev-up` or use `make test-auth`
## CLI Issues
- **Not found**: `cd cli && zig build dev`
- **Not found**: `cd cli && zig build --release=fast`
- **Connection errors**: Check `--server` and `--api-key`
## Network Issues
- **Port conflicts**: `lsof -i :9101` and kill processes
- **Firewall**: Allow ports 9101, 6379, 5432
- **Port conflicts**: `lsof -i :8080` / `lsof -i :8443` and kill processes
- **Firewall**: Allow ports 8080, 8443, 6379, 5432
## Configuration Issues
- **Invalid YAML**: `python3 -c "import yaml; yaml.safe_load(open('config.yaml'))"`
@ -82,13 +77,19 @@ docker-compose exec redis redis-cli ping
## Debug Information
```bash
./bin/api-server --version
docker-compose ps
docker-compose logs api-server | grep ERROR
docker ps --filter "name=ml-"
docker logs ml-experiments-api | grep ERROR
```
## Emergency Reset
```bash
docker-compose down -v
# Stop and remove all dev containers and volumes
make dev-down
docker volume prune
# Remove local data if needed
rm -rf data/ results/ *.db
docker-compose up -d (testing only)
# Start fresh dev stack
make dev-up
```

101
docs/src/validate.md Normal file
View file

@ -0,0 +1,101 @@
---
layout: page
title: "Validation (ml validate)"
permalink: /validate/
---
# Validation (`ml validate`)
The `ml validate` command verifies experiment integrity and provenance.
It can be run against:
- A **commit id** (validates the experiment tree + dependency manifest)
- A **task id** (additionally validates the runs `run_manifest.json` provenance and lifecycle)
## CLI usage
```bash
# Validate by commit
ml validate <commit_id> [--json] [--verbose]
# Validate by task
ml validate --task <task_id> [--json] [--verbose]
```
### Output modes
- Default (human): prints a summary with `errors`, `warnings`, and `failed_checks`.
- `--verbose`: prints all checks under `checks` and includes `expected/actual/details` when present.
- `--json`: prints the raw JSON payload.
## Report shape
The API returns a JSON report of the form:
- `ok`: overall boolean
- `commit_id`: commit being validated (if known)
- `task_id`: task being validated (when validating by task)
- `checks`: map of check name → `{ ok, expected?, actual?, details? }`
- `errors`: list of high-level failures
- `warnings`: list of non-fatal issues
- `ts`: UTC timestamp
## Check semantics
- For **task statuses** `running`, `completed`, or `failed`, run-manifest issues are treated as **errors**.
- For **queued/pending** tasks, run-manifest issues are usually **warnings** (the job may not have started yet).
## Notable checks
### Experiment integrity
- `experiment_manifest`: validates the experiment manifest (content-addressed integrity)
- `deps_manifest`: validates that a dependency manifest exists and can be hashed
- `expected_manifest_overall_sha`: compares the tasks recorded manifest SHA to the current manifest SHA
- `expected_deps_manifest`: compares the tasks recorded deps manifest name/SHA to what exists on disk
### Run manifest provenance (task validation)
- `run_manifest`: whether `run_manifest.json` could be found and loaded
- `run_manifest_location`: verifies the manifest was found in the expected bucket:
- `pending` for queued/pending
- `running` for running
- `finished` for completed
- `failed` for failed
- `run_manifest_task_id`: task id match
- `run_manifest_commit_id`: commit id match
- `run_manifest_deps`: deps manifest name/SHA match
- `run_manifest_snapshot_id`: snapshot id match (when snapshot is part of the task)
- `run_manifest_snapshot_sha256`: snapshot sha256 match (when snapshot sha is recorded)
### Run manifest lifecycle (task validation)
- `run_manifest_lifecycle`:
- `running`: must have `started_at`, must not have `ended_at`/`exit_code`
- `completed`/`failed`: must have `started_at`, `ended_at`, `exit_code`, and `ended_at >= started_at`
- `queued`/`pending`: must not have `ended_at`/`exit_code`
## Example report (task validation)
```json
{
"ok": false,
"commit_id": "6161616161616161616161616161616161616161",
"task_id": "task-run-manifest-location-mismatch",
"checks": {
"experiment_manifest": {"ok": true},
"deps_manifest": {"ok": true, "actual": "requirements.txt:..."},
"run_manifest": {"ok": true},
"run_manifest_location": {
"ok": false,
"expected": "running",
"actual": "finished"
}
},
"errors": [
"run manifest location mismatch"
],
"ts": "2025-12-17T18:43:00Z"
}
```

View file

@ -47,6 +47,8 @@ export FETCH_ML_CLI_API_KEY="prod-key"
- `ml dataset list` list datasets
- `ml monitor` launch TUI over SSH (remote UI)
`ml status --json` may include an optional `prewarm` field when the worker is prefetching datasets for the next queued task.
## Build flavors
- `make all` releasesmall (default)