docs(dev): document validate workflow, CLI/TUI UX contract, and consolidate dev/testing docs
This commit is contained in:
parent
8157f73a70
commit
1aed78839b
12 changed files with 1119 additions and 646 deletions
|
|
@ -95,7 +95,7 @@ make test
|
|||
go test ./internal/queue/...
|
||||
|
||||
# Build CLI
|
||||
cd cli && zig build dev
|
||||
cd cli && zig build --release=fast
|
||||
|
||||
# Run formatters and linters
|
||||
make lint
|
||||
|
|
|
|||
|
|
@ -34,6 +34,9 @@ High-performance command-line interface for experiment management, written in Zi
|
|||
| `cancel` | Cancel running job | `ml cancel job123` |
|
||||
| `prune` | Clean up old experiments | `ml prune --keep 10` |
|
||||
| `watch` | Auto-sync directory on changes | `ml watch ./project --queue` |
|
||||
| `jupyter` | Manage Jupyter notebook services | `ml jupyter start --name my-nb` |
|
||||
| `validate` | Validate provenance/integrity for a commit or task | `ml validate <commit_id> --verbose` |
|
||||
| `info` | Show run info from `run_manifest.json` | `ml info <run_dir>` |
|
||||
|
||||
### Command Details
|
||||
|
||||
|
|
@ -78,6 +81,9 @@ ml queue my-job --commit abc123 --priority 8
|
|||
- Priority queuing system
|
||||
- API key authentication
|
||||
|
||||
**Notes:**
|
||||
- Tasks support optional `snapshot_id` and `dataset_specs` fields server-side (for provenance and dataset resolution).
|
||||
|
||||
#### `watch` - Auto-Sync Monitoring
|
||||
```bash
|
||||
# Watch directory for changes
|
||||
|
|
@ -108,12 +114,78 @@ ml monitor
|
|||
```
|
||||
Launches TUI interface via SSH for real-time monitoring.
|
||||
|
||||
#### `status` - System Status
|
||||
|
||||
`ml status --json` returns a JSON object including an optional `prewarm` field when worker prewarming is active:
|
||||
|
||||
```json
|
||||
{
|
||||
"prewarm": [
|
||||
{
|
||||
"worker_id": "worker-1",
|
||||
"task_id": "<task-id>",
|
||||
"started_at": "2025-01-01T00:00:00Z",
|
||||
"updated_at": "2025-01-01T00:00:05Z",
|
||||
"phase": "datasets",
|
||||
"dataset_count": 2
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
#### `cancel` - Job Cancellation
|
||||
```bash
|
||||
ml cancel running-job-id
|
||||
```
|
||||
Cancels currently running jobs by ID.
|
||||
|
||||
#### `jupyter` - Jupyter Notebook Management
|
||||
|
||||
Manage Jupyter notebook services via WebSocket protocol.
|
||||
|
||||
```bash
|
||||
# Start a Jupyter service
|
||||
ml jupyter start --name my-notebook --workspace /path/to/workspace
|
||||
|
||||
# Start with password protection
|
||||
ml jupyter start --name my-notebook --workspace /path/to/workspace --password mypass
|
||||
|
||||
# List running services
|
||||
ml jupyter list
|
||||
|
||||
# Stop a service
|
||||
ml jupyter stop service-id-12345
|
||||
|
||||
# Check service status
|
||||
ml jupyter status
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- WebSocket-based binary protocol for low latency
|
||||
- Secure API key authentication (SHA256 hashed)
|
||||
- Real-time service management
|
||||
- Workspace isolation
|
||||
|
||||
**Common Use Cases:**
|
||||
```bash
|
||||
# Development workflow
|
||||
ml jupyter start --name dev-notebook --workspace ./notebooks
|
||||
# ... do development work ...
|
||||
ml jupyter stop dev-service-123
|
||||
|
||||
# Team collaboration
|
||||
ml jupyter start --name team-analysis --workspace /shared/analysis --password teampass
|
||||
|
||||
# Multiple services
|
||||
ml jupyter list # View all running services
|
||||
```
|
||||
|
||||
**Security:**
|
||||
- API keys are hashed before transmission
|
||||
- Password protection for notebooks
|
||||
- Workspace path validation
|
||||
- Service ID-based authorization
|
||||
|
||||
### Configuration
|
||||
|
||||
The Zig CLI reads configuration from `~/.ml/config.toml`:
|
||||
|
|
@ -144,7 +216,7 @@ Main HTTPS API server for experiment management.
|
|||
go run ./cmd/api-server/main.go
|
||||
|
||||
# With configuration
|
||||
./bin/api-server --config configs/config-local.yaml
|
||||
./bin/api-server --config configs/api/dev.yaml
|
||||
```
|
||||
|
||||
**Features:**
|
||||
|
|
@ -160,9 +232,6 @@ Terminal User Interface for monitoring experiments.
|
|||
```bash
|
||||
# Launch TUI
|
||||
go run ./cmd/tui/main.go
|
||||
|
||||
# With custom config
|
||||
./tui --config configs/config-local.yaml
|
||||
```
|
||||
|
||||
**Features:**
|
||||
|
|
@ -187,10 +256,10 @@ Configuration validation and linting tool.
|
|||
|
||||
```bash
|
||||
# Validate configuration
|
||||
./configlint configs/config-local.yaml
|
||||
./configlint configs/api/dev.yaml
|
||||
|
||||
# Check schema compliance
|
||||
./configlint --schema configs/schema/config_schema.yaml
|
||||
./configlint --schema configs/schema/api_server_config.yaml
|
||||
```
|
||||
|
||||
## Management Script (`./tools/manage.sh`)
|
||||
|
|
@ -208,39 +277,22 @@ Simple service management for your homelab.
|
|||
./tools/manage.sh cleanup # Clean project artifacts
|
||||
```
|
||||
|
||||
## Setup Script (`./setup.sh`)
|
||||
|
||||
One-command homelab setup.
|
||||
|
||||
### Usage
|
||||
```bash
|
||||
# Full setup
|
||||
./setup.sh
|
||||
|
||||
# Setup includes:
|
||||
# - SSL certificate generation
|
||||
# - Configuration creation
|
||||
# - Build all components
|
||||
# - Start Redis
|
||||
# - Setup Fail2Ban (if available)
|
||||
```
|
||||
|
||||
## API Testing
|
||||
|
||||
Test the API with curl:
|
||||
|
||||
```bash
|
||||
# Health check
|
||||
curl -k -H 'X-API-Key: password' https://localhost:9101/health
|
||||
curl -f http://localhost:8080/health
|
||||
|
||||
# List experiments
|
||||
curl -k -H 'X-API-Key: password' https://localhost:9101/experiments
|
||||
curl -H 'X-API-Key: password' http://localhost:8080/experiments
|
||||
|
||||
# Submit experiment
|
||||
curl -k -X POST -H 'X-API-Key: password' \
|
||||
curl -X POST -H 'X-API-Key: password' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"name":"test","config":{"type":"basic"}}' \
|
||||
https://localhost:9101/experiments
|
||||
http://localhost:8080/experiments
|
||||
```
|
||||
|
||||
## Zig CLI Architecture
|
||||
|
|
@ -269,7 +321,7 @@ The Zig CLI is designed for performance and reliability:
|
|||
|
||||
## Configuration
|
||||
|
||||
Main configuration file: `configs/config-local.yaml`
|
||||
Main configuration file: `configs/api/dev.yaml`
|
||||
|
||||
### Key Settings
|
||||
```yaml
|
||||
|
|
@ -277,14 +329,14 @@ auth:
|
|||
enabled: true
|
||||
api_keys:
|
||||
dev_user:
|
||||
hash: "2baf1f40105d9501fe319a8ec463fdf4325a2a5df445adf3f572f626253678c9"
|
||||
hash: "CHANGE_ME_SHA256_DEV_USER_KEY"
|
||||
admin: true
|
||||
roles:
|
||||
- admin
|
||||
permissions:
|
||||
'*': true
|
||||
researcher_user:
|
||||
hash: "ef92b778ba7a6c8f2150019a5678047b6a9a2b95cef81895吧18f9b35c54d2e3ae"
|
||||
hash: "CHANGE_ME_SHA256_RESEARCHER_USER_KEY"
|
||||
admin: false
|
||||
roles:
|
||||
- researcher
|
||||
|
|
@ -385,8 +437,8 @@ telnet worker.local 9100
|
|||
|
||||
**Authentication failed:**
|
||||
```bash
|
||||
# Check API key in config-local.yaml
|
||||
grep -A 5 "api_keys:" configs/config-local.yaml
|
||||
# Check API key in config
|
||||
grep -A 5 "api_keys:" configs/api/dev.yaml
|
||||
```
|
||||
|
||||
**Redis connection failed:**
|
||||
|
|
|
|||
337
docs/src/cli-tui-ux-contract-v1.md
Normal file
337
docs/src/cli-tui-ux-contract-v1.md
Normal file
|
|
@ -0,0 +1,337 @@
|
|||
# FetchML CLI/TUI UX Contract v1
|
||||
|
||||
This document defines the user experience contract for FetchML v1, focusing on clean, predictable CLI/TUI interactions without mode flags.
|
||||
|
||||
## Core Principles
|
||||
|
||||
1. **Thin CLI**: Local CLI does minimal validation; authoritative checks happen server-side
|
||||
2. **No Mode Flags**: Commands do what they say; no `--mode` or similar flags
|
||||
3. **Predictable Defaults**: Sensible defaults that work for most use cases
|
||||
4. **Graceful Degradation**: JSON output for automation, human-friendly output for interactive use
|
||||
5. **Explicit Operations**: `--dry-run`, `--validate`, `--explain` are explicit, not implied
|
||||
|
||||
## Commands v1
|
||||
|
||||
### Core Workflow Commands
|
||||
|
||||
#### `ml queue <job-name> [options]`
|
||||
Submit a job for execution.
|
||||
|
||||
**Basic Usage:**
|
||||
```bash
|
||||
ml queue my-experiment
|
||||
```
|
||||
|
||||
**Options:**
|
||||
- `--commit <sha>`: Specify commit ID (default: current git HEAD)
|
||||
- `--priority <1-10>`: Job priority (default: 5)
|
||||
- `--cpu <cores>`: CPU cores requested (default: 2)
|
||||
- `--memory <gb>`: Memory in GB (default: 8)
|
||||
- `--gpu <count>`: GPU count (default: 0)
|
||||
- `--gpu-memory <gb>`: GPU memory budget (default: auto)
|
||||
|
||||
**Dry Run:**
|
||||
```bash
|
||||
ml queue my-experiment --dry-run
|
||||
# Output: JSON with what would be submitted, validation results
|
||||
```
|
||||
|
||||
**Validate Only:**
|
||||
```bash
|
||||
ml queue my-experiment --validate
|
||||
# Output: Validation results without submitting
|
||||
```
|
||||
|
||||
**Explain:**
|
||||
```bash
|
||||
ml queue my-experiment --explain
|
||||
# Output: Human-readable explanation of what will happen
|
||||
```
|
||||
|
||||
**JSON Output:**
|
||||
When using `--json`, the response may include a `prewarm` field describing best-effort worker prewarming activity (e.g. dataset prefetch for the next queued task).
|
||||
|
||||
```bash
|
||||
ml queue my-experiment --json
|
||||
# Output: Structured JSON response
|
||||
```
|
||||
|
||||
#### `ml status [job-name]`
|
||||
Show job status.
|
||||
|
||||
**Basic Usage:**
|
||||
```bash
|
||||
ml status # All jobs summary
|
||||
ml status my-experiment # Specific job details
|
||||
```
|
||||
|
||||
**Options:**
|
||||
- `--json`: JSON output
|
||||
- `--watch`: Watch mode (refresh every 2s)
|
||||
- `--limit <n>`: Limit number of jobs shown (default: 20)
|
||||
|
||||
#### `ml cancel <job-name>`
|
||||
Cancel a running or queued job.
|
||||
|
||||
**Basic Usage:**
|
||||
```bash
|
||||
ml cancel my-experiment
|
||||
```
|
||||
|
||||
**Options:**
|
||||
- `--force`: Force cancel even if running
|
||||
- `--json`: JSON output
|
||||
|
||||
### Experiment Management
|
||||
|
||||
#### `ml experiment init <name>`
|
||||
Initialize a new experiment directory.
|
||||
|
||||
**Basic Usage:**
|
||||
```bash
|
||||
ml experiment init my-project
|
||||
```
|
||||
|
||||
**Options:**
|
||||
- `--template <name>`: Use experiment template
|
||||
- `--dry-run`: Show what would be created
|
||||
|
||||
#### `ml experiment list`
|
||||
List available experiments.
|
||||
|
||||
**Options:**
|
||||
- `--json`: JSON output
|
||||
- `--limit <n>`: Limit results
|
||||
|
||||
#### `ml experiment show <commit-id>`
|
||||
Show experiment details.
|
||||
|
||||
**Options:**
|
||||
- `--json`: JSON output
|
||||
- `--manifest`: Show content integrity manifest
|
||||
|
||||
### Dataset Management
|
||||
|
||||
#### `ml dataset list`
|
||||
List available datasets.
|
||||
|
||||
**Options:**
|
||||
- `--json`: JSON output
|
||||
- `--synced-only`: Show only synced datasets
|
||||
|
||||
#### `ml dataset sync <dataset-name>`
|
||||
Sync a dataset from NAS to ML server.
|
||||
|
||||
**Options:**
|
||||
- `--dry-run`: Show what would be synced
|
||||
- `--validate`: Validate dataset integrity after sync
|
||||
|
||||
### Monitoring & TUI
|
||||
|
||||
#### `ml monitor`
|
||||
Launch TUI for real-time monitoring (runs over SSH).
|
||||
|
||||
**Basic Usage:**
|
||||
```bash
|
||||
ml monitor
|
||||
```
|
||||
|
||||
**TUI Controls:**
|
||||
- `Ctrl+C`: Exit TUI
|
||||
- `q`: Quit
|
||||
- `r`: Refresh
|
||||
- `j/k`: Navigate jobs
|
||||
- `Enter`: Job details
|
||||
- `c`: Cancel selected job
|
||||
|
||||
#### `ml watch <job-name>`
|
||||
Watch a specific job's output.
|
||||
|
||||
**Options:**
|
||||
- `--follow`: Follow log output (default)
|
||||
- `--tail <n>`: Show last n lines
|
||||
|
||||
## Global Options
|
||||
|
||||
These options work with any command:
|
||||
|
||||
- `--json`: Output structured JSON instead of human-readable format
|
||||
- `--config <path>`: Use custom config file (default: ~/.ml/config.toml)
|
||||
- `--verbose`: Verbose output
|
||||
- `--quiet`: Minimal output
|
||||
- `--help`: Show help for command
|
||||
|
||||
## Defaults Configuration
|
||||
|
||||
### Default Job Resources
|
||||
```toml
|
||||
[defaults]
|
||||
cpu = 2 # CPU cores
|
||||
memory = 8 # GB
|
||||
gpu = 0 # GPU count
|
||||
gpu_memory = "auto" # Auto-detect or specify GB
|
||||
priority = 5 # Job priority (1-10)
|
||||
```
|
||||
|
||||
### Default Behavior
|
||||
- **Commit ID**: Current git HEAD (must be clean working directory)
|
||||
- **Working Directory**: Current directory for experiment files
|
||||
- **Output**: Human-readable format unless `--json` specified
|
||||
- **Validation**: Server-side authoritative validation
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Exit Codes
|
||||
- `0`: Success
|
||||
- `1`: General error
|
||||
- `2`: Invalid arguments
|
||||
- `3`: Validation failed
|
||||
- `4`: Network/connection error
|
||||
- `5`: Server error
|
||||
|
||||
### Error Output Format
|
||||
**Human-readable:**
|
||||
```
|
||||
Error: Experiment validation failed
|
||||
- Missing dependency manifest (environment.yml, poetry.lock, pyproject.toml, or requirements.txt)
|
||||
- Train script not found: train.py
|
||||
```
|
||||
|
||||
**JSON:**
|
||||
```json
|
||||
{
|
||||
"error": "validation_failed",
|
||||
"message": "Experiment validation failed",
|
||||
"details": [
|
||||
{"field": "dependency_manifest", "error": "missing", "supported": ["environment.yml", "poetry.lock", "pyproject.toml", "requirements.txt"]},
|
||||
{"field": "train_script", "error": "not_found", "expected": "train.py"}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Ctrl+C Semantics
|
||||
|
||||
### Command Cancellation
|
||||
- **Ctrl+C during `ml queue --dry-run`**: Immediate exit, no side effects
|
||||
- **Ctrl+C during `ml queue`**: Attempt to cancel submission, show status
|
||||
- **Ctrl+C during `ml status --watch`**: Exit watch mode
|
||||
- **Ctrl+C during `ml monitor`**: Gracefully exit TUI
|
||||
- **Ctrl+C during `ml watch`**: Stop following logs, show final status
|
||||
|
||||
### Graceful Shutdown
|
||||
1. Signal interrupt to server (if applicable)
|
||||
2. Clean up local resources
|
||||
3. Display current status
|
||||
4. Exit with appropriate code
|
||||
|
||||
## JSON Output Schema
|
||||
|
||||
### Job Submission Response
|
||||
```json
|
||||
{
|
||||
"job_id": "uuid-string",
|
||||
"job_name": "my-experiment",
|
||||
"status": "queued",
|
||||
"commit_id": "abc123...",
|
||||
"submitted_at": "2025-01-01T12:00:00Z",
|
||||
"estimated_start": "2025-01-01T12:05:00Z",
|
||||
"resources": {
|
||||
"cpu": 2,
|
||||
"memory_gb": 8,
|
||||
"gpu": 1,
|
||||
"gpu_memory_gb": 16
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Status Response
|
||||
```json
|
||||
{
|
||||
"jobs": [
|
||||
{
|
||||
"job_id": "uuid-string",
|
||||
"job_name": "my-experiment",
|
||||
"status": "running",
|
||||
"progress": 0.75,
|
||||
"started_at": "2025-01-01T12:05:00Z",
|
||||
"estimated_completion": "2025-01-01T12:30:00Z",
|
||||
"node": "worker-01"
|
||||
}
|
||||
],
|
||||
"total": 1,
|
||||
"showing": 1
|
||||
}
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### Typical Workflow
|
||||
```bash
|
||||
# 1. Initialize experiment
|
||||
ml experiment init my-project
|
||||
cd my-project
|
||||
|
||||
# 2. Validate experiment locally
|
||||
ml queue . --validate --dry-run
|
||||
|
||||
# 3. Submit job
|
||||
ml queue . --priority 8 --gpu 1
|
||||
|
||||
# 4. Monitor progress
|
||||
ml status .
|
||||
ml watch .
|
||||
|
||||
# 5. Check results
|
||||
ml status . --json
|
||||
```
|
||||
|
||||
### Automation Script
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Submit job and wait for completion
|
||||
JOB_ID=$(ml queue my-experiment --json | jq -r '.job_id')
|
||||
|
||||
echo "Submitted job: $JOB_ID"
|
||||
|
||||
# Wait for completion
|
||||
while true; do
|
||||
STATUS=$(ml status $JOB_ID --json | jq -r '.jobs[0].status')
|
||||
echo "Status: $STATUS"
|
||||
|
||||
if [[ "$STATUS" == "completed" || "$STATUS" == "failed" ]]; then
|
||||
break
|
||||
fi
|
||||
|
||||
sleep 10
|
||||
done
|
||||
|
||||
ml status $JOB_ID
|
||||
```
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
### Server-side Validation
|
||||
- CLI performs minimal local checks (git status, file existence)
|
||||
- All authoritative validation happens on worker
|
||||
- Validation failures are propagated back to CLI with clear error messages
|
||||
|
||||
### Trust Contract Integration
|
||||
- Every job submission includes commit ID and content integrity manifest
|
||||
- Worker validates both before execution
|
||||
- Any mismatch causes hard-fail with detailed error reporting
|
||||
|
||||
### Resource Management
|
||||
- Resource requests are validated against available capacity
|
||||
- Jobs are queued based on priority and resource availability
|
||||
- Resource usage is tracked and reported in status
|
||||
|
||||
## Future Extensions
|
||||
|
||||
The v1 contract is intentionally minimal but designed for extension:
|
||||
|
||||
- **v1.1**: Add job dependencies and workflows
|
||||
- **v1.2**: Add experiment templates and scaffolding
|
||||
- **v1.3**: Add distributed execution across multiple workers
|
||||
- **v2.0**: Add advanced scheduling and resource optimization
|
||||
|
||||
All extensions will maintain backward compatibility with the v1 contract.
|
||||
|
|
@ -7,14 +7,14 @@ This document provides a comprehensive reference for all configuration options i
|
|||
## Environment Configurations
|
||||
|
||||
### Local Development
|
||||
**File:** `configs/environments/config-local.yaml`
|
||||
**File:** `configs/api/dev.yaml`
|
||||
|
||||
```yaml
|
||||
auth:
|
||||
enabled: true
|
||||
apikeys:
|
||||
api_keys:
|
||||
dev_user:
|
||||
hash: "2baf1f40105d9501fe319a8ec463fdf4325a2a5df445adf3f572f626253678c9"
|
||||
hash: "CHANGE_ME_SHA256_DEV_USER_KEY"
|
||||
admin: true
|
||||
roles: ["admin"]
|
||||
permissions:
|
||||
|
|
@ -35,14 +35,14 @@ security:
|
|||
```
|
||||
|
||||
### Multi-User Setup
|
||||
**File:** `configs/environments/config-multi-user.yaml`
|
||||
**File:** `configs/api/multi-user.yaml`
|
||||
|
||||
```yaml
|
||||
auth:
|
||||
enabled: true
|
||||
apikeys:
|
||||
api_keys:
|
||||
admin_user:
|
||||
hash: "5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8"
|
||||
hash: "CHANGE_ME_SHA256_ADMIN_USER_KEY"
|
||||
admin: true
|
||||
roles: ["user", "admin"]
|
||||
permissions:
|
||||
|
|
@ -51,7 +51,7 @@ auth:
|
|||
delete: true
|
||||
|
||||
researcher1:
|
||||
hash: "ef92b778ba7a6c8f2150019a5678047b6a9a2b95cef8189518f9b35c54d2e3ae"
|
||||
hash: "CHANGE_ME_SHA256_RESEARCHER1_KEY"
|
||||
admin: false
|
||||
roles: ["user", "researcher"]
|
||||
permissions:
|
||||
|
|
@ -61,7 +61,7 @@ auth:
|
|||
jobs:delete: false
|
||||
|
||||
analyst1:
|
||||
hash: "a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3"
|
||||
hash: "CHANGE_ME_SHA256_ANALYST1_KEY"
|
||||
admin: false
|
||||
roles: ["user", "analyst"]
|
||||
permissions:
|
||||
|
|
@ -72,12 +72,12 @@ auth:
|
|||
```
|
||||
|
||||
### Production
|
||||
**File:** `configs/environments/config-prod.yaml`
|
||||
**File:** `configs/api/prod.yaml`
|
||||
|
||||
```yaml
|
||||
auth:
|
||||
enabled: true
|
||||
apikeys:
|
||||
api_keys:
|
||||
# Production users configured here
|
||||
|
||||
server:
|
||||
|
|
@ -98,13 +98,14 @@ security:
|
|||
- "10.0.0.0/8"
|
||||
|
||||
redis:
|
||||
url: "redis://redis:6379"
|
||||
max_connections: 10
|
||||
addr: "redis:6379"
|
||||
password: ""
|
||||
db: 0
|
||||
|
||||
logging:
|
||||
level: "info"
|
||||
file: "/app/logs/app.log"
|
||||
audit_file: "/app/logs/audit.log"
|
||||
audit_log: "/app/logs/audit.log"
|
||||
```
|
||||
|
||||
## Worker Configurations
|
||||
|
|
@ -113,59 +114,80 @@ logging:
|
|||
**File:** `configs/workers/worker-prod.toml`
|
||||
|
||||
```toml
|
||||
[worker]
|
||||
name = "production-worker"
|
||||
id = "worker-prod-1"
|
||||
worker_id = "worker-prod-01"
|
||||
base_path = "/data/ml-experiments"
|
||||
max_workers = 4
|
||||
|
||||
[server]
|
||||
host = "api-server"
|
||||
port = 9101
|
||||
api_key = "your-api-key-here"
|
||||
redis_addr = "localhost:6379"
|
||||
redis_password = "CHANGE_ME_REDIS_PASSWORD"
|
||||
redis_db = 0
|
||||
|
||||
[execution]
|
||||
max_concurrent_jobs = 2
|
||||
timeout_minutes = 60
|
||||
retry_attempts = 3
|
||||
host = "localhost"
|
||||
user = "ml-user"
|
||||
port = 22
|
||||
ssh_key = "~/.ssh/id_rsa"
|
||||
|
||||
podman_image = "ml-training:latest"
|
||||
gpu_vendor = "none"
|
||||
gpu_visible_devices = []
|
||||
gpu_devices = []
|
||||
container_workspace = "/workspace"
|
||||
container_results = "/results"
|
||||
train_script = "train.py"
|
||||
|
||||
[resources]
|
||||
memory_limit = "4Gi"
|
||||
cpu_limit = "2"
|
||||
gpu_enabled = false
|
||||
max_workers = 4
|
||||
desired_rps_per_worker = 2
|
||||
podman_cpus = "4"
|
||||
podman_memory = "16g"
|
||||
|
||||
[storage]
|
||||
work_dir = "/tmp/fetchml-jobs"
|
||||
cleanup_interval_minutes = 30
|
||||
[metrics]
|
||||
enabled = true
|
||||
listen_addr = ":9100"
|
||||
```
|
||||
|
||||
```toml
|
||||
# Production Worker (NVIDIA, UUID-based GPU selection)
|
||||
worker_id = "worker-prod-01"
|
||||
base_path = "/data/ml-experiments"
|
||||
|
||||
podman_image = "ml-training:latest"
|
||||
gpu_vendor = "nvidia"
|
||||
gpu_visible_device_ids = ["GPU-REPLACE_WITH_REAL_UUID"]
|
||||
gpu_devices = ["/dev/dri"]
|
||||
container_workspace = "/workspace"
|
||||
container_results = "/results"
|
||||
train_script = "train.py"
|
||||
```
|
||||
|
||||
### Docker Worker
|
||||
**File:** `configs/workers/worker-docker.yaml`
|
||||
**File:** `configs/workers/docker.yaml`
|
||||
|
||||
```yaml
|
||||
worker:
|
||||
name: "docker-worker"
|
||||
id: "worker-docker-1"
|
||||
worker_id: "docker-worker"
|
||||
base_path: "/tmp/fetchml-jobs"
|
||||
train_script: "train.py"
|
||||
|
||||
server:
|
||||
host: "api-server"
|
||||
port: 9101
|
||||
api_key: "your-api-key-here"
|
||||
redis_addr: "redis:6379"
|
||||
redis_password: ""
|
||||
redis_db: 0
|
||||
|
||||
execution:
|
||||
max_concurrent_jobs: 1
|
||||
timeout_minutes: 30
|
||||
retry_attempts: 3
|
||||
local_mode: true
|
||||
|
||||
resources:
|
||||
memory_limit: "2Gi"
|
||||
cpu_limit: "1"
|
||||
gpu_enabled: false
|
||||
max_workers: 1
|
||||
poll_interval_seconds: 5
|
||||
|
||||
docker:
|
||||
podman_image: "python:3.9-slim"
|
||||
container_workspace: "/workspace"
|
||||
container_results: "/results"
|
||||
gpu_devices: []
|
||||
gpu_vendor: "none"
|
||||
gpu_visible_devices: []
|
||||
|
||||
metrics:
|
||||
enabled: true
|
||||
image: "fetchml/worker:latest"
|
||||
volume_mounts:
|
||||
- "/tmp:/tmp"
|
||||
- "/var/run/docker.sock:/var/run/docker.sock"
|
||||
listen_addr: ":9100"
|
||||
metrics_flush_interval: "500ms"
|
||||
```
|
||||
|
||||
## CLI Configuration
|
||||
|
|
@ -181,7 +203,7 @@ worker_base = "/app"
|
|||
worker_port = 22
|
||||
|
||||
[auth]
|
||||
api_key = "your-hashed-api-key"
|
||||
api_key = "<your-api-key>"
|
||||
|
||||
[cli]
|
||||
default_timeout = 30
|
||||
|
|
@ -199,7 +221,7 @@ worker_base = "/app"
|
|||
worker_port = 22
|
||||
|
||||
[auth]
|
||||
api_key = "5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8"
|
||||
api_key = "<admin-api-key>"
|
||||
```
|
||||
|
||||
**Researcher Config:** `~/.ml/config-researcher.toml`
|
||||
|
|
@ -211,7 +233,7 @@ worker_base = "/app"
|
|||
worker_port = 22
|
||||
|
||||
[auth]
|
||||
api_key = "ef92b778ba7a6c8f2150019a5678047b6a9a2b95cef8189518f9b35c54d2e3ae"
|
||||
api_key = "<researcher-api-key>"
|
||||
```
|
||||
|
||||
**Analyst Config:** `~/.ml/config-analyst.toml`
|
||||
|
|
@ -223,7 +245,7 @@ worker_base = "/app"
|
|||
worker_port = 22
|
||||
|
||||
[auth]
|
||||
api_key = "a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3"
|
||||
api_key = "<analyst-api-key>"
|
||||
```
|
||||
|
||||
## Configuration Options
|
||||
|
|
@ -298,7 +320,6 @@ api_key = "a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3"
|
|||
|----------|---------|-------------|
|
||||
| `FETCHML_CONFIG` | - | Path to config file |
|
||||
| `FETCHML_LOG_LEVEL` | "info" | Override log level |
|
||||
| `FETCHML_REDIS_URL` | - | Override Redis URL |
|
||||
| `CLI_CONFIG` | - | Path to CLI config file |
|
||||
|
||||
## Troubleshooting
|
||||
|
|
@ -324,7 +345,7 @@ api_key = "a665a45920422f9d417e4867efdc4fb8a04a1f3fff1fa07e998e86f7f7a27ae3"
|
|||
|
||||
```bash
|
||||
# Validate server configuration
|
||||
go run cmd/api-server/main.go --config configs/environments/config-local.yaml --validate
|
||||
go run cmd/api-server/main.go --config configs/api/dev.yaml --validate
|
||||
|
||||
# Test CLI configuration
|
||||
./cli/zig-out/bin/ml status --debug
|
||||
|
|
|
|||
30
docs/src/dev-quick-start.md
Normal file
30
docs/src/dev-quick-start.md
Normal file
|
|
@ -0,0 +1,30 @@
|
|||
# Development Quick Start
|
||||
|
||||
This page is the developer-focused entrypoint for working on FetchML.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Go
|
||||
- Zig
|
||||
- Docker / Docker Compose
|
||||
|
||||
## Quick setup
|
||||
|
||||
```bash
|
||||
# Clone
|
||||
git clone https://github.com/jfraeys/fetch_ml.git
|
||||
cd fetch_ml
|
||||
|
||||
# Start dev environment
|
||||
make dev-up
|
||||
|
||||
# Run tests
|
||||
make test
|
||||
```
|
||||
|
||||
## Next
|
||||
|
||||
- See `testing.md` for test workflows.
|
||||
- See `architecture.md` for system structure.
|
||||
- See `zig-cli.md` for CLI build details.
|
||||
- See the repository root `DEVELOPMENT.md` for the full development guide.
|
||||
|
|
@ -1,185 +0,0 @@
|
|||
# Quick Start Testing Guide
|
||||
|
||||
## Overview
|
||||
|
||||
This guide provides the fastest way to test the FetchML multi-user authentication system.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Docker and Docker Compose installed
|
||||
- CLI built: `cd cli && zig build`
|
||||
- Test configs available in `~/.ml/`
|
||||
|
||||
## 5-Minute Test
|
||||
|
||||
### 1. Clean Environment
|
||||
```bash
|
||||
make self-cleanup
|
||||
```
|
||||
|
||||
### 2. Start Services
|
||||
```bash
|
||||
docker-compose -f deployments/docker-compose.prod.yml up -d
|
||||
```
|
||||
|
||||
### 3. Test Authentication
|
||||
```bash
|
||||
make test-auth
|
||||
```
|
||||
|
||||
### 4. Check Results
|
||||
You should see:
|
||||
- Admin user: Full access, shows all jobs
|
||||
- Researcher user: Own jobs only
|
||||
- Analyst user: Read-only access
|
||||
|
||||
### 5. Clean Up
|
||||
```bash
|
||||
make self-cleanup
|
||||
```
|
||||
|
||||
## Detailed Testing
|
||||
|
||||
### Multi-User Authentication Test
|
||||
|
||||
```bash
|
||||
# Test each user role
|
||||
cp ~/.ml/config-admin.toml ~/.ml/config.toml && ./cli/zig-out/bin/ml status
|
||||
cp ~/.ml/config-researcher.toml ~/.ml/config.toml && ./cli/zig-out/bin/ml status
|
||||
cp ~/.ml/config-analyst.toml ~/.ml/config.toml && ./cli/zig-out/bin/ml status
|
||||
```
|
||||
|
||||
### Job Queueing Test
|
||||
|
||||
```bash
|
||||
# Admin can queue jobs
|
||||
cp ~/.ml/config-admin.toml ~/.ml/config.toml
|
||||
echo "admin job" | ./cli/zig-out/bin/ml queue admin-test
|
||||
|
||||
# Researcher can queue jobs
|
||||
cp ~/.ml/config-researcher.toml ~/.ml/config.toml
|
||||
echo "research job" | ./cli/zig-out/bin/ml queue research-test
|
||||
|
||||
# Analyst cannot queue jobs (should fail)
|
||||
cp ~/.ml/config-analyst.toml ~/.ml/config.toml
|
||||
echo "analysis job" | ./cli/zig-out/bin/ml queue analysis-test
|
||||
```
|
||||
|
||||
### Status Verification
|
||||
|
||||
```bash
|
||||
# Check what each user can see
|
||||
make test-auth
|
||||
```
|
||||
|
||||
## Expected Results
|
||||
|
||||
### Admin User Output
|
||||
```
|
||||
Status retrieved for user: admin_user (admin: true)
|
||||
Tasks: X total, X queued, X running, X failed, X completed
|
||||
```
|
||||
|
||||
### Researcher User Output
|
||||
```
|
||||
Status retrieved for user: researcher1 (admin: false)
|
||||
Tasks: X total, X queued, X running, X failed, X completed
|
||||
```
|
||||
|
||||
### Analyst User Output
|
||||
```
|
||||
Status retrieved for user: analyst1 (admin: false)
|
||||
Tasks: X total, X queued, X running, X failed, X completed
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Server Not Running
|
||||
```bash
|
||||
# Check containers
|
||||
docker ps --filter "name=ml-"
|
||||
|
||||
# Start services
|
||||
docker-compose -f deployments/docker-compose.prod.yml up -d
|
||||
|
||||
# Check logs
|
||||
docker logs ml-prod-api
|
||||
```
|
||||
|
||||
### Authentication Failures
|
||||
```bash
|
||||
# Check config files
|
||||
ls ~/.ml/config-*.toml
|
||||
|
||||
# Verify API keys
|
||||
cat ~/.ml/config-admin.toml
|
||||
```
|
||||
|
||||
### Connection Issues
|
||||
```bash
|
||||
# Test API directly
|
||||
curl -I http://localhost:9103/health
|
||||
|
||||
# Check ports
|
||||
netstat -an | grep 9103
|
||||
```
|
||||
|
||||
## Advanced Testing
|
||||
|
||||
### Full Test Suite
|
||||
```bash
|
||||
make test-full
|
||||
```
|
||||
|
||||
### Performance Testing
|
||||
```bash
|
||||
./scripts/benchmarks/run-benchmarks-local.sh
|
||||
```
|
||||
|
||||
### Cleanup Status
|
||||
```bash
|
||||
make test-status
|
||||
```
|
||||
|
||||
## Configuration Files
|
||||
|
||||
### Test Configs Location
|
||||
- `~/.ml/config-admin.toml` - Admin user
|
||||
- `~/.ml/config-researcher.toml` - Researcher user
|
||||
- `~/.ml/config-analyst.toml` - Analyst user
|
||||
|
||||
### Server Configs
|
||||
- `configs/environments/config-multi-user.yaml` - Multi-user setup
|
||||
- `configs/environments/config-local.yaml` - Local development
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Review Documentation**
|
||||
- [Testing Protocol](testing-protocol.md)
|
||||
- [Configuration Reference](configuration-reference.md)
|
||||
- [Testing Guide](testing-guide.md)
|
||||
|
||||
2. **Explore Features**
|
||||
- Job queueing and management
|
||||
- WebSocket communication
|
||||
- Role-based permissions
|
||||
|
||||
3. **Production Setup**
|
||||
- TLS configuration
|
||||
- Security hardening
|
||||
- Monitoring setup
|
||||
|
||||
## Help
|
||||
|
||||
### Common Commands
|
||||
```bash
|
||||
make help # Show all commands
|
||||
make test-auth # Quick auth test
|
||||
make self-cleanup # Clean environment
|
||||
make test-status # Check system status
|
||||
```
|
||||
|
||||
### Get Help
|
||||
- Check logs: `docker logs ml-prod-api`
|
||||
- Review documentation in `docs/src/`
|
||||
- Use `--debug` flag with CLI commands
|
||||
|
|
@ -1,50 +0,0 @@
|
|||
# Testing Guide
|
||||
|
||||
## Quick Start
|
||||
|
||||
The FetchML project includes comprehensive testing tools.
|
||||
|
||||
## Testing Commands
|
||||
|
||||
### Quick Tests
|
||||
```bash
|
||||
make test-auth # Test multi-user authentication
|
||||
make test-status # Check cleanup status
|
||||
make self-cleanup # Clean environment
|
||||
```
|
||||
|
||||
### Full Test Suite
|
||||
```bash
|
||||
make test-full # Run complete test suite
|
||||
```
|
||||
|
||||
## Expected Results
|
||||
|
||||
### Admin User
|
||||
Status retrieved for user: admin_user (admin: true)
|
||||
Tasks: X total, X queued, X running, X failed, X completed
|
||||
|
||||
### Researcher User
|
||||
Status retrieved for user: researcher1 (admin: false)
|
||||
Tasks: X total, X queued, X running, X failed, X completed
|
||||
|
||||
### Analyst User
|
||||
Status retrieved for user: analyst1 (admin: false)
|
||||
Tasks: X total, X queued, X running, X failed, X completed
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Authentication Failures
|
||||
- Check API key in ~/.ml/config.toml
|
||||
- Verify server is running with auth enabled
|
||||
|
||||
### Container Issues
|
||||
- Check Docker daemon is running
|
||||
- Verify ports 9100, 9103 are available
|
||||
- Review logs: docker logs ml-prod-api
|
||||
|
||||
## Cleanup
|
||||
```bash
|
||||
make self-cleanup # Interactive cleanup
|
||||
make auto-cleanup # Setup daily auto-cleanup
|
||||
```
|
||||
|
|
@ -1,258 +0,0 @@
|
|||
# Testing Protocol
|
||||
|
||||
This document outlines the comprehensive testing protocol for the FetchML project.
|
||||
|
||||
## Overview
|
||||
|
||||
The testing protocol is designed to ensure:
|
||||
- Multi-user authentication works correctly
|
||||
- API functionality is reliable
|
||||
- CLI commands function properly
|
||||
- Docker containers run as expected
|
||||
- Performance meets requirements
|
||||
|
||||
## Test Categories
|
||||
|
||||
### 1. Authentication Tests
|
||||
|
||||
#### 1.1 Multi-User Authentication
|
||||
```bash
|
||||
# Test admin user
|
||||
cp ~/.ml/config-admin.toml ~/.ml/config.toml
|
||||
./cli/zig-out/bin/ml status
|
||||
# Expected: Shows admin status and all jobs
|
||||
|
||||
# Test researcher user
|
||||
cp ~/.ml/config-researcher.toml ~/.ml/config.toml
|
||||
./cli/zig-out/bin/ml status
|
||||
# Expected: Shows researcher status and own jobs only
|
||||
|
||||
# Test analyst user
|
||||
cp ~/.ml/config-analyst.toml ~/.ml/config.toml
|
||||
./cli/zig-out/bin/ml status
|
||||
# Expected: Shows analyst status, read-only access
|
||||
```
|
||||
|
||||
#### 1.2 API Key Validation
|
||||
```bash
|
||||
# Test invalid API key
|
||||
echo "invalid_key" > ~/.ml/config.toml
|
||||
./cli/zig-out/bin/ml status
|
||||
# Expected: Authentication failed error
|
||||
|
||||
# Test missing API key
|
||||
rm ~/.ml/config.toml
|
||||
./cli/zig-out/bin/ml status
|
||||
# Expected: API key not configured error
|
||||
```
|
||||
|
||||
### 2. CLI Functionality Tests
|
||||
|
||||
#### 2.1 Job Queueing
|
||||
```bash
|
||||
# Test job queueing with different users
|
||||
cp ~/.ml/config-admin.toml ~/.ml/config.toml
|
||||
echo "test job" | ./cli/zig-out/bin/ml queue test-job
|
||||
|
||||
cp ~/.ml/config-researcher.toml ~/.ml/config.toml
|
||||
echo "research job" | ./cli/zig-out/bin/ml queue research-job
|
||||
|
||||
cp ~/.ml/config-analyst.toml ~/.ml/config.toml
|
||||
echo "analysis job" | ./cli/zig-out/bin/ml queue analysis-job
|
||||
# Expected: Admin and researcher can queue, analyst cannot
|
||||
```
|
||||
|
||||
#### 2.2 Status Checking
|
||||
```bash
|
||||
# Check status after job queueing
|
||||
./cli/zig-out/bin/ml status
|
||||
# Expected: Shows jobs based on user permissions
|
||||
```
|
||||
|
||||
### 3. Docker Container Tests
|
||||
|
||||
#### 3.1 Container Startup
|
||||
```bash
|
||||
# Start production environment
|
||||
docker-compose -f deployments/docker-compose.prod.yml up -d
|
||||
|
||||
# Check container status
|
||||
docker ps --filter "name=ml-"
|
||||
# Expected: All containers running and healthy
|
||||
```
|
||||
|
||||
#### 3.2 Port Accessibility
|
||||
```bash
|
||||
# Test API server port
|
||||
curl -I http://localhost:9103/health
|
||||
# Expected: 200 OK response
|
||||
|
||||
# Test metrics port
|
||||
curl -I http://localhost:9100/metrics
|
||||
# Expected: 200 OK response
|
||||
```
|
||||
|
||||
#### 3.3 Container Cleanup
|
||||
```bash
|
||||
# Test cleanup script
|
||||
./scripts/maintenance/cleanup.sh --dry-run
|
||||
./scripts/maintenance/cleanup.sh --force
|
||||
# Expected: Containers stopped and removed
|
||||
```
|
||||
|
||||
### 4. Performance Tests
|
||||
|
||||
#### 4.1 API Performance
|
||||
```bash
|
||||
# Run API benchmarks
|
||||
./scripts/benchmarks/run-benchmarks-local.sh
|
||||
# Expected: Response times under 100ms for basic operations
|
||||
```
|
||||
|
||||
#### 4.2 Load Testing
|
||||
```bash
|
||||
# Run load tests
|
||||
go test -v ./tests/load/...
|
||||
# Expected: System handles concurrent requests without degradation
|
||||
```
|
||||
|
||||
### 5. Integration Tests
|
||||
|
||||
#### 5.1 End-to-End Workflow
|
||||
```bash
|
||||
# Complete workflow test
|
||||
cp ~/.ml/config-admin.toml ~/.ml/config.toml
|
||||
|
||||
# Queue job
|
||||
echo "integration test" | ./cli/zig-out/bin/ml queue integration-test
|
||||
|
||||
# Check status
|
||||
./cli/zig-out/bin/ml status
|
||||
|
||||
# Verify job appears in queue
|
||||
# Expected: Job queued and visible in status
|
||||
```
|
||||
|
||||
#### 5.2 WebSocket Communication
|
||||
```bash
|
||||
# Test WebSocket handshake
|
||||
./cli/zig-out/bin/ml status
|
||||
# Expected: Successful WebSocket upgrade and response
|
||||
```
|
||||
|
||||
## Test Execution Order
|
||||
|
||||
### Phase 1: Environment Setup
|
||||
1. Clean up any existing containers
|
||||
2. Start fresh Docker environment
|
||||
3. Verify all services are running
|
||||
|
||||
### Phase 2: Authentication Testing
|
||||
1. Test all user roles (admin, researcher, analyst)
|
||||
2. Test invalid authentication scenarios
|
||||
3. Verify role-based permissions
|
||||
|
||||
### Phase 3: Functional Testing
|
||||
1. Test CLI commands (queue, status)
|
||||
2. Test API endpoints
|
||||
3. Test WebSocket communication
|
||||
|
||||
### Phase 4: Integration Testing
|
||||
1. Test complete workflows
|
||||
2. Test error scenarios
|
||||
3. Test cleanup procedures
|
||||
|
||||
### Phase 5: Performance Testing
|
||||
1. Run benchmarks
|
||||
2. Perform load testing
|
||||
3. Validate performance metrics
|
||||
|
||||
## Automated Testing
|
||||
|
||||
### Continuous Integration Tests
|
||||
```bash
|
||||
# Run all tests
|
||||
make test
|
||||
|
||||
# Run specific test categories
|
||||
make test-unit
|
||||
make test-integration
|
||||
make test-e2e
|
||||
```
|
||||
|
||||
### Pre-deployment Checklist
|
||||
```bash
|
||||
# Complete test suite
|
||||
./scripts/testing/run-full-test-suite.sh
|
||||
|
||||
# Performance validation
|
||||
./scripts/benchmarks/run-benchmarks-local.sh
|
||||
|
||||
# Security validation
|
||||
./scripts/security/security-scan.sh
|
||||
```
|
||||
|
||||
## Test Data Management
|
||||
|
||||
### Test Users
|
||||
- **admin_user**: Full access, can see all jobs
|
||||
- **researcher1**: Can create and view own jobs
|
||||
- **analyst1**: Read-only access, cannot create jobs
|
||||
|
||||
### Test Jobs
|
||||
- **test-job**: Basic job for testing
|
||||
- **research-job**: Research-specific job
|
||||
- **analysis-job**: Analysis-specific job
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
#### Authentication Failures
|
||||
- Check API key configuration
|
||||
- Verify server is running with auth enabled
|
||||
- Check YAML config syntax
|
||||
|
||||
#### Container Issues
|
||||
- Verify Docker daemon is running
|
||||
- Check port conflicts
|
||||
- Review container logs
|
||||
|
||||
#### Performance Issues
|
||||
- Monitor resource usage
|
||||
- Check for memory leaks
|
||||
- Verify database connections
|
||||
|
||||
### Debug Commands
|
||||
```bash
|
||||
# Check container logs
|
||||
docker logs ml-prod-api
|
||||
|
||||
# Check system resources
|
||||
docker stats
|
||||
|
||||
# Verify network connectivity
|
||||
docker network ls
|
||||
```
|
||||
|
||||
## Test Results Documentation
|
||||
|
||||
All test results should be documented in:
|
||||
- `test-results/` directory
|
||||
- Performance benchmarks
|
||||
- Integration test reports
|
||||
- Security scan results
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Regular Tasks
|
||||
- Update test data periodically
|
||||
- Review and update test cases
|
||||
- Maintain test infrastructure
|
||||
- Monitor test performance
|
||||
|
||||
### Test Environment
|
||||
- Keep test environment isolated
|
||||
- Use consistent test data
|
||||
- Regular cleanup of test artifacts
|
||||
- Monitor test resource usage
|
||||
|
|
@ -1,10 +1,62 @@
|
|||
# Testing Guide
|
||||
|
||||
Comprehensive testing documentation for FetchML platform.
|
||||
Comprehensive testing documentation for FetchML platform with integrated monitoring.
|
||||
|
||||
## Quick Start Testing
|
||||
|
||||
For a fast 5-minute testing experience, see the **[Quick Start Testing Guide](quick-start-testing.md)**.
|
||||
### 5-Minute Fast Test
|
||||
|
||||
```bash
|
||||
# Clean environment
|
||||
make self-cleanup
|
||||
|
||||
# Start development stack with monitoring
|
||||
make dev-up
|
||||
|
||||
# Quick authentication test
|
||||
make test-auth
|
||||
|
||||
# Clean up
|
||||
make dev-down
|
||||
```
|
||||
|
||||
**Expected Results**:
|
||||
- Admin user: Full access, shows all jobs
|
||||
- Researcher user: Own jobs only
|
||||
- Analyst user: Read-only access
|
||||
|
||||
## Test Environment Setup
|
||||
|
||||
### Development Environment with Monitoring
|
||||
|
||||
```bash
|
||||
# Start development stack with monitoring
|
||||
make dev-up
|
||||
|
||||
# Verify all services are running
|
||||
make dev-status
|
||||
|
||||
# Run tests against running services
|
||||
make test
|
||||
|
||||
# Check monitoring during tests
|
||||
# Grafana: http://localhost:3000 (admin/admin123)
|
||||
```
|
||||
|
||||
### Test Environment Verification
|
||||
|
||||
```bash
|
||||
# Verify API server
|
||||
curl -f http://localhost:8080/health
|
||||
|
||||
# Verify monitoring services
|
||||
curl -f http://localhost:3000/api/health
|
||||
curl -f http://localhost:9090/api/v1/query?query=up
|
||||
curl -f http://localhost:3100/ready
|
||||
|
||||
# Verify Redis
|
||||
docker exec ml-experiments-redis redis-cli ping
|
||||
```
|
||||
|
||||
## Test Types
|
||||
|
||||
|
|
@ -12,6 +64,9 @@ For a fast 5-minute testing experience, see the **[Quick Start Testing Guide](qu
|
|||
```bash
|
||||
make test-unit # Go unit tests only
|
||||
cd cli && zig build test # Zig CLI tests
|
||||
|
||||
# Unit tests live under tests/unit/ (including tests that cover internal/ packages)
|
||||
go test ./tests/unit/...
|
||||
```
|
||||
|
||||
### Integration Tests
|
||||
|
|
@ -22,6 +77,9 @@ make test-integration # API and database integration
|
|||
### End-to-End Tests
|
||||
```bash
|
||||
make test-e2e # Full workflow testing
|
||||
|
||||
# Podman E2E is opt-in because it builds/runs containers
|
||||
FETCH_ML_E2E_PODMAN=1 go test ./tests/e2e/... # Enables TestPodmanIntegration
|
||||
```
|
||||
|
||||
### All Tests
|
||||
|
|
@ -30,66 +88,308 @@ make test # Run complete test suite
|
|||
make test-coverage # With coverage report
|
||||
```
|
||||
|
||||
## Docker Testing
|
||||
## Deployment-Specific Testing
|
||||
|
||||
### Development Environment Testing
|
||||
|
||||
### Development Environment
|
||||
```bash
|
||||
docker-compose up -d
|
||||
# Start dev stack
|
||||
make dev-up
|
||||
|
||||
# Run tests with monitoring
|
||||
make test
|
||||
docker-compose down
|
||||
|
||||
# View test results in Grafana
|
||||
# Load Test Performance dashboard
|
||||
# System Health dashboard
|
||||
```
|
||||
|
||||
### Production Environment Testing
|
||||
|
||||
```bash
|
||||
docker-compose -f docker-compose.prod.yml up -d
|
||||
cd deployments
|
||||
make prod-up
|
||||
|
||||
# Test production deployment
|
||||
make test-auth # Multi-user auth test
|
||||
make self-cleanup # Clean up after testing
|
||||
|
||||
# Verify production monitoring
|
||||
curl -f https://your-domain.com/health
|
||||
```
|
||||
|
||||
### Homelab Secure Testing
|
||||
|
||||
```bash
|
||||
cd deployments
|
||||
make homelab-up
|
||||
|
||||
# Test secure deployment
|
||||
make test-auth
|
||||
make test-ssl # SSL/TLS testing
|
||||
```
|
||||
|
||||
## Authentication Testing Protocol
|
||||
|
||||
### Multi-User Authentication
|
||||
|
||||
```bash
|
||||
# Test admin user
|
||||
cp ~/.ml/config-admin.toml ~/.ml/config.toml
|
||||
./cli/zig-out/bin/ml status
|
||||
# Expected: Shows admin status and all jobs
|
||||
|
||||
# Test researcher user
|
||||
cp ~/.ml/config-researcher.toml ~/.ml/config.toml
|
||||
./cli/zig-out/bin/ml status
|
||||
# Expected: Shows researcher status and own jobs only
|
||||
|
||||
# Test analyst user
|
||||
cp ~/.ml/config-analyst.toml ~/.ml/config.toml
|
||||
./cli/zig-out/bin/ml status
|
||||
# Expected: Shows analyst status, read-only access
|
||||
```
|
||||
|
||||
### API Key Validation
|
||||
|
||||
```bash
|
||||
# Test invalid API key
|
||||
echo "invalid_key" > ~/.ml/config.toml
|
||||
./cli/zig-out/bin/ml status
|
||||
# Expected: Authentication failed error
|
||||
|
||||
# Test missing API key
|
||||
rm ~/.ml/config.toml
|
||||
./cli/zig-out/bin/ml status
|
||||
# Expected: API key not configured error
|
||||
```
|
||||
|
||||
### Job Queueing by Role
|
||||
|
||||
```bash
|
||||
# Admin can queue jobs
|
||||
cp ~/.ml/config-admin.toml ~/.ml/config.toml
|
||||
echo "admin job" | ./cli/zig-out/bin/ml queue admin-test
|
||||
|
||||
# Researcher can queue jobs
|
||||
cp ~/.ml/config-researcher.toml ~/.ml/config.toml
|
||||
echo "research job" | ./cli/zig-out/bin/ml queue research-test
|
||||
|
||||
# Analyst cannot queue jobs (should fail)
|
||||
cp ~/.ml/config-analyst.toml ~/.ml/config.toml
|
||||
echo "analysis job" | ./cli/zig-out/bin/ml queue analysis-test
|
||||
# Expected: Permission denied error
|
||||
```
|
||||
|
||||
## Performance Testing
|
||||
|
||||
### Load Testing with Monitoring
|
||||
|
||||
```bash
|
||||
# Start monitoring
|
||||
make dev-up
|
||||
|
||||
# Run load tests
|
||||
make load-test
|
||||
|
||||
# Monitor performance in real-time
|
||||
# Grafana: http://localhost:3000
|
||||
# Check: Request rates, response times, error rates
|
||||
```
|
||||
|
||||
### Benchmark Suite
|
||||
```bash
|
||||
./scripts/benchmarks/run-benchmarks-local.sh
|
||||
```
|
||||
|
||||
### Load Testing
|
||||
```bash
|
||||
make test-load # API load testing
|
||||
```
|
||||
|
||||
## Authentication Testing
|
||||
|
||||
Multi-user authentication testing is fully covered in the **[Quick Start Testing Guide](quick-start-testing.md)**.
|
||||
|
||||
```bash
|
||||
make test-auth # Quick auth role testing
|
||||
# Run benchmarks with performance tracking
|
||||
./scripts/track_performance.sh
|
||||
|
||||
# Or run directly
|
||||
make benchmark-local
|
||||
|
||||
# View results in Grafana dashboards
|
||||
```
|
||||
|
||||
### Performance Monitoring During Tests
|
||||
|
||||
**Key Metrics to Watch**:
|
||||
- API response times (95th percentile)
|
||||
- Error rates (should be < 1%)
|
||||
- Memory usage trends
|
||||
- CPU utilization
|
||||
- Request throughput
|
||||
|
||||
## CLI Testing
|
||||
|
||||
### Build and Test CLI
|
||||
|
||||
```bash
|
||||
cd cli && zig build dev
|
||||
./cli/zig-out/dev/ml --help
|
||||
cd cli && zig build --release=fast
|
||||
./zig-out/bin/ml --help
|
||||
zig build test
|
||||
```
|
||||
|
||||
### CLI Integration Tests
|
||||
|
||||
```bash
|
||||
make test-cli # CLI-specific integration tests
|
||||
|
||||
```
|
||||
|
||||
## Docker Container Testing
|
||||
|
||||
### Container Startup
|
||||
|
||||
```bash
|
||||
# Start environment
|
||||
make dev-up
|
||||
|
||||
# Check container status
|
||||
docker ps --filter "name=ml-"
|
||||
# Expected: All containers running and healthy
|
||||
```
|
||||
|
||||
### Port Accessibility
|
||||
|
||||
```bash
|
||||
# Test API server port
|
||||
curl -I http://localhost:8080/health
|
||||
# Expected: 200 OK response
|
||||
|
||||
# Test metrics port
|
||||
curl -I http://localhost:9100/metrics
|
||||
# Expected: 200 OK response
|
||||
|
||||
# Test monitoring ports
|
||||
curl -I http://localhost:3000/api/health
|
||||
curl -I http://localhost:9090/api/v1/query?query=up
|
||||
```
|
||||
|
||||
### Container Cleanup
|
||||
|
||||
```bash
|
||||
# Test cleanup script
|
||||
make self-cleanup
|
||||
make dev-down
|
||||
# Expected: Containers stopped and removed
|
||||
```
|
||||
|
||||
## Monitoring During Testing
|
||||
|
||||
### Real-Time Monitoring
|
||||
|
||||
```bash
|
||||
# Access Grafana during tests
|
||||
http://localhost:3000 (admin/admin123)
|
||||
|
||||
# Key Dashboards:
|
||||
# - Load Test Performance: Request metrics, response times
|
||||
# - System Health: Service status, resource usage
|
||||
# - Log Analysis: Error logs, service logs
|
||||
```
|
||||
|
||||
### Test Metrics Collection
|
||||
|
||||
**Automatically Collected**:
|
||||
- HTTP request metrics
|
||||
- Response time histograms
|
||||
- Error counters
|
||||
- Resource utilization
|
||||
- Log aggregation
|
||||
|
||||
**Manual Test Markers**:
|
||||
```bash
|
||||
# Mark test start in logs
|
||||
echo "TEST_START: $(date)" | tee -a /logs/test.log
|
||||
|
||||
# Mark test completion
|
||||
echo "TEST_END: $(date)" | tee -a /logs/test.log
|
||||
```
|
||||
|
||||
## Test Execution Protocol
|
||||
|
||||
### Phase 1: Environment Setup
|
||||
1. Clean up any existing containers
|
||||
2. Start fresh Docker environment with monitoring
|
||||
3. Verify all services are running
|
||||
|
||||
### Phase 2: Authentication Testing
|
||||
1. Test all user roles (admin, researcher, analyst)
|
||||
2. Test invalid authentication scenarios
|
||||
3. Verify role-based permissions
|
||||
|
||||
### Phase 3: Functional Testing
|
||||
1. Test CLI commands (queue, status)
|
||||
2. Test API endpoints
|
||||
3. Test WebSocket communication
|
||||
|
||||
### Phase 4: Integration Testing
|
||||
1. Test complete workflows
|
||||
2. Test error scenarios
|
||||
3. Test cleanup procedures
|
||||
|
||||
### Phase 5: Performance Testing
|
||||
1. Run benchmarks
|
||||
2. Perform load testing
|
||||
3. Validate performance metrics
|
||||
|
||||
## Troubleshooting Tests
|
||||
|
||||
### Common Issues
|
||||
- **Server not running**: Check with `docker ps --filter "name=ml-"`
|
||||
- **Authentication failures**: Verify configs in `~/.ml/config-*.toml`
|
||||
- **Connection issues**: Test API with `curl -I http://localhost:9103/health`
|
||||
|
||||
**Server Not Running**:
|
||||
```bash
|
||||
# Check service status
|
||||
make dev-status
|
||||
|
||||
# Check container logs
|
||||
docker logs ml-experiments-api
|
||||
docker logs ml-experiments-grafana
|
||||
```
|
||||
|
||||
**Authentication Failures**:
|
||||
```bash
|
||||
# Verify configs
|
||||
ls ~/.ml/config-*.toml
|
||||
|
||||
# Check API health
|
||||
curl -I http://localhost:8080/health
|
||||
|
||||
# Monitor auth logs in Grafana
|
||||
# Log Analysis dashboard -> filter "auth"
|
||||
```
|
||||
|
||||
**Performance Issues**:
|
||||
```bash
|
||||
# Check resource usage in Grafana
|
||||
# System Health dashboard
|
||||
|
||||
# Check API response times
|
||||
# Load Test Performance dashboard
|
||||
|
||||
# Identify bottlenecks
|
||||
# Prometheus: http://localhost:9090
|
||||
```
|
||||
|
||||
**Monitoring Issues**:
|
||||
```bash
|
||||
# Re-setup monitoring provisioning (Grafana datasources/providers)
|
||||
python3 scripts/setup_monitoring.py
|
||||
|
||||
# Restart Grafana
|
||||
docker restart ml-experiments-grafana
|
||||
|
||||
# Check datasource connectivity
|
||||
# Grafana -> Configuration -> Data Sources
|
||||
```
|
||||
|
||||
### Debug Mode
|
||||
|
||||
```bash
|
||||
make test-debug # Run tests with verbose output
|
||||
# Enable debug logging
|
||||
export LOG_LEVEL=debug
|
||||
make test
|
||||
|
||||
# Monitor debug logs
|
||||
# Grafana Log Analysis dashboard
|
||||
```
|
||||
|
||||
## Test Configuration
|
||||
|
|
@ -103,12 +403,27 @@ make test-debug # Run tests with verbose output
|
|||
- `tests/fixtures/` - Test data and examples
|
||||
- `tests/benchmarks/` - Performance test data
|
||||
|
||||
### Monitoring Configuration
|
||||
- `monitoring/grafana/provisioning/` - Auto-provisioned datasources
|
||||
- `monitoring/grafana/dashboards/` - Auto-provisioned dashboards
|
||||
- `monitoring/prometheus/prometheus.yml` - Metrics collection
|
||||
|
||||
## Continuous Integration
|
||||
|
||||
### CI Pipeline with Monitoring
|
||||
|
||||
Tests run automatically on:
|
||||
- Pull requests (full suite)
|
||||
- Main branch commits (unit + integration)
|
||||
- Releases (full suite + benchmarks)
|
||||
- **Pull requests**: Full suite + performance benchmarks
|
||||
- **Main branch**: Unit + integration tests
|
||||
- **Releases**: Full suite + benchmarks + security scans
|
||||
|
||||
### CI Monitoring
|
||||
|
||||
During CI runs:
|
||||
- Performance metrics collected
|
||||
- Test results tracked in Grafana
|
||||
- Regression detection
|
||||
- Automated alerts on failures
|
||||
|
||||
## Writing Tests
|
||||
|
||||
|
|
@ -117,13 +432,120 @@ Tests run automatically on:
|
|||
- Integration tests: `tests/e2e/` directory
|
||||
- Benchmark tests: `tests/benchmarks/` directory
|
||||
|
||||
### Test with Monitoring
|
||||
|
||||
```go
|
||||
// Add custom metrics for tests
|
||||
var testRequests = prometheus.NewCounterVec(
|
||||
prometheus.CounterOpts{
|
||||
Name: "test_requests_total",
|
||||
Help: "Total number of test requests",
|
||||
},
|
||||
[]string{"method", "status"},
|
||||
)
|
||||
|
||||
// Log test events for monitoring
|
||||
log.WithFields(log.Fields{
|
||||
"test": "integration_test",
|
||||
"operation": "api_call",
|
||||
"status": "success",
|
||||
}).Info("Test operation completed")
|
||||
```
|
||||
|
||||
### Zig Tests
|
||||
- CLI tests: `cli/tests/` directory
|
||||
- Follow Zig testing conventions
|
||||
|
||||
## See Also
|
||||
## Test Result Analysis
|
||||
|
||||
- **[Quick Start Testing Guide](quick-start-testing.md)** - Fast 5-minute testing
|
||||
- **[Testing Protocol](testing-protocol.md)** - Detailed testing procedures
|
||||
- **[Configuration Reference](configuration-reference.md)** - Test setup
|
||||
- **[Troubleshooting](troubleshooting.md)** - Common issues
|
||||
### Grafana Dashboard Analysis
|
||||
|
||||
**Load Test Performance**:
|
||||
- Request rates over time
|
||||
- Response time percentiles
|
||||
- Error rate trends
|
||||
- Throughput metrics
|
||||
|
||||
**System Health**:
|
||||
- Service availability
|
||||
- Resource utilization
|
||||
- Memory usage patterns
|
||||
- CPU consumption
|
||||
|
||||
**Log Analysis**:
|
||||
- Error patterns
|
||||
- Warning frequency
|
||||
- Service log aggregation
|
||||
- Debug information
|
||||
|
||||
### Performance Regression Detection
|
||||
|
||||
```bash
|
||||
# Track performance over time
|
||||
./scripts/track_performance.sh
|
||||
|
||||
# Compare with baseline
|
||||
# Grafana: Compare current run with historical data
|
||||
|
||||
# Alert on regressions
|
||||
# Set up Grafana alerts for performance degradation
|
||||
```
|
||||
|
||||
## Test Cleanup
|
||||
|
||||
### Automated Cleanup
|
||||
|
||||
```bash
|
||||
# Clean up test data
|
||||
make self-cleanup
|
||||
|
||||
# Clean up Docker resources
|
||||
make clean-all
|
||||
|
||||
# Reset monitoring data
|
||||
docker volume rm monitoring_prometheus_data
|
||||
docker volume rm monitoring_grafana_data
|
||||
```
|
||||
|
||||
### Manual Cleanup
|
||||
|
||||
```bash
|
||||
# Stop test environment
|
||||
make dev-down
|
||||
|
||||
# Remove test artifacts
|
||||
rm -rf ~/.ml/config-*.toml
|
||||
rm -rf test-results/
|
||||
```
|
||||
|
||||
## Expected Test Results
|
||||
|
||||
### Admin User
|
||||
```
|
||||
Status retrieved for user: admin_user (admin: true)
|
||||
Tasks: X total, X queued, X running, X failed, X completed
|
||||
```
|
||||
|
||||
### Researcher User
|
||||
```
|
||||
Status retrieved for user: researcher1 (admin: false)
|
||||
Tasks: X total, X queued, X running, X failed, X completed
|
||||
```
|
||||
|
||||
### Analyst User
|
||||
```
|
||||
Status retrieved for user: analyst1 (admin: false)
|
||||
Tasks: X total, X queued, X running, X failed, X completed
|
||||
```
|
||||
|
||||
## Common Commands Reference
|
||||
|
||||
```bash
|
||||
make help # Show all commands
|
||||
make test-auth # Quick auth test
|
||||
make self-cleanup # Clean environment
|
||||
make test-status # Check system status
|
||||
make dev-up # Start dev environment
|
||||
make dev-down # Stop dev environment
|
||||
make dev-status # Check dev status
|
||||
```
|
||||
|
|
@ -7,41 +7,36 @@ Common issues and solutions for Fetch ML.
|
|||
### Services Not Starting
|
||||
|
||||
```bash
|
||||
# Check Docker status
|
||||
docker-compose ps
|
||||
# Check container status
|
||||
docker ps --filter "name=ml-"
|
||||
|
||||
# Restart services
|
||||
docker-compose down && docker-compose up -d (testing only)
|
||||
|
||||
# Check logs
|
||||
docker-compose logs -f
|
||||
# Restart development stack
|
||||
make dev-down
|
||||
make dev-up
|
||||
```
|
||||
|
||||
### API Not Responding
|
||||
|
||||
```bash
|
||||
# Check health endpoint
|
||||
curl http://localhost:9101/health
|
||||
curl http://localhost:8080/health
|
||||
|
||||
# Check if port is in use
|
||||
lsof -i :9101
|
||||
lsof -i :8080
|
||||
lsof -i :8443
|
||||
|
||||
# Kill process on port
|
||||
kill -9 $(lsof -ti :9101)
|
||||
kill -9 $(lsof -ti :8080)
|
||||
```
|
||||
|
||||
### Database Issues
|
||||
### Database / Redis Issues
|
||||
|
||||
```bash
|
||||
# Check database connection
|
||||
docker-compose exec postgres psql -U postgres -d fetch_ml
|
||||
# Check Redis from container
|
||||
docker exec ml-experiments-redis redis-cli ping
|
||||
|
||||
# Reset database
|
||||
docker-compose down postgres
|
||||
docker-compose up -d (testing only) postgres
|
||||
|
||||
# Check Redis
|
||||
docker-compose exec redis redis-cli ping
|
||||
# Check API can reach database (via health endpoint)
|
||||
curl -f http://localhost:8080/health || echo "API not healthy"
|
||||
```
|
||||
|
||||
## Common Errors
|
||||
|
|
@ -53,7 +48,7 @@ docker-compose exec redis redis-cli ping
|
|||
|
||||
### Database Errors
|
||||
- **Connection failed**: Verify database type and connection params
|
||||
- **No such table**: Run migrations with `--migrate` (see [Development Setup](development-setup.md))
|
||||
- **No such table**: Run migrations with `--migrate` (see [Quick Start](quick-start.md))
|
||||
|
||||
### Container Errors
|
||||
- **Runtime not found**: Set `runtime: docker (testing only)` in config
|
||||
|
|
@ -65,15 +60,15 @@ docker-compose exec redis redis-cli ping
|
|||
|
||||
## Development Issues
|
||||
- **Build fails**: `go mod tidy` and `cd cli && rm -rf zig-out zig-cache`
|
||||
- **Tests fail**: Start test dependencies with `docker-compose up -d` or `make test-auth`
|
||||
- **Tests fail**: Ensure dev stack is running with `make dev-up` or use `make test-auth`
|
||||
|
||||
## CLI Issues
|
||||
- **Not found**: `cd cli && zig build dev`
|
||||
- **Not found**: `cd cli && zig build --release=fast`
|
||||
- **Connection errors**: Check `--server` and `--api-key`
|
||||
|
||||
## Network Issues
|
||||
- **Port conflicts**: `lsof -i :9101` and kill processes
|
||||
- **Firewall**: Allow ports 9101, 6379, 5432
|
||||
- **Port conflicts**: `lsof -i :8080` / `lsof -i :8443` and kill processes
|
||||
- **Firewall**: Allow ports 8080, 8443, 6379, 5432
|
||||
|
||||
## Configuration Issues
|
||||
- **Invalid YAML**: `python3 -c "import yaml; yaml.safe_load(open('config.yaml'))"`
|
||||
|
|
@ -82,13 +77,19 @@ docker-compose exec redis redis-cli ping
|
|||
## Debug Information
|
||||
```bash
|
||||
./bin/api-server --version
|
||||
docker-compose ps
|
||||
docker-compose logs api-server | grep ERROR
|
||||
docker ps --filter "name=ml-"
|
||||
docker logs ml-experiments-api | grep ERROR
|
||||
```
|
||||
|
||||
## Emergency Reset
|
||||
```bash
|
||||
docker-compose down -v
|
||||
# Stop and remove all dev containers and volumes
|
||||
make dev-down
|
||||
docker volume prune
|
||||
|
||||
# Remove local data if needed
|
||||
rm -rf data/ results/ *.db
|
||||
docker-compose up -d (testing only)
|
||||
|
||||
# Start fresh dev stack
|
||||
make dev-up
|
||||
```
|
||||
|
|
|
|||
101
docs/src/validate.md
Normal file
101
docs/src/validate.md
Normal file
|
|
@ -0,0 +1,101 @@
|
|||
---
|
||||
layout: page
|
||||
title: "Validation (ml validate)"
|
||||
permalink: /validate/
|
||||
---
|
||||
|
||||
# Validation (`ml validate`)
|
||||
|
||||
The `ml validate` command verifies experiment integrity and provenance.
|
||||
|
||||
It can be run against:
|
||||
|
||||
- A **commit id** (validates the experiment tree + dependency manifest)
|
||||
- A **task id** (additionally validates the run’s `run_manifest.json` provenance and lifecycle)
|
||||
|
||||
## CLI usage
|
||||
|
||||
```bash
|
||||
# Validate by commit
|
||||
ml validate <commit_id> [--json] [--verbose]
|
||||
|
||||
# Validate by task
|
||||
ml validate --task <task_id> [--json] [--verbose]
|
||||
```
|
||||
|
||||
### Output modes
|
||||
|
||||
- Default (human): prints a summary with `errors`, `warnings`, and `failed_checks`.
|
||||
- `--verbose`: prints all checks under `checks` and includes `expected/actual/details` when present.
|
||||
- `--json`: prints the raw JSON payload.
|
||||
|
||||
## Report shape
|
||||
|
||||
The API returns a JSON report of the form:
|
||||
|
||||
- `ok`: overall boolean
|
||||
- `commit_id`: commit being validated (if known)
|
||||
- `task_id`: task being validated (when validating by task)
|
||||
- `checks`: map of check name → `{ ok, expected?, actual?, details? }`
|
||||
- `errors`: list of high-level failures
|
||||
- `warnings`: list of non-fatal issues
|
||||
- `ts`: UTC timestamp
|
||||
|
||||
## Check semantics
|
||||
|
||||
- For **task statuses** `running`, `completed`, or `failed`, run-manifest issues are treated as **errors**.
|
||||
- For **queued/pending** tasks, run-manifest issues are usually **warnings** (the job may not have started yet).
|
||||
|
||||
## Notable checks
|
||||
|
||||
### Experiment integrity
|
||||
|
||||
- `experiment_manifest`: validates the experiment manifest (content-addressed integrity)
|
||||
- `deps_manifest`: validates that a dependency manifest exists and can be hashed
|
||||
- `expected_manifest_overall_sha`: compares the task’s recorded manifest SHA to the current manifest SHA
|
||||
- `expected_deps_manifest`: compares the task’s recorded deps manifest name/SHA to what exists on disk
|
||||
|
||||
### Run manifest provenance (task validation)
|
||||
|
||||
- `run_manifest`: whether `run_manifest.json` could be found and loaded
|
||||
- `run_manifest_location`: verifies the manifest was found in the expected bucket:
|
||||
- `pending` for queued/pending
|
||||
- `running` for running
|
||||
- `finished` for completed
|
||||
- `failed` for failed
|
||||
- `run_manifest_task_id`: task id match
|
||||
- `run_manifest_commit_id`: commit id match
|
||||
- `run_manifest_deps`: deps manifest name/SHA match
|
||||
- `run_manifest_snapshot_id`: snapshot id match (when snapshot is part of the task)
|
||||
- `run_manifest_snapshot_sha256`: snapshot sha256 match (when snapshot sha is recorded)
|
||||
|
||||
### Run manifest lifecycle (task validation)
|
||||
|
||||
- `run_manifest_lifecycle`:
|
||||
- `running`: must have `started_at`, must not have `ended_at`/`exit_code`
|
||||
- `completed`/`failed`: must have `started_at`, `ended_at`, `exit_code`, and `ended_at >= started_at`
|
||||
- `queued`/`pending`: must not have `ended_at`/`exit_code`
|
||||
|
||||
## Example report (task validation)
|
||||
|
||||
```json
|
||||
{
|
||||
"ok": false,
|
||||
"commit_id": "6161616161616161616161616161616161616161",
|
||||
"task_id": "task-run-manifest-location-mismatch",
|
||||
"checks": {
|
||||
"experiment_manifest": {"ok": true},
|
||||
"deps_manifest": {"ok": true, "actual": "requirements.txt:..."},
|
||||
"run_manifest": {"ok": true},
|
||||
"run_manifest_location": {
|
||||
"ok": false,
|
||||
"expected": "running",
|
||||
"actual": "finished"
|
||||
}
|
||||
},
|
||||
"errors": [
|
||||
"run manifest location mismatch"
|
||||
],
|
||||
"ts": "2025-12-17T18:43:00Z"
|
||||
}
|
||||
```
|
||||
|
|
@ -47,6 +47,8 @@ export FETCH_ML_CLI_API_KEY="prod-key"
|
|||
- `ml dataset list` – list datasets
|
||||
- `ml monitor` – launch TUI over SSH (remote UI)
|
||||
|
||||
`ml status --json` may include an optional `prewarm` field when the worker is prefetching datasets for the next queued task.
|
||||
|
||||
## Build flavors
|
||||
|
||||
- `make all` – release‑small (default)
|
||||
|
|
|
|||
Loading…
Reference in a new issue