docs(ops): consolidate deployment and performance monitoring docs for Caddy-based setup

2026-01-05 12:37:40 -05:00 · 2026-01-05 12:37:40 -05:00 · 8157f73a70
commit 8157f73a70
parent c0eeeda940
5 changed files with 1234 additions and 915 deletions
--- a/docs/src/deployment.md
+++ b/docs/src/deployment.md
@ -2,302 +2,480 @@

 ## Overview

-The ML Experiment Manager supports multiple deployment methods from local development to homelab Docker setups.
+The ML Experiment Manager supports multiple deployment methods from local development to production setups with integrated monitoring.
+
+## TLS / WSS Policy
+
+- The Zig CLI currently supports `ws://` only (native `wss://` is not implemented).
+- For production, use a reverse proxy (Caddy) to terminate TLS/WSS (Approach A) and keep the API server on internal HTTP/WS.
+- If you need remote CLI access, use one of:
+  - an SSH tunnel to the internal `ws://` endpoint
+  - a private network/VPN so `ws://` is not exposed to the public Internet
+- When `server.tls.enabled: false`, the API server still runs on plain HTTP/WS internally. In development, access it via Caddy at `http://localhost:8080/health`.
+
+## Data Directories
+
+- `base_path` is where experiment directories live.
+- `data_dir` is used for dataset/snapshot materialization and integrity validation.
+- If you want `ml validate` to check snapshots/datasets, you must mount `data_dir` into the API server container.

 ## Quick Start

-### Docker Compose (Recommended for Development)
+### Development Deployment with Monitoring

 ```bash
-# Clone repository
-git clone https://github.com/your-org/fetch_ml.git
-cd fetch_ml
+# Start development stack with monitoring
+make dev-up

-# Start all services
-docker-compose up -d (testing only)
+# Alternative: Use deployment script
+cd deployments && make dev-up

 # Check status
-docker-compose ps
-
-# View logs
-docker-compose logs -f api-server
+make dev-status
 ```

-Access the API at `http://localhost:9100`
+**Access Services:**
+- **API Server (via Caddy)**: http://localhost:8080
+- **API Server (via Caddy + internal TLS)**: https://localhost:8443
+- **Grafana**: http://localhost:3000 (admin/admin123)
+- **Prometheus**: http://localhost:9090
+- **Loki**: http://localhost:3100

 ## Deployment Options

-### 1. Local Development
+### 1. Development Environment

-#### Prerequisites
+**Purpose**: Local development with full monitoring stack

-**Container Runtimes:**
- **Docker Compose**: For testing and development only
- **Podman**: For production experiment execution
- Go 1.25+
- Zig 0.15.2
- Redis 7+
- Docker & Docker Compose (optional)
+**Services**: API Server, Redis, Prometheus, Grafana, Loki, Promtail

-#### Manual Setup
+**Configuration**:
 ```bash
-# Start Redis
-redis-server
+# Using Makefile (recommended)
+make dev-up
+make dev-down
+make dev-status

-# Build and run Go server
-go build -o bin/api-server ./cmd/api-server
-./bin/api-server -config configs/config-local.yaml
-
-# Build Zig CLI
-cd cli
-zig build prod
-./zig-out/bin/ml --help
+# Using deployment script
+cd deployments
+make dev-up
+make dev-down
+make dev-status
 ```

-### 2. Docker Deployment
+**Features**:
+- Auto-provisioned Grafana dashboards
+- Real-time metrics and logs
+- Hot reload for development
+- Local data persistence

-#### Build Image
+### 2. Production Environment
+
+**Purpose**: Production deployment with security
+
+**Services**: API Server, Worker, Redis with authentication
+
+**Configuration**:
 ```bash
-docker build -t ml-experiment-manager:latest .
+cd deployments
+make prod-up
+make prod-down
+make prod-status
 ```

-#### Run Container
+**Features**:
+- Secure Redis with authentication
+- TLS/WSS via reverse proxy termination (Caddy)
+- Production-optimized configurations
+- Health checks and restart policies
+
+### 3. Homelab Secure Environment
+
+**Purpose**: Secure homelab deployment
+
+**Services**: API Server, Redis, Caddy reverse proxy
+
+**Configuration**:
 ```bash
-docker run -d \
-  --name ml-api \
-  -p 9100:9100 \
-  -p 9101:9101 \
-  -v $(pwd)/configs:/app/configs:ro \
-  -v experiment-data:/data/ml-experiments \
-  ml-experiment-manager:latest
+cd deployments
+make homelab-up
+make homelab-down
+make homelab-status
 ```

-#### Docker Compose
-```bash
-# Development mode (uses root docker-compose.yml)
-docker-compose up -d
+**Features**:
+- Caddy reverse proxy
+- TLS termination
+- Network isolation
+- External networks

-# Production deployment
-docker-compose -f deployments/docker-compose.prod.yml up -d
+## Environment Setup

-# Secure homelab deployment
-docker-compose -f deployments/docker-compose.homelab-secure.yml up -d
-
-# With custom configuration
-docker-compose -f deployments/docker-compose.prod.yml up --env-file .env.prod
-```
-
-### 3. Homelab Setup
+### Development Environment

 ```bash
-# Use the simple setup script
-./setup.sh
+# Copy example environment
+cp deployments/env.dev.example .env

-# Or manually with Docker Compose
-docker-compose up -d (testing only)
+# Edit as needed
+vim .env
 ```

-### 4. Cloud Deployment
+**Key Variables**:
+- `LOG_LEVEL=info`
+- `GRAFANA_ADMIN_PASSWORD=admin123`
+
+### Production Environment

-#### AWS ECS
 ```bash
-# Build and push to ECR
-aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REGISTRY
-docker build -t $ECR_REGISTRY/ml-experiment-manager:latest .
-docker push $ECR_REGISTRY/ml-experiment-manager:latest
+# Copy example environment
+cp deployments/env.prod.example .env

-# Deploy with ECS CLI
-ecs-cli compose --project-name ml-experiment-manager up
+# Edit with production values
+vim .env
 ```

-#### Google Cloud Run
+**Key Variables**:
+- `REDIS_PASSWORD=your-secure-password`
+- `JWT_SECRET=your-jwt-secret`
+- `SSL_CERT_PATH=/path/to/cert`
+
+## Monitoring Setup
+
+### Automatic Configuration
+
+Monitoring dashboards and datasources are auto-provisioned:
+
 ```bash
-# Build and push
-gcloud builds submit --tag gcr.io/$PROJECT_ID/ml-experiment-manager
+# Setup monitoring provisioning (Grafana datasources/providers)
+python3 scripts/setup_monitoring.py

-# Deploy
-gcloud run deploy ml-experiment-manager \
-  --image gcr.io/$PROJECT_ID/ml-experiment-manager \
-  --platform managed \
-  --region us-central1 \
-  --allow-unauthenticated
+# Start services (includes monitoring)
+make dev-up
 ```

-## Configuration
+### Available Dashboards

-### Environment Variables
-```yaml
-# configs/config-local.yaml
-base_path: "/data/ml-experiments"
-auth:
-  enabled: true
-  api_keys:
-    - "your-production-api-key"
-server:
-  address: ":9100"
-  tls:
-    enabled: true
-    cert_file: "/app/ssl/cert.pem"
-    key_file: "/app/ssl/key.pem"
+1. **Load Test Performance**: Request rates, response times, error rates
+2. **System Health**: Service status, memory, CPU usage
+3. **Log Analysis**: Error logs, service logs, log aggregation
+
+### Manual Configuration
+
+If auto-provisioning fails:
+
+1. **Access Grafana**: http://localhost:3000
+2. **Add Data Sources**:
+   - Prometheus: http://prometheus:9090
+   - Loki: http://loki:3100
+3. **Import Dashboards**: From `monitoring/grafana/dashboards/`
+
+## Testing Procedures
+
+### Pre-Deployment Testing
+
+```bash
+# Run unit tests
+make test-unit
+
+# Run integration tests
+make test-integration
+
+# Run full test suite
+make test
+
+# Run with coverage
+make test-coverage
 ```

-### Docker Compose Environment
-```yaml
-# docker-compose.yml
-version: '3.8'
-services:
-  api-server:
-    environment:
-      - REDIS_URL=redis://redis:6379
-      - LOG_LEVEL=info
-    volumes:
-      - ./configs:/configs:ro
-      - ./data:/data/experiments
-```
+### Load Testing

-## Monitoring & Logging
+```bash
+# Run load tests
+make load-test
+
+# Run specific load scenarios
+make benchmark-local
+
+# Track performance over time
+./scripts/track_performance.sh
+```

 ### Health Checks
- HTTP: `GET /health`
- WebSocket: Connection test
- Redis: Ping check

-### Metrics
- Prometheus metrics at `/metrics`
- Custom application metrics
- Container resource usage
-
-### Logging
- Structured JSON logging
- Log levels: DEBUG, INFO, WARN, ERROR
- Centralized logging via ELK stack
-
-## Security
-
-### TLS Configuration
 ```bash
-# Generate self-signed cert (development)
-openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365 -nodes
+# Check service health
+curl -f http://localhost:8080/health

-# Production - use Let's Encrypt
-certbot certonly --standalone -d ml-experiments.example.com
+# Check monitoring services
+curl -f http://localhost:3000/api/health
+curl -f http://localhost:9090/api/v1/query?query=up
+curl -f http://localhost:3100/ready
 ```

-### Network Security
- Firewall rules (ports 9100, 9101, 6379)
- VPN access for internal services
- API key authentication
- Rate limiting
-
-## Performance Tuning
-
-### Resource Allocation
-FetchML now centralizes pacing and container limits under a `resources` section in every server/worker config. Example for a homelab box:
-```yaml
-resources:
-  max_workers: 1
-  desired_rps_per_worker: 2   # conservative pacing per worker
-  podman_cpus: "2"            # Podman --cpus, keeps host responsive
-  podman_memory: "8g"         # Podman --memory, isolates experiment installs
-```
-
-For high-end machines (e.g., M2 Ultra, 18 performance cores / 64 GB RAM), start with:
-```yaml
-resources:
-  max_workers: 2              # two concurrent experiments
-  desired_rps_per_worker: 5   # faster job submission
-  podman_cpus: "8"
-  podman_memory: "32g"
-```
-Adjust upward only if experiments stay GPU-bound; keeping Podman limits in place ensures users can install packages inside the container without jeopardizing the host.
-
-### Scaling Strategies
- Horizontal pod autoscaling
- Redis clustering
- Load balancing
- CDN for static assets
-
-## Backup & Recovery
-
-### Data Backup
-```bash
-# Backup experiment data
-docker-compose exec redis redis-cli BGSAVE
-docker cp $(docker-compose ps -q redis):/data/dump.rdb ./redis-backup.rdb
-
-# Backup data volume
-docker run --rm -v ml-experiments_redis_data:/data -v $(pwd):/backup alpine tar czf /backup/redis-backup.tar.gz -C /data .
-```
-
-### Disaster Recovery
-1. Restore Redis data
-2. Restart services
-3. Verify experiment metadata
-4. Test API endpoints
-
 ## Troubleshooting

 ### Common Issues

-#### API Server Not Starting
+**Port Conflicts**:
 ```bash
-# Check logs
-docker-compose logs api-server
+# Check port usage
+lsof -i :9101
+lsof -i :3000
+lsof -i :9090

-# Check configuration
-cat configs/config-local.yaml
-
-# Check Redis connection
-docker-compose exec redis redis-cli ping
+# Kill conflicting processes
+kill -9 <PID>
 ```

-#### WebSocket Connection Issues
+**Container Issues**:
 ```bash
-# Test WebSocket
-wscat -c ws://localhost:9100/ws
+# View container logs
+docker logs ml-experiments-api
+docker logs ml-experiments-grafana

-# Check TLS
-openssl s_client -connect localhost:9101 -servername localhost
+# Restart services
+make dev-restart
+
+# Clean restart
+make dev-down && make dev-up
 ```

-#### Performance Issues
+**Monitoring Issues**:
 ```bash
-# Check resource usage
-docker-compose exec api-server ps aux
+# Re-setup monitoring configuration
+python3 scripts/setup_monitoring.py

-# Check Redis memory
-docker-compose exec redis redis-cli info memory
+# Restart Grafana only
+docker restart ml-experiments-grafana
 ```

-### Debug Mode
+### Performance Issues
+
+**High Memory Usage**:
+- Check Grafana dashboards for memory metrics
+- Adjust Prometheus retention in `prometheus.yml`
+- Monitor log retention in `loki-config.yml`
+
+**Slow Response Times**:
+- Check network connectivity between containers
+- Verify Redis performance
+- Review API server logs for bottlenecks
+
+## Maintenance
+
+### Regular Tasks
+
+**Weekly**:
+- Check Grafana dashboards for anomalies
+- Review log files for errors
+- Verify backup procedures
+
+**Monthly**:
+- Update Docker images
+- Clean up old Docker volumes
+- Review and rotate secrets
+
+### Backup Procedures
+
+**Data Backup**:
 ```bash
-# Enable debug logging
-export LOG_LEVEL=debug
-./bin/api-server -config configs/config-local.yaml
+# Backup application data
+docker run --rm -v ml_data:/data -v $(pwd):/backup alpine tar czf /backup/data-backup.tar.gz -C /data .
+
+# Backup monitoring data
+docker run --rm -v prometheus_data:/data -v $(pwd):/backup alpine tar czf /backup/prometheus-backup.tar.gz -C /data .
 ```

-## CI/CD Integration
+**Configuration Backup**:
+```bash
+# Backup configurations
+tar czf config-backup.tar.gz monitoring/ deployments/ configs/
+```

-### GitHub Actions
- Automated testing on PR
- Multi-platform builds
- Security scanning
- Automatic releases
+## Security Considerations

-### Deployment Pipeline
-1. Code commit → GitHub
-2. CI/CD pipeline triggers
-3. Build and test
-4. Security scan
-5. Deploy to staging
-6. Run integration tests
-7. Deploy to production
-8. Post-deployment verification
+### Development Environment
+- Change default Grafana password
+- Use environment variables for secrets
+- Monitor container logs for security events
+
+### Production Environment
+- Enable Redis authentication
+- Use SSL/TLS certificates
+- Implement network segmentation
+- Regular security updates
+- Monitor access logs
+
+## Performance Optimization
+
+### Resource Limits
+
+**Development**:
+```yaml
+# docker-compose.dev.yml
+services:
+  api-server:
+    deploy:
+      resources:
+        limits:
+          memory: 512M
+          cpus: '0.5'
+```
+
+**Production**:
+```yaml
+# docker-compose.prod.yml
+services:
+  api-server:
+    deploy:
+      resources:
+        limits:
+          memory: 2G
+          cpus: '2.0'
+```
+
+### Monitoring Optimization
+
+**Prometheus**:
+- Adjust scrape intervals
+- Configure retention periods
+- Use recording rules for frequent queries
+
+**Loki**:
+- Configure log retention
+- Use log sampling for high-volume sources
+- Optimize label cardinality
+
+## Non-Docker Production (systemd)
+
+This project can be run in production without Docker. The recommended model is:
+
+- Run `api-server` and `worker` as systemd services.
+- Terminate TLS/WSS at Caddy and keep the API server on internal plain HTTP/WS.
+
+The unit templates below are copy-paste friendly, but you must adjust paths, users, and config locations to your environment.
+
+### `fetchml-api.service`
+
+```ini
+[Unit]
+Description=FetchML API Server
+After=network-online.target
+Wants=network-online.target
+
+[Service]
+Type=simple
+User=fetchml
+Group=fetchml
+WorkingDirectory=/var/lib/fetchml
+
+Environment=LOG_LEVEL=info
+
+ExecStart=/usr/local/bin/api-server -config /etc/fetchml/api.yaml
+Restart=on-failure
+RestartSec=2
+
+NoNewPrivileges=true
+PrivateTmp=true
+ProtectSystem=strict
+ProtectHome=true
+ReadWritePaths=/var/lib/fetchml /var/log/fetchml
+
+[Install]
+WantedBy=multi-user.target
+```
+
+### `fetchml-worker.service`
+
+```ini
+[Unit]
+Description=FetchML Worker
+After=network-online.target
+Wants=network-online.target
+
+[Service]
+Type=simple
+User=fetchml
+Group=fetchml
+WorkingDirectory=/var/lib/fetchml
+
+Environment=LOG_LEVEL=info
+
+ExecStart=/usr/local/bin/worker -config /etc/fetchml/worker.yaml
+Restart=on-failure
+RestartSec=2
+
+NoNewPrivileges=true
+PrivateTmp=true
+
+[Install]
+WantedBy=multi-user.target
+```
+
+### Optional: `caddy.service`
+
+Most distros ship a Caddy systemd unit. If you do not have one available, you can use this template.
+
+```ini
+[Unit]
+Description=Caddy
+Documentation=https://caddyserver.com/docs/
+After=network-online.target
+Wants=network-online.target
+
+[Service]
+Type=notify
+User=caddy
+Group=caddy
+ExecStart=/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile
+ExecReload=/usr/bin/caddy reload --config /etc/caddy/Caddyfile
+TimeoutStopSec=5s
+LimitNOFILE=1048576
+LimitNPROC=512
+PrivateTmp=true
+ProtectSystem=full
+AmbientCapabilities=CAP_NET_BIND_SERVICE
+CapabilityBoundingSet=CAP_NET_BIND_SERVICE
+NoNewPrivileges=true
+Restart=on-failure
+
+[Install]
+WantedBy=multi-user.target
+```
+
+## Migration Guide
+
+### From Development to Production
+
+1. **Export Data**:
+   ```bash
+   docker exec ml-data redis-cli BGSAVE
+   docker cp ml-data:/data/dump.rdb ./redis-backup.rdb
+   ```
+
+2. **Update Configuration**:
+   ```bash
+   cp deployments/env.dev.example deployments/env.prod.example
+   # Edit with production values
+   ```
+
+3. **Deploy Production**:
+   ```bash
+   cd deployments
+   make prod-up
+   ```
+
+4. **Import Data**:
+   ```bash
+   docker cp ./redis-backup.rdb ml-prod-redis:/data/dump.rdb
+   docker restart ml-prod-redis
+   ```

 ## Support

 For deployment issues:
-1. Check this guide
-2. Review logs
-3. Check GitHub Issues
-4. Contact maintainers
+1. Check troubleshooting section
+2. Review container logs
+3. Verify network connectivity
+4. Check resource usage in Grafana
--- a/docs/src/performance-monitoring.md
+++ b/docs/src/performance-monitoring.md
@ -1,231 +1,626 @@
 # Performance Monitoring

-This document describes the performance monitoring system for Fetch ML, which automatically tracks benchmark metrics through CI/CD integration with Prometheus and Grafana.
+Comprehensive performance monitoring system for Fetch ML with CI/CD integration, profiling, and production deployment.

-## Overview
+## Quick Start

-The performance monitoring system provides:
+### 5-Minute Setup

- **Automatic benchmark execution** on every CI/CD run
- **Real-time metrics collection** via Prometheus Pushgateway
- **Historical trend visualization** in Grafana dashboards
- **Performance regression detection**
- **Cross-commit comparisons**
+```bash
+# Start monitoring stack
+make dev-up
+
+# Run benchmarks
+make benchmark
+
+# View results in Grafana
+open http://localhost:3000
+```
+
+### Basic Profiling
+
+```bash
+# CPU profiling
+make profile-load-norate
+
+# View interactive profile
+go tool pprof -http=:8080 cpu_load.out
+```

 ## Architecture

+**Development**: Docker Compose with integrated monitoring  
+**Production**: Podman + systemd (Linux)  
+**CI/CD**: GitHub Actions → Prometheus Pushgateway → Grafana
+
 ```
 GitHub Actions → Benchmark Tests → Prometheus Pushgateway → Prometheus → Grafana Dashboard
 ```

 ## Components

-### 1. GitHub Actions Workflow
- **File**: `.github/workflows/benchmark-metrics.yml`
+### 1. Development Monitoring (Docker Compose)
+
+**Services**:
+- **Grafana**: http://localhost:3000 (admin/admin123)
+- **Prometheus**: http://localhost:9090
+- **Loki**: http://localhost:3100
+- **Promtail**: Log aggregation
+
+**Configuration**:
+```bash
+# Start dev stack with monitoring
+make dev-up
+
+# Verify services
+curl -f http://localhost:3000/api/health
+curl -f http://localhost:9090/api/v1/query?query=up
+curl -f http://localhost:3100/ready
+```
+
+### 2. Production Monitoring (Podman + systemd)
+
+**Architecture**:
+- Each service runs as separate Podman container
+- Managed by systemd for automatic restarts
+- Proper lifecycle management
+
+**Setup**:
+```bash
+# Run production setup script
+sudo ./scripts/setup-monitoring-prod.sh /data/monitoring ml-user ml-group
+
+# Start services
+sudo systemctl start prometheus loki promtail grafana
+sudo systemctl enable prometheus loki promtail grafana
+```
+
+**Access**:
+- URL: `http://YOUR_SERVER_IP:3000`
+- Username: `admin`
+- Password: `admin` (change on first login)
+
+### 3. CI/CD Integration
+
+**GitHub Actions Workflow**:
 - **Triggers**: Push to main/develop, PRs, daily schedule, manual
 - **Function**: Runs benchmarks and pushes metrics to Prometheus

-### 2. Prometheus Pushgateway
- **Port**: 9091
- **Purpose**: Receives benchmark metrics from CI/CD runs
- **URL**: `http://localhost:9091`
-
-### 3. Prometheus Server
- **Configuration**: `monitoring/prometheus.yml`
- **Scrapes**: Pushgateway for benchmark metrics
- **Retention**: Configurable retention period
-
-### 4. Grafana Dashboard
- **Location**: `monitoring/dashboards/performance-dashboard.json`
- **Visualizations**: Performance trends, regressions, comparisons
- **Access**: http://localhost:3001
-
-## Setup
-
-### 1. Start Monitoring Stack
-
+**Setup**:
 ```bash
-make monitoring-performance
-```
-
-This starts:
- Grafana: http://localhost:3001 (admin/admin)
- Loki: http://localhost:3100
- Pushgateway: http://localhost:9091
-
-### 2. Configure GitHub Secrets
-
-Add this secret to your GitHub repository:
-
-```
+# Add GitHub secret
 PROMETHEUS_PUSHGATEWAY_URL=http://your-pushgateway:9091
 ```

-### 3. Verify Integration
+## Performance Testing

-1. Push code to trigger the workflow
-2. Check Pushgateway: http://localhost:9091
-3. View metrics in Grafana dashboard
-
-## Available Metrics
-
-### Benchmark Metrics
-
- `benchmark_time_per_op` - Time per operation in nanoseconds
- `benchmark_memory_per_op` - Memory per operation in bytes
- `benchmark_allocs_per_op` - Allocations per operation
-
-Labels:
- `benchmark` - Benchmark name (sanitized)
- `job` - Always "benchmark"
- `instance` - GitHub Actions run ID
-
-### Example Metrics Output
-
-```
-benchmark_time_per_op{benchmark="BenchmarkAPIServerCreateJobSimple"} 42653
-benchmark_memory_per_op{benchmark="BenchmarkAPIServerCreateJobSimple"} 13518
-benchmark_allocs_per_op{benchmark="BenchmarkAPIServerCreateJobSimple"} 98
-```
-
-## Usage
-
-### Manual Benchmark Execution
+### Benchmarks

 ```bash
 # Run benchmarks locally
 make benchmark

-# View results in console
+# Or run with detailed output
 go test -bench=. -benchmem ./tests/benchmarks/...
+
+# Run specific benchmark
+go test -bench=BenchmarkName -benchmem ./tests/benchmarks/...
+
+# Run with race detection
+go test -race -bench=. ./tests/benchmarks/...
 ```

-### Automated Monitoring
+### Load Testing

-The system automatically runs benchmarks on:
+```bash
+# Run load test suite
+make load-test

- **Every push** to main/develop branches
- **Pull requests** to main branch
- **Daily schedule** at 6:00 AM UTC
- **Manual trigger** via GitHub Actions UI
+# Start Redis for load tests
+docker run -d -p 6379:6379 redis:alpine
+```

-### Viewing Results
+### CPU Profiling

-1. **Grafana Dashboard**: http://localhost:3001
-2. **Pushgateway**: http://localhost:9091/metrics
-3. **Prometheus**: http://localhost:9090/targets
+#### HTTP Load Test Profiling

-## Configuration
+```bash
+# CPU profile MediumLoad HTTP test (no rate limiting - recommended)
+make profile-load-norate

-### Prometheus Configuration
+# CPU profile MediumLoad HTTP test (with rate limiting)
+make profile-load
+```

-Edit `monitoring/prometheus.yml` to adjust:
+**Analyze Results**:
+```bash
+# View interactive profile (web UI)
+go tool pprof -http=:8081 cpu_load.out
+
+# View interactive profile (terminal)
+go tool pprof cpu_load.out
+
+# Generate flame graph
+go tool pprof -raw cpu_load.out | go-flamegraph.pl > cpu_flame.svg
+
+# View top functions
+go tool pprof -top cpu_load.out
+```
+
+#### WebSocket Queue Profiling
+
+```bash
+# CPU profile WebSocket → Redis queue → worker path
+make profile-ws-queue
+
+# View interactive profile
+go tool pprof -http=:8082 cpu_ws.out
+```
+
+### Profiling Tips
+
+- Use `profile-load-norate` for cleaner CPU profiles (no rate limiting delays)
+- Profiles run for 60 seconds by default
+- Requires Redis running on localhost:6379
+- Results show throughput, latency, and error rate metrics
+
+## Grafana Dashboards
+
+### Development Dashboards
+
+**Access**: http://localhost:3000 (admin/admin123)
+
+**Available Dashboards**:
+1. **Load Test Performance**: Request metrics, response times, error rates
+2. **System Health**: Service status, resource usage, memory, CPU
+3. **Log Analysis**: Error logs, service logs, log aggregation
+
+### Production Dashboards
+
+**Auto-loaded Dashboards**:
+- **ML Task Queue Monitoring** (metrics)
+- **Application Logs** (Loki logs)
+
+### Key Metrics
+
+- `benchmark_time_per_op` - Execution time
+- `benchmark_memory_per_op` - Memory usage
+- `benchmark_allocs_per_op` - Allocation count
+- HTTP request rates and response times
+- Error rates and system health metrics
+
+### Worker Resource Metrics
+
+The worker exposes a Prometheus endpoint (default `:9100/metrics`) which includes ResourceManager and task execution metrics.
+
+**Resource availability**:
+- `fetchml_resources_cpu_total` - Total CPU tokens managed by the worker.
+- `fetchml_resources_cpu_free` - Currently free CPU tokens.
+- `fetchml_resources_gpu_slots_total{gpu_index="N"}` - Total GPU slots per GPU index.
+- `fetchml_resources_gpu_slots_free{gpu_index="N"}` - Free GPU slots per GPU index.
+
+**Acquisition pressure**:
+- `fetchml_resources_acquire_total` - Total resource acquisition attempts.
+- `fetchml_resources_acquire_wait_total` - Number of acquisitions that had to wait.
+- `fetchml_resources_acquire_timeout_total` - Number of acquisitions that timed out.
+- `fetchml_resources_acquire_wait_seconds_total` - Total time spent waiting for resources.
+
+**Why these help**:
+- Debug why runs slow down under load (wait time increases).
+- Confirm GPU slot sharing is working (free slots fluctuate as expected).
+- Detect saturation and timeouts before tasks start failing.
+
+### Prometheus Scrape Example (Worker)
+
+If you run the worker locally on your machine (default metrics port `:9100`) and Prometheus runs in Docker Compose, use `host.docker.internal`:

 ```yaml
 scrape_configs:
-  - job_name: 'benchmark'
+  - job_name: 'worker'
    static_configs:
-      - targets: ['pushgateway:9091']
+      - targets: ['host.docker.internal:9100']
+      - targets: ['worker:9100']
    metrics_path: /metrics
-    honor_labels: true
-    scrape_interval: 15s
 ```

-### Grafana Dashboard
+## Production Deployment

-Customize the dashboard in `monitoring/dashboards/performance-dashboard.json`:
+### Prerequisites

- Add new panels
- Modify queries
- Adjust visualization types
- Set up alerts
+- Linux distribution with systemd (Rocky/RHEL/CentOS, Ubuntu/Debian, Arch, SUSE)
+- Production app already deployed
+- Root or sudo access
+- Ports 3000, 9090, 3100 available
+
+### Service Configuration
+
+**Prometheus**:
+- **Port**: 9090
+- **Config**: `/etc/fetch_ml/monitoring/prometheus/prometheus.yml`
+- **Data**: `/data/monitoring/prometheus`
+- **Purpose**: Scrapes metrics from API server
+
+**Loki**:
+- **Port**: 3100
+- **Config**: `/etc/fetch_ml/monitoring/loki-config.yml`
+- **Data**: `/data/monitoring/loki`
+- **Purpose**: Log aggregation
+
+**Promtail**:
+- **Config**: `/etc/fetch_ml/monitoring/promtail-config.yml`
+- **Log Source**: `/var/log/fetch_ml/*.log`
+- **Purpose**: Ships logs to Loki
+
+**Grafana**:
+- **Port**: 3000
+- **Config**: `/etc/fetch_ml/monitoring/grafana/provisioning`
+- **Data**: `/data/monitoring/grafana`
+- **Dashboards**: `/var/lib/grafana/dashboards`
+
+### Management Commands
+
+```bash
+# Check status
+sudo systemctl status prometheus grafana loki promtail
+
+# View logs
+sudo journalctl -u prometheus -f
+sudo journalctl -u grafana -f
+sudo journalctl -u loki -f
+sudo journalctl -u promtail -f
+
+# Restart services
+sudo systemctl restart prometheus
+sudo systemctl restart grafana
+
+# Stop all monitoring
+sudo systemctl stop prometheus grafana loki promtail
+```
+
+## Data Retention
+
+### Prometheus
+Default: 15 days. Edit `/etc/fetch_ml/monitoring/prometheus/prometheus.yml`:
+```yaml
+storage:
+  tsdb:
+    retention.time: 30d
+```
+
+### Loki
+Default: 30 days. Edit `/etc/fetch_ml/monitoring/loki-config.yml`:
+```yaml
+limits_config:
+  retention_period: 30d
+```
+
+## Security
+
+### Firewall
+
+**RHEL/Rocky/Fedora (firewalld)**:
+```bash
+# Remove public access
+sudo firewall-cmd --permanent --remove-port=3000/tcp
+sudo firewall-cmd --permanent --remove-port=9090/tcp
+
+# Add specific source
+sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.0.0.0/24" port port="3000" protocol="tcp" accept'
+sudo firewall-cmd --reload
+```
+
+**Ubuntu/Debian (ufw)**:
+```bash
+# Remove public access
+sudo ufw delete allow 3000/tcp
+sudo ufw delete allow 9090/tcp
+
+# Add specific source
+sudo ufw allow from 10.0.0.0/24 to any port 3000 proto tcp
+```
+
+### Authentication
+
+Change Grafana admin password:
+1. Login to Grafana
+2. User menu → Profile → Change Password
+
+### TLS (Optional)
+
+For HTTPS, configure reverse proxy (nginx/Apache) in front of Grafana.
+
+## Performance Regression Detection
+
+```bash
+# Create baseline
+make detect-regressions
+
+# Analyze current performance
+go test -bench=. -benchmem ./tests/benchmarks/... | tee current.json
+
+# Track performance over time
+./scripts/track_performance.sh
+```

 ## Troubleshooting

-### Common Issues
-
-1. **Metrics not appearing in Grafana**
-   - Check Pushgateway: http://localhost:9091
-   - Verify Prometheus targets: http://localhost:9090/targets
-   - Check GitHub Actions logs
-
-2. **GitHub Actions workflow failing**
-   - Verify `PROMETHEUS_PUSHGATEWAY_URL` secret
-   - Check workflow syntax
-   - Review benchmark execution logs
-
-3. **Pushgateway not receiving metrics**
-   - Verify URL accessibility from CI/CD
-   - Check network connectivity
-   - Review curl command in workflow
-
-### Debug Commands
+### Development Issues

+**No metrics in Grafana?**
 ```bash
-# Check running services
-docker ps --filter "name=monitoring"
+# Check services
+docker ps --filter "name=ml-"

-# View Pushgateway metrics
-curl http://localhost:9091/metrics
-
-# Check Prometheus targets
-curl http://localhost:9090/api/v1/targets
-
-# Test manual metric push
-echo "test_metric 123" | curl --data-binary @- http://localhost:9091/metrics/job/test
+# Check monitoring services
+curl http://localhost:3000/api/health
+curl http://localhost:9090/api/v1/query?query=up
 ```

-## Best Practices
+**Workflow failing?**
+- Verify GitHub secret configuration
+- Check workflow logs in GitHub Actions

-### Benchmark Naming
+**Profiling Issues**:
+```bash
+# Flag error handling
+go test ./tests/load -run TestLoadProfile_Medium -count=1 -cpuprofile cpu_load.out -v -args -profile-norate

-Use consistent naming conventions:
- `BenchmarkAPIServerCreateJob`
- `BenchmarkMLExperimentTraining`
- `BenchmarkDatasetOperations`
+# Redis not available expansion
+docker run -d -p 6379:6379 redis:alpine
+
+# Port conflicts
+lsof -i :3000  # Grafana
+lsof -i :8080  # pprof web UI
+lsof -i :6379  # Redis
+```
+
+### Production Issues
+
+**Grafana shows no data**:
+```bash
+# Check if Prometheus is reachable
+curl http://localhost:9090/-/healthy
+
+# Check datasource in Grafana
+# Settings → Data Sources → Prometheus → Save & Test
+```
+
+**Loki not receiving logs**:
+```bash
+# Check Promtail is running
+sudo systemctl status promtail
+
+# Verify log file exists
+ls -l /var/log/fetch_ml/
+
+# Check Promtail can reach Loki
+curl http://localhost:3100/ready
+```
+
+**Podman containers not starting**:
+```bash
+# Check pod status
+sudo -u ml-user podman pod ps
+sudo -u ml-user podman ps -a
+
+# Remove and recreate
+sudo -u ml-user podman pod stop monitoring
+sudo -u ml-user podman pod rm monitoring
+sudo systemctl restart prometheus
+```
+
+## Backup and Recovery
+
+### Backup Procedures
+
+```bash
+# Development backup
+docker run --rm -v prometheus_data:/data -v $(pwd):/backup alpine tar czf /backup/prometheus-backup.tar.gz -C /data .
+docker run --rm -v grafana_data:/data -v $(pwd):/backup alpine tar czf /backup/grafana-backup.tar.gz -C /data .
+
+# Production backup
+sudo tar -czf grafana-backup.tar.gz /data/monitoring/grafana
+sudo tar -czf prometheus-backup.tar.gz /data/monitoring/prometheus
+```
+
+### Configuration Backup
+
+```bash
+# Backup configurations
+tar czf monitoring-config-backup.tar.gz monitoring/ deployments/
+```
+
+## Updates and Maintenance
+
+### Development Updates
+
+```bash
+# Update monitoring provisioning (Grafana datasources/providers)
+python3 scripts/setup_monitoring.py
+
+# Restart services
+make dev-down && make dev-up
+```
+
+### Production Updates
+
+```bash
+# Pull latest images
+sudo -u ml-user podman pull docker.io/grafana/grafana:latest
+sudo -u ml-user podman pull docker.io/prom/prometheus:latest
+sudo -u ml-user podman pull docker.io/grafana/loki:latest
+sudo -u ml-user podman pull docker.io/grafana/promtail:latest
+
+# Restart services to use new images
+sudo systemctl restart grafana prometheus loki promtail
+```
+
+### Regular Maintenance
+
+**Weekly**:
+- Check Grafana dashboards for anomalies
+- Review log files for errors
+- Verify backup procedures
+
+**Monthly**:
+- Update Docker/Podman images
+- Clean up old data volumes
+- Review and rotate secrets
+
+## Metrics Reference
+
+### Worker Metrics
+
+**Task Processing**:
+- `fetchml_tasks_processed_total` - Total tasks processed successfully
+- `fetchml_tasks_failed_total` - Total tasks failed
+- `fetchml_tasks_active` - Currently active tasks
+- `fetchml_tasks_queued` - Current queue depth
+
+**Data Transfer**:
+- `fetchml_data_transferred_bytes_total` - Total bytes transferred
+- `fetchml_data_fetch_time_seconds_total` - Total time fetching datasets
+- `fetchml_execution_time_seconds_total` - Total task execution time
+
+**Prewarming**:
+- `fetchml_prewarm_env_hit_total` - Environment prewarm hits (warm image existed)
+- `fetchml_prewarm_env_miss_total` - Environment prewarm misses (warm image not found)
+- `fetchml_prewarm_env_built_total` - Environment images built for prewarming
+- `fetchml_prewarm_env_time_seconds_total` - Total time building prewarm images
+- `fetchml_prewarm_snapshot_hit_total` - Snapshot prewarm hits (found in .prewarm/)
+- `fetchml_prewarm_snapshot_miss_total` - Snapshot prewarm misses (not in .prewarm/)
+- `fetchml_prewarm_snapshot_built_total` - Snapshots prewarmed into .prewarm/
+- `fetchml_prewarm_snapshot_time_seconds_total` - Total time prewarming snapshots
+
+**Resources**:
+- `fetchml_resources_cpu_total` - Total CPU tokens
+- `fetchml_resources_cpu_free` - Free CPU tokens
+- `fetchml_resources_gpu_slots_total` - Total GPU slots per index
+- `fetchml_resources_gpu_slots_free` - Free GPU slots per index
+
+### API Server Metrics
+
+**HTTP**:
+- `fetchml_http_requests_total` - Total HTTP requests
+- `fetchml_http_duration_seconds` - HTTP request duration
+
+**WebSocket**:
+- `fetchml_websocket_connections` - Active WebSocket connections
+- `fetchml_websocket_messages_total` - Total WebSocket messages
+- `fetchml_websocket_duration_seconds` - Message processing duration
+- `fetchml_websocket_errors_total` - WebSocket errors
+
+**Jupyter**:
+- `fetchml_jupyter_services` - Jupyter services count
+- `fetchml_jupyter_operations_total` - Jupyter operations
+
+## Worker Configuration: Prewarming
+
+### Prewarm Flag
+
+Enable Phase 1 prewarming in worker configuration:
+
+```yaml
+# worker-config.yaml
+prewarm_enabled: true  # Default: false (opt-in)
+```
+
+**Behavior**:
+- When `false`: No prewarming loops run
+- When `true`: Worker stages next snapshot and fetches datasets when idle
+
+**What gets prewarmed**:
+1. **Snapshots**: Copied to `.prewarm/snapshots/<taskID>/`
+2. **Datasets**: Fetched to `.prewarm/datasets/` (if `auto_fetch_data: true`)
+3. **Environment images**: Warmed in envpool cache (if deps manifest exists)
+
+**Execution path**:
+- During task execution, `StageSnapshotFromPath` checks `.prewarm/snapshots/<taskID>/`
+- If found: **Hit** - Renames prewarmed directory into job (fast)
+- If not found: **Miss** - Copies from snapshot store (slower)
+
+**Metrics impact**:
+- Prewarm hits reduce task startup latency
+- Metrics track hit/miss ratios and prewarm timing
+- Use `fetchml_prewarm_snapshot_*` metrics to monitor effectiveness
+
+### Grafana Dashboards
+
+**Prewarm Performance Dashboard**:
+1. Import `monitoring/grafana/dashboards/prewarm-performance.txt` into Grafana
+2. Shows hit rates, build times, and efficiency metrics
+3. Use for monitoring prewarm effectiveness
+
+**Worker Resources Dashboard**:
+- Added prewarm panels to existing worker-resources dashboard
+- Environment and snapshot hit rate percentages
+- Prewarm hits vs misses graphs
+- Build time and build count metrics
+
+### Prometheus Queries
+
+**Hit Rate Calculations**:
+```promql
+# Environment prewarm hit rate
+100 * (fetchml_prewarm_env_hit_total / clamp_min(fetchml_prewarm_env_hit_total + fetchml_prewarm_env_miss_total, 1))
+
+# Snapshot prewarm hit rate
+100 * (fetchml_prewarm_snapshot_hit_total / clamp_min(fetchml_prewarm_snapshot_hit_total + fetchml_prewarm_snapshot_miss_total, 1))
+```
+
+**Rate-based Monitoring**:
+```promql
+# Prewarm activity rate
+rate(fetchml_prewarm_env_hit_total[5m])
+rate(fetchml_prewarm_snapshot_hit_total[5m])
+
+# Build time rate
+rate(fetchml_prewarm_env_time_seconds_total[5m])
+rate(fetchml_prewarm_snapshot_time_seconds_total[5m])
+```
+
+## Advanced Usage
+
+### Custom Dashboards
+
+1. **Access Grafana**: http://localhost:3000
+2. **Create Dashboard**: + → Dashboard
+3. **Add Panels**: Use Prometheus queries
+4. **Save Dashboard**: Export JSON for sharing

 ### Alerting

 Set up Grafana alerts for:
- Performance regressions (>10% degradation)
- Missing benchmark data
- High memory allocation rates
+- High error rates (> 5%)
+- Slow response times (> 1s)
+- Service downtime
+- Resource exhaustion
+- Low prewarm hit rates (< 50%)

-### Retention
+### Custom Metrics

-Configure appropriate retention periods:
- Raw metrics: 30 days
- Aggregated data: 1 year
- Dashboard snapshots: Permanent
+Add custom metrics to your Go code:

-## Integration with Existing Workflows
+```go
+import "github.com/prometheus/client_golang/prometheus"

-The benchmark monitoring integrates seamlessly with:
+var (
+    requestDuration = prometheus.NewHistogramVec(
+        prometheus.HistogramOpts{
+            Name: "http_request_duration_seconds",
+            Help: "HTTP request duration in seconds",
+        },
+        []string{"method", "endpoint"},
+    )
+)

- **CI/CD pipelines**: Automatic execution
- **Code reviews**: Performance impact visible
- **Release management**: Performance trends over time
- **Development**: Local testing with same metrics
+// Record metrics
+requestDuration.WithLabelValues("GET", "/api/v1/jobs").Observe(duration)
+```

-## Future Enhancements
+## See Also

-Potential improvements:
-
-1. **Automated performance regression alerts**
-2. **Performance budgets and gates**
-3. **Comparative analysis across branches**
-4. **Integration with load testing results**
-5. **Performance impact scoring**
-
-## Support
-
-For issuesundles:
-
-1. Check this documentation
-2. Review GitHub Actions logs
-3. Verify monitoring stack status
-4. Consult Grafana/Prometheus docs
-
---
-
-*Last updated: December 2024*
+- **[Testing Guide](testing.md)** - Testing with monitoring
+- **[Deployment Guide](deployment.md)** - Deployment procedures
+- **[Architecture Guide](architecture.md)** - System architecture
+- **[Troubleshooting](troubleshooting.md)** - Common issues
--- a/docs/src/performance-quick-start.md
+++ b/docs/src/performance-quick-start.md
@ -1,245 +0,0 @@
-# Performance Monitoring Quick Start
-
-Get started with performance monitoring and profiling in 5 minutes.
-
-## Quick Start Options
-
-### Option 1: Basic Benchmarking
-```bash
-# Run benchmarks
-make benchmark
-
-# View results in Grafana
-open http://localhost:3001
-```
-
-### Option 2: CPU Profiling
-```bash
-# Generate CPU profile
-make profile-load-norate
-
-# View interactive profile
-go tool pprof -http=:8080 cpu_load.out
-```
-
-### Option 3: Full Monitoring Stack
-```bash
-# Start monitoring services
-make monitoring-performance
-
-# Run benchmarks with metrics collection
-make benchmark
-
-# View in Grafana dashboard
-open http://localhost:3001
-```
-
-## Prerequisites
-
- Docker and Docker Compose
- Go 1.21 or later
- Redis (for load tests)
- GitHub repository (for CI/CD integration)
-
-## 1. Setup & Installation
-
-### Start Monitoring Stack (Optional)
-
-For full metrics visualization:
-
-```bash
-make monitoring-performance
-```
-
-This starts:
- **Grafana**: http://localhost:3001 (admin/admin)
- **Pushgateway**: http://localhost:9091
- **Loki**: http://localhost:3100
-
-### Start Redis (Required for Load Tests)
-
-```bash
-docker run -d -p 6379:6379 redis:alpine
-```
-
-## 2. Performance Testing
-
-### Benchmarks
-
-```bash
-# Run benchmarks locally
-make benchmark
-
-# Or run with detailed output
-go test -bench=. -benchmem ./tests/benchmarks/...
-```
-
-### Load Testing
-
-```bash
-# Run load test suite
-make load-test
-```
-
-## 3. CPU Profiling
-
-### HTTP Load Test Profiling
-
-```bash
-# CPU profile MediumLoad HTTP test (with rate limiting)
-make profile-load
-
-# CPU profile MediumLoad HTTP test (no rate limiting - recommended)
-make profile-load-norate
-```
-
-**Analyze Results:**
-```bash
-# View interactive profile (web UI)
-go tool pprof -http=:8081 cpu_load.out
-
-# View interactive profile (terminal)
-go tool pprof cpu_load.out
-
-# Generate flame graph
-go tool pprof -raw cpu_load.out | go-flamegraph.pl > cpu_flame.svg
-
-# View top functions
-go tool pprof -top cpu_load.out
-```
-
-Web UI: http://localhost:8080
-
-### WebSocket Queue Profiling
-
-```bash
-# CPU profile WebSocket → Redis queue → worker path
-make profile-ws-queue
-```
-
-**Analyze Results:**
-```bash
-# View interactive profile (web UI)
-go tool pprof -http=:8082 cpu_ws.out
-
-# View interactive profile (terminal)
-go tool pprof cpu_ws.out
-```
-
-### Profiling Tips
-
- Use `profile-load-norate` for cleaner CPU profiles (no rate limiting delays)
- Profiles run for 60 seconds by default
- Requires Redis running on localhost:6379
- Results show throughput, latency, and error rate metrics
-
-## 4. Results & Visualization
-
-### Grafana Dashboard
-
-Open: http://localhost:3001 (admin/admin)
-
-Navigate to the **Performance Dashboard** to see:
- Real-time benchmark results
- Historical trends
- Performance comparisons
-
-### Key Metrics
-
- `benchmark_time_per_op` - Execution time
- `benchmark_memory_per_op` - Memory usage
- `benchmark_allocs_per_op` - Allocation count
-
-## 5. CI/CD Integration
-
-### Setup GitHub Integration
-
-Add GitHub secret:
-```
-PROMETHEUS_PUSHGATEWAY_URL=http://your-pushgateway:9091
-```
-
-Now benchmarks run automatically on:
- Every push to main/develop
- Pull requests
- Daily schedule
-
-### Verify Integration
-
-1. Push code to trigger workflow
-2. Check Pushgateway: http://localhost:9091/metrics
-3. View metrics in Grafana
-
-## 6. Troubleshooting
-
-### Monitoring Stack Issues
-
-**No metrics in Grafana?**
-```bash
-# Check services
-docker ps --filter "name=monitoring"
-
-# Check Pushgateway
-curl http://localhost:9091/metrics
-```
-
-**Workflow failing?**
- Verify GitHub secret configuration
- Check workflow logs in GitHub Actions
-
-### Profiling Issues
-
-**Flag error like "flag provided but not defined: -test.paniconexit0"**
-```bash
-# This should be fixed now, but if it persists:
-go test ./tests/load -run TestLoadProfile_Medium -count=1 -cpuprofile cpu_load.out -v -args -profile-norate
-```
-
-**Redis not available?**
-```bash
-# Start Redis for profiling tests
-docker run -d -p 6379:6379 redis:alpine
-
-# Check profile file generated
-ls -la cpu_load.out
-```
-
-**Port conflicts?**
-```bash
-# Check if ports are in use
-lsof -i :3001  # Grafana
-lsof -i :8080  # pprof web UI
-lsof -i :6379  # Redis
-```
-
-## 7. Advanced Usage
-
-### Performance Regression Detection
-```bash
-# Create baseline
-make detect-regressions
-
-# Analyze current performance
-go test -bench=. -benchmem ./tests/benchmarks/... | tee current.json
-```
-
-### Custom Benchmarks
-```bash
-# Run specific benchmark
-go test -bench=BenchmarkName -benchmem ./tests/benchmarks/...
-
-# Run with race detection
-go test -race -bench=. ./tests/benchmarks/...
-```
-
-## 8. Further Reading
-
- [Full Documentation](performance-monitoring.md)
- [Dashboard Customization](performance-monitoring.md#grafana-dashboard)
- [Alert Configuration](performance-monitoring.md#alerting)
- [Architecture Guide](architecture.md)
- [Testing Guide](testing.md)
-
---
-
-*Ready in 5 minutes!*
--- a/docs/src/production-monitoring.md
+++ b/docs/src/production-monitoring.md
@ -1,217 +0,0 @@
-# Production Monitoring Deployment Guide (Linux)
-
-This guide covers deploying the monitoring stack (Prometheus, Grafana, Loki, Promtail) on Linux production servers.
-
-## Architecture
-
-**Testing**: Docker Compose (macOS/Linux)
-**Production**: Podman + systemd (Linux)
-
-**Important**: Docker is for testing only. Podman is used for running actual ML experiments in production.
-
-**Dev (Testing)**: Docker Compose  
-**Prod (Experiments)**: Podman + systemd
-
-Each service runs as a separate Podman container managed by systemd for automatic restarts and proper lifecycle management.
-
-## Prerequisites
-
-**Container Runtimes:**
- **Docker Compose**: For testing and development only
- **Podman**: For production experiment execution
-
- Linux distribution with systemd (Rocky/RHEL/CentOS, Ubuntu/Debian, Arch, SUSE, etc.)
- Production app already deployed (see `scripts/setup-prod.sh`)
- Root or sudo access
- Ports 3000, 9090, 3100 available
-
-## Quick Setup
-
-### 1. Run Setup Script
-```bash
-cd /path/to/fetch_ml
-sudo ./scripts/setup-monitoring-prod.sh /data/monitoring ml-user ml-group
-```
-
-This will:
- Create directory structure at `/data/monitoring`
- Copy configuration files to `/etc/fetch_ml/monitoring`
- Create systemd services for each component
- Set up firewall rules
-
-### 2. Start Services
-```bash
-# Start all monitoring services
-sudo systemctl start prometheus
-sudo systemctl start loki
-sudo systemctl start promtail
-sudo systemctl start grafana
-
-# Enable on boot
-sudo systemctl enable prometheus loki promtail grafana
-```
-
-### 3. Access Grafana
- URL: `http://YOUR_SERVER_IP:3000`
- Username: `admin`
- Password: `admin` (change on first login)
-
-Dashboards will auto-load:
- **ML Task Queue Monitoring** (metrics)
- **Application Logs** (Loki logs)
-
-## Service Details
-
-### Prometheus
- **Port**: 9090
- **Config**: `/etc/fetch_ml/monitoring/prometheus.yml`
- **Data**: `/data/monitoring/prometheus`
- **Purpose**: Scrapes metrics from API server
-
-### Loki
- **Port**: 3100
- **Config**: `/etc/fetch_ml/monitoring/loki-config.yml`
- **Data**: `/data/monitoring/loki`
- **Purpose**: Log aggregation
-
-### Promtail
- **Config**: `/etc/fetch_ml/monitoring/promtail-config.yml`
- **Log Source**: `/var/log/fetch_ml/*.log`
- **Purpose**: Ships logs to Loki
-
-### Grafana
- **Port**: 3000
- **Config**: `/etc/fetch_ml/monitoring/grafana/provisioning`
- **Data**: `/data/monitoring/grafana`
- **Dashboards**: `/var/lib/grafana/dashboards`
-
-## Management Commands
-
-```bash
-# Check status
-sudo systemctl status prometheus grafana loki promtail
-
-# View logs
-sudo journalctl -u prometheus -f
-sudo journalctl -u grafana -f
-sudo journalctl -u loki -f
-sudo journalctl -u promtail -f
-
-# Restart services
-sudo systemctl restart prometheus
-sudo systemctl restart grafana
-
-# Stop all monitoring
-sudo systemctl stop prometheus grafana loki promtail
-```
-
-## Data Retention
-
-### Prometheus
-Default: 15 days. Edit `/etc/fetch_ml/monitoring/prometheus.yml`:
-```yaml
-storage:
-  tsdb:
-    retention.time: 30d
-```
-
-### Loki
-Default: 30 days. Edit `/etc/fetch_ml/monitoring/loki-config.yml`:
-```yaml
-limits_config:
-  retention_period: 30d
-```
-
-## Security
-
-### Firewall
-The setup script automatically configures firewall rules using the detected firewall manager (firewalld or ufw).
-
-For manual firewall configuration:
-
-**RHEL/Rocky/Fedora (firewalld)**:
-```bash
-# Remove public access
-sudo firewall-cmd --permanent --remove-port=3000/tcp
-sudo firewall-cmd --permanent --remove-port=9090/tcp
-
-# Add specific source
-sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.0.0.0/24" port port="3000" protocol="tcp" accept'
-sudo firewall-cmd --reload
-```
-
-**Ubuntu/Debian (ufw)**:
-```bash
-# Remove public access
-sudo ufw delete allow 3000/tcp
-sudo ufw delete allow 9090/tcp
-
-# Add specific source
-sudo ufw allow from 10.0.0.0/24 to any port 3000 proto tcp
-```
-
-### Authentication
-Change Grafana admin password:
-1. Login to Grafana
-2. User menu → Profile → Change Password
-
-### TLS (Optional)
-For HTTPS, configure reverse proxy (nginx/Apache) in front of Grafana.
-
-## Troubleshooting
-
-### Grafana shows no data
-```bash
-# Check if Prometheus is reachable
-curl http://localhost:9090/-/healthy
-
-# Check datasource in Grafana
-# Settings → Data Sources → Prometheus → Save & Test
-```
-
-### Loki not receiving logs
-```bash
-# Check Promtail is running
-sudo systemctl status promtail
-
-# Verify log file exists
-ls -l /var/log/fetch_ml/
-
-# Check Promtail can reach Loki
-curl http://localhost:3100/ready
-```
-
-### Podman containers not starting
-```bash
-# Check pod status
-sudo -u ml-user podman pod ps
-sudo -u ml-user podman ps -a
-
-# Remove and recreate
-sudo -u ml-user podman pod stop monitoring
-sudo -u ml-user podman pod rm monitoring
-sudo systemctl restart prometheus
-```
-
-## Backup
-
-```bash
-# Backup Grafana dashboards and data
-sudo tar -czf grafana-backup.tar.gz /data/monitoring/grafana
-
-# Backup Prometheus data
-sudo tar -czf prometheus-backup.tar.gz /data/monitoring/prometheus
-```
-
-## Updates
-
-```bash
-# Pull latest images
-sudo -u ml-user podman pull docker.io/grafana/grafana:latest
-sudo -u ml-user podman pull docker.io/prom/prometheus:latest
-sudo -u ml-user podman pull docker.io/grafana/loki:latest
-sudo -u ml-user podman pull docker.io/grafana/promtail:latest
-
-# Restart services to use new images
-sudo systemctl restart grafana prometheus loki promtail
-```
--- a/docs/src/quick-start.md
+++ b/docs/src/quick-start.md
@ -1,6 +1,6 @@
 # Quick Start

-Get Fetch ML running in minutes with Docker Compose.
+Get Fetch ML running in minutes with Docker Compose and integrated monitoring.

 ## Prerequisites

@ -8,9 +8,13 @@ Get Fetch ML running in minutes with Docker Compose.
 - **Docker Compose**: For testing and development only
 - **Podman**: For production experiment execution

+**Requirements:**
+- Go 1.21+
+- Zig 0.11+
 - Docker Compose (testing only)
 - 4GB+ RAM
 - 2GB+ disk space
+- Git

 ## One-Command Setup

@ -18,108 +22,312 @@ Get Fetch ML running in minutes with Docker Compose.
 # Clone and start
 git clone https://github.com/jfraeys/fetch_ml.git
 cd fetch_ml
-docker-compose up -d (testing only)
+make dev-up

 # Wait for services (30 seconds)
 sleep 30

 # Verify setup
-curl http://localhost:9101/health
+curl http://localhost:8080/health
+```
+
+Note: the development compose runs the API server over HTTP/WS for CLI compatibility. For HTTPS/WSS, terminate TLS at a reverse proxy.
+
+**Access Services:**
+- **API Server (via Caddy)**: http://localhost:8080
+- **API Server (via Caddy + internal TLS)**: https://localhost:8443
+- **Grafana**: http://localhost:3000 (admin/admin123)
+- **Prometheus**: http://localhost:9090
+- **Loki**: http://localhost:3100
+
+## Development Setup
+
+### Build Components
+
+```bash
+# Build all components
+make build
+
+# Development build
+make dev
+```
+
+### Start Services
+
+```bash
+# Start development stack with monitoring
+make dev-up
+
+# Check status
+make dev-status
+
+# Stop services
+make dev-down
+```
+
+### Verify Setup
+
+```bash
+# Check API health
+curl -f http://localhost:8080/health
+
+# Check monitoring services
+curl -f http://localhost:3000/api/health
+curl -f http://localhost:9090/api/v1/query?query=up
+curl -f http://localhost:3100/ready
+
+# Check Redis
+docker exec ml-experiments-redis redis-cli ping
 ```

 ## First Experiment

-```bash
-# Submit a simple ML job (see [First Experiment](first-experiment.md) for details)
-curl -X POST http://localhost:9101/api/v1/jobs \
-  -H "Content-Type: application/json" \
-  -H "X-API-Key: admin" \
-  -d '{
-    "job_name": "hello-world",
-    "args": "--echo Hello World",
-    "priority": 1
-  }'
-
-# Check job status
-curl http://localhost:9101/api/v1/jobs \
-  -H "X-API-Key: admin"
-```
-
-## CLI Access
+### 1. Setup CLI

 ```bash
 # Build CLI
-cd cli && zig build dev
+cd cli && zig build --release=fast

-# List jobs
-./cli/zig-out/dev/ml --server http://localhost:9101 list-jobs
-
-# Submit new job
-./cli/zig-out/dev/ml --server http://localhost:9101 submit \
-  --name "test-job" --args "--epochs 10"
+# Initialize CLI config
+./cli/zig-out/bin/ml init
 ```

-## Local Mode (Zero-Install)
-
-Run workers locally without Redis or SSH for development and testing:
+### 2. Queue Job

 ```bash
-# Start a local worker (uses configs/worker-dev.yaml)
-./cmd/worker/worker -config configs/worker-dev.yaml
+# Simple test job
+echo "test experiment" | ./cli/zig-out/bin/ml queue test-job

-# In another terminal, submit a job to the local worker
-curl -X POST http://localhost:9101/api/v1/jobs \
-  -H "Content-Type: application/json" \
-  -H "X-API-Key: admin" \
-  -d '{
-    "job_name": "local-test",
-    "args": "--echo Local Mode Works",
-    "priority": 1
-  }'
-
-# The worker will execute locally using:
-# - Local command execution (no SSH)
-# - Local job directories (pending/running/finished)
-# - In-memory task queue (no Redis required)
+# Check status
+./cli/zig-out/bin/ml status
 ```

-Local mode configuration (`configs/worker-dev.yaml`):
-```yaml
-local_mode: true    # Enable local execution
-base_path: "./jobs" # Local job directory
-redis_addr: ""     # Optional: skip Redis
-host: ""           # Optional: skip SSH
+### 3. Monitor Progress
+
+```bash
+# View in Grafana
+open http://localhost:3000
+
+# Check logs in Grafana Log Analysis dashboard
+# Or view container logs
+docker logs ml-experiments-api -f
 ```

-## Related Documentation
+## Key Commands

- [Installation Guide](installation.md) - Detailed setup options
- [First Experiment](first-experiment.md) - Complete ML workflow
- [Development Setup](development-setup.md) - Local development
- [Security](security.md) - Authentication and permissions
+### Development Commands
+
+```bash
+make help              # Show all commands
+make build             # Build all components
+make dev-up            # Start dev environment
+make dev-down          # Stop dev environment
+make dev-status        # Check dev status
+make test              # Run tests
+make test-unit         # Run unit tests
+make test-integration  # Run integration tests
+```
+
+### CLI Commands
+
+```bash
+# Build CLI
+cd cli && zig build --release=fast
+
+# Common operations
+./cli/zig-out/bin/ml status          # Check system status
+./cli/zig-out/bin/ml queue job-name  # Queue job
+./cli/zig-out/bin/ml list           # List jobs
+./cli/zig-out/bin/ml help           # Show help
+```
+
+### Monitoring Commands
+
+```bash
+# Access monitoring services
+open http://localhost:3000  # Grafana
+open http://localhost:9090  # Prometheus
+open http://localhost:3100  # Loki
+
+# (Optional) Re-generate Grafana provisioning (datasources/providers)
+python3 scripts/setup_monitoring.py
+```
+
+## Configuration
+
+### Environment Setup
+
+```bash
+# Copy example environment
+cp deployments/env.dev.example .env
+
+# Edit as needed
+vim .env
+```
+
+**Key Variables**:
+- `LOG_LEVEL=info`
+- `GRAFANA_ADMIN_PASSWORD=admin123`
+
+### CLI Configuration
+
+```bash
+# Setup CLI config
+mkdir -p ~/.ml
+
+# Create config file if needed
+touch ~/.ml/config.toml
+
+# Edit configuration
+vim ~/.ml/config.toml
+```
+
+## Testing
+
+### Quick Test
+
+```bash
+# 5-minute authentication test
+make test-auth
+
+# Clean up
+make self-cleanup
+```
+
+### Full Test Suite
+
+```bash
+# Run all tests
+make test
+
+# Run with coverage
+make test-coverage
+
+# Run specific test types
+make test-unit
+make test-integration
+make test-e2e
+```
+
+### Load Testing
+
+```bash
+# Run load tests
+make load-test
+
+# Run benchmarks
+make benchmark
+
+# Track performance
+./scripts/track_performance.sh
+```

 ## Troubleshooting

-**Services not starting?**
+### Common Issues
+
+**Port Conflicts**:
 ```bash
-# Check logs
-docker-compose logs
+# Check port usage
+lsof -i :8080
+lsof -i :8443
+lsof -i :3000
+lsof -i :9090
+
+# Kill conflicting processes
+kill -9 <PID>
+```
+
+**Build Issues**:
+```bash
+# Fix Go modules
+go mod tidy
+
+# Fix Zig build
+cd cli && rm -rf zig-out zig-cache && zig build --release=fast
+```
+
+**Container Issues**:
+```bash
+# Check container status
+docker ps --filter "name=ml-"
+
+# View logs
+docker logs ml-experiments-api
+docker logs ml-experiments-grafana

 # Restart services
-docker-compose down && docker-compose up -d (testing only)
+make dev-down && make dev-up
 ```

-**API not responding?**
+**Monitoring Issues**:
 ```bash
-# Check health
-curl http://localhost:9101/health
+# Re-setup monitoring
+python3 scripts/setup_monitoring.py

-# Verify ports
-docker-compose ps
+# Restart Grafana
+docker restart ml-experiments-grafana
+
+# Check datasources in Grafana
+# Settings → Data Sources → Test connection
 ```

-**Permission denied?**
+### Debug Mode
+
 ```bash
-# Check API key
-curl -H "X-API-Key: admin" http://localhost:9101/api/v1/jobs
+# Enable debug logging
+export LOG_LEVEL=debug
+make dev-up
 ```
+
+## Next Steps
+
+### Explore Features
+
+1. **Job Management**: Queue and monitor ML experiments
+2. **WebSocket Communication**: Real-time updates
+3. **Multi-User Authentication**: Role-based access control
+4. **Performance Monitoring**: Grafana dashboards and metrics
+5. **Log Aggregation**: Centralized logging with Loki
+
+### Advanced Configuration
+
+- **Production Setup**: See [Deployment Guide](deployment.md)
+- **Performance Monitoring**: See [Performance Monitoring](performance-monitoring.md)
+- **Testing Procedures**: See [Testing Guide](testing.md)
+- **CLI Reference**: See [CLI Reference](cli-reference.md)
+
+### Production Deployment
+
+For production deployment:
+1. Review [Deployment Guide](deployment.md)
+2. Set up production monitoring
+3. Configure security and authentication
+4. Set up backup procedures
+
+## Help and Support
+
+### Get Help
+
+```bash
+make help              # Show all available commands
+./cli/zig-out/bin/ml --help  # CLI help
+```
+
+### Documentation
+
+- **[Testing Guide](testing.md)** - Comprehensive testing procedures
+- **[Deployment Guide](deployment.md)** - Production deployment
+- **[Performance Monitoring](performance-monitoring.md)** - Monitoring setup
+- **[Architecture Guide](architecture.md)** - System architecture
+- **[Troubleshooting](troubleshooting.md)** - Common issues
+
+### Community
+
+- Check logs: `docker logs ml-experiments-api`
+- Review documentation in `docs/src/`
+- Use `--debug` flag with CLI commands for detailed output
+
+---
+
+*Ready in minutes!*