From 8157f73a70eb9ae41603c0bf0b316c51c13cbc36 Mon Sep 17 00:00:00 2001 From: Jeremie Fraeys Date: Mon, 5 Jan 2026 12:37:40 -0500 Subject: [PATCH] docs(ops): consolidate deployment and performance monitoring docs for Caddy-based setup --- docs/src/deployment.md | 618 +++++++++++++++--------- docs/src/performance-monitoring.md | 723 +++++++++++++++++++++------- docs/src/performance-quick-start.md | 245 ---------- docs/src/production-monitoring.md | 217 --------- docs/src/quick-start.md | 346 ++++++++++--- 5 files changed, 1234 insertions(+), 915 deletions(-) delete mode 100644 docs/src/performance-quick-start.md delete mode 100644 docs/src/production-monitoring.md diff --git a/docs/src/deployment.md b/docs/src/deployment.md index 5005ec6..505e2f2 100644 --- a/docs/src/deployment.md +++ b/docs/src/deployment.md @@ -2,302 +2,480 @@ ## Overview -The ML Experiment Manager supports multiple deployment methods from local development to homelab Docker setups. +The ML Experiment Manager supports multiple deployment methods from local development to production setups with integrated monitoring. + +## TLS / WSS Policy + +- The Zig CLI currently supports `ws://` only (native `wss://` is not implemented). +- For production, use a reverse proxy (Caddy) to terminate TLS/WSS (Approach A) and keep the API server on internal HTTP/WS. +- If you need remote CLI access, use one of: + - an SSH tunnel to the internal `ws://` endpoint + - a private network/VPN so `ws://` is not exposed to the public Internet +- When `server.tls.enabled: false`, the API server still runs on plain HTTP/WS internally. In development, access it via Caddy at `http://localhost:8080/health`. + +## Data Directories + +- `base_path` is where experiment directories live. +- `data_dir` is used for dataset/snapshot materialization and integrity validation. +- If you want `ml validate` to check snapshots/datasets, you must mount `data_dir` into the API server container. ## Quick Start -### Docker Compose (Recommended for Development) +### Development Deployment with Monitoring ```bash -# Clone repository -git clone https://github.com/your-org/fetch_ml.git -cd fetch_ml +# Start development stack with monitoring +make dev-up -# Start all services -docker-compose up -d (testing only) +# Alternative: Use deployment script +cd deployments && make dev-up # Check status -docker-compose ps - -# View logs -docker-compose logs -f api-server +make dev-status ``` -Access the API at `http://localhost:9100` +**Access Services:** +- **API Server (via Caddy)**: http://localhost:8080 +- **API Server (via Caddy + internal TLS)**: https://localhost:8443 +- **Grafana**: http://localhost:3000 (admin/admin123) +- **Prometheus**: http://localhost:9090 +- **Loki**: http://localhost:3100 ## Deployment Options -### 1. Local Development +### 1. Development Environment -#### Prerequisites +**Purpose**: Local development with full monitoring stack -**Container Runtimes:** -- **Docker Compose**: For testing and development only -- **Podman**: For production experiment execution -- Go 1.25+ -- Zig 0.15.2 -- Redis 7+ -- Docker & Docker Compose (optional) +**Services**: API Server, Redis, Prometheus, Grafana, Loki, Promtail -#### Manual Setup +**Configuration**: ```bash -# Start Redis -redis-server +# Using Makefile (recommended) +make dev-up +make dev-down +make dev-status -# Build and run Go server -go build -o bin/api-server ./cmd/api-server -./bin/api-server -config configs/config-local.yaml - -# Build Zig CLI -cd cli -zig build prod -./zig-out/bin/ml --help +# Using deployment script +cd deployments +make dev-up +make dev-down +make dev-status ``` -### 2. Docker Deployment +**Features**: +- Auto-provisioned Grafana dashboards +- Real-time metrics and logs +- Hot reload for development +- Local data persistence -#### Build Image +### 2. Production Environment + +**Purpose**: Production deployment with security + +**Services**: API Server, Worker, Redis with authentication + +**Configuration**: ```bash -docker build -t ml-experiment-manager:latest . +cd deployments +make prod-up +make prod-down +make prod-status ``` -#### Run Container +**Features**: +- Secure Redis with authentication +- TLS/WSS via reverse proxy termination (Caddy) +- Production-optimized configurations +- Health checks and restart policies + +### 3. Homelab Secure Environment + +**Purpose**: Secure homelab deployment + +**Services**: API Server, Redis, Caddy reverse proxy + +**Configuration**: ```bash -docker run -d \ - --name ml-api \ - -p 9100:9100 \ - -p 9101:9101 \ - -v $(pwd)/configs:/app/configs:ro \ - -v experiment-data:/data/ml-experiments \ - ml-experiment-manager:latest +cd deployments +make homelab-up +make homelab-down +make homelab-status ``` -#### Docker Compose -```bash -# Development mode (uses root docker-compose.yml) -docker-compose up -d +**Features**: +- Caddy reverse proxy +- TLS termination +- Network isolation +- External networks -# Production deployment -docker-compose -f deployments/docker-compose.prod.yml up -d +## Environment Setup -# Secure homelab deployment -docker-compose -f deployments/docker-compose.homelab-secure.yml up -d - -# With custom configuration -docker-compose -f deployments/docker-compose.prod.yml up --env-file .env.prod -``` - -### 3. Homelab Setup +### Development Environment ```bash -# Use the simple setup script -./setup.sh +# Copy example environment +cp deployments/env.dev.example .env -# Or manually with Docker Compose -docker-compose up -d (testing only) +# Edit as needed +vim .env ``` -### 4. Cloud Deployment +**Key Variables**: +- `LOG_LEVEL=info` +- `GRAFANA_ADMIN_PASSWORD=admin123` + +### Production Environment -#### AWS ECS ```bash -# Build and push to ECR -aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REGISTRY -docker build -t $ECR_REGISTRY/ml-experiment-manager:latest . -docker push $ECR_REGISTRY/ml-experiment-manager:latest +# Copy example environment +cp deployments/env.prod.example .env -# Deploy with ECS CLI -ecs-cli compose --project-name ml-experiment-manager up +# Edit with production values +vim .env ``` -#### Google Cloud Run +**Key Variables**: +- `REDIS_PASSWORD=your-secure-password` +- `JWT_SECRET=your-jwt-secret` +- `SSL_CERT_PATH=/path/to/cert` + +## Monitoring Setup + +### Automatic Configuration + +Monitoring dashboards and datasources are auto-provisioned: + ```bash -# Build and push -gcloud builds submit --tag gcr.io/$PROJECT_ID/ml-experiment-manager +# Setup monitoring provisioning (Grafana datasources/providers) +python3 scripts/setup_monitoring.py -# Deploy -gcloud run deploy ml-experiment-manager \ - --image gcr.io/$PROJECT_ID/ml-experiment-manager \ - --platform managed \ - --region us-central1 \ - --allow-unauthenticated +# Start services (includes monitoring) +make dev-up ``` -## Configuration +### Available Dashboards -### Environment Variables -```yaml -# configs/config-local.yaml -base_path: "/data/ml-experiments" -auth: - enabled: true - api_keys: - - "your-production-api-key" -server: - address: ":9100" - tls: - enabled: true - cert_file: "/app/ssl/cert.pem" - key_file: "/app/ssl/key.pem" +1. **Load Test Performance**: Request rates, response times, error rates +2. **System Health**: Service status, memory, CPU usage +3. **Log Analysis**: Error logs, service logs, log aggregation + +### Manual Configuration + +If auto-provisioning fails: + +1. **Access Grafana**: http://localhost:3000 +2. **Add Data Sources**: + - Prometheus: http://prometheus:9090 + - Loki: http://loki:3100 +3. **Import Dashboards**: From `monitoring/grafana/dashboards/` + +## Testing Procedures + +### Pre-Deployment Testing + +```bash +# Run unit tests +make test-unit + +# Run integration tests +make test-integration + +# Run full test suite +make test + +# Run with coverage +make test-coverage ``` -### Docker Compose Environment -```yaml -# docker-compose.yml -version: '3.8' -services: - api-server: - environment: - - REDIS_URL=redis://redis:6379 - - LOG_LEVEL=info - volumes: - - ./configs:/configs:ro - - ./data:/data/experiments -``` +### Load Testing -## Monitoring & Logging +```bash +# Run load tests +make load-test + +# Run specific load scenarios +make benchmark-local + +# Track performance over time +./scripts/track_performance.sh +``` ### Health Checks -- HTTP: `GET /health` -- WebSocket: Connection test -- Redis: Ping check -### Metrics -- Prometheus metrics at `/metrics` -- Custom application metrics -- Container resource usage - -### Logging -- Structured JSON logging -- Log levels: DEBUG, INFO, WARN, ERROR -- Centralized logging via ELK stack - -## Security - -### TLS Configuration ```bash -# Generate self-signed cert (development) -openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365 -nodes +# Check service health +curl -f http://localhost:8080/health -# Production - use Let's Encrypt -certbot certonly --standalone -d ml-experiments.example.com +# Check monitoring services +curl -f http://localhost:3000/api/health +curl -f http://localhost:9090/api/v1/query?query=up +curl -f http://localhost:3100/ready ``` -### Network Security -- Firewall rules (ports 9100, 9101, 6379) -- VPN access for internal services -- API key authentication -- Rate limiting - -## Performance Tuning - -### Resource Allocation -FetchML now centralizes pacing and container limits under a `resources` section in every server/worker config. Example for a homelab box: -```yaml -resources: - max_workers: 1 - desired_rps_per_worker: 2 # conservative pacing per worker - podman_cpus: "2" # Podman --cpus, keeps host responsive - podman_memory: "8g" # Podman --memory, isolates experiment installs -``` - -For high-end machines (e.g., M2 Ultra, 18 performance cores / 64 GB RAM), start with: -```yaml -resources: - max_workers: 2 # two concurrent experiments - desired_rps_per_worker: 5 # faster job submission - podman_cpus: "8" - podman_memory: "32g" -``` -Adjust upward only if experiments stay GPU-bound; keeping Podman limits in place ensures users can install packages inside the container without jeopardizing the host. - -### Scaling Strategies -- Horizontal pod autoscaling -- Redis clustering -- Load balancing -- CDN for static assets - -## Backup & Recovery - -### Data Backup -```bash -# Backup experiment data -docker-compose exec redis redis-cli BGSAVE -docker cp $(docker-compose ps -q redis):/data/dump.rdb ./redis-backup.rdb - -# Backup data volume -docker run --rm -v ml-experiments_redis_data:/data -v $(pwd):/backup alpine tar czf /backup/redis-backup.tar.gz -C /data . -``` - -### Disaster Recovery -1. Restore Redis data -2. Restart services -3. Verify experiment metadata -4. Test API endpoints - ## Troubleshooting ### Common Issues -#### API Server Not Starting +**Port Conflicts**: ```bash -# Check logs -docker-compose logs api-server +# Check port usage +lsof -i :9101 +lsof -i :3000 +lsof -i :9090 -# Check configuration -cat configs/config-local.yaml - -# Check Redis connection -docker-compose exec redis redis-cli ping +# Kill conflicting processes +kill -9 ``` -#### WebSocket Connection Issues +**Container Issues**: ```bash -# Test WebSocket -wscat -c ws://localhost:9100/ws +# View container logs +docker logs ml-experiments-api +docker logs ml-experiments-grafana -# Check TLS -openssl s_client -connect localhost:9101 -servername localhost +# Restart services +make dev-restart + +# Clean restart +make dev-down && make dev-up ``` -#### Performance Issues +**Monitoring Issues**: ```bash -# Check resource usage -docker-compose exec api-server ps aux +# Re-setup monitoring configuration +python3 scripts/setup_monitoring.py -# Check Redis memory -docker-compose exec redis redis-cli info memory +# Restart Grafana only +docker restart ml-experiments-grafana ``` -### Debug Mode +### Performance Issues + +**High Memory Usage**: +- Check Grafana dashboards for memory metrics +- Adjust Prometheus retention in `prometheus.yml` +- Monitor log retention in `loki-config.yml` + +**Slow Response Times**: +- Check network connectivity between containers +- Verify Redis performance +- Review API server logs for bottlenecks + +## Maintenance + +### Regular Tasks + +**Weekly**: +- Check Grafana dashboards for anomalies +- Review log files for errors +- Verify backup procedures + +**Monthly**: +- Update Docker images +- Clean up old Docker volumes +- Review and rotate secrets + +### Backup Procedures + +**Data Backup**: ```bash -# Enable debug logging -export LOG_LEVEL=debug -./bin/api-server -config configs/config-local.yaml +# Backup application data +docker run --rm -v ml_data:/data -v $(pwd):/backup alpine tar czf /backup/data-backup.tar.gz -C /data . + +# Backup monitoring data +docker run --rm -v prometheus_data:/data -v $(pwd):/backup alpine tar czf /backup/prometheus-backup.tar.gz -C /data . ``` -## CI/CD Integration +**Configuration Backup**: +```bash +# Backup configurations +tar czf config-backup.tar.gz monitoring/ deployments/ configs/ +``` -### GitHub Actions -- Automated testing on PR -- Multi-platform builds -- Security scanning -- Automatic releases +## Security Considerations -### Deployment Pipeline -1. Code commit → GitHub -2. CI/CD pipeline triggers -3. Build and test -4. Security scan -5. Deploy to staging -6. Run integration tests -7. Deploy to production -8. Post-deployment verification +### Development Environment +- Change default Grafana password +- Use environment variables for secrets +- Monitor container logs for security events + +### Production Environment +- Enable Redis authentication +- Use SSL/TLS certificates +- Implement network segmentation +- Regular security updates +- Monitor access logs + +## Performance Optimization + +### Resource Limits + +**Development**: +```yaml +# docker-compose.dev.yml +services: + api-server: + deploy: + resources: + limits: + memory: 512M + cpus: '0.5' +``` + +**Production**: +```yaml +# docker-compose.prod.yml +services: + api-server: + deploy: + resources: + limits: + memory: 2G + cpus: '2.0' +``` + +### Monitoring Optimization + +**Prometheus**: +- Adjust scrape intervals +- Configure retention periods +- Use recording rules for frequent queries + +**Loki**: +- Configure log retention +- Use log sampling for high-volume sources +- Optimize label cardinality + +## Non-Docker Production (systemd) + +This project can be run in production without Docker. The recommended model is: + +- Run `api-server` and `worker` as systemd services. +- Terminate TLS/WSS at Caddy and keep the API server on internal plain HTTP/WS. + +The unit templates below are copy-paste friendly, but you must adjust paths, users, and config locations to your environment. + +### `fetchml-api.service` + +```ini +[Unit] +Description=FetchML API Server +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +User=fetchml +Group=fetchml +WorkingDirectory=/var/lib/fetchml + +Environment=LOG_LEVEL=info + +ExecStart=/usr/local/bin/api-server -config /etc/fetchml/api.yaml +Restart=on-failure +RestartSec=2 + +NoNewPrivileges=true +PrivateTmp=true +ProtectSystem=strict +ProtectHome=true +ReadWritePaths=/var/lib/fetchml /var/log/fetchml + +[Install] +WantedBy=multi-user.target +``` + +### `fetchml-worker.service` + +```ini +[Unit] +Description=FetchML Worker +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +User=fetchml +Group=fetchml +WorkingDirectory=/var/lib/fetchml + +Environment=LOG_LEVEL=info + +ExecStart=/usr/local/bin/worker -config /etc/fetchml/worker.yaml +Restart=on-failure +RestartSec=2 + +NoNewPrivileges=true +PrivateTmp=true + +[Install] +WantedBy=multi-user.target +``` + +### Optional: `caddy.service` + +Most distros ship a Caddy systemd unit. If you do not have one available, you can use this template. + +```ini +[Unit] +Description=Caddy +Documentation=https://caddyserver.com/docs/ +After=network-online.target +Wants=network-online.target + +[Service] +Type=notify +User=caddy +Group=caddy +ExecStart=/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile +ExecReload=/usr/bin/caddy reload --config /etc/caddy/Caddyfile +TimeoutStopSec=5s +LimitNOFILE=1048576 +LimitNPROC=512 +PrivateTmp=true +ProtectSystem=full +AmbientCapabilities=CAP_NET_BIND_SERVICE +CapabilityBoundingSet=CAP_NET_BIND_SERVICE +NoNewPrivileges=true +Restart=on-failure + +[Install] +WantedBy=multi-user.target +``` + +## Migration Guide + +### From Development to Production + +1. **Export Data**: + ```bash + docker exec ml-data redis-cli BGSAVE + docker cp ml-data:/data/dump.rdb ./redis-backup.rdb + ``` + +2. **Update Configuration**: + ```bash + cp deployments/env.dev.example deployments/env.prod.example + # Edit with production values + ``` + +3. **Deploy Production**: + ```bash + cd deployments + make prod-up + ``` + +4. **Import Data**: + ```bash + docker cp ./redis-backup.rdb ml-prod-redis:/data/dump.rdb + docker restart ml-prod-redis + ``` ## Support For deployment issues: -1. Check this guide -2. Review logs -3. Check GitHub Issues -4. Contact maintainers +1. Check troubleshooting section +2. Review container logs +3. Verify network connectivity +4. Check resource usage in Grafana \ No newline at end of file diff --git a/docs/src/performance-monitoring.md b/docs/src/performance-monitoring.md index e44be21..1683401 100644 --- a/docs/src/performance-monitoring.md +++ b/docs/src/performance-monitoring.md @@ -1,231 +1,626 @@ # Performance Monitoring -This document describes the performance monitoring system for Fetch ML, which automatically tracks benchmark metrics through CI/CD integration with Prometheus and Grafana. +Comprehensive performance monitoring system for Fetch ML with CI/CD integration, profiling, and production deployment. -## Overview +## Quick Start -The performance monitoring system provides: +### 5-Minute Setup -- **Automatic benchmark execution** on every CI/CD run -- **Real-time metrics collection** via Prometheus Pushgateway -- **Historical trend visualization** in Grafana dashboards -- **Performance regression detection** -- **Cross-commit comparisons** +```bash +# Start monitoring stack +make dev-up + +# Run benchmarks +make benchmark + +# View results in Grafana +open http://localhost:3000 +``` + +### Basic Profiling + +```bash +# CPU profiling +make profile-load-norate + +# View interactive profile +go tool pprof -http=:8080 cpu_load.out +``` ## Architecture +**Development**: Docker Compose with integrated monitoring +**Production**: Podman + systemd (Linux) +**CI/CD**: GitHub Actions → Prometheus Pushgateway → Grafana + ``` GitHub Actions → Benchmark Tests → Prometheus Pushgateway → Prometheus → Grafana Dashboard ``` ## Components -### 1. GitHub Actions Workflow -- **File**: `.github/workflows/benchmark-metrics.yml` +### 1. Development Monitoring (Docker Compose) + +**Services**: +- **Grafana**: http://localhost:3000 (admin/admin123) +- **Prometheus**: http://localhost:9090 +- **Loki**: http://localhost:3100 +- **Promtail**: Log aggregation + +**Configuration**: +```bash +# Start dev stack with monitoring +make dev-up + +# Verify services +curl -f http://localhost:3000/api/health +curl -f http://localhost:9090/api/v1/query?query=up +curl -f http://localhost:3100/ready +``` + +### 2. Production Monitoring (Podman + systemd) + +**Architecture**: +- Each service runs as separate Podman container +- Managed by systemd for automatic restarts +- Proper lifecycle management + +**Setup**: +```bash +# Run production setup script +sudo ./scripts/setup-monitoring-prod.sh /data/monitoring ml-user ml-group + +# Start services +sudo systemctl start prometheus loki promtail grafana +sudo systemctl enable prometheus loki promtail grafana +``` + +**Access**: +- URL: `http://YOUR_SERVER_IP:3000` +- Username: `admin` +- Password: `admin` (change on first login) + +### 3. CI/CD Integration + +**GitHub Actions Workflow**: - **Triggers**: Push to main/develop, PRs, daily schedule, manual - **Function**: Runs benchmarks and pushes metrics to Prometheus -### 2. Prometheus Pushgateway -- **Port**: 9091 -- **Purpose**: Receives benchmark metrics from CI/CD runs -- **URL**: `http://localhost:9091` - -### 3. Prometheus Server -- **Configuration**: `monitoring/prometheus.yml` -- **Scrapes**: Pushgateway for benchmark metrics -- **Retention**: Configurable retention period - -### 4. Grafana Dashboard -- **Location**: `monitoring/dashboards/performance-dashboard.json` -- **Visualizations**: Performance trends, regressions, comparisons -- **Access**: http://localhost:3001 - -## Setup - -### 1. Start Monitoring Stack - +**Setup**: ```bash -make monitoring-performance -``` - -This starts: -- Grafana: http://localhost:3001 (admin/admin) -- Loki: http://localhost:3100 -- Pushgateway: http://localhost:9091 - -### 2. Configure GitHub Secrets - -Add this secret to your GitHub repository: - -``` +# Add GitHub secret PROMETHEUS_PUSHGATEWAY_URL=http://your-pushgateway:9091 ``` -### 3. Verify Integration +## Performance Testing -1. Push code to trigger the workflow -2. Check Pushgateway: http://localhost:9091 -3. View metrics in Grafana dashboard - -## Available Metrics - -### Benchmark Metrics - -- `benchmark_time_per_op` - Time per operation in nanoseconds -- `benchmark_memory_per_op` - Memory per operation in bytes -- `benchmark_allocs_per_op` - Allocations per operation - -Labels: -- `benchmark` - Benchmark name (sanitized) -- `job` - Always "benchmark" -- `instance` - GitHub Actions run ID - -### Example Metrics Output - -``` -benchmark_time_per_op{benchmark="BenchmarkAPIServerCreateJobSimple"} 42653 -benchmark_memory_per_op{benchmark="BenchmarkAPIServerCreateJobSimple"} 13518 -benchmark_allocs_per_op{benchmark="BenchmarkAPIServerCreateJobSimple"} 98 -``` - -## Usage - -### Manual Benchmark Execution +### Benchmarks ```bash # Run benchmarks locally make benchmark -# View results in console +# Or run with detailed output go test -bench=. -benchmem ./tests/benchmarks/... + +# Run specific benchmark +go test -bench=BenchmarkName -benchmem ./tests/benchmarks/... + +# Run with race detection +go test -race -bench=. ./tests/benchmarks/... ``` -### Automated Monitoring +### Load Testing -The system automatically runs benchmarks on: +```bash +# Run load test suite +make load-test -- **Every push** to main/develop branches -- **Pull requests** to main branch -- **Daily schedule** at 6:00 AM UTC -- **Manual trigger** via GitHub Actions UI +# Start Redis for load tests +docker run -d -p 6379:6379 redis:alpine +``` -### Viewing Results +### CPU Profiling -1. **Grafana Dashboard**: http://localhost:3001 -2. **Pushgateway**: http://localhost:9091/metrics -3. **Prometheus**: http://localhost:9090/targets +#### HTTP Load Test Profiling -## Configuration +```bash +# CPU profile MediumLoad HTTP test (no rate limiting - recommended) +make profile-load-norate -### Prometheus Configuration +# CPU profile MediumLoad HTTP test (with rate limiting) +make profile-load +``` -Edit `monitoring/prometheus.yml` to adjust: +**Analyze Results**: +```bash +# View interactive profile (web UI) +go tool pprof -http=:8081 cpu_load.out + +# View interactive profile (terminal) +go tool pprof cpu_load.out + +# Generate flame graph +go tool pprof -raw cpu_load.out | go-flamegraph.pl > cpu_flame.svg + +# View top functions +go tool pprof -top cpu_load.out +``` + +#### WebSocket Queue Profiling + +```bash +# CPU profile WebSocket → Redis queue → worker path +make profile-ws-queue + +# View interactive profile +go tool pprof -http=:8082 cpu_ws.out +``` + +### Profiling Tips + +- Use `profile-load-norate` for cleaner CPU profiles (no rate limiting delays) +- Profiles run for 60 seconds by default +- Requires Redis running on localhost:6379 +- Results show throughput, latency, and error rate metrics + +## Grafana Dashboards + +### Development Dashboards + +**Access**: http://localhost:3000 (admin/admin123) + +**Available Dashboards**: +1. **Load Test Performance**: Request metrics, response times, error rates +2. **System Health**: Service status, resource usage, memory, CPU +3. **Log Analysis**: Error logs, service logs, log aggregation + +### Production Dashboards + +**Auto-loaded Dashboards**: +- **ML Task Queue Monitoring** (metrics) +- **Application Logs** (Loki logs) + +### Key Metrics + +- `benchmark_time_per_op` - Execution time +- `benchmark_memory_per_op` - Memory usage +- `benchmark_allocs_per_op` - Allocation count +- HTTP request rates and response times +- Error rates and system health metrics + +### Worker Resource Metrics + +The worker exposes a Prometheus endpoint (default `:9100/metrics`) which includes ResourceManager and task execution metrics. + +**Resource availability**: +- `fetchml_resources_cpu_total` - Total CPU tokens managed by the worker. +- `fetchml_resources_cpu_free` - Currently free CPU tokens. +- `fetchml_resources_gpu_slots_total{gpu_index="N"}` - Total GPU slots per GPU index. +- `fetchml_resources_gpu_slots_free{gpu_index="N"}` - Free GPU slots per GPU index. + +**Acquisition pressure**: +- `fetchml_resources_acquire_total` - Total resource acquisition attempts. +- `fetchml_resources_acquire_wait_total` - Number of acquisitions that had to wait. +- `fetchml_resources_acquire_timeout_total` - Number of acquisitions that timed out. +- `fetchml_resources_acquire_wait_seconds_total` - Total time spent waiting for resources. + +**Why these help**: +- Debug why runs slow down under load (wait time increases). +- Confirm GPU slot sharing is working (free slots fluctuate as expected). +- Detect saturation and timeouts before tasks start failing. + +### Prometheus Scrape Example (Worker) + +If you run the worker locally on your machine (default metrics port `:9100`) and Prometheus runs in Docker Compose, use `host.docker.internal`: ```yaml scrape_configs: - - job_name: 'benchmark' + - job_name: 'worker' static_configs: - - targets: ['pushgateway:9091'] + - targets: ['host.docker.internal:9100'] + - targets: ['worker:9100'] metrics_path: /metrics - honor_labels: true - scrape_interval: 15s ``` -### Grafana Dashboard +## Production Deployment -Customize the dashboard in `monitoring/dashboards/performance-dashboard.json`: +### Prerequisites -- Add new panels -- Modify queries -- Adjust visualization types -- Set up alerts +- Linux distribution with systemd (Rocky/RHEL/CentOS, Ubuntu/Debian, Arch, SUSE) +- Production app already deployed +- Root or sudo access +- Ports 3000, 9090, 3100 available + +### Service Configuration + +**Prometheus**: +- **Port**: 9090 +- **Config**: `/etc/fetch_ml/monitoring/prometheus/prometheus.yml` +- **Data**: `/data/monitoring/prometheus` +- **Purpose**: Scrapes metrics from API server + +**Loki**: +- **Port**: 3100 +- **Config**: `/etc/fetch_ml/monitoring/loki-config.yml` +- **Data**: `/data/monitoring/loki` +- **Purpose**: Log aggregation + +**Promtail**: +- **Config**: `/etc/fetch_ml/monitoring/promtail-config.yml` +- **Log Source**: `/var/log/fetch_ml/*.log` +- **Purpose**: Ships logs to Loki + +**Grafana**: +- **Port**: 3000 +- **Config**: `/etc/fetch_ml/monitoring/grafana/provisioning` +- **Data**: `/data/monitoring/grafana` +- **Dashboards**: `/var/lib/grafana/dashboards` + +### Management Commands + +```bash +# Check status +sudo systemctl status prometheus grafana loki promtail + +# View logs +sudo journalctl -u prometheus -f +sudo journalctl -u grafana -f +sudo journalctl -u loki -f +sudo journalctl -u promtail -f + +# Restart services +sudo systemctl restart prometheus +sudo systemctl restart grafana + +# Stop all monitoring +sudo systemctl stop prometheus grafana loki promtail +``` + +## Data Retention + +### Prometheus +Default: 15 days. Edit `/etc/fetch_ml/monitoring/prometheus/prometheus.yml`: +```yaml +storage: + tsdb: + retention.time: 30d +``` + +### Loki +Default: 30 days. Edit `/etc/fetch_ml/monitoring/loki-config.yml`: +```yaml +limits_config: + retention_period: 30d +``` + +## Security + +### Firewall + +**RHEL/Rocky/Fedora (firewalld)**: +```bash +# Remove public access +sudo firewall-cmd --permanent --remove-port=3000/tcp +sudo firewall-cmd --permanent --remove-port=9090/tcp + +# Add specific source +sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.0.0.0/24" port port="3000" protocol="tcp" accept' +sudo firewall-cmd --reload +``` + +**Ubuntu/Debian (ufw)**: +```bash +# Remove public access +sudo ufw delete allow 3000/tcp +sudo ufw delete allow 9090/tcp + +# Add specific source +sudo ufw allow from 10.0.0.0/24 to any port 3000 proto tcp +``` + +### Authentication + +Change Grafana admin password: +1. Login to Grafana +2. User menu → Profile → Change Password + +### TLS (Optional) + +For HTTPS, configure reverse proxy (nginx/Apache) in front of Grafana. + +## Performance Regression Detection + +```bash +# Create baseline +make detect-regressions + +# Analyze current performance +go test -bench=. -benchmem ./tests/benchmarks/... | tee current.json + +# Track performance over time +./scripts/track_performance.sh +``` ## Troubleshooting -### Common Issues - -1. **Metrics not appearing in Grafana** - - Check Pushgateway: http://localhost:9091 - - Verify Prometheus targets: http://localhost:9090/targets - - Check GitHub Actions logs - -2. **GitHub Actions workflow failing** - - Verify `PROMETHEUS_PUSHGATEWAY_URL` secret - - Check workflow syntax - - Review benchmark execution logs - -3. **Pushgateway not receiving metrics** - - Verify URL accessibility from CI/CD - - Check network connectivity - - Review curl command in workflow - -### Debug Commands +### Development Issues +**No metrics in Grafana?** ```bash -# Check running services -docker ps --filter "name=monitoring" +# Check services +docker ps --filter "name=ml-" -# View Pushgateway metrics -curl http://localhost:9091/metrics - -# Check Prometheus targets -curl http://localhost:9090/api/v1/targets - -# Test manual metric push -echo "test_metric 123" | curl --data-binary @- http://localhost:9091/metrics/job/test +# Check monitoring services +curl http://localhost:3000/api/health +curl http://localhost:9090/api/v1/query?query=up ``` -## Best Practices +**Workflow failing?** +- Verify GitHub secret configuration +- Check workflow logs in GitHub Actions -### Benchmark Naming +**Profiling Issues**: +```bash +# Flag error handling +go test ./tests/load -run TestLoadProfile_Medium -count=1 -cpuprofile cpu_load.out -v -args -profile-norate -Use consistent naming conventions: -- `BenchmarkAPIServerCreateJob` -- `BenchmarkMLExperimentTraining` -- `BenchmarkDatasetOperations` +# Redis not available expansion +docker run -d -p 6379:6379 redis:alpine + +# Port conflicts +lsof -i :3000 # Grafana +lsof -i :8080 # pprof web UI +lsof -i :6379 # Redis +``` + +### Production Issues + +**Grafana shows no data**: +```bash +# Check if Prometheus is reachable +curl http://localhost:9090/-/healthy + +# Check datasource in Grafana +# Settings → Data Sources → Prometheus → Save & Test +``` + +**Loki not receiving logs**: +```bash +# Check Promtail is running +sudo systemctl status promtail + +# Verify log file exists +ls -l /var/log/fetch_ml/ + +# Check Promtail can reach Loki +curl http://localhost:3100/ready +``` + +**Podman containers not starting**: +```bash +# Check pod status +sudo -u ml-user podman pod ps +sudo -u ml-user podman ps -a + +# Remove and recreate +sudo -u ml-user podman pod stop monitoring +sudo -u ml-user podman pod rm monitoring +sudo systemctl restart prometheus +``` + +## Backup and Recovery + +### Backup Procedures + +```bash +# Development backup +docker run --rm -v prometheus_data:/data -v $(pwd):/backup alpine tar czf /backup/prometheus-backup.tar.gz -C /data . +docker run --rm -v grafana_data:/data -v $(pwd):/backup alpine tar czf /backup/grafana-backup.tar.gz -C /data . + +# Production backup +sudo tar -czf grafana-backup.tar.gz /data/monitoring/grafana +sudo tar -czf prometheus-backup.tar.gz /data/monitoring/prometheus +``` + +### Configuration Backup + +```bash +# Backup configurations +tar czf monitoring-config-backup.tar.gz monitoring/ deployments/ +``` + +## Updates and Maintenance + +### Development Updates + +```bash +# Update monitoring provisioning (Grafana datasources/providers) +python3 scripts/setup_monitoring.py + +# Restart services +make dev-down && make dev-up +``` + +### Production Updates + +```bash +# Pull latest images +sudo -u ml-user podman pull docker.io/grafana/grafana:latest +sudo -u ml-user podman pull docker.io/prom/prometheus:latest +sudo -u ml-user podman pull docker.io/grafana/loki:latest +sudo -u ml-user podman pull docker.io/grafana/promtail:latest + +# Restart services to use new images +sudo systemctl restart grafana prometheus loki promtail +``` + +### Regular Maintenance + +**Weekly**: +- Check Grafana dashboards for anomalies +- Review log files for errors +- Verify backup procedures + +**Monthly**: +- Update Docker/Podman images +- Clean up old data volumes +- Review and rotate secrets + +## Metrics Reference + +### Worker Metrics + +**Task Processing**: +- `fetchml_tasks_processed_total` - Total tasks processed successfully +- `fetchml_tasks_failed_total` - Total tasks failed +- `fetchml_tasks_active` - Currently active tasks +- `fetchml_tasks_queued` - Current queue depth + +**Data Transfer**: +- `fetchml_data_transferred_bytes_total` - Total bytes transferred +- `fetchml_data_fetch_time_seconds_total` - Total time fetching datasets +- `fetchml_execution_time_seconds_total` - Total task execution time + +**Prewarming**: +- `fetchml_prewarm_env_hit_total` - Environment prewarm hits (warm image existed) +- `fetchml_prewarm_env_miss_total` - Environment prewarm misses (warm image not found) +- `fetchml_prewarm_env_built_total` - Environment images built for prewarming +- `fetchml_prewarm_env_time_seconds_total` - Total time building prewarm images +- `fetchml_prewarm_snapshot_hit_total` - Snapshot prewarm hits (found in .prewarm/) +- `fetchml_prewarm_snapshot_miss_total` - Snapshot prewarm misses (not in .prewarm/) +- `fetchml_prewarm_snapshot_built_total` - Snapshots prewarmed into .prewarm/ +- `fetchml_prewarm_snapshot_time_seconds_total` - Total time prewarming snapshots + +**Resources**: +- `fetchml_resources_cpu_total` - Total CPU tokens +- `fetchml_resources_cpu_free` - Free CPU tokens +- `fetchml_resources_gpu_slots_total` - Total GPU slots per index +- `fetchml_resources_gpu_slots_free` - Free GPU slots per index + +### API Server Metrics + +**HTTP**: +- `fetchml_http_requests_total` - Total HTTP requests +- `fetchml_http_duration_seconds` - HTTP request duration + +**WebSocket**: +- `fetchml_websocket_connections` - Active WebSocket connections +- `fetchml_websocket_messages_total` - Total WebSocket messages +- `fetchml_websocket_duration_seconds` - Message processing duration +- `fetchml_websocket_errors_total` - WebSocket errors + +**Jupyter**: +- `fetchml_jupyter_services` - Jupyter services count +- `fetchml_jupyter_operations_total` - Jupyter operations + +## Worker Configuration: Prewarming + +### Prewarm Flag + +Enable Phase 1 prewarming in worker configuration: + +```yaml +# worker-config.yaml +prewarm_enabled: true # Default: false (opt-in) +``` + +**Behavior**: +- When `false`: No prewarming loops run +- When `true`: Worker stages next snapshot and fetches datasets when idle + +**What gets prewarmed**: +1. **Snapshots**: Copied to `.prewarm/snapshots//` +2. **Datasets**: Fetched to `.prewarm/datasets/` (if `auto_fetch_data: true`) +3. **Environment images**: Warmed in envpool cache (if deps manifest exists) + +**Execution path**: +- During task execution, `StageSnapshotFromPath` checks `.prewarm/snapshots//` +- If found: **Hit** - Renames prewarmed directory into job (fast) +- If not found: **Miss** - Copies from snapshot store (slower) + +**Metrics impact**: +- Prewarm hits reduce task startup latency +- Metrics track hit/miss ratios and prewarm timing +- Use `fetchml_prewarm_snapshot_*` metrics to monitor effectiveness + +### Grafana Dashboards + +**Prewarm Performance Dashboard**: +1. Import `monitoring/grafana/dashboards/prewarm-performance.txt` into Grafana +2. Shows hit rates, build times, and efficiency metrics +3. Use for monitoring prewarm effectiveness + +**Worker Resources Dashboard**: +- Added prewarm panels to existing worker-resources dashboard +- Environment and snapshot hit rate percentages +- Prewarm hits vs misses graphs +- Build time and build count metrics + +### Prometheus Queries + +**Hit Rate Calculations**: +```promql +# Environment prewarm hit rate +100 * (fetchml_prewarm_env_hit_total / clamp_min(fetchml_prewarm_env_hit_total + fetchml_prewarm_env_miss_total, 1)) + +# Snapshot prewarm hit rate +100 * (fetchml_prewarm_snapshot_hit_total / clamp_min(fetchml_prewarm_snapshot_hit_total + fetchml_prewarm_snapshot_miss_total, 1)) +``` + +**Rate-based Monitoring**: +```promql +# Prewarm activity rate +rate(fetchml_prewarm_env_hit_total[5m]) +rate(fetchml_prewarm_snapshot_hit_total[5m]) + +# Build time rate +rate(fetchml_prewarm_env_time_seconds_total[5m]) +rate(fetchml_prewarm_snapshot_time_seconds_total[5m]) +``` + +## Advanced Usage + +### Custom Dashboards + +1. **Access Grafana**: http://localhost:3000 +2. **Create Dashboard**: + → Dashboard +3. **Add Panels**: Use Prometheus queries +4. **Save Dashboard**: Export JSON for sharing ### Alerting Set up Grafana alerts for: -- Performance regressions (>10% degradation) -- Missing benchmark data -- High memory allocation rates +- High error rates (> 5%) +- Slow response times (> 1s) +- Service downtime +- Resource exhaustion +- Low prewarm hit rates (< 50%) -### Retention +### Custom Metrics -Configure appropriate retention periods: -- Raw metrics: 30 days -- Aggregated data: 1 year -- Dashboard snapshots: Permanent +Add custom metrics to your Go code: -## Integration with Existing Workflows +```go +import "github.com/prometheus/client_golang/prometheus" -The benchmark monitoring integrates seamlessly with: +var ( + requestDuration = prometheus.NewHistogramVec( + prometheus.HistogramOpts{ + Name: "http_request_duration_seconds", + Help: "HTTP request duration in seconds", + }, + []string{"method", "endpoint"}, + ) +) -- **CI/CD pipelines**: Automatic execution -- **Code reviews**: Performance impact visible -- **Release management**: Performance trends over time -- **Development**: Local testing with same metrics +// Record metrics +requestDuration.WithLabelValues("GET", "/api/v1/jobs").Observe(duration) +``` -## Future Enhancements +## See Also -Potential improvements: - -1. **Automated performance regression alerts** -2. **Performance budgets and gates** -3. **Comparative analysis across branches** -4. **Integration with load testing results** -5. **Performance impact scoring** - -## Support - -For issuesundles: - -1. Check this documentation -2. Review GitHub Actions logs -3. Verify monitoring stack status -4. Consult Grafana/Prometheus docs - ---- - -*Last updated: December 2024* +- **[Testing Guide](testing.md)** - Testing with monitoring +- **[Deployment Guide](deployment.md)** - Deployment procedures +- **[Architecture Guide](architecture.md)** - System architecture +- **[Troubleshooting](troubleshooting.md)** - Common issues \ No newline at end of file diff --git a/docs/src/performance-quick-start.md b/docs/src/performance-quick-start.md deleted file mode 100644 index ae3279f..0000000 --- a/docs/src/performance-quick-start.md +++ /dev/null @@ -1,245 +0,0 @@ -# Performance Monitoring Quick Start - -Get started with performance monitoring and profiling in 5 minutes. - -## Quick Start Options - -### Option 1: Basic Benchmarking -```bash -# Run benchmarks -make benchmark - -# View results in Grafana -open http://localhost:3001 -``` - -### Option 2: CPU Profiling -```bash -# Generate CPU profile -make profile-load-norate - -# View interactive profile -go tool pprof -http=:8080 cpu_load.out -``` - -### Option 3: Full Monitoring Stack -```bash -# Start monitoring services -make monitoring-performance - -# Run benchmarks with metrics collection -make benchmark - -# View in Grafana dashboard -open http://localhost:3001 -``` - -## Prerequisites - -- Docker and Docker Compose -- Go 1.21 or later -- Redis (for load tests) -- GitHub repository (for CI/CD integration) - -## 1. Setup & Installation - -### Start Monitoring Stack (Optional) - -For full metrics visualization: - -```bash -make monitoring-performance -``` - -This starts: -- **Grafana**: http://localhost:3001 (admin/admin) -- **Pushgateway**: http://localhost:9091 -- **Loki**: http://localhost:3100 - -### Start Redis (Required for Load Tests) - -```bash -docker run -d -p 6379:6379 redis:alpine -``` - -## 2. Performance Testing - -### Benchmarks - -```bash -# Run benchmarks locally -make benchmark - -# Or run with detailed output -go test -bench=. -benchmem ./tests/benchmarks/... -``` - -### Load Testing - -```bash -# Run load test suite -make load-test -``` - -## 3. CPU Profiling - -### HTTP Load Test Profiling - -```bash -# CPU profile MediumLoad HTTP test (with rate limiting) -make profile-load - -# CPU profile MediumLoad HTTP test (no rate limiting - recommended) -make profile-load-norate -``` - -**Analyze Results:** -```bash -# View interactive profile (web UI) -go tool pprof -http=:8081 cpu_load.out - -# View interactive profile (terminal) -go tool pprof cpu_load.out - -# Generate flame graph -go tool pprof -raw cpu_load.out | go-flamegraph.pl > cpu_flame.svg - -# View top functions -go tool pprof -top cpu_load.out -``` - -Web UI: http://localhost:8080 - -### WebSocket Queue Profiling - -```bash -# CPU profile WebSocket → Redis queue → worker path -make profile-ws-queue -``` - -**Analyze Results:** -```bash -# View interactive profile (web UI) -go tool pprof -http=:8082 cpu_ws.out - -# View interactive profile (terminal) -go tool pprof cpu_ws.out -``` - -### Profiling Tips - -- Use `profile-load-norate` for cleaner CPU profiles (no rate limiting delays) -- Profiles run for 60 seconds by default -- Requires Redis running on localhost:6379 -- Results show throughput, latency, and error rate metrics - -## 4. Results & Visualization - -### Grafana Dashboard - -Open: http://localhost:3001 (admin/admin) - -Navigate to the **Performance Dashboard** to see: -- Real-time benchmark results -- Historical trends -- Performance comparisons - -### Key Metrics - -- `benchmark_time_per_op` - Execution time -- `benchmark_memory_per_op` - Memory usage -- `benchmark_allocs_per_op` - Allocation count - -## 5. CI/CD Integration - -### Setup GitHub Integration - -Add GitHub secret: -``` -PROMETHEUS_PUSHGATEWAY_URL=http://your-pushgateway:9091 -``` - -Now benchmarks run automatically on: -- Every push to main/develop -- Pull requests -- Daily schedule - -### Verify Integration - -1. Push code to trigger workflow -2. Check Pushgateway: http://localhost:9091/metrics -3. View metrics in Grafana - -## 6. Troubleshooting - -### Monitoring Stack Issues - -**No metrics in Grafana?** -```bash -# Check services -docker ps --filter "name=monitoring" - -# Check Pushgateway -curl http://localhost:9091/metrics -``` - -**Workflow failing?** -- Verify GitHub secret configuration -- Check workflow logs in GitHub Actions - -### Profiling Issues - -**Flag error like "flag provided but not defined: -test.paniconexit0"** -```bash -# This should be fixed now, but if it persists: -go test ./tests/load -run TestLoadProfile_Medium -count=1 -cpuprofile cpu_load.out -v -args -profile-norate -``` - -**Redis not available?** -```bash -# Start Redis for profiling tests -docker run -d -p 6379:6379 redis:alpine - -# Check profile file generated -ls -la cpu_load.out -``` - -**Port conflicts?** -```bash -# Check if ports are in use -lsof -i :3001 # Grafana -lsof -i :8080 # pprof web UI -lsof -i :6379 # Redis -``` - -## 7. Advanced Usage - -### Performance Regression Detection -```bash -# Create baseline -make detect-regressions - -# Analyze current performance -go test -bench=. -benchmem ./tests/benchmarks/... | tee current.json -``` - -### Custom Benchmarks -```bash -# Run specific benchmark -go test -bench=BenchmarkName -benchmem ./tests/benchmarks/... - -# Run with race detection -go test -race -bench=. ./tests/benchmarks/... -``` - -## 8. Further Reading - -- [Full Documentation](performance-monitoring.md) -- [Dashboard Customization](performance-monitoring.md#grafana-dashboard) -- [Alert Configuration](performance-monitoring.md#alerting) -- [Architecture Guide](architecture.md) -- [Testing Guide](testing.md) - ---- - -*Ready in 5 minutes!* diff --git a/docs/src/production-monitoring.md b/docs/src/production-monitoring.md deleted file mode 100644 index e08f693..0000000 --- a/docs/src/production-monitoring.md +++ /dev/null @@ -1,217 +0,0 @@ -# Production Monitoring Deployment Guide (Linux) - -This guide covers deploying the monitoring stack (Prometheus, Grafana, Loki, Promtail) on Linux production servers. - -## Architecture - -**Testing**: Docker Compose (macOS/Linux) -**Production**: Podman + systemd (Linux) - -**Important**: Docker is for testing only. Podman is used for running actual ML experiments in production. - -**Dev (Testing)**: Docker Compose -**Prod (Experiments)**: Podman + systemd - -Each service runs as a separate Podman container managed by systemd for automatic restarts and proper lifecycle management. - -## Prerequisites - -**Container Runtimes:** -- **Docker Compose**: For testing and development only -- **Podman**: For production experiment execution - -- Linux distribution with systemd (Rocky/RHEL/CentOS, Ubuntu/Debian, Arch, SUSE, etc.) -- Production app already deployed (see `scripts/setup-prod.sh`) -- Root or sudo access -- Ports 3000, 9090, 3100 available - -## Quick Setup - -### 1. Run Setup Script -```bash -cd /path/to/fetch_ml -sudo ./scripts/setup-monitoring-prod.sh /data/monitoring ml-user ml-group -``` - -This will: -- Create directory structure at `/data/monitoring` -- Copy configuration files to `/etc/fetch_ml/monitoring` -- Create systemd services for each component -- Set up firewall rules - -### 2. Start Services -```bash -# Start all monitoring services -sudo systemctl start prometheus -sudo systemctl start loki -sudo systemctl start promtail -sudo systemctl start grafana - -# Enable on boot -sudo systemctl enable prometheus loki promtail grafana -``` - -### 3. Access Grafana -- URL: `http://YOUR_SERVER_IP:3000` -- Username: `admin` -- Password: `admin` (change on first login) - -Dashboards will auto-load: -- **ML Task Queue Monitoring** (metrics) -- **Application Logs** (Loki logs) - -## Service Details - -### Prometheus -- **Port**: 9090 -- **Config**: `/etc/fetch_ml/monitoring/prometheus.yml` -- **Data**: `/data/monitoring/prometheus` -- **Purpose**: Scrapes metrics from API server - -### Loki -- **Port**: 3100 -- **Config**: `/etc/fetch_ml/monitoring/loki-config.yml` -- **Data**: `/data/monitoring/loki` -- **Purpose**: Log aggregation - -### Promtail -- **Config**: `/etc/fetch_ml/monitoring/promtail-config.yml` -- **Log Source**: `/var/log/fetch_ml/*.log` -- **Purpose**: Ships logs to Loki - -### Grafana -- **Port**: 3000 -- **Config**: `/etc/fetch_ml/monitoring/grafana/provisioning` -- **Data**: `/data/monitoring/grafana` -- **Dashboards**: `/var/lib/grafana/dashboards` - -## Management Commands - -```bash -# Check status -sudo systemctl status prometheus grafana loki promtail - -# View logs -sudo journalctl -u prometheus -f -sudo journalctl -u grafana -f -sudo journalctl -u loki -f -sudo journalctl -u promtail -f - -# Restart services -sudo systemctl restart prometheus -sudo systemctl restart grafana - -# Stop all monitoring -sudo systemctl stop prometheus grafana loki promtail -``` - -## Data Retention - -### Prometheus -Default: 15 days. Edit `/etc/fetch_ml/monitoring/prometheus.yml`: -```yaml -storage: - tsdb: - retention.time: 30d -``` - -### Loki -Default: 30 days. Edit `/etc/fetch_ml/monitoring/loki-config.yml`: -```yaml -limits_config: - retention_period: 30d -``` - -## Security - -### Firewall -The setup script automatically configures firewall rules using the detected firewall manager (firewalld or ufw). - -For manual firewall configuration: - -**RHEL/Rocky/Fedora (firewalld)**: -```bash -# Remove public access -sudo firewall-cmd --permanent --remove-port=3000/tcp -sudo firewall-cmd --permanent --remove-port=9090/tcp - -# Add specific source -sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.0.0.0/24" port port="3000" protocol="tcp" accept' -sudo firewall-cmd --reload -``` - -**Ubuntu/Debian (ufw)**: -```bash -# Remove public access -sudo ufw delete allow 3000/tcp -sudo ufw delete allow 9090/tcp - -# Add specific source -sudo ufw allow from 10.0.0.0/24 to any port 3000 proto tcp -``` - -### Authentication -Change Grafana admin password: -1. Login to Grafana -2. User menu → Profile → Change Password - -### TLS (Optional) -For HTTPS, configure reverse proxy (nginx/Apache) in front of Grafana. - -## Troubleshooting - -### Grafana shows no data -```bash -# Check if Prometheus is reachable -curl http://localhost:9090/-/healthy - -# Check datasource in Grafana -# Settings → Data Sources → Prometheus → Save & Test -``` - -### Loki not receiving logs -```bash -# Check Promtail is running -sudo systemctl status promtail - -# Verify log file exists -ls -l /var/log/fetch_ml/ - -# Check Promtail can reach Loki -curl http://localhost:3100/ready -``` - -### Podman containers not starting -```bash -# Check pod status -sudo -u ml-user podman pod ps -sudo -u ml-user podman ps -a - -# Remove and recreate -sudo -u ml-user podman pod stop monitoring -sudo -u ml-user podman pod rm monitoring -sudo systemctl restart prometheus -``` - -## Backup - -```bash -# Backup Grafana dashboards and data -sudo tar -czf grafana-backup.tar.gz /data/monitoring/grafana - -# Backup Prometheus data -sudo tar -czf prometheus-backup.tar.gz /data/monitoring/prometheus -``` - -## Updates - -```bash -# Pull latest images -sudo -u ml-user podman pull docker.io/grafana/grafana:latest -sudo -u ml-user podman pull docker.io/prom/prometheus:latest -sudo -u ml-user podman pull docker.io/grafana/loki:latest -sudo -u ml-user podman pull docker.io/grafana/promtail:latest - -# Restart services to use new images -sudo systemctl restart grafana prometheus loki promtail -``` diff --git a/docs/src/quick-start.md b/docs/src/quick-start.md index 342d3f9..3b275e6 100644 --- a/docs/src/quick-start.md +++ b/docs/src/quick-start.md @@ -1,6 +1,6 @@ # Quick Start -Get Fetch ML running in minutes with Docker Compose. +Get Fetch ML running in minutes with Docker Compose and integrated monitoring. ## Prerequisites @@ -8,9 +8,13 @@ Get Fetch ML running in minutes with Docker Compose. - **Docker Compose**: For testing and development only - **Podman**: For production experiment execution +**Requirements:** +- Go 1.21+ +- Zig 0.11+ - Docker Compose (testing only) - 4GB+ RAM - 2GB+ disk space +- Git ## One-Command Setup @@ -18,108 +22,312 @@ Get Fetch ML running in minutes with Docker Compose. # Clone and start git clone https://github.com/jfraeys/fetch_ml.git cd fetch_ml -docker-compose up -d (testing only) +make dev-up # Wait for services (30 seconds) sleep 30 # Verify setup -curl http://localhost:9101/health +curl http://localhost:8080/health +``` + +Note: the development compose runs the API server over HTTP/WS for CLI compatibility. For HTTPS/WSS, terminate TLS at a reverse proxy. + +**Access Services:** +- **API Server (via Caddy)**: http://localhost:8080 +- **API Server (via Caddy + internal TLS)**: https://localhost:8443 +- **Grafana**: http://localhost:3000 (admin/admin123) +- **Prometheus**: http://localhost:9090 +- **Loki**: http://localhost:3100 + +## Development Setup + +### Build Components + +```bash +# Build all components +make build + +# Development build +make dev +``` + +### Start Services + +```bash +# Start development stack with monitoring +make dev-up + +# Check status +make dev-status + +# Stop services +make dev-down +``` + +### Verify Setup + +```bash +# Check API health +curl -f http://localhost:8080/health + +# Check monitoring services +curl -f http://localhost:3000/api/health +curl -f http://localhost:9090/api/v1/query?query=up +curl -f http://localhost:3100/ready + +# Check Redis +docker exec ml-experiments-redis redis-cli ping ``` ## First Experiment -```bash -# Submit a simple ML job (see [First Experiment](first-experiment.md) for details) -curl -X POST http://localhost:9101/api/v1/jobs \ - -H "Content-Type: application/json" \ - -H "X-API-Key: admin" \ - -d '{ - "job_name": "hello-world", - "args": "--echo Hello World", - "priority": 1 - }' - -# Check job status -curl http://localhost:9101/api/v1/jobs \ - -H "X-API-Key: admin" -``` - -## CLI Access +### 1. Setup CLI ```bash # Build CLI -cd cli && zig build dev +cd cli && zig build --release=fast -# List jobs -./cli/zig-out/dev/ml --server http://localhost:9101 list-jobs - -# Submit new job -./cli/zig-out/dev/ml --server http://localhost:9101 submit \ - --name "test-job" --args "--epochs 10" +# Initialize CLI config +./cli/zig-out/bin/ml init ``` -## Local Mode (Zero-Install) - -Run workers locally without Redis or SSH for development and testing: +### 2. Queue Job ```bash -# Start a local worker (uses configs/worker-dev.yaml) -./cmd/worker/worker -config configs/worker-dev.yaml +# Simple test job +echo "test experiment" | ./cli/zig-out/bin/ml queue test-job -# In another terminal, submit a job to the local worker -curl -X POST http://localhost:9101/api/v1/jobs \ - -H "Content-Type: application/json" \ - -H "X-API-Key: admin" \ - -d '{ - "job_name": "local-test", - "args": "--echo Local Mode Works", - "priority": 1 - }' - -# The worker will execute locally using: -# - Local command execution (no SSH) -# - Local job directories (pending/running/finished) -# - In-memory task queue (no Redis required) +# Check status +./cli/zig-out/bin/ml status ``` -Local mode configuration (`configs/worker-dev.yaml`): -```yaml -local_mode: true # Enable local execution -base_path: "./jobs" # Local job directory -redis_addr: "" # Optional: skip Redis -host: "" # Optional: skip SSH +### 3. Monitor Progress + +```bash +# View in Grafana +open http://localhost:3000 + +# Check logs in Grafana Log Analysis dashboard +# Or view container logs +docker logs ml-experiments-api -f ``` -## Related Documentation +## Key Commands -- [Installation Guide](installation.md) - Detailed setup options -- [First Experiment](first-experiment.md) - Complete ML workflow -- [Development Setup](development-setup.md) - Local development -- [Security](security.md) - Authentication and permissions +### Development Commands + +```bash +make help # Show all commands +make build # Build all components +make dev-up # Start dev environment +make dev-down # Stop dev environment +make dev-status # Check dev status +make test # Run tests +make test-unit # Run unit tests +make test-integration # Run integration tests +``` + +### CLI Commands + +```bash +# Build CLI +cd cli && zig build --release=fast + +# Common operations +./cli/zig-out/bin/ml status # Check system status +./cli/zig-out/bin/ml queue job-name # Queue job +./cli/zig-out/bin/ml list # List jobs +./cli/zig-out/bin/ml help # Show help +``` + +### Monitoring Commands + +```bash +# Access monitoring services +open http://localhost:3000 # Grafana +open http://localhost:9090 # Prometheus +open http://localhost:3100 # Loki + +# (Optional) Re-generate Grafana provisioning (datasources/providers) +python3 scripts/setup_monitoring.py +``` + +## Configuration + +### Environment Setup + +```bash +# Copy example environment +cp deployments/env.dev.example .env + +# Edit as needed +vim .env +``` + +**Key Variables**: +- `LOG_LEVEL=info` +- `GRAFANA_ADMIN_PASSWORD=admin123` + +### CLI Configuration + +```bash +# Setup CLI config +mkdir -p ~/.ml + +# Create config file if needed +touch ~/.ml/config.toml + +# Edit configuration +vim ~/.ml/config.toml +``` + +## Testing + +### Quick Test + +```bash +# 5-minute authentication test +make test-auth + +# Clean up +make self-cleanup +``` + +### Full Test Suite + +```bash +# Run all tests +make test + +# Run with coverage +make test-coverage + +# Run specific test types +make test-unit +make test-integration +make test-e2e +``` + +### Load Testing + +```bash +# Run load tests +make load-test + +# Run benchmarks +make benchmark + +# Track performance +./scripts/track_performance.sh +``` ## Troubleshooting -**Services not starting?** +### Common Issues + +**Port Conflicts**: ```bash -# Check logs -docker-compose logs +# Check port usage +lsof -i :8080 +lsof -i :8443 +lsof -i :3000 +lsof -i :9090 + +# Kill conflicting processes +kill -9 +``` + +**Build Issues**: +```bash +# Fix Go modules +go mod tidy + +# Fix Zig build +cd cli && rm -rf zig-out zig-cache && zig build --release=fast +``` + +**Container Issues**: +```bash +# Check container status +docker ps --filter "name=ml-" + +# View logs +docker logs ml-experiments-api +docker logs ml-experiments-grafana # Restart services -docker-compose down && docker-compose up -d (testing only) +make dev-down && make dev-up ``` -**API not responding?** +**Monitoring Issues**: ```bash -# Check health -curl http://localhost:9101/health +# Re-setup monitoring +python3 scripts/setup_monitoring.py -# Verify ports -docker-compose ps +# Restart Grafana +docker restart ml-experiments-grafana + +# Check datasources in Grafana +# Settings → Data Sources → Test connection ``` -**Permission denied?** +### Debug Mode + ```bash -# Check API key -curl -H "X-API-Key: admin" http://localhost:9101/api/v1/jobs +# Enable debug logging +export LOG_LEVEL=debug +make dev-up ``` + +## Next Steps + +### Explore Features + +1. **Job Management**: Queue and monitor ML experiments +2. **WebSocket Communication**: Real-time updates +3. **Multi-User Authentication**: Role-based access control +4. **Performance Monitoring**: Grafana dashboards and metrics +5. **Log Aggregation**: Centralized logging with Loki + +### Advanced Configuration + +- **Production Setup**: See [Deployment Guide](deployment.md) +- **Performance Monitoring**: See [Performance Monitoring](performance-monitoring.md) +- **Testing Procedures**: See [Testing Guide](testing.md) +- **CLI Reference**: See [CLI Reference](cli-reference.md) + +### Production Deployment + +For production deployment: +1. Review [Deployment Guide](deployment.md) +2. Set up production monitoring +3. Configure security and authentication +4. Set up backup procedures + +## Help and Support + +### Get Help + +```bash +make help # Show all available commands +./cli/zig-out/bin/ml --help # CLI help +``` + +### Documentation + +- **[Testing Guide](testing.md)** - Comprehensive testing procedures +- **[Deployment Guide](deployment.md)** - Production deployment +- **[Performance Monitoring](performance-monitoring.md)** - Monitoring setup +- **[Architecture Guide](architecture.md)** - System architecture +- **[Troubleshooting](troubleshooting.md)** - Common issues + +### Community + +- Check logs: `docker logs ml-experiments-api` +- Review documentation in `docs/src/` +- Use `--debug` flag with CLI commands for detailed output + +--- + +*Ready in minutes!* \ No newline at end of file