fetch_ml/docs/src/performance-monitoring.md

16 KiB

Performance Monitoring

Comprehensive performance monitoring system for Fetch ML with CI/CD integration, profiling, and production deployment.

Quick Start

5-Minute Setup

# Start monitoring stack
make dev-up

# Run benchmarks
make benchmark

# View results in Grafana
open http://localhost:3000

Basic Profiling

# CPU profiling
make profile-load-norate

# View interactive profile
go tool pprof -http=:8080 cpu_load.out

Architecture

Development: Docker Compose with integrated monitoring
Production: Podman + systemd (Linux)
CI/CD: GitHub Actions → Prometheus Pushgateway → Grafana

GitHub Actions → Benchmark Tests → Prometheus Pushgateway → Prometheus → Grafana Dashboard

Components

1. Development Monitoring (Docker Compose)

Services:

Configuration:

# Start dev stack with monitoring
make dev-up

# Verify services
curl -f http://localhost:3000/api/health
curl -f http://localhost:9090/api/v1/query?query=up
curl -f http://localhost:3100/ready

2. Production Monitoring (Podman + systemd)

Architecture:

  • Each service runs as separate Podman container
  • Managed by systemd for automatic restarts
  • Proper lifecycle management

Setup:

# Run production setup script
sudo ./scripts/setup-monitoring-prod.sh /data/monitoring ml-user ml-group

# Start services
sudo systemctl start prometheus loki promtail grafana
sudo systemctl enable prometheus loki promtail grafana

Access:

  • URL: http://YOUR_SERVER_IP:3000
  • Username: admin
  • Password: admin (change on first login)

3. CI/CD Integration

GitHub Actions Workflow:

  • Triggers: Push to main/develop, PRs, daily schedule, manual
  • Function: Runs benchmarks and pushes metrics to Prometheus

Setup:

# Add GitHub secret
PROMETHEUS_PUSHGATEWAY_URL=http://your-pushgateway:9091

Performance Testing

Benchmarks

# Run benchmarks locally
make benchmark

# Or run with detailed output
go test -bench=. -benchmem ./tests/benchmarks/...

# Run specific benchmark
go test -bench=BenchmarkName -benchmem ./tests/benchmarks/...

# Run with race detection
go test -race -bench=. ./tests/benchmarks/...

Load Testing

# Run load test suite
make load-test

# Start Redis for load tests
docker run -d -p 6379:6379 redis:alpine

CPU Profiling

HTTP Load Test Profiling

# CPU profile MediumLoad HTTP test (no rate limiting - recommended)
make profile-load-norate

# CPU profile MediumLoad HTTP test (with rate limiting)
make profile-load

Analyze Results:

# View interactive profile (web UI)
go tool pprof -http=:8081 cpu_load.out

# View interactive profile (terminal)
go tool pprof cpu_load.out

# Generate flame graph
go tool pprof -raw cpu_load.out | go-flamegraph.pl > cpu_flame.svg

# View top functions
go tool pprof -top cpu_load.out

WebSocket Queue Profiling

# CPU profile WebSocket → Redis queue → worker path
make profile-ws-queue

# View interactive profile
go tool pprof -http=:8082 cpu_ws.out

Profiling Tips

  • Use profile-load-norate for cleaner CPU profiles (no rate limiting delays)
  • Profiles run for 60 seconds by default
  • Requires Redis running on localhost:6379
  • Results show throughput, latency, and error rate metrics

Grafana Dashboards

Development Dashboards

Access: http://localhost:3000 (admin/admin123)

Available Dashboards:

  1. Load Test Performance: Request metrics, response times, error rates
  2. System Health: Service status, resource usage, memory, CPU
  3. Log Analysis: Error logs, service logs, log aggregation

Production Dashboards

Auto-loaded Dashboards:

  • ML Task Queue Monitoring (metrics)
  • Application Logs (Loki logs)

Key Metrics

  • benchmark_time_per_op - Execution time
  • benchmark_memory_per_op - Memory usage
  • benchmark_allocs_per_op - Allocation count
  • HTTP request rates and response times
  • Error rates and system health metrics

Worker Resource Metrics

The worker exposes a Prometheus endpoint (default :9100/metrics) which includes ResourceManager and task execution metrics.

Resource availability:

  • fetchml_resources_cpu_total - Total CPU tokens managed by the worker.
  • fetchml_resources_cpu_free - Currently free CPU tokens.
  • fetchml_resources_gpu_slots_total{gpu_index="N"} - Total GPU slots per GPU index.
  • fetchml_resources_gpu_slots_free{gpu_index="N"} - Free GPU slots per GPU index.

Acquisition pressure:

  • fetchml_resources_acquire_total - Total resource acquisition attempts.
  • fetchml_resources_acquire_wait_total - Number of acquisitions that had to wait.
  • fetchml_resources_acquire_timeout_total - Number of acquisitions that timed out.
  • fetchml_resources_acquire_wait_seconds_total - Total time spent waiting for resources.

Why these help:

  • Debug why runs slow down under load (wait time increases).
  • Confirm GPU slot sharing is working (free slots fluctuate as expected).
  • Detect saturation and timeouts before tasks start failing.

Prometheus Scrape Example (Worker)

If you run the worker locally on your machine (default metrics port :9100) and Prometheus runs in Docker Compose, use host.docker.internal:

scrape_configs:
  - job_name: 'worker'
    static_configs:
      - targets: ['host.docker.internal:9100']
      - targets: ['worker:9100']
    metrics_path: /metrics

Production Deployment

Prerequisites

  • Linux distribution with systemd (Rocky/RHEL/CentOS, Ubuntu/Debian, Arch, SUSE)
  • Production app already deployed
  • Root or sudo access
  • Ports 3000, 9090, 3100 available

Service Configuration

Prometheus:

  • Port: 9090
  • Config: /etc/fetch_ml/monitoring/prometheus/prometheus.yml
  • Data: /data/monitoring/prometheus
  • Purpose: Scrapes metrics from API server

Loki:

  • Port: 3100
  • Config: /etc/fetch_ml/monitoring/loki-config.yml
  • Data: /data/monitoring/loki
  • Purpose: Log aggregation

Promtail:

  • Config: /etc/fetch_ml/monitoring/promtail-config.yml
  • Log Source: /var/log/fetch_ml/*.log
  • Purpose: Ships logs to Loki

Grafana:

  • Port: 3000
  • Config: /etc/fetch_ml/monitoring/grafana/provisioning
  • Data: /data/monitoring/grafana
  • Dashboards: /var/lib/grafana/dashboards

Management Commands

# Check status
sudo systemctl status prometheus grafana loki promtail

# View logs
sudo journalctl -u prometheus -f
sudo journalctl -u grafana -f
sudo journalctl -u loki -f
sudo journalctl -u promtail -f

# Restart services
sudo systemctl restart prometheus
sudo systemctl restart grafana

# Stop all monitoring
sudo systemctl stop prometheus grafana loki promtail

Data Retention

Prometheus

Default: 15 days. Edit /etc/fetch_ml/monitoring/prometheus/prometheus.yml:

storage:
  tsdb:
    retention.time: 30d

Loki

Default: 30 days. Edit /etc/fetch_ml/monitoring/loki-config.yml:

limits_config:
  retention_period: 30d

Security

Firewall

RHEL/Rocky/Fedora (firewalld):

# Remove public access
sudo firewall-cmd --permanent --remove-port=3000/tcp
sudo firewall-cmd --permanent --remove-port=9090/tcp

# Add specific source
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.0.0.0/24" port port="3000" protocol="tcp" accept'
sudo firewall-cmd --reload

Ubuntu/Debian (ufw):

# Remove public access
sudo ufw delete allow 3000/tcp
sudo ufw delete allow 9090/tcp

# Add specific source
sudo ufw allow from 10.0.0.0/24 to any port 3000 proto tcp

Authentication

Change Grafana admin password:

  1. Login to Grafana
  2. User menu → Profile → Change Password

TLS (Optional)

For HTTPS, configure reverse proxy (nginx/Apache) in front of Grafana.

Performance Regression Detection

# Create baseline
make detect-regressions

# Analyze current performance
go test -bench=. -benchmem ./tests/benchmarks/... | tee current.json

# Track performance over time
./scripts/track_performance.sh

Troubleshooting

Development Issues

No metrics in Grafana?

# Check services
docker ps --filter "name=ml-"

# Check monitoring services
curl http://localhost:3000/api/health
curl http://localhost:9090/api/v1/query?query=up

Workflow failing?

  • Verify GitHub secret configuration
  • Check workflow logs in GitHub Actions

Profiling Issues:

# Flag error handling
go test ./tests/load -run TestLoadProfile_Medium -count=1 -cpuprofile cpu_load.out -v -args -profile-norate

# Redis not available expansion
docker run -d -p 6379:6379 redis:alpine

# Port conflicts
lsof -i :3000  # Grafana
lsof -i :8080  # pprof web UI
lsof -i :6379  # Redis

Production Issues

Grafana shows no data:

# Check if Prometheus is reachable
curl http://localhost:9090/-/healthy

# Check datasource in Grafana
# Settings → Data Sources → Prometheus → Save & Test

Loki not receiving logs:

# Check Promtail is running
sudo systemctl status promtail

# Verify log file exists
ls -l /var/log/fetch_ml/

# Check Promtail can reach Loki
curl http://localhost:3100/ready

Podman containers not starting:

# Check pod status
sudo -u ml-user podman pod ps
sudo -u ml-user podman ps -a

# Remove and recreate
sudo -u ml-user podman pod stop monitoring
sudo -u ml-user podman pod rm monitoring
sudo systemctl restart prometheus

Backup and Recovery

Backup Procedures

# Development backup
docker run --rm -v prometheus_data:/data -v $(pwd):/backup alpine tar czf /backup/prometheus-backup.tar.gz -C /data .
docker run --rm -v grafana_data:/data -v $(pwd):/backup alpine tar czf /backup/grafana-backup.tar.gz -C /data .

# Production backup
sudo tar -czf grafana-backup.tar.gz /data/monitoring/grafana
sudo tar -czf prometheus-backup.tar.gz /data/monitoring/prometheus

Configuration Backup

# Backup configurations
tar czf monitoring-config-backup.tar.gz monitoring/ deployments/

Updates and Maintenance

Development Updates

# Update monitoring provisioning (Grafana datasources/providers)
python3 scripts/setup_monitoring.py

# Restart services
make dev-down && make dev-up

Production Updates

# Pull latest images
sudo -u ml-user podman pull docker.io/grafana/grafana:latest
sudo -u ml-user podman pull docker.io/prom/prometheus:latest
sudo -u ml-user podman pull docker.io/grafana/loki:latest
sudo -u ml-user podman pull docker.io/grafana/promtail:latest

# Restart services to use new images
sudo systemctl restart grafana prometheus loki promtail

Regular Maintenance

Weekly:

  • Check Grafana dashboards for anomalies
  • Review log files for errors
  • Verify backup procedures

Monthly:

  • Update Docker/Podman images
  • Clean up old data volumes
  • Review and rotate secrets

Metrics Reference

Worker Metrics

Task Processing:

  • fetchml_tasks_processed_total - Total tasks processed successfully
  • fetchml_tasks_failed_total - Total tasks failed
  • fetchml_tasks_active - Currently active tasks
  • fetchml_tasks_queued - Current queue depth

Data Transfer:

  • fetchml_data_transferred_bytes_total - Total bytes transferred
  • fetchml_data_fetch_time_seconds_total - Total time fetching datasets
  • fetchml_execution_time_seconds_total - Total task execution time

Prewarming:

  • fetchml_prewarm_env_hit_total - Environment prewarm hits (warm image existed)
  • fetchml_prewarm_env_miss_total - Environment prewarm misses (warm image not found)
  • fetchml_prewarm_env_built_total - Environment images built for prewarming
  • fetchml_prewarm_env_time_seconds_total - Total time building prewarm images
  • fetchml_prewarm_snapshot_hit_total - Snapshot prewarm hits (found in .prewarm/)
  • fetchml_prewarm_snapshot_miss_total - Snapshot prewarm misses (not in .prewarm/)
  • fetchml_prewarm_snapshot_built_total - Snapshots prewarmed into .prewarm/
  • fetchml_prewarm_snapshot_time_seconds_total - Total time prewarming snapshots

Resources:

  • fetchml_resources_cpu_total - Total CPU tokens
  • fetchml_resources_cpu_free - Free CPU tokens
  • fetchml_resources_gpu_slots_total - Total GPU slots per index
  • fetchml_resources_gpu_slots_free - Free GPU slots per index

API Server Metrics

HTTP:

  • fetchml_http_requests_total - Total HTTP requests
  • fetchml_http_duration_seconds - HTTP request duration

WebSocket:

  • fetchml_websocket_connections - Active WebSocket connections
  • fetchml_websocket_messages_total - Total WebSocket messages
  • fetchml_websocket_duration_seconds - Message processing duration
  • fetchml_websocket_errors_total - WebSocket errors

Jupyter:

  • fetchml_jupyter_services - Jupyter services count
  • fetchml_jupyter_operations_total - Jupyter operations

Worker Configuration: Prewarming

Prewarm Flag

Enable Phase 1 prewarming in worker configuration:

# worker-config.yaml
prewarm_enabled: true  # Default: false (opt-in)

Behavior:

  • When false: No prewarming loops run
  • When true: Worker stages next snapshot and fetches datasets when idle

What gets prewarmed:

  1. Snapshots: Copied to .prewarm/snapshots/<taskID>/
  2. Datasets: Fetched to .prewarm/datasets/ (if auto_fetch_data: true)
  3. Environment images: Warmed in envpool cache (if deps manifest exists)

Execution path:

  • During task execution, StageSnapshotFromPath checks .prewarm/snapshots/<taskID>/
  • If found: Hit - Renames prewarmed directory into job (fast)
  • If not found: Miss - Copies from snapshot store (slower)

Metrics impact:

  • Prewarm hits reduce task startup latency
  • Metrics track hit/miss ratios and prewarm timing
  • Use fetchml_prewarm_snapshot_* metrics to monitor effectiveness

Grafana Dashboards

Prewarm Performance Dashboard:

  1. Import monitoring/grafana/dashboards/prewarm-performance.txt into Grafana
  2. Shows hit rates, build times, and efficiency metrics
  3. Use for monitoring prewarm effectiveness

Worker Resources Dashboard:

  • Added prewarm panels to existing worker-resources dashboard
  • Environment and snapshot hit rate percentages
  • Prewarm hits vs misses graphs
  • Build time and build count metrics

Prometheus Queries

Hit Rate Calculations:

# Environment prewarm hit rate
100 * (fetchml_prewarm_env_hit_total / clamp_min(fetchml_prewarm_env_hit_total + fetchml_prewarm_env_miss_total, 1))

# Snapshot prewarm hit rate
100 * (fetchml_prewarm_snapshot_hit_total / clamp_min(fetchml_prewarm_snapshot_hit_total + fetchml_prewarm_snapshot_miss_total, 1))

Rate-based Monitoring:

# Prewarm activity rate
rate(fetchml_prewarm_env_hit_total[5m])
rate(fetchml_prewarm_snapshot_hit_total[5m])

# Build time rate
rate(fetchml_prewarm_env_time_seconds_total[5m])
rate(fetchml_prewarm_snapshot_time_seconds_total[5m])

Advanced Usage

Custom Dashboards

  1. Access Grafana: http://localhost:3000
  2. Create Dashboard: + → Dashboard
  3. Add Panels: Use Prometheus queries
  4. Save Dashboard: Export JSON for sharing

Alerting

Set up Grafana alerts for:

  • High error rates (> 5%)
  • Slow response times (> 1s)
  • Service downtime
  • Resource exhaustion
  • Low prewarm hit rates (< 50%)

Custom Metrics

Add custom metrics to your Go code:

import "github.com/prometheus/client_golang/prometheus"

var (
    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "HTTP request duration in seconds",
        },
        []string{"method", "endpoint"},
    )
)

// Record metrics
requestDuration.WithLabelValues("GET", "/api/v1/jobs").Observe(duration)

See Also