- Update TEST_COVERAGE_MAP with current requirements - Refresh ADR-004 with C++ implementation details - Update architecture, deployment, and security docs - Improve CLI/TUI UX contract documentation
16 KiB
Performance Monitoring
Comprehensive performance monitoring system for Fetch ML with CI/CD integration, profiling, and production deployment.
Quick Start
5-Minute Setup
# Start monitoring stack
make dev-up
# Run benchmarks
make benchmark
# View results in Grafana
open http://localhost:3000
Basic Profiling
# CPU profiling
make profile-load-norate
# View interactive profile
go tool pprof -http=:8080 cpu_load.out
Architecture
Development: Docker Compose with integrated monitoring
Production: Podman + systemd (Linux)
CI/CD: GitHub Actions → Prometheus Pushgateway → Grafana
GitHub Actions → Benchmark Tests → Prometheus Pushgateway → Prometheus → Grafana Dashboard
Components
1. Development Monitoring (Docker Compose)
Services:
- Grafana: http://localhost:3000 (admin/admin123)
- Prometheus: http://localhost:9090
- Loki: http://localhost:3100
- Promtail: Log aggregation
Configuration:
# Start dev stack with monitoring
make dev-up
# Verify services
curl -f http://localhost:3000/api/health
curl -f http://localhost:9090/api/v1/query?query=up
curl -f http://localhost:3100/ready
2. Production Monitoring (Podman + systemd)
Architecture:
- Each service runs as separate Podman container
- Managed by systemd for automatic restarts
- Proper lifecycle management
Setup:
# Set up monitoring provisioning (Grafana datasources/providers)
python3 scripts/setup_monitoring.py
# Set up systemd services for production monitoring
# See systemd unit files in deployments/systemd/
sudo systemctl start prometheus loki promtail grafana
sudo systemctl enable prometheus loki promtail grafana
Access:
- URL:
http://YOUR_SERVER_IP:3000 - Username:
admin - Password:
admin(change on first login)
3. CI/CD Integration
GitHub Actions Workflow:
- Triggers: Push to main/develop, PRs, daily schedule, manual
- Function: Runs benchmarks and pushes metrics to Prometheus
Setup:
# Add GitHub secret
PROMETHEUS_PUSHGATEWAY_URL=http://your-pushgateway:9091
Performance Testing
Benchmarks
# Run benchmarks locally
make benchmark
# Or run with detailed output
go test -bench=. -benchmem ./tests/benchmarks/...
# Run specific benchmark
go test -bench=BenchmarkName -benchmem ./tests/benchmarks/...
# Run with race detection
go test -race -bench=. ./tests/benchmarks/...
Load Testing
# Run load test suite
make load-test
# Start Redis for load tests
docker run -d -p 6379:6379 redis:alpine
CPU Profiling
HTTP Load Test Profiling
# CPU profile MediumLoad HTTP test (no rate limiting - recommended)
make profile-load-norate
# CPU profile MediumLoad HTTP test (with rate limiting)
make profile-load
Analyze Results:
# View interactive profile (web UI)
go tool pprof -http=:8081 cpu_load.out
# View interactive profile (terminal)
go tool pprof cpu_load.out
# Generate flame graph
go tool pprof -raw cpu_load.out | go-flamegraph.pl > cpu_flame.svg
# View top functions
go tool pprof -top cpu_load.out
WebSocket Queue Profiling
# CPU profile WebSocket → Redis queue → worker path
make profile-ws-queue
# View interactive profile
go tool pprof -http=:8082 cpu_ws.out
Profiling Tips
- Use
profile-load-noratefor cleaner CPU profiles (no rate limiting delays) - Profiles run for 60 seconds by default
- Requires Redis running on localhost:6379
- Results show throughput, latency, and error rate metrics
Grafana Dashboards
Development Dashboards
Access: http://localhost:3000 (admin/admin123)
Available Dashboards:
- Load Test Performance: Request metrics, response times, error rates
- System Health: Service status, resource usage, memory, CPU
- Log Analysis: Error logs, service logs, log aggregation
Production Dashboards
Auto-loaded Dashboards:
- ML Task Queue Monitoring (metrics)
- Application Logs (Loki logs)
Key Metrics
benchmark_time_per_op- Execution timebenchmark_memory_per_op- Memory usagebenchmark_allocs_per_op- Allocation count- HTTP request rates and response times
- Error rates and system health metrics
Worker Resource Metrics
The worker exposes a Prometheus endpoint (default :9100/metrics) which includes ResourceManager and task execution metrics.
Resource availability:
fetchml_resources_cpu_total- Total CPU tokens managed by the worker.fetchml_resources_cpu_free- Currently free CPU tokens.fetchml_resources_gpu_slots_total{gpu_index="N"}- Total GPU slots per GPU index.fetchml_resources_gpu_slots_free{gpu_index="N"}- Free GPU slots per GPU index.
Acquisition pressure:
fetchml_resources_acquire_total- Total resource acquisition attempts.fetchml_resources_acquire_wait_total- Number of acquisitions that had to wait.fetchml_resources_acquire_timeout_total- Number of acquisitions that timed out.fetchml_resources_acquire_wait_seconds_total- Total time spent waiting for resources.
Why these help:
- Debug why runs slow down under load (wait time increases).
- Confirm GPU slot sharing is working (free slots fluctuate as expected).
- Detect saturation and timeouts before tasks start failing.
Prometheus Scrape Example (Worker)
If you run the worker locally on your machine (default metrics port :9100) and Prometheus runs in Docker Compose, use host.docker.internal:
scrape_configs:
- job_name: 'worker'
static_configs:
- targets: ['host.docker.internal:9100']
- targets: ['worker:9100']
metrics_path: /metrics
Production Deployment
Prerequisites
- Linux distribution with systemd (Rocky/RHEL/CentOS, Ubuntu/Debian, Arch, SUSE)
- Production app already deployed
- Root or sudo access
- Ports 3000, 9090, 3100 available
Service Configuration
Prometheus:
- Port: 9090
- Config:
/etc/fetch_ml/monitoring/prometheus/prometheus.yml - Data:
/data/monitoring/prometheus - Purpose: Scrapes metrics from API server
Loki:
- Port: 3100
- Config:
/etc/fetch_ml/monitoring/loki-config.yml - Data:
/data/monitoring/loki - Purpose: Log aggregation
Promtail:
- Config:
/etc/fetch_ml/monitoring/promtail-config.yml - Log Source:
/var/log/fetch_ml/*.log - Purpose: Ships logs to Loki
Grafana:
- Port: 3000
- Config:
/etc/fetch_ml/monitoring/grafana/provisioning - Data:
/data/monitoring/grafana - Dashboards:
/var/lib/grafana/dashboards
Management Commands
# Check status
sudo systemctl status prometheus grafana loki promtail
# View logs
sudo journalctl -u prometheus -f
sudo journalctl -u grafana -f
sudo journalctl -u loki -f
sudo journalctl -u promtail -f
# Restart services
sudo systemctl restart prometheus
sudo systemctl restart grafana
# Stop all monitoring
sudo systemctl stop prometheus grafana loki promtail
Data Retention
Prometheus
Default: 15 days. Edit /etc/fetch_ml/monitoring/prometheus/prometheus.yml:
storage:
tsdb:
retention.time: 30d
Loki
Default: 30 days. Edit /etc/fetch_ml/monitoring/loki-config.yml:
limits_config:
retention_period: 30d
Security
Firewall
RHEL/Rocky/Fedora (firewalld):
# Remove public access
sudo firewall-cmd --permanent --remove-port=3000/tcp
sudo firewall-cmd --permanent --remove-port=9090/tcp
# Add specific source
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.0.0.0/24" port port="3000" protocol="tcp" accept'
sudo firewall-cmd --reload
Ubuntu/Debian (ufw):
# Remove public access
sudo ufw delete allow 3000/tcp
sudo ufw delete allow 9090/tcp
# Add specific source
sudo ufw allow from 10.0.0.0/24 to any port 3000 proto tcp
Authentication
Change Grafana admin password:
- Login to Grafana
- User menu → Profile → Change Password
TLS (Optional)
For HTTPS, configure reverse proxy (nginx/Apache) in front of Grafana.
Performance Regression Detection
# Create baseline
make detect-regressions
# Analyze current performance
go test -bench=. -benchmem ./tests/benchmarks/... | tee current.json
# Track performance over time
./scripts/track_performance.sh
Troubleshooting
Development Issues
No metrics in Grafana?
# Check services
docker ps --filter "name=ml-"
# Check monitoring services
curl http://localhost:3000/api/health
curl http://localhost:9090/api/v1/query?query=up
Workflow failing?
- Verify GitHub secret configuration
- Check workflow logs in GitHub Actions
Profiling Issues:
# Flag error handling
go test ./tests/load -run TestLoadProfile_Medium -count=1 -cpuprofile cpu_load.out -v -args -profile-norate
# Redis not available expansion
docker run -d -p 6379:6379 redis:alpine
# Port conflicts
lsof -i :3000 # Grafana
lsof -i :8080 # pprof web UI
lsof -i :6379 # Redis
Production Issues
Grafana shows no data:
# Check if Prometheus is reachable
curl http://localhost:9090/-/healthy
# Check datasource in Grafana
# Settings → Data Sources → Prometheus → Save & Test
Loki not receiving logs:
# Check Promtail is running
sudo systemctl status promtail
# Verify log file exists
ls -l /var/log/fetch_ml/
# Check Promtail can reach Loki
curl http://localhost:3100/ready
Podman containers not starting:
# Check pod status
sudo -u ml-user podman pod ps
sudo -u ml-user podman ps -a
# Remove and recreate
sudo -u ml-user podman pod stop monitoring
sudo -u ml-user podman pod rm monitoring
sudo systemctl restart prometheus
Backup and Recovery
Backup Procedures
# Development backup
docker run --rm -v prometheus_data:/data -v $(pwd):/backup alpine tar czf /backup/prometheus-backup.tar.gz -C /data .
docker run --rm -v grafana_data:/data -v $(pwd):/backup alpine tar czf /backup/grafana-backup.tar.gz -C /data .
# Production backup
sudo tar -czf grafana-backup.tar.gz /data/monitoring/grafana
sudo tar -czf prometheus-backup.tar.gz /data/monitoring/prometheus
Configuration Backup
# Backup configurations
tar czf monitoring-config-backup.tar.gz monitoring/ deployments/
Updates and Maintenance
Development Updates
# Update monitoring provisioning (Grafana datasources/providers)
python3 scripts/setup_monitoring.py
# Restart services
make dev-down && make dev-up
Production Updates
# Pull latest images
sudo -u ml-user podman pull docker.io/grafana/grafana:latest
sudo -u ml-user podman pull docker.io/prom/prometheus:latest
sudo -u ml-user podman pull docker.io/grafana/loki:latest
sudo -u ml-user podman pull docker.io/grafana/promtail:latest
# Restart services to use new images
sudo systemctl restart grafana prometheus loki promtail
Regular Maintenance
Weekly:
- Check Grafana dashboards for anomalies
- Review log files for errors
- Verify backup procedures
Monthly:
- Update Docker/Podman images
- Clean up old data volumes
- Review and rotate secrets
Metrics Reference
Worker Metrics
Task Processing:
fetchml_tasks_processed_total- Total tasks processed successfullyfetchml_tasks_failed_total- Total tasks failedfetchml_tasks_active- Currently active tasksfetchml_tasks_queued- Current queue depth
Data Transfer:
fetchml_data_transferred_bytes_total- Total bytes transferredfetchml_data_fetch_time_seconds_total- Total time fetching datasetsfetchml_execution_time_seconds_total- Total task execution time
Prewarming:
fetchml_prewarm_env_hit_total- Environment prewarm hits (warm image existed)fetchml_prewarm_env_miss_total- Environment prewarm misses (warm image not found)fetchml_prewarm_env_built_total- Environment images built for prewarmingfetchml_prewarm_env_time_seconds_total- Total time building prewarm imagesfetchml_prewarm_snapshot_hit_total- Snapshot prewarm hits (found in .prewarm/)fetchml_prewarm_snapshot_miss_total- Snapshot prewarm misses (not in .prewarm/)fetchml_prewarm_snapshot_built_total- Snapshots prewarmed into .prewarm/fetchml_prewarm_snapshot_time_seconds_total- Total time prewarming snapshots
Resources:
fetchml_resources_cpu_total- Total CPU tokensfetchml_resources_cpu_free- Free CPU tokensfetchml_resources_gpu_slots_total- Total GPU slots per indexfetchml_resources_gpu_slots_free- Free GPU slots per index
API Server Metrics
HTTP:
fetchml_http_requests_total- Total HTTP requestsfetchml_http_duration_seconds- HTTP request duration
WebSocket:
fetchml_websocket_connections- Active WebSocket connectionsfetchml_websocket_messages_total- Total WebSocket messagesfetchml_websocket_duration_seconds- Message processing durationfetchml_websocket_errors_total- WebSocket errors
Jupyter:
fetchml_jupyter_services- Jupyter services countfetchml_jupyter_operations_total- Jupyter operations
Worker Configuration: Prewarming
Prewarm Flag
Enable prewarming in worker configuration:
# worker-config.yaml
prewarm_enabled: true # Default: false (opt-in)
Behavior:
- When
false: No prewarming loops run - When
true: Worker stages next snapshot and fetches datasets when idle
What gets prewarmed:
- Snapshots: Copied to
.prewarm/snapshots/<taskID>/ - Datasets: Fetched to
.prewarm/datasets/(ifauto_fetch_data: true) - Environment images: Warmed in envpool cache (if deps manifest exists)
Execution path:
- During task execution,
StageSnapshotFromPathchecks.prewarm/snapshots/<taskID>/ - If found: Hit - Renames prewarmed directory into job (fast)
- If not found: Miss - Copies from snapshot store (slower)
Metrics impact:
- Prewarm hits reduce task startup latency
- Metrics track hit/miss ratios and prewarm timing
- Use
fetchml_prewarm_snapshot_*metrics to monitor effectiveness
Grafana Dashboards
Prewarm Performance Dashboard:
- Import
monitoring/grafana/dashboards/prewarm-performance.txtinto Grafana - Shows hit rates, build times, and efficiency metrics
- Use for monitoring prewarm effectiveness
Worker Resources Dashboard:
- Added prewarm panels to existing worker-resources dashboard
- Environment and snapshot hit rate percentages
- Prewarm hits vs misses graphs
- Build time and build count metrics
Prometheus Queries
Hit Rate Calculations:
# Environment prewarm hit rate
100 * (fetchml_prewarm_env_hit_total / clamp_min(fetchml_prewarm_env_hit_total + fetchml_prewarm_env_miss_total, 1))
# Snapshot prewarm hit rate
100 * (fetchml_prewarm_snapshot_hit_total / clamp_min(fetchml_prewarm_snapshot_hit_total + fetchml_prewarm_snapshot_miss_total, 1))
Rate-based Monitoring:
# Prewarm activity rate
rate(fetchml_prewarm_env_hit_total[5m])
rate(fetchml_prewarm_snapshot_hit_total[5m])
# Build time rate
rate(fetchml_prewarm_env_time_seconds_total[5m])
rate(fetchml_prewarm_snapshot_time_seconds_total[5m])
Advanced Usage
Custom Dashboards
- Access Grafana: http://localhost:3000
- Create Dashboard: + → Dashboard
- Add Panels: Use Prometheus queries
- Save Dashboard: Export JSON for sharing
Alerting
Set up Grafana alerts for:
- High error rates (> 5%)
- Slow response times (> 1s)
- Service downtime
- Resource exhaustion
- Low prewarm hit rates (< 50%)
Custom Metrics
Add custom metrics to your Go code:
import "github.com/prometheus/client_golang/prometheus"
var (
requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
},
[]string{"method", "endpoint"},
)
)
// Record metrics
requestDuration.WithLabelValues("GET", "/api/v1/jobs").Observe(duration)
See Also
- Testing Guide - Testing with monitoring
- Deployment Guide - Deployment procedures
- Architecture Guide - System architecture
- Troubleshooting - Common issues