jfraeysd/fetch_ml

Fork 0

Jeremie Fraeys 8157f73a70 docs(ops): consolidate deployment and performance monitoring docs for Caddy-based setup

2026-01-05 12:37:40 -05:00

16 KiB

Raw Blame History

Performance Monitoring

Comprehensive performance monitoring system for Fetch ML with CI/CD integration, profiling, and production deployment.

Quick Start

5-Minute Setup

# Start monitoring stack
make dev-up

# Run benchmarks
make benchmark

# View results in Grafana
open http://localhost:3000

Basic Profiling

# CPU profiling
make profile-load-norate

# View interactive profile
go tool pprof -http=:8080 cpu_load.out

Architecture

Development: Docker Compose with integrated monitoring
Production: Podman + systemd (Linux)
CI/CD: GitHub Actions → Prometheus Pushgateway → Grafana

GitHub Actions → Benchmark Tests → Prometheus Pushgateway → Prometheus → Grafana Dashboard

Components

1. Development Monitoring (Docker Compose)

Services:

Grafana: http://localhost:3000 (admin/admin123)
Prometheus: http://localhost:9090
Loki: http://localhost:3100
Promtail: Log aggregation

Configuration:

# Start dev stack with monitoring
make dev-up

# Verify services
curl -f http://localhost:3000/api/health
curl -f http://localhost:9090/api/v1/query?query=up
curl -f http://localhost:3100/ready

2. Production Monitoring (Podman + systemd)

Architecture:

Each service runs as separate Podman container
Managed by systemd for automatic restarts
Proper lifecycle management

Setup:

# Run production setup script
sudo ./scripts/setup-monitoring-prod.sh /data/monitoring ml-user ml-group

# Start services
sudo systemctl start prometheus loki promtail grafana
sudo systemctl enable prometheus loki promtail grafana

Access:

URL: http://YOUR_SERVER_IP:3000
Username: admin
Password: admin (change on first login)

3. CI/CD Integration

GitHub Actions Workflow:

Triggers: Push to main/develop, PRs, daily schedule, manual
Function: Runs benchmarks and pushes metrics to Prometheus

Setup:

# Add GitHub secret
PROMETHEUS_PUSHGATEWAY_URL=http://your-pushgateway:9091

Performance Testing

Benchmarks

# Run benchmarks locally
make benchmark

# Or run with detailed output
go test -bench=. -benchmem ./tests/benchmarks/...

# Run specific benchmark
go test -bench=BenchmarkName -benchmem ./tests/benchmarks/...

# Run with race detection
go test -race -bench=. ./tests/benchmarks/...

Load Testing

# Run load test suite
make load-test

# Start Redis for load tests
docker run -d -p 6379:6379 redis:alpine

CPU Profiling

HTTP Load Test Profiling

# CPU profile MediumLoad HTTP test (no rate limiting - recommended)
make profile-load-norate

# CPU profile MediumLoad HTTP test (with rate limiting)
make profile-load

Analyze Results:

# View interactive profile (web UI)
go tool pprof -http=:8081 cpu_load.out

# View interactive profile (terminal)
go tool pprof cpu_load.out

# Generate flame graph
go tool pprof -raw cpu_load.out | go-flamegraph.pl > cpu_flame.svg

# View top functions
go tool pprof -top cpu_load.out

WebSocket Queue Profiling

# CPU profile WebSocket → Redis queue → worker path
make profile-ws-queue

# View interactive profile
go tool pprof -http=:8082 cpu_ws.out

Profiling Tips

Use profile-load-norate for cleaner CPU profiles (no rate limiting delays)
Profiles run for 60 seconds by default
Requires Redis running on localhost:6379
Results show throughput, latency, and error rate metrics

Grafana Dashboards

Development Dashboards

Access: http://localhost:3000 (admin/admin123)

Available Dashboards:

Load Test Performance: Request metrics, response times, error rates
System Health: Service status, resource usage, memory, CPU
Log Analysis: Error logs, service logs, log aggregation

Production Dashboards

Auto-loaded Dashboards:

ML Task Queue Monitoring (metrics)
Application Logs (Loki logs)

Key Metrics

benchmark_time_per_op - Execution time
benchmark_memory_per_op - Memory usage
benchmark_allocs_per_op - Allocation count
HTTP request rates and response times
Error rates and system health metrics

Worker Resource Metrics

The worker exposes a Prometheus endpoint (default :9100/metrics) which includes ResourceManager and task execution metrics.

Resource availability:

fetchml_resources_cpu_total - Total CPU tokens managed by the worker.
fetchml_resources_cpu_free - Currently free CPU tokens.
fetchml_resources_gpu_slots_total{gpu_index="N"} - Total GPU slots per GPU index.
fetchml_resources_gpu_slots_free{gpu_index="N"} - Free GPU slots per GPU index.

Acquisition pressure:

fetchml_resources_acquire_total - Total resource acquisition attempts.
fetchml_resources_acquire_wait_total - Number of acquisitions that had to wait.
fetchml_resources_acquire_timeout_total - Number of acquisitions that timed out.
fetchml_resources_acquire_wait_seconds_total - Total time spent waiting for resources.

Why these help:

Debug why runs slow down under load (wait time increases).
Confirm GPU slot sharing is working (free slots fluctuate as expected).
Detect saturation and timeouts before tasks start failing.

Prometheus Scrape Example (Worker)

If you run the worker locally on your machine (default metrics port :9100) and Prometheus runs in Docker Compose, use host.docker.internal:

scrape_configs:
  - job_name: 'worker'
    static_configs:
      - targets: ['host.docker.internal:9100']
      - targets: ['worker:9100']
    metrics_path: /metrics

Production Deployment

Prerequisites

Linux distribution with systemd (Rocky/RHEL/CentOS, Ubuntu/Debian, Arch, SUSE)
Production app already deployed
Root or sudo access
Ports 3000, 9090, 3100 available

Service Configuration

Prometheus:

Port: 9090
Config: /etc/fetch_ml/monitoring/prometheus/prometheus.yml
Data: /data/monitoring/prometheus
Purpose: Scrapes metrics from API server

Loki:

Port: 3100
Config: /etc/fetch_ml/monitoring/loki-config.yml
Data: /data/monitoring/loki
Purpose: Log aggregation

Promtail:

Config: /etc/fetch_ml/monitoring/promtail-config.yml
Log Source: /var/log/fetch_ml/*.log
Purpose: Ships logs to Loki

Grafana:

Port: 3000
Config: /etc/fetch_ml/monitoring/grafana/provisioning
Data: /data/monitoring/grafana
Dashboards: /var/lib/grafana/dashboards

Management Commands

# Check status
sudo systemctl status prometheus grafana loki promtail

# View logs
sudo journalctl -u prometheus -f
sudo journalctl -u grafana -f
sudo journalctl -u loki -f
sudo journalctl -u promtail -f

# Restart services
sudo systemctl restart prometheus
sudo systemctl restart grafana

# Stop all monitoring
sudo systemctl stop prometheus grafana loki promtail

Data Retention

Prometheus

Default: 15 days. Edit /etc/fetch_ml/monitoring/prometheus/prometheus.yml:

storage:
  tsdb:
    retention.time: 30d

Loki

Default: 30 days. Edit /etc/fetch_ml/monitoring/loki-config.yml:

limits_config:
  retention_period: 30d

Security

Firewall

RHEL/Rocky/Fedora (firewalld):

# Remove public access
sudo firewall-cmd --permanent --remove-port=3000/tcp
sudo firewall-cmd --permanent --remove-port=9090/tcp

# Add specific source
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.0.0.0/24" port port="3000" protocol="tcp" accept'
sudo firewall-cmd --reload

Ubuntu/Debian (ufw):

# Remove public access
sudo ufw delete allow 3000/tcp
sudo ufw delete allow 9090/tcp

# Add specific source
sudo ufw allow from 10.0.0.0/24 to any port 3000 proto tcp

Authentication

Change Grafana admin password:

Login to Grafana
User menu → Profile → Change Password

TLS (Optional)

For HTTPS, configure reverse proxy (nginx/Apache) in front of Grafana.

Performance Regression Detection

# Create baseline
make detect-regressions

# Analyze current performance
go test -bench=. -benchmem ./tests/benchmarks/... | tee current.json

# Track performance over time
./scripts/track_performance.sh

Troubleshooting

Development Issues

No metrics in Grafana?

# Check services
docker ps --filter "name=ml-"

# Check monitoring services
curl http://localhost:3000/api/health
curl http://localhost:9090/api/v1/query?query=up

Workflow failing?

Verify GitHub secret configuration
Check workflow logs in GitHub Actions

Profiling Issues:

# Flag error handling
go test ./tests/load -run TestLoadProfile_Medium -count=1 -cpuprofile cpu_load.out -v -args -profile-norate

# Redis not available expansion
docker run -d -p 6379:6379 redis:alpine

# Port conflicts
lsof -i :3000  # Grafana
lsof -i :8080  # pprof web UI
lsof -i :6379  # Redis

Production Issues

Grafana shows no data:

# Check if Prometheus is reachable
curl http://localhost:9090/-/healthy

# Check datasource in Grafana
# Settings → Data Sources → Prometheus → Save & Test

Loki not receiving logs:

# Check Promtail is running
sudo systemctl status promtail

# Verify log file exists
ls -l /var/log/fetch_ml/

# Check Promtail can reach Loki
curl http://localhost:3100/ready

Podman containers not starting:

# Check pod status
sudo -u ml-user podman pod ps
sudo -u ml-user podman ps -a

# Remove and recreate
sudo -u ml-user podman pod stop monitoring
sudo -u ml-user podman pod rm monitoring
sudo systemctl restart prometheus

Backup and Recovery

Backup Procedures

# Development backup
docker run --rm -v prometheus_data:/data -v $(pwd):/backup alpine tar czf /backup/prometheus-backup.tar.gz -C /data .
docker run --rm -v grafana_data:/data -v $(pwd):/backup alpine tar czf /backup/grafana-backup.tar.gz -C /data .

# Production backup
sudo tar -czf grafana-backup.tar.gz /data/monitoring/grafana
sudo tar -czf prometheus-backup.tar.gz /data/monitoring/prometheus

Configuration Backup

# Backup configurations
tar czf monitoring-config-backup.tar.gz monitoring/ deployments/

Updates and Maintenance

Development Updates

# Update monitoring provisioning (Grafana datasources/providers)
python3 scripts/setup_monitoring.py

# Restart services
make dev-down && make dev-up

Production Updates

# Pull latest images
sudo -u ml-user podman pull docker.io/grafana/grafana:latest
sudo -u ml-user podman pull docker.io/prom/prometheus:latest
sudo -u ml-user podman pull docker.io/grafana/loki:latest
sudo -u ml-user podman pull docker.io/grafana/promtail:latest

# Restart services to use new images
sudo systemctl restart grafana prometheus loki promtail

Regular Maintenance

Weekly:

Check Grafana dashboards for anomalies
Review log files for errors
Verify backup procedures

Monthly:

Update Docker/Podman images
Clean up old data volumes
Review and rotate secrets

Metrics Reference

Worker Metrics

Task Processing:

fetchml_tasks_processed_total - Total tasks processed successfully
fetchml_tasks_failed_total - Total tasks failed
fetchml_tasks_active - Currently active tasks
fetchml_tasks_queued - Current queue depth

Data Transfer:

fetchml_data_transferred_bytes_total - Total bytes transferred
fetchml_data_fetch_time_seconds_total - Total time fetching datasets
fetchml_execution_time_seconds_total - Total task execution time

Prewarming:

fetchml_prewarm_env_hit_total - Environment prewarm hits (warm image existed)
fetchml_prewarm_env_miss_total - Environment prewarm misses (warm image not found)
fetchml_prewarm_env_built_total - Environment images built for prewarming
fetchml_prewarm_env_time_seconds_total - Total time building prewarm images
fetchml_prewarm_snapshot_hit_total - Snapshot prewarm hits (found in .prewarm/)
fetchml_prewarm_snapshot_miss_total - Snapshot prewarm misses (not in .prewarm/)
fetchml_prewarm_snapshot_built_total - Snapshots prewarmed into .prewarm/
fetchml_prewarm_snapshot_time_seconds_total - Total time prewarming snapshots

Resources:

fetchml_resources_cpu_total - Total CPU tokens
fetchml_resources_cpu_free - Free CPU tokens
fetchml_resources_gpu_slots_total - Total GPU slots per index
fetchml_resources_gpu_slots_free - Free GPU slots per index

API Server Metrics

HTTP:

fetchml_http_requests_total - Total HTTP requests
fetchml_http_duration_seconds - HTTP request duration

WebSocket:

fetchml_websocket_connections - Active WebSocket connections
fetchml_websocket_messages_total - Total WebSocket messages
fetchml_websocket_duration_seconds - Message processing duration
fetchml_websocket_errors_total - WebSocket errors

Jupyter:

fetchml_jupyter_services - Jupyter services count
fetchml_jupyter_operations_total - Jupyter operations

Worker Configuration: Prewarming

Prewarm Flag

Enable Phase 1 prewarming in worker configuration:

# worker-config.yaml
prewarm_enabled: true  # Default: false (opt-in)

Behavior:

When false: No prewarming loops run
When true: Worker stages next snapshot and fetches datasets when idle

What gets prewarmed:

Snapshots: Copied to .prewarm/snapshots/<taskID>/
Datasets: Fetched to .prewarm/datasets/ (if auto_fetch_data: true)
Environment images: Warmed in envpool cache (if deps manifest exists)

Execution path:

During task execution, StageSnapshotFromPath checks .prewarm/snapshots/<taskID>/
If found: Hit - Renames prewarmed directory into job (fast)
If not found: Miss - Copies from snapshot store (slower)

Metrics impact:

Prewarm hits reduce task startup latency
Metrics track hit/miss ratios and prewarm timing
Use fetchml_prewarm_snapshot_* metrics to monitor effectiveness

Grafana Dashboards

Prewarm Performance Dashboard:

Import monitoring/grafana/dashboards/prewarm-performance.txt into Grafana
Shows hit rates, build times, and efficiency metrics
Use for monitoring prewarm effectiveness

Worker Resources Dashboard:

Added prewarm panels to existing worker-resources dashboard
Environment and snapshot hit rate percentages
Prewarm hits vs misses graphs
Build time and build count metrics

Prometheus Queries

Hit Rate Calculations:

# Environment prewarm hit rate
100 * (fetchml_prewarm_env_hit_total / clamp_min(fetchml_prewarm_env_hit_total + fetchml_prewarm_env_miss_total, 1))

# Snapshot prewarm hit rate
100 * (fetchml_prewarm_snapshot_hit_total / clamp_min(fetchml_prewarm_snapshot_hit_total + fetchml_prewarm_snapshot_miss_total, 1))

Rate-based Monitoring:

# Prewarm activity rate
rate(fetchml_prewarm_env_hit_total[5m])
rate(fetchml_prewarm_snapshot_hit_total[5m])

# Build time rate
rate(fetchml_prewarm_env_time_seconds_total[5m])
rate(fetchml_prewarm_snapshot_time_seconds_total[5m])

Advanced Usage

Custom Dashboards

Access Grafana: http://localhost:3000
Create Dashboard: + → Dashboard
Add Panels: Use Prometheus queries
Save Dashboard: Export JSON for sharing

Alerting

Set up Grafana alerts for:

High error rates (> 5%)
Slow response times (> 1s)
Service downtime
Resource exhaustion
Low prewarm hit rates (< 50%)

Custom Metrics

Add custom metrics to your Go code:

import "github.com/prometheus/client_golang/prometheus"

var (
    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "HTTP request duration in seconds",
        },
        []string{"method", "endpoint"},
    )
)

// Record metrics
requestDuration.WithLabelValues("GET", "/api/v1/jobs").Observe(duration)

16 KiB Raw Blame History

Performance Monitoring

Quick Start

5-Minute Setup

Basic Profiling

Architecture

Components

1. Development Monitoring (Docker Compose)

2. Production Monitoring (Podman + systemd)

3. CI/CD Integration

Performance Testing

Benchmarks

Load Testing

CPU Profiling

HTTP Load Test Profiling

WebSocket Queue Profiling

Profiling Tips

Grafana Dashboards

Development Dashboards

Production Dashboards

Key Metrics

Worker Resource Metrics

Prometheus Scrape Example (Worker)

Production Deployment

Prerequisites

Service Configuration

Management Commands

Data Retention

Prometheus

Loki

Security

Firewall

Authentication

TLS (Optional)

Performance Regression Detection

Troubleshooting

Development Issues

Production Issues

Backup and Recovery

Backup Procedures

Configuration Backup

Updates and Maintenance

Development Updates

Production Updates

Regular Maintenance

Metrics Reference

Worker Metrics

API Server Metrics

Worker Configuration: Prewarming

Prewarm Flag

Grafana Dashboards

Prometheus Queries

Advanced Usage

Custom Dashboards

Alerting

Custom Metrics

See Also

16 KiB

Raw Blame History