docs(ops): consolidate deployment and performance monitoring docs for Caddy-based setup

This commit is contained in:
Jeremie Fraeys 2026-01-05 12:37:40 -05:00
parent c0eeeda940
commit 8157f73a70
5 changed files with 1234 additions and 915 deletions

View file

@ -2,302 +2,480 @@
## Overview
The ML Experiment Manager supports multiple deployment methods from local development to homelab Docker setups.
The ML Experiment Manager supports multiple deployment methods from local development to production setups with integrated monitoring.
## TLS / WSS Policy
- The Zig CLI currently supports `ws://` only (native `wss://` is not implemented).
- For production, use a reverse proxy (Caddy) to terminate TLS/WSS (Approach A) and keep the API server on internal HTTP/WS.
- If you need remote CLI access, use one of:
- an SSH tunnel to the internal `ws://` endpoint
- a private network/VPN so `ws://` is not exposed to the public Internet
- When `server.tls.enabled: false`, the API server still runs on plain HTTP/WS internally. In development, access it via Caddy at `http://localhost:8080/health`.
## Data Directories
- `base_path` is where experiment directories live.
- `data_dir` is used for dataset/snapshot materialization and integrity validation.
- If you want `ml validate` to check snapshots/datasets, you must mount `data_dir` into the API server container.
## Quick Start
### Docker Compose (Recommended for Development)
### Development Deployment with Monitoring
```bash
# Clone repository
git clone https://github.com/your-org/fetch_ml.git
cd fetch_ml
# Start development stack with monitoring
make dev-up
# Start all services
docker-compose up -d (testing only)
# Alternative: Use deployment script
cd deployments && make dev-up
# Check status
docker-compose ps
# View logs
docker-compose logs -f api-server
make dev-status
```
Access the API at `http://localhost:9100`
**Access Services:**
- **API Server (via Caddy)**: http://localhost:8080
- **API Server (via Caddy + internal TLS)**: https://localhost:8443
- **Grafana**: http://localhost:3000 (admin/admin123)
- **Prometheus**: http://localhost:9090
- **Loki**: http://localhost:3100
## Deployment Options
### 1. Local Development
### 1. Development Environment
#### Prerequisites
**Purpose**: Local development with full monitoring stack
**Container Runtimes:**
- **Docker Compose**: For testing and development only
- **Podman**: For production experiment execution
- Go 1.25+
- Zig 0.15.2
- Redis 7+
- Docker & Docker Compose (optional)
**Services**: API Server, Redis, Prometheus, Grafana, Loki, Promtail
#### Manual Setup
**Configuration**:
```bash
# Start Redis
redis-server
# Using Makefile (recommended)
make dev-up
make dev-down
make dev-status
# Build and run Go server
go build -o bin/api-server ./cmd/api-server
./bin/api-server -config configs/config-local.yaml
# Build Zig CLI
cd cli
zig build prod
./zig-out/bin/ml --help
# Using deployment script
cd deployments
make dev-up
make dev-down
make dev-status
```
### 2. Docker Deployment
**Features**:
- Auto-provisioned Grafana dashboards
- Real-time metrics and logs
- Hot reload for development
- Local data persistence
#### Build Image
### 2. Production Environment
**Purpose**: Production deployment with security
**Services**: API Server, Worker, Redis with authentication
**Configuration**:
```bash
docker build -t ml-experiment-manager:latest .
cd deployments
make prod-up
make prod-down
make prod-status
```
#### Run Container
**Features**:
- Secure Redis with authentication
- TLS/WSS via reverse proxy termination (Caddy)
- Production-optimized configurations
- Health checks and restart policies
### 3. Homelab Secure Environment
**Purpose**: Secure homelab deployment
**Services**: API Server, Redis, Caddy reverse proxy
**Configuration**:
```bash
docker run -d \
--name ml-api \
-p 9100:9100 \
-p 9101:9101 \
-v $(pwd)/configs:/app/configs:ro \
-v experiment-data:/data/ml-experiments \
ml-experiment-manager:latest
cd deployments
make homelab-up
make homelab-down
make homelab-status
```
#### Docker Compose
```bash
# Development mode (uses root docker-compose.yml)
docker-compose up -d
**Features**:
- Caddy reverse proxy
- TLS termination
- Network isolation
- External networks
# Production deployment
docker-compose -f deployments/docker-compose.prod.yml up -d
## Environment Setup
# Secure homelab deployment
docker-compose -f deployments/docker-compose.homelab-secure.yml up -d
# With custom configuration
docker-compose -f deployments/docker-compose.prod.yml up --env-file .env.prod
```
### 3. Homelab Setup
### Development Environment
```bash
# Use the simple setup script
./setup.sh
# Copy example environment
cp deployments/env.dev.example .env
# Or manually with Docker Compose
docker-compose up -d (testing only)
# Edit as needed
vim .env
```
### 4. Cloud Deployment
**Key Variables**:
- `LOG_LEVEL=info`
- `GRAFANA_ADMIN_PASSWORD=admin123`
### Production Environment
#### AWS ECS
```bash
# Build and push to ECR
aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REGISTRY
docker build -t $ECR_REGISTRY/ml-experiment-manager:latest .
docker push $ECR_REGISTRY/ml-experiment-manager:latest
# Copy example environment
cp deployments/env.prod.example .env
# Deploy with ECS CLI
ecs-cli compose --project-name ml-experiment-manager up
# Edit with production values
vim .env
```
#### Google Cloud Run
**Key Variables**:
- `REDIS_PASSWORD=your-secure-password`
- `JWT_SECRET=your-jwt-secret`
- `SSL_CERT_PATH=/path/to/cert`
## Monitoring Setup
### Automatic Configuration
Monitoring dashboards and datasources are auto-provisioned:
```bash
# Build and push
gcloud builds submit --tag gcr.io/$PROJECT_ID/ml-experiment-manager
# Setup monitoring provisioning (Grafana datasources/providers)
python3 scripts/setup_monitoring.py
# Deploy
gcloud run deploy ml-experiment-manager \
--image gcr.io/$PROJECT_ID/ml-experiment-manager \
--platform managed \
--region us-central1 \
--allow-unauthenticated
# Start services (includes monitoring)
make dev-up
```
## Configuration
### Available Dashboards
### Environment Variables
```yaml
# configs/config-local.yaml
base_path: "/data/ml-experiments"
auth:
enabled: true
api_keys:
- "your-production-api-key"
server:
address: ":9100"
tls:
enabled: true
cert_file: "/app/ssl/cert.pem"
key_file: "/app/ssl/key.pem"
1. **Load Test Performance**: Request rates, response times, error rates
2. **System Health**: Service status, memory, CPU usage
3. **Log Analysis**: Error logs, service logs, log aggregation
### Manual Configuration
If auto-provisioning fails:
1. **Access Grafana**: http://localhost:3000
2. **Add Data Sources**:
- Prometheus: http://prometheus:9090
- Loki: http://loki:3100
3. **Import Dashboards**: From `monitoring/grafana/dashboards/`
## Testing Procedures
### Pre-Deployment Testing
```bash
# Run unit tests
make test-unit
# Run integration tests
make test-integration
# Run full test suite
make test
# Run with coverage
make test-coverage
```
### Docker Compose Environment
```yaml
# docker-compose.yml
version: '3.8'
services:
api-server:
environment:
- REDIS_URL=redis://redis:6379
- LOG_LEVEL=info
volumes:
- ./configs:/configs:ro
- ./data:/data/experiments
```
### Load Testing
## Monitoring & Logging
```bash
# Run load tests
make load-test
# Run specific load scenarios
make benchmark-local
# Track performance over time
./scripts/track_performance.sh
```
### Health Checks
- HTTP: `GET /health`
- WebSocket: Connection test
- Redis: Ping check
### Metrics
- Prometheus metrics at `/metrics`
- Custom application metrics
- Container resource usage
### Logging
- Structured JSON logging
- Log levels: DEBUG, INFO, WARN, ERROR
- Centralized logging via ELK stack
## Security
### TLS Configuration
```bash
# Generate self-signed cert (development)
openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365 -nodes
# Check service health
curl -f http://localhost:8080/health
# Production - use Let's Encrypt
certbot certonly --standalone -d ml-experiments.example.com
# Check monitoring services
curl -f http://localhost:3000/api/health
curl -f http://localhost:9090/api/v1/query?query=up
curl -f http://localhost:3100/ready
```
### Network Security
- Firewall rules (ports 9100, 9101, 6379)
- VPN access for internal services
- API key authentication
- Rate limiting
## Performance Tuning
### Resource Allocation
FetchML now centralizes pacing and container limits under a `resources` section in every server/worker config. Example for a homelab box:
```yaml
resources:
max_workers: 1
desired_rps_per_worker: 2 # conservative pacing per worker
podman_cpus: "2" # Podman --cpus, keeps host responsive
podman_memory: "8g" # Podman --memory, isolates experiment installs
```
For high-end machines (e.g., M2 Ultra, 18 performance cores / 64GB RAM), start with:
```yaml
resources:
max_workers: 2 # two concurrent experiments
desired_rps_per_worker: 5 # faster job submission
podman_cpus: "8"
podman_memory: "32g"
```
Adjust upward only if experiments stay GPU-bound; keeping Podman limits in place ensures users can install packages inside the container without jeopardizing the host.
### Scaling Strategies
- Horizontal pod autoscaling
- Redis clustering
- Load balancing
- CDN for static assets
## Backup & Recovery
### Data Backup
```bash
# Backup experiment data
docker-compose exec redis redis-cli BGSAVE
docker cp $(docker-compose ps -q redis):/data/dump.rdb ./redis-backup.rdb
# Backup data volume
docker run --rm -v ml-experiments_redis_data:/data -v $(pwd):/backup alpine tar czf /backup/redis-backup.tar.gz -C /data .
```
### Disaster Recovery
1. Restore Redis data
2. Restart services
3. Verify experiment metadata
4. Test API endpoints
## Troubleshooting
### Common Issues
#### API Server Not Starting
**Port Conflicts**:
```bash
# Check logs
docker-compose logs api-server
# Check port usage
lsof -i :9101
lsof -i :3000
lsof -i :9090
# Check configuration
cat configs/config-local.yaml
# Check Redis connection
docker-compose exec redis redis-cli ping
# Kill conflicting processes
kill -9 <PID>
```
#### WebSocket Connection Issues
**Container Issues**:
```bash
# Test WebSocket
wscat -c ws://localhost:9100/ws
# View container logs
docker logs ml-experiments-api
docker logs ml-experiments-grafana
# Check TLS
openssl s_client -connect localhost:9101 -servername localhost
# Restart services
make dev-restart
# Clean restart
make dev-down && make dev-up
```
#### Performance Issues
**Monitoring Issues**:
```bash
# Check resource usage
docker-compose exec api-server ps aux
# Re-setup monitoring configuration
python3 scripts/setup_monitoring.py
# Check Redis memory
docker-compose exec redis redis-cli info memory
# Restart Grafana only
docker restart ml-experiments-grafana
```
### Debug Mode
### Performance Issues
**High Memory Usage**:
- Check Grafana dashboards for memory metrics
- Adjust Prometheus retention in `prometheus.yml`
- Monitor log retention in `loki-config.yml`
**Slow Response Times**:
- Check network connectivity between containers
- Verify Redis performance
- Review API server logs for bottlenecks
## Maintenance
### Regular Tasks
**Weekly**:
- Check Grafana dashboards for anomalies
- Review log files for errors
- Verify backup procedures
**Monthly**:
- Update Docker images
- Clean up old Docker volumes
- Review and rotate secrets
### Backup Procedures
**Data Backup**:
```bash
# Enable debug logging
export LOG_LEVEL=debug
./bin/api-server -config configs/config-local.yaml
# Backup application data
docker run --rm -v ml_data:/data -v $(pwd):/backup alpine tar czf /backup/data-backup.tar.gz -C /data .
# Backup monitoring data
docker run --rm -v prometheus_data:/data -v $(pwd):/backup alpine tar czf /backup/prometheus-backup.tar.gz -C /data .
```
## CI/CD Integration
**Configuration Backup**:
```bash
# Backup configurations
tar czf config-backup.tar.gz monitoring/ deployments/ configs/
```
### GitHub Actions
- Automated testing on PR
- Multi-platform builds
- Security scanning
- Automatic releases
## Security Considerations
### Deployment Pipeline
1. Code commit → GitHub
2. CI/CD pipeline triggers
3. Build and test
4. Security scan
5. Deploy to staging
6. Run integration tests
7. Deploy to production
8. Post-deployment verification
### Development Environment
- Change default Grafana password
- Use environment variables for secrets
- Monitor container logs for security events
### Production Environment
- Enable Redis authentication
- Use SSL/TLS certificates
- Implement network segmentation
- Regular security updates
- Monitor access logs
## Performance Optimization
### Resource Limits
**Development**:
```yaml
# docker-compose.dev.yml
services:
api-server:
deploy:
resources:
limits:
memory: 512M
cpus: '0.5'
```
**Production**:
```yaml
# docker-compose.prod.yml
services:
api-server:
deploy:
resources:
limits:
memory: 2G
cpus: '2.0'
```
### Monitoring Optimization
**Prometheus**:
- Adjust scrape intervals
- Configure retention periods
- Use recording rules for frequent queries
**Loki**:
- Configure log retention
- Use log sampling for high-volume sources
- Optimize label cardinality
## Non-Docker Production (systemd)
This project can be run in production without Docker. The recommended model is:
- Run `api-server` and `worker` as systemd services.
- Terminate TLS/WSS at Caddy and keep the API server on internal plain HTTP/WS.
The unit templates below are copy-paste friendly, but you must adjust paths, users, and config locations to your environment.
### `fetchml-api.service`
```ini
[Unit]
Description=FetchML API Server
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=fetchml
Group=fetchml
WorkingDirectory=/var/lib/fetchml
Environment=LOG_LEVEL=info
ExecStart=/usr/local/bin/api-server -config /etc/fetchml/api.yaml
Restart=on-failure
RestartSec=2
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/fetchml /var/log/fetchml
[Install]
WantedBy=multi-user.target
```
### `fetchml-worker.service`
```ini
[Unit]
Description=FetchML Worker
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=fetchml
Group=fetchml
WorkingDirectory=/var/lib/fetchml
Environment=LOG_LEVEL=info
ExecStart=/usr/local/bin/worker -config /etc/fetchml/worker.yaml
Restart=on-failure
RestartSec=2
NoNewPrivileges=true
PrivateTmp=true
[Install]
WantedBy=multi-user.target
```
### Optional: `caddy.service`
Most distros ship a Caddy systemd unit. If you do not have one available, you can use this template.
```ini
[Unit]
Description=Caddy
Documentation=https://caddyserver.com/docs/
After=network-online.target
Wants=network-online.target
[Service]
Type=notify
User=caddy
Group=caddy
ExecStart=/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile
ExecReload=/usr/bin/caddy reload --config /etc/caddy/Caddyfile
TimeoutStopSec=5s
LimitNOFILE=1048576
LimitNPROC=512
PrivateTmp=true
ProtectSystem=full
AmbientCapabilities=CAP_NET_BIND_SERVICE
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
NoNewPrivileges=true
Restart=on-failure
[Install]
WantedBy=multi-user.target
```
## Migration Guide
### From Development to Production
1. **Export Data**:
```bash
docker exec ml-data redis-cli BGSAVE
docker cp ml-data:/data/dump.rdb ./redis-backup.rdb
```
2. **Update Configuration**:
```bash
cp deployments/env.dev.example deployments/env.prod.example
# Edit with production values
```
3. **Deploy Production**:
```bash
cd deployments
make prod-up
```
4. **Import Data**:
```bash
docker cp ./redis-backup.rdb ml-prod-redis:/data/dump.rdb
docker restart ml-prod-redis
```
## Support
For deployment issues:
1. Check this guide
2. Review logs
3. Check GitHub Issues
4. Contact maintainers
1. Check troubleshooting section
2. Review container logs
3. Verify network connectivity
4. Check resource usage in Grafana

View file

@ -1,231 +1,626 @@
# Performance Monitoring
This document describes the performance monitoring system for Fetch ML, which automatically tracks benchmark metrics through CI/CD integration with Prometheus and Grafana.
Comprehensive performance monitoring system for Fetch ML with CI/CD integration, profiling, and production deployment.
## Overview
## Quick Start
The performance monitoring system provides:
### 5-Minute Setup
- **Automatic benchmark execution** on every CI/CD run
- **Real-time metrics collection** via Prometheus Pushgateway
- **Historical trend visualization** in Grafana dashboards
- **Performance regression detection**
- **Cross-commit comparisons**
```bash
# Start monitoring stack
make dev-up
# Run benchmarks
make benchmark
# View results in Grafana
open http://localhost:3000
```
### Basic Profiling
```bash
# CPU profiling
make profile-load-norate
# View interactive profile
go tool pprof -http=:8080 cpu_load.out
```
## Architecture
**Development**: Docker Compose with integrated monitoring
**Production**: Podman + systemd (Linux)
**CI/CD**: GitHub Actions → Prometheus Pushgateway → Grafana
```
GitHub Actions → Benchmark Tests → Prometheus Pushgateway → Prometheus → Grafana Dashboard
```
## Components
### 1. GitHub Actions Workflow
- **File**: `.github/workflows/benchmark-metrics.yml`
### 1. Development Monitoring (Docker Compose)
**Services**:
- **Grafana**: http://localhost:3000 (admin/admin123)
- **Prometheus**: http://localhost:9090
- **Loki**: http://localhost:3100
- **Promtail**: Log aggregation
**Configuration**:
```bash
# Start dev stack with monitoring
make dev-up
# Verify services
curl -f http://localhost:3000/api/health
curl -f http://localhost:9090/api/v1/query?query=up
curl -f http://localhost:3100/ready
```
### 2. Production Monitoring (Podman + systemd)
**Architecture**:
- Each service runs as separate Podman container
- Managed by systemd for automatic restarts
- Proper lifecycle management
**Setup**:
```bash
# Run production setup script
sudo ./scripts/setup-monitoring-prod.sh /data/monitoring ml-user ml-group
# Start services
sudo systemctl start prometheus loki promtail grafana
sudo systemctl enable prometheus loki promtail grafana
```
**Access**:
- URL: `http://YOUR_SERVER_IP:3000`
- Username: `admin`
- Password: `admin` (change on first login)
### 3. CI/CD Integration
**GitHub Actions Workflow**:
- **Triggers**: Push to main/develop, PRs, daily schedule, manual
- **Function**: Runs benchmarks and pushes metrics to Prometheus
### 2. Prometheus Pushgateway
- **Port**: 9091
- **Purpose**: Receives benchmark metrics from CI/CD runs
- **URL**: `http://localhost:9091`
### 3. Prometheus Server
- **Configuration**: `monitoring/prometheus.yml`
- **Scrapes**: Pushgateway for benchmark metrics
- **Retention**: Configurable retention period
### 4. Grafana Dashboard
- **Location**: `monitoring/dashboards/performance-dashboard.json`
- **Visualizations**: Performance trends, regressions, comparisons
- **Access**: http://localhost:3001
## Setup
### 1. Start Monitoring Stack
**Setup**:
```bash
make monitoring-performance
```
This starts:
- Grafana: http://localhost:3001 (admin/admin)
- Loki: http://localhost:3100
- Pushgateway: http://localhost:9091
### 2. Configure GitHub Secrets
Add this secret to your GitHub repository:
```
# Add GitHub secret
PROMETHEUS_PUSHGATEWAY_URL=http://your-pushgateway:9091
```
### 3. Verify Integration
## Performance Testing
1. Push code to trigger the workflow
2. Check Pushgateway: http://localhost:9091
3. View metrics in Grafana dashboard
## Available Metrics
### Benchmark Metrics
- `benchmark_time_per_op` - Time per operation in nanoseconds
- `benchmark_memory_per_op` - Memory per operation in bytes
- `benchmark_allocs_per_op` - Allocations per operation
Labels:
- `benchmark` - Benchmark name (sanitized)
- `job` - Always "benchmark"
- `instance` - GitHub Actions run ID
### Example Metrics Output
```
benchmark_time_per_op{benchmark="BenchmarkAPIServerCreateJobSimple"} 42653
benchmark_memory_per_op{benchmark="BenchmarkAPIServerCreateJobSimple"} 13518
benchmark_allocs_per_op{benchmark="BenchmarkAPIServerCreateJobSimple"} 98
```
## Usage
### Manual Benchmark Execution
### Benchmarks
```bash
# Run benchmarks locally
make benchmark
# View results in console
# Or run with detailed output
go test -bench=. -benchmem ./tests/benchmarks/...
# Run specific benchmark
go test -bench=BenchmarkName -benchmem ./tests/benchmarks/...
# Run with race detection
go test -race -bench=. ./tests/benchmarks/...
```
### Automated Monitoring
### Load Testing
The system automatically runs benchmarks on:
```bash
# Run load test suite
make load-test
- **Every push** to main/develop branches
- **Pull requests** to main branch
- **Daily schedule** at 6:00 AM UTC
- **Manual trigger** via GitHub Actions UI
# Start Redis for load tests
docker run -d -p 6379:6379 redis:alpine
```
### Viewing Results
### CPU Profiling
1. **Grafana Dashboard**: http://localhost:3001
2. **Pushgateway**: http://localhost:9091/metrics
3. **Prometheus**: http://localhost:9090/targets
#### HTTP Load Test Profiling
## Configuration
```bash
# CPU profile MediumLoad HTTP test (no rate limiting - recommended)
make profile-load-norate
### Prometheus Configuration
# CPU profile MediumLoad HTTP test (with rate limiting)
make profile-load
```
Edit `monitoring/prometheus.yml` to adjust:
**Analyze Results**:
```bash
# View interactive profile (web UI)
go tool pprof -http=:8081 cpu_load.out
# View interactive profile (terminal)
go tool pprof cpu_load.out
# Generate flame graph
go tool pprof -raw cpu_load.out | go-flamegraph.pl > cpu_flame.svg
# View top functions
go tool pprof -top cpu_load.out
```
#### WebSocket Queue Profiling
```bash
# CPU profile WebSocket → Redis queue → worker path
make profile-ws-queue
# View interactive profile
go tool pprof -http=:8082 cpu_ws.out
```
### Profiling Tips
- Use `profile-load-norate` for cleaner CPU profiles (no rate limiting delays)
- Profiles run for 60 seconds by default
- Requires Redis running on localhost:6379
- Results show throughput, latency, and error rate metrics
## Grafana Dashboards
### Development Dashboards
**Access**: http://localhost:3000 (admin/admin123)
**Available Dashboards**:
1. **Load Test Performance**: Request metrics, response times, error rates
2. **System Health**: Service status, resource usage, memory, CPU
3. **Log Analysis**: Error logs, service logs, log aggregation
### Production Dashboards
**Auto-loaded Dashboards**:
- **ML Task Queue Monitoring** (metrics)
- **Application Logs** (Loki logs)
### Key Metrics
- `benchmark_time_per_op` - Execution time
- `benchmark_memory_per_op` - Memory usage
- `benchmark_allocs_per_op` - Allocation count
- HTTP request rates and response times
- Error rates and system health metrics
### Worker Resource Metrics
The worker exposes a Prometheus endpoint (default `:9100/metrics`) which includes ResourceManager and task execution metrics.
**Resource availability**:
- `fetchml_resources_cpu_total` - Total CPU tokens managed by the worker.
- `fetchml_resources_cpu_free` - Currently free CPU tokens.
- `fetchml_resources_gpu_slots_total{gpu_index="N"}` - Total GPU slots per GPU index.
- `fetchml_resources_gpu_slots_free{gpu_index="N"}` - Free GPU slots per GPU index.
**Acquisition pressure**:
- `fetchml_resources_acquire_total` - Total resource acquisition attempts.
- `fetchml_resources_acquire_wait_total` - Number of acquisitions that had to wait.
- `fetchml_resources_acquire_timeout_total` - Number of acquisitions that timed out.
- `fetchml_resources_acquire_wait_seconds_total` - Total time spent waiting for resources.
**Why these help**:
- Debug why runs slow down under load (wait time increases).
- Confirm GPU slot sharing is working (free slots fluctuate as expected).
- Detect saturation and timeouts before tasks start failing.
### Prometheus Scrape Example (Worker)
If you run the worker locally on your machine (default metrics port `:9100`) and Prometheus runs in Docker Compose, use `host.docker.internal`:
```yaml
scrape_configs:
- job_name: 'benchmark'
- job_name: 'worker'
static_configs:
- targets: ['pushgateway:9091']
- targets: ['host.docker.internal:9100']
- targets: ['worker:9100']
metrics_path: /metrics
honor_labels: true
scrape_interval: 15s
```
### Grafana Dashboard
## Production Deployment
Customize the dashboard in `monitoring/dashboards/performance-dashboard.json`:
### Prerequisites
- Add new panels
- Modify queries
- Adjust visualization types
- Set up alerts
- Linux distribution with systemd (Rocky/RHEL/CentOS, Ubuntu/Debian, Arch, SUSE)
- Production app already deployed
- Root or sudo access
- Ports 3000, 9090, 3100 available
### Service Configuration
**Prometheus**:
- **Port**: 9090
- **Config**: `/etc/fetch_ml/monitoring/prometheus/prometheus.yml`
- **Data**: `/data/monitoring/prometheus`
- **Purpose**: Scrapes metrics from API server
**Loki**:
- **Port**: 3100
- **Config**: `/etc/fetch_ml/monitoring/loki-config.yml`
- **Data**: `/data/monitoring/loki`
- **Purpose**: Log aggregation
**Promtail**:
- **Config**: `/etc/fetch_ml/monitoring/promtail-config.yml`
- **Log Source**: `/var/log/fetch_ml/*.log`
- **Purpose**: Ships logs to Loki
**Grafana**:
- **Port**: 3000
- **Config**: `/etc/fetch_ml/monitoring/grafana/provisioning`
- **Data**: `/data/monitoring/grafana`
- **Dashboards**: `/var/lib/grafana/dashboards`
### Management Commands
```bash
# Check status
sudo systemctl status prometheus grafana loki promtail
# View logs
sudo journalctl -u prometheus -f
sudo journalctl -u grafana -f
sudo journalctl -u loki -f
sudo journalctl -u promtail -f
# Restart services
sudo systemctl restart prometheus
sudo systemctl restart grafana
# Stop all monitoring
sudo systemctl stop prometheus grafana loki promtail
```
## Data Retention
### Prometheus
Default: 15 days. Edit `/etc/fetch_ml/monitoring/prometheus/prometheus.yml`:
```yaml
storage:
tsdb:
retention.time: 30d
```
### Loki
Default: 30 days. Edit `/etc/fetch_ml/monitoring/loki-config.yml`:
```yaml
limits_config:
retention_period: 30d
```
## Security
### Firewall
**RHEL/Rocky/Fedora (firewalld)**:
```bash
# Remove public access
sudo firewall-cmd --permanent --remove-port=3000/tcp
sudo firewall-cmd --permanent --remove-port=9090/tcp
# Add specific source
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.0.0.0/24" port port="3000" protocol="tcp" accept'
sudo firewall-cmd --reload
```
**Ubuntu/Debian (ufw)**:
```bash
# Remove public access
sudo ufw delete allow 3000/tcp
sudo ufw delete allow 9090/tcp
# Add specific source
sudo ufw allow from 10.0.0.0/24 to any port 3000 proto tcp
```
### Authentication
Change Grafana admin password:
1. Login to Grafana
2. User menu → Profile → Change Password
### TLS (Optional)
For HTTPS, configure reverse proxy (nginx/Apache) in front of Grafana.
## Performance Regression Detection
```bash
# Create baseline
make detect-regressions
# Analyze current performance
go test -bench=. -benchmem ./tests/benchmarks/... | tee current.json
# Track performance over time
./scripts/track_performance.sh
```
## Troubleshooting
### Common Issues
1. **Metrics not appearing in Grafana**
- Check Pushgateway: http://localhost:9091
- Verify Prometheus targets: http://localhost:9090/targets
- Check GitHub Actions logs
2. **GitHub Actions workflow failing**
- Verify `PROMETHEUS_PUSHGATEWAY_URL` secret
- Check workflow syntax
- Review benchmark execution logs
3. **Pushgateway not receiving metrics**
- Verify URL accessibility from CI/CD
- Check network connectivity
- Review curl command in workflow
### Debug Commands
### Development Issues
**No metrics in Grafana?**
```bash
# Check running services
docker ps --filter "name=monitoring"
# Check services
docker ps --filter "name=ml-"
# View Pushgateway metrics
curl http://localhost:9091/metrics
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets
# Test manual metric push
echo "test_metric 123" | curl --data-binary @- http://localhost:9091/metrics/job/test
# Check monitoring services
curl http://localhost:3000/api/health
curl http://localhost:9090/api/v1/query?query=up
```
## Best Practices
**Workflow failing?**
- Verify GitHub secret configuration
- Check workflow logs in GitHub Actions
### Benchmark Naming
**Profiling Issues**:
```bash
# Flag error handling
go test ./tests/load -run TestLoadProfile_Medium -count=1 -cpuprofile cpu_load.out -v -args -profile-norate
Use consistent naming conventions:
- `BenchmarkAPIServerCreateJob`
- `BenchmarkMLExperimentTraining`
- `BenchmarkDatasetOperations`
# Redis not available expansion
docker run -d -p 6379:6379 redis:alpine
# Port conflicts
lsof -i :3000 # Grafana
lsof -i :8080 # pprof web UI
lsof -i :6379 # Redis
```
### Production Issues
**Grafana shows no data**:
```bash
# Check if Prometheus is reachable
curl http://localhost:9090/-/healthy
# Check datasource in Grafana
# Settings → Data Sources → Prometheus → Save & Test
```
**Loki not receiving logs**:
```bash
# Check Promtail is running
sudo systemctl status promtail
# Verify log file exists
ls -l /var/log/fetch_ml/
# Check Promtail can reach Loki
curl http://localhost:3100/ready
```
**Podman containers not starting**:
```bash
# Check pod status
sudo -u ml-user podman pod ps
sudo -u ml-user podman ps -a
# Remove and recreate
sudo -u ml-user podman pod stop monitoring
sudo -u ml-user podman pod rm monitoring
sudo systemctl restart prometheus
```
## Backup and Recovery
### Backup Procedures
```bash
# Development backup
docker run --rm -v prometheus_data:/data -v $(pwd):/backup alpine tar czf /backup/prometheus-backup.tar.gz -C /data .
docker run --rm -v grafana_data:/data -v $(pwd):/backup alpine tar czf /backup/grafana-backup.tar.gz -C /data .
# Production backup
sudo tar -czf grafana-backup.tar.gz /data/monitoring/grafana
sudo tar -czf prometheus-backup.tar.gz /data/monitoring/prometheus
```
### Configuration Backup
```bash
# Backup configurations
tar czf monitoring-config-backup.tar.gz monitoring/ deployments/
```
## Updates and Maintenance
### Development Updates
```bash
# Update monitoring provisioning (Grafana datasources/providers)
python3 scripts/setup_monitoring.py
# Restart services
make dev-down && make dev-up
```
### Production Updates
```bash
# Pull latest images
sudo -u ml-user podman pull docker.io/grafana/grafana:latest
sudo -u ml-user podman pull docker.io/prom/prometheus:latest
sudo -u ml-user podman pull docker.io/grafana/loki:latest
sudo -u ml-user podman pull docker.io/grafana/promtail:latest
# Restart services to use new images
sudo systemctl restart grafana prometheus loki promtail
```
### Regular Maintenance
**Weekly**:
- Check Grafana dashboards for anomalies
- Review log files for errors
- Verify backup procedures
**Monthly**:
- Update Docker/Podman images
- Clean up old data volumes
- Review and rotate secrets
## Metrics Reference
### Worker Metrics
**Task Processing**:
- `fetchml_tasks_processed_total` - Total tasks processed successfully
- `fetchml_tasks_failed_total` - Total tasks failed
- `fetchml_tasks_active` - Currently active tasks
- `fetchml_tasks_queued` - Current queue depth
**Data Transfer**:
- `fetchml_data_transferred_bytes_total` - Total bytes transferred
- `fetchml_data_fetch_time_seconds_total` - Total time fetching datasets
- `fetchml_execution_time_seconds_total` - Total task execution time
**Prewarming**:
- `fetchml_prewarm_env_hit_total` - Environment prewarm hits (warm image existed)
- `fetchml_prewarm_env_miss_total` - Environment prewarm misses (warm image not found)
- `fetchml_prewarm_env_built_total` - Environment images built for prewarming
- `fetchml_prewarm_env_time_seconds_total` - Total time building prewarm images
- `fetchml_prewarm_snapshot_hit_total` - Snapshot prewarm hits (found in .prewarm/)
- `fetchml_prewarm_snapshot_miss_total` - Snapshot prewarm misses (not in .prewarm/)
- `fetchml_prewarm_snapshot_built_total` - Snapshots prewarmed into .prewarm/
- `fetchml_prewarm_snapshot_time_seconds_total` - Total time prewarming snapshots
**Resources**:
- `fetchml_resources_cpu_total` - Total CPU tokens
- `fetchml_resources_cpu_free` - Free CPU tokens
- `fetchml_resources_gpu_slots_total` - Total GPU slots per index
- `fetchml_resources_gpu_slots_free` - Free GPU slots per index
### API Server Metrics
**HTTP**:
- `fetchml_http_requests_total` - Total HTTP requests
- `fetchml_http_duration_seconds` - HTTP request duration
**WebSocket**:
- `fetchml_websocket_connections` - Active WebSocket connections
- `fetchml_websocket_messages_total` - Total WebSocket messages
- `fetchml_websocket_duration_seconds` - Message processing duration
- `fetchml_websocket_errors_total` - WebSocket errors
**Jupyter**:
- `fetchml_jupyter_services` - Jupyter services count
- `fetchml_jupyter_operations_total` - Jupyter operations
## Worker Configuration: Prewarming
### Prewarm Flag
Enable Phase 1 prewarming in worker configuration:
```yaml
# worker-config.yaml
prewarm_enabled: true # Default: false (opt-in)
```
**Behavior**:
- When `false`: No prewarming loops run
- When `true`: Worker stages next snapshot and fetches datasets when idle
**What gets prewarmed**:
1. **Snapshots**: Copied to `.prewarm/snapshots/<taskID>/`
2. **Datasets**: Fetched to `.prewarm/datasets/` (if `auto_fetch_data: true`)
3. **Environment images**: Warmed in envpool cache (if deps manifest exists)
**Execution path**:
- During task execution, `StageSnapshotFromPath` checks `.prewarm/snapshots/<taskID>/`
- If found: **Hit** - Renames prewarmed directory into job (fast)
- If not found: **Miss** - Copies from snapshot store (slower)
**Metrics impact**:
- Prewarm hits reduce task startup latency
- Metrics track hit/miss ratios and prewarm timing
- Use `fetchml_prewarm_snapshot_*` metrics to monitor effectiveness
### Grafana Dashboards
**Prewarm Performance Dashboard**:
1. Import `monitoring/grafana/dashboards/prewarm-performance.txt` into Grafana
2. Shows hit rates, build times, and efficiency metrics
3. Use for monitoring prewarm effectiveness
**Worker Resources Dashboard**:
- Added prewarm panels to existing worker-resources dashboard
- Environment and snapshot hit rate percentages
- Prewarm hits vs misses graphs
- Build time and build count metrics
### Prometheus Queries
**Hit Rate Calculations**:
```promql
# Environment prewarm hit rate
100 * (fetchml_prewarm_env_hit_total / clamp_min(fetchml_prewarm_env_hit_total + fetchml_prewarm_env_miss_total, 1))
# Snapshot prewarm hit rate
100 * (fetchml_prewarm_snapshot_hit_total / clamp_min(fetchml_prewarm_snapshot_hit_total + fetchml_prewarm_snapshot_miss_total, 1))
```
**Rate-based Monitoring**:
```promql
# Prewarm activity rate
rate(fetchml_prewarm_env_hit_total[5m])
rate(fetchml_prewarm_snapshot_hit_total[5m])
# Build time rate
rate(fetchml_prewarm_env_time_seconds_total[5m])
rate(fetchml_prewarm_snapshot_time_seconds_total[5m])
```
## Advanced Usage
### Custom Dashboards
1. **Access Grafana**: http://localhost:3000
2. **Create Dashboard**: + → Dashboard
3. **Add Panels**: Use Prometheus queries
4. **Save Dashboard**: Export JSON for sharing
### Alerting
Set up Grafana alerts for:
- Performance regressions (>10% degradation)
- Missing benchmark data
- High memory allocation rates
- High error rates (> 5%)
- Slow response times (> 1s)
- Service downtime
- Resource exhaustion
- Low prewarm hit rates (< 50%)
### Retention
### Custom Metrics
Configure appropriate retention periods:
- Raw metrics: 30 days
- Aggregated data: 1 year
- Dashboard snapshots: Permanent
Add custom metrics to your Go code:
## Integration with Existing Workflows
```go
import "github.com/prometheus/client_golang/prometheus"
The benchmark monitoring integrates seamlessly with:
var (
requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
},
[]string{"method", "endpoint"},
)
)
- **CI/CD pipelines**: Automatic execution
- **Code reviews**: Performance impact visible
- **Release management**: Performance trends over time
- **Development**: Local testing with same metrics
// Record metrics
requestDuration.WithLabelValues("GET", "/api/v1/jobs").Observe(duration)
```
## Future Enhancements
## See Also
Potential improvements:
1. **Automated performance regression alerts**
2. **Performance budgets and gates**
3. **Comparative analysis across branches**
4. **Integration with load testing results**
5. **Performance impact scoring**
## Support
For issuesundles:
1. Check this documentation
2. Review GitHub Actions logs
3. Verify monitoring stack status
4. Consult Grafana/Prometheus docs
---
*Last updated: December 2024*
- **[Testing Guide](testing.md)** - Testing with monitoring
- **[Deployment Guide](deployment.md)** - Deployment procedures
- **[Architecture Guide](architecture.md)** - System architecture
- **[Troubleshooting](troubleshooting.md)** - Common issues

View file

@ -1,245 +0,0 @@
# Performance Monitoring Quick Start
Get started with performance monitoring and profiling in 5 minutes.
## Quick Start Options
### Option 1: Basic Benchmarking
```bash
# Run benchmarks
make benchmark
# View results in Grafana
open http://localhost:3001
```
### Option 2: CPU Profiling
```bash
# Generate CPU profile
make profile-load-norate
# View interactive profile
go tool pprof -http=:8080 cpu_load.out
```
### Option 3: Full Monitoring Stack
```bash
# Start monitoring services
make monitoring-performance
# Run benchmarks with metrics collection
make benchmark
# View in Grafana dashboard
open http://localhost:3001
```
## Prerequisites
- Docker and Docker Compose
- Go 1.21 or later
- Redis (for load tests)
- GitHub repository (for CI/CD integration)
## 1. Setup & Installation
### Start Monitoring Stack (Optional)
For full metrics visualization:
```bash
make monitoring-performance
```
This starts:
- **Grafana**: http://localhost:3001 (admin/admin)
- **Pushgateway**: http://localhost:9091
- **Loki**: http://localhost:3100
### Start Redis (Required for Load Tests)
```bash
docker run -d -p 6379:6379 redis:alpine
```
## 2. Performance Testing
### Benchmarks
```bash
# Run benchmarks locally
make benchmark
# Or run with detailed output
go test -bench=. -benchmem ./tests/benchmarks/...
```
### Load Testing
```bash
# Run load test suite
make load-test
```
## 3. CPU Profiling
### HTTP Load Test Profiling
```bash
# CPU profile MediumLoad HTTP test (with rate limiting)
make profile-load
# CPU profile MediumLoad HTTP test (no rate limiting - recommended)
make profile-load-norate
```
**Analyze Results:**
```bash
# View interactive profile (web UI)
go tool pprof -http=:8081 cpu_load.out
# View interactive profile (terminal)
go tool pprof cpu_load.out
# Generate flame graph
go tool pprof -raw cpu_load.out | go-flamegraph.pl > cpu_flame.svg
# View top functions
go tool pprof -top cpu_load.out
```
Web UI: http://localhost:8080
### WebSocket Queue Profiling
```bash
# CPU profile WebSocket → Redis queue → worker path
make profile-ws-queue
```
**Analyze Results:**
```bash
# View interactive profile (web UI)
go tool pprof -http=:8082 cpu_ws.out
# View interactive profile (terminal)
go tool pprof cpu_ws.out
```
### Profiling Tips
- Use `profile-load-norate` for cleaner CPU profiles (no rate limiting delays)
- Profiles run for 60 seconds by default
- Requires Redis running on localhost:6379
- Results show throughput, latency, and error rate metrics
## 4. Results & Visualization
### Grafana Dashboard
Open: http://localhost:3001 (admin/admin)
Navigate to the **Performance Dashboard** to see:
- Real-time benchmark results
- Historical trends
- Performance comparisons
### Key Metrics
- `benchmark_time_per_op` - Execution time
- `benchmark_memory_per_op` - Memory usage
- `benchmark_allocs_per_op` - Allocation count
## 5. CI/CD Integration
### Setup GitHub Integration
Add GitHub secret:
```
PROMETHEUS_PUSHGATEWAY_URL=http://your-pushgateway:9091
```
Now benchmarks run automatically on:
- Every push to main/develop
- Pull requests
- Daily schedule
### Verify Integration
1. Push code to trigger workflow
2. Check Pushgateway: http://localhost:9091/metrics
3. View metrics in Grafana
## 6. Troubleshooting
### Monitoring Stack Issues
**No metrics in Grafana?**
```bash
# Check services
docker ps --filter "name=monitoring"
# Check Pushgateway
curl http://localhost:9091/metrics
```
**Workflow failing?**
- Verify GitHub secret configuration
- Check workflow logs in GitHub Actions
### Profiling Issues
**Flag error like "flag provided but not defined: -test.paniconexit0"**
```bash
# This should be fixed now, but if it persists:
go test ./tests/load -run TestLoadProfile_Medium -count=1 -cpuprofile cpu_load.out -v -args -profile-norate
```
**Redis not available?**
```bash
# Start Redis for profiling tests
docker run -d -p 6379:6379 redis:alpine
# Check profile file generated
ls -la cpu_load.out
```
**Port conflicts?**
```bash
# Check if ports are in use
lsof -i :3001 # Grafana
lsof -i :8080 # pprof web UI
lsof -i :6379 # Redis
```
## 7. Advanced Usage
### Performance Regression Detection
```bash
# Create baseline
make detect-regressions
# Analyze current performance
go test -bench=. -benchmem ./tests/benchmarks/... | tee current.json
```
### Custom Benchmarks
```bash
# Run specific benchmark
go test -bench=BenchmarkName -benchmem ./tests/benchmarks/...
# Run with race detection
go test -race -bench=. ./tests/benchmarks/...
```
## 8. Further Reading
- [Full Documentation](performance-monitoring.md)
- [Dashboard Customization](performance-monitoring.md#grafana-dashboard)
- [Alert Configuration](performance-monitoring.md#alerting)
- [Architecture Guide](architecture.md)
- [Testing Guide](testing.md)
---
*Ready in 5 minutes!*

View file

@ -1,217 +0,0 @@
# Production Monitoring Deployment Guide (Linux)
This guide covers deploying the monitoring stack (Prometheus, Grafana, Loki, Promtail) on Linux production servers.
## Architecture
**Testing**: Docker Compose (macOS/Linux)
**Production**: Podman + systemd (Linux)
**Important**: Docker is for testing only. Podman is used for running actual ML experiments in production.
**Dev (Testing)**: Docker Compose
**Prod (Experiments)**: Podman + systemd
Each service runs as a separate Podman container managed by systemd for automatic restarts and proper lifecycle management.
## Prerequisites
**Container Runtimes:**
- **Docker Compose**: For testing and development only
- **Podman**: For production experiment execution
- Linux distribution with systemd (Rocky/RHEL/CentOS, Ubuntu/Debian, Arch, SUSE, etc.)
- Production app already deployed (see `scripts/setup-prod.sh`)
- Root or sudo access
- Ports 3000, 9090, 3100 available
## Quick Setup
### 1. Run Setup Script
```bash
cd /path/to/fetch_ml
sudo ./scripts/setup-monitoring-prod.sh /data/monitoring ml-user ml-group
```
This will:
- Create directory structure at `/data/monitoring`
- Copy configuration files to `/etc/fetch_ml/monitoring`
- Create systemd services for each component
- Set up firewall rules
### 2. Start Services
```bash
# Start all monitoring services
sudo systemctl start prometheus
sudo systemctl start loki
sudo systemctl start promtail
sudo systemctl start grafana
# Enable on boot
sudo systemctl enable prometheus loki promtail grafana
```
### 3. Access Grafana
- URL: `http://YOUR_SERVER_IP:3000`
- Username: `admin`
- Password: `admin` (change on first login)
Dashboards will auto-load:
- **ML Task Queue Monitoring** (metrics)
- **Application Logs** (Loki logs)
## Service Details
### Prometheus
- **Port**: 9090
- **Config**: `/etc/fetch_ml/monitoring/prometheus.yml`
- **Data**: `/data/monitoring/prometheus`
- **Purpose**: Scrapes metrics from API server
### Loki
- **Port**: 3100
- **Config**: `/etc/fetch_ml/monitoring/loki-config.yml`
- **Data**: `/data/monitoring/loki`
- **Purpose**: Log aggregation
### Promtail
- **Config**: `/etc/fetch_ml/monitoring/promtail-config.yml`
- **Log Source**: `/var/log/fetch_ml/*.log`
- **Purpose**: Ships logs to Loki
### Grafana
- **Port**: 3000
- **Config**: `/etc/fetch_ml/monitoring/grafana/provisioning`
- **Data**: `/data/monitoring/grafana`
- **Dashboards**: `/var/lib/grafana/dashboards`
## Management Commands
```bash
# Check status
sudo systemctl status prometheus grafana loki promtail
# View logs
sudo journalctl -u prometheus -f
sudo journalctl -u grafana -f
sudo journalctl -u loki -f
sudo journalctl -u promtail -f
# Restart services
sudo systemctl restart prometheus
sudo systemctl restart grafana
# Stop all monitoring
sudo systemctl stop prometheus grafana loki promtail
```
## Data Retention
### Prometheus
Default: 15 days. Edit `/etc/fetch_ml/monitoring/prometheus.yml`:
```yaml
storage:
tsdb:
retention.time: 30d
```
### Loki
Default: 30 days. Edit `/etc/fetch_ml/monitoring/loki-config.yml`:
```yaml
limits_config:
retention_period: 30d
```
## Security
### Firewall
The setup script automatically configures firewall rules using the detected firewall manager (firewalld or ufw).
For manual firewall configuration:
**RHEL/Rocky/Fedora (firewalld)**:
```bash
# Remove public access
sudo firewall-cmd --permanent --remove-port=3000/tcp
sudo firewall-cmd --permanent --remove-port=9090/tcp
# Add specific source
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.0.0.0/24" port port="3000" protocol="tcp" accept'
sudo firewall-cmd --reload
```
**Ubuntu/Debian (ufw)**:
```bash
# Remove public access
sudo ufw delete allow 3000/tcp
sudo ufw delete allow 9090/tcp
# Add specific source
sudo ufw allow from 10.0.0.0/24 to any port 3000 proto tcp
```
### Authentication
Change Grafana admin password:
1. Login to Grafana
2. User menu → Profile → Change Password
### TLS (Optional)
For HTTPS, configure reverse proxy (nginx/Apache) in front of Grafana.
## Troubleshooting
### Grafana shows no data
```bash
# Check if Prometheus is reachable
curl http://localhost:9090/-/healthy
# Check datasource in Grafana
# Settings → Data Sources → Prometheus → Save & Test
```
### Loki not receiving logs
```bash
# Check Promtail is running
sudo systemctl status promtail
# Verify log file exists
ls -l /var/log/fetch_ml/
# Check Promtail can reach Loki
curl http://localhost:3100/ready
```
### Podman containers not starting
```bash
# Check pod status
sudo -u ml-user podman pod ps
sudo -u ml-user podman ps -a
# Remove and recreate
sudo -u ml-user podman pod stop monitoring
sudo -u ml-user podman pod rm monitoring
sudo systemctl restart prometheus
```
## Backup
```bash
# Backup Grafana dashboards and data
sudo tar -czf grafana-backup.tar.gz /data/monitoring/grafana
# Backup Prometheus data
sudo tar -czf prometheus-backup.tar.gz /data/monitoring/prometheus
```
## Updates
```bash
# Pull latest images
sudo -u ml-user podman pull docker.io/grafana/grafana:latest
sudo -u ml-user podman pull docker.io/prom/prometheus:latest
sudo -u ml-user podman pull docker.io/grafana/loki:latest
sudo -u ml-user podman pull docker.io/grafana/promtail:latest
# Restart services to use new images
sudo systemctl restart grafana prometheus loki promtail
```

View file

@ -1,6 +1,6 @@
# Quick Start
Get Fetch ML running in minutes with Docker Compose.
Get Fetch ML running in minutes with Docker Compose and integrated monitoring.
## Prerequisites
@ -8,9 +8,13 @@ Get Fetch ML running in minutes with Docker Compose.
- **Docker Compose**: For testing and development only
- **Podman**: For production experiment execution
**Requirements:**
- Go 1.21+
- Zig 0.11+
- Docker Compose (testing only)
- 4GB+ RAM
- 2GB+ disk space
- Git
## One-Command Setup
@ -18,108 +22,312 @@ Get Fetch ML running in minutes with Docker Compose.
# Clone and start
git clone https://github.com/jfraeys/fetch_ml.git
cd fetch_ml
docker-compose up -d (testing only)
make dev-up
# Wait for services (30 seconds)
sleep 30
# Verify setup
curl http://localhost:9101/health
curl http://localhost:8080/health
```
Note: the development compose runs the API server over HTTP/WS for CLI compatibility. For HTTPS/WSS, terminate TLS at a reverse proxy.
**Access Services:**
- **API Server (via Caddy)**: http://localhost:8080
- **API Server (via Caddy + internal TLS)**: https://localhost:8443
- **Grafana**: http://localhost:3000 (admin/admin123)
- **Prometheus**: http://localhost:9090
- **Loki**: http://localhost:3100
## Development Setup
### Build Components
```bash
# Build all components
make build
# Development build
make dev
```
### Start Services
```bash
# Start development stack with monitoring
make dev-up
# Check status
make dev-status
# Stop services
make dev-down
```
### Verify Setup
```bash
# Check API health
curl -f http://localhost:8080/health
# Check monitoring services
curl -f http://localhost:3000/api/health
curl -f http://localhost:9090/api/v1/query?query=up
curl -f http://localhost:3100/ready
# Check Redis
docker exec ml-experiments-redis redis-cli ping
```
## First Experiment
```bash
# Submit a simple ML job (see [First Experiment](first-experiment.md) for details)
curl -X POST http://localhost:9101/api/v1/jobs \
-H "Content-Type: application/json" \
-H "X-API-Key: admin" \
-d '{
"job_name": "hello-world",
"args": "--echo Hello World",
"priority": 1
}'
# Check job status
curl http://localhost:9101/api/v1/jobs \
-H "X-API-Key: admin"
```
## CLI Access
### 1. Setup CLI
```bash
# Build CLI
cd cli && zig build dev
cd cli && zig build --release=fast
# List jobs
./cli/zig-out/dev/ml --server http://localhost:9101 list-jobs
# Submit new job
./cli/zig-out/dev/ml --server http://localhost:9101 submit \
--name "test-job" --args "--epochs 10"
# Initialize CLI config
./cli/zig-out/bin/ml init
```
## Local Mode (Zero-Install)
Run workers locally without Redis or SSH for development and testing:
### 2. Queue Job
```bash
# Start a local worker (uses configs/worker-dev.yaml)
./cmd/worker/worker -config configs/worker-dev.yaml
# Simple test job
echo "test experiment" | ./cli/zig-out/bin/ml queue test-job
# In another terminal, submit a job to the local worker
curl -X POST http://localhost:9101/api/v1/jobs \
-H "Content-Type: application/json" \
-H "X-API-Key: admin" \
-d '{
"job_name": "local-test",
"args": "--echo Local Mode Works",
"priority": 1
}'
# The worker will execute locally using:
# - Local command execution (no SSH)
# - Local job directories (pending/running/finished)
# - In-memory task queue (no Redis required)
# Check status
./cli/zig-out/bin/ml status
```
Local mode configuration (`configs/worker-dev.yaml`):
```yaml
local_mode: true # Enable local execution
base_path: "./jobs" # Local job directory
redis_addr: "" # Optional: skip Redis
host: "" # Optional: skip SSH
### 3. Monitor Progress
```bash
# View in Grafana
open http://localhost:3000
# Check logs in Grafana Log Analysis dashboard
# Or view container logs
docker logs ml-experiments-api -f
```
## Related Documentation
## Key Commands
- [Installation Guide](installation.md) - Detailed setup options
- [First Experiment](first-experiment.md) - Complete ML workflow
- [Development Setup](development-setup.md) - Local development
- [Security](security.md) - Authentication and permissions
### Development Commands
```bash
make help # Show all commands
make build # Build all components
make dev-up # Start dev environment
make dev-down # Stop dev environment
make dev-status # Check dev status
make test # Run tests
make test-unit # Run unit tests
make test-integration # Run integration tests
```
### CLI Commands
```bash
# Build CLI
cd cli && zig build --release=fast
# Common operations
./cli/zig-out/bin/ml status # Check system status
./cli/zig-out/bin/ml queue job-name # Queue job
./cli/zig-out/bin/ml list # List jobs
./cli/zig-out/bin/ml help # Show help
```
### Monitoring Commands
```bash
# Access monitoring services
open http://localhost:3000 # Grafana
open http://localhost:9090 # Prometheus
open http://localhost:3100 # Loki
# (Optional) Re-generate Grafana provisioning (datasources/providers)
python3 scripts/setup_monitoring.py
```
## Configuration
### Environment Setup
```bash
# Copy example environment
cp deployments/env.dev.example .env
# Edit as needed
vim .env
```
**Key Variables**:
- `LOG_LEVEL=info`
- `GRAFANA_ADMIN_PASSWORD=admin123`
### CLI Configuration
```bash
# Setup CLI config
mkdir -p ~/.ml
# Create config file if needed
touch ~/.ml/config.toml
# Edit configuration
vim ~/.ml/config.toml
```
## Testing
### Quick Test
```bash
# 5-minute authentication test
make test-auth
# Clean up
make self-cleanup
```
### Full Test Suite
```bash
# Run all tests
make test
# Run with coverage
make test-coverage
# Run specific test types
make test-unit
make test-integration
make test-e2e
```
### Load Testing
```bash
# Run load tests
make load-test
# Run benchmarks
make benchmark
# Track performance
./scripts/track_performance.sh
```
## Troubleshooting
**Services not starting?**
### Common Issues
**Port Conflicts**:
```bash
# Check logs
docker-compose logs
# Check port usage
lsof -i :8080
lsof -i :8443
lsof -i :3000
lsof -i :9090
# Kill conflicting processes
kill -9 <PID>
```
**Build Issues**:
```bash
# Fix Go modules
go mod tidy
# Fix Zig build
cd cli && rm -rf zig-out zig-cache && zig build --release=fast
```
**Container Issues**:
```bash
# Check container status
docker ps --filter "name=ml-"
# View logs
docker logs ml-experiments-api
docker logs ml-experiments-grafana
# Restart services
docker-compose down && docker-compose up -d (testing only)
make dev-down && make dev-up
```
**API not responding?**
**Monitoring Issues**:
```bash
# Check health
curl http://localhost:9101/health
# Re-setup monitoring
python3 scripts/setup_monitoring.py
# Verify ports
docker-compose ps
# Restart Grafana
docker restart ml-experiments-grafana
# Check datasources in Grafana
# Settings → Data Sources → Test connection
```
**Permission denied?**
### Debug Mode
```bash
# Check API key
curl -H "X-API-Key: admin" http://localhost:9101/api/v1/jobs
# Enable debug logging
export LOG_LEVEL=debug
make dev-up
```
## Next Steps
### Explore Features
1. **Job Management**: Queue and monitor ML experiments
2. **WebSocket Communication**: Real-time updates
3. **Multi-User Authentication**: Role-based access control
4. **Performance Monitoring**: Grafana dashboards and metrics
5. **Log Aggregation**: Centralized logging with Loki
### Advanced Configuration
- **Production Setup**: See [Deployment Guide](deployment.md)
- **Performance Monitoring**: See [Performance Monitoring](performance-monitoring.md)
- **Testing Procedures**: See [Testing Guide](testing.md)
- **CLI Reference**: See [CLI Reference](cli-reference.md)
### Production Deployment
For production deployment:
1. Review [Deployment Guide](deployment.md)
2. Set up production monitoring
3. Configure security and authentication
4. Set up backup procedures
## Help and Support
### Get Help
```bash
make help # Show all available commands
./cli/zig-out/bin/ml --help # CLI help
```
### Documentation
- **[Testing Guide](testing.md)** - Comprehensive testing procedures
- **[Deployment Guide](deployment.md)** - Production deployment
- **[Performance Monitoring](performance-monitoring.md)** - Monitoring setup
- **[Architecture Guide](architecture.md)** - System architecture
- **[Troubleshooting](troubleshooting.md)** - Common issues
### Community
- Check logs: `docker logs ml-experiments-api`
- Review documentation in `docs/src/`
- Use `--debug` flag with CLI commands for detailed output
---
*Ready in minutes!*