docs(ops): consolidate deployment and performance monitoring docs for Caddy-based setup
This commit is contained in:
parent
c0eeeda940
commit
8157f73a70
5 changed files with 1234 additions and 915 deletions
|
|
@ -2,302 +2,480 @@
|
|||
|
||||
## Overview
|
||||
|
||||
The ML Experiment Manager supports multiple deployment methods from local development to homelab Docker setups.
|
||||
The ML Experiment Manager supports multiple deployment methods from local development to production setups with integrated monitoring.
|
||||
|
||||
## TLS / WSS Policy
|
||||
|
||||
- The Zig CLI currently supports `ws://` only (native `wss://` is not implemented).
|
||||
- For production, use a reverse proxy (Caddy) to terminate TLS/WSS (Approach A) and keep the API server on internal HTTP/WS.
|
||||
- If you need remote CLI access, use one of:
|
||||
- an SSH tunnel to the internal `ws://` endpoint
|
||||
- a private network/VPN so `ws://` is not exposed to the public Internet
|
||||
- When `server.tls.enabled: false`, the API server still runs on plain HTTP/WS internally. In development, access it via Caddy at `http://localhost:8080/health`.
|
||||
|
||||
## Data Directories
|
||||
|
||||
- `base_path` is where experiment directories live.
|
||||
- `data_dir` is used for dataset/snapshot materialization and integrity validation.
|
||||
- If you want `ml validate` to check snapshots/datasets, you must mount `data_dir` into the API server container.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Docker Compose (Recommended for Development)
|
||||
### Development Deployment with Monitoring
|
||||
|
||||
```bash
|
||||
# Clone repository
|
||||
git clone https://github.com/your-org/fetch_ml.git
|
||||
cd fetch_ml
|
||||
# Start development stack with monitoring
|
||||
make dev-up
|
||||
|
||||
# Start all services
|
||||
docker-compose up -d (testing only)
|
||||
# Alternative: Use deployment script
|
||||
cd deployments && make dev-up
|
||||
|
||||
# Check status
|
||||
docker-compose ps
|
||||
|
||||
# View logs
|
||||
docker-compose logs -f api-server
|
||||
make dev-status
|
||||
```
|
||||
|
||||
Access the API at `http://localhost:9100`
|
||||
**Access Services:**
|
||||
- **API Server (via Caddy)**: http://localhost:8080
|
||||
- **API Server (via Caddy + internal TLS)**: https://localhost:8443
|
||||
- **Grafana**: http://localhost:3000 (admin/admin123)
|
||||
- **Prometheus**: http://localhost:9090
|
||||
- **Loki**: http://localhost:3100
|
||||
|
||||
## Deployment Options
|
||||
|
||||
### 1. Local Development
|
||||
### 1. Development Environment
|
||||
|
||||
#### Prerequisites
|
||||
**Purpose**: Local development with full monitoring stack
|
||||
|
||||
**Container Runtimes:**
|
||||
- **Docker Compose**: For testing and development only
|
||||
- **Podman**: For production experiment execution
|
||||
- Go 1.25+
|
||||
- Zig 0.15.2
|
||||
- Redis 7+
|
||||
- Docker & Docker Compose (optional)
|
||||
**Services**: API Server, Redis, Prometheus, Grafana, Loki, Promtail
|
||||
|
||||
#### Manual Setup
|
||||
**Configuration**:
|
||||
```bash
|
||||
# Start Redis
|
||||
redis-server
|
||||
# Using Makefile (recommended)
|
||||
make dev-up
|
||||
make dev-down
|
||||
make dev-status
|
||||
|
||||
# Build and run Go server
|
||||
go build -o bin/api-server ./cmd/api-server
|
||||
./bin/api-server -config configs/config-local.yaml
|
||||
|
||||
# Build Zig CLI
|
||||
cd cli
|
||||
zig build prod
|
||||
./zig-out/bin/ml --help
|
||||
# Using deployment script
|
||||
cd deployments
|
||||
make dev-up
|
||||
make dev-down
|
||||
make dev-status
|
||||
```
|
||||
|
||||
### 2. Docker Deployment
|
||||
**Features**:
|
||||
- Auto-provisioned Grafana dashboards
|
||||
- Real-time metrics and logs
|
||||
- Hot reload for development
|
||||
- Local data persistence
|
||||
|
||||
#### Build Image
|
||||
### 2. Production Environment
|
||||
|
||||
**Purpose**: Production deployment with security
|
||||
|
||||
**Services**: API Server, Worker, Redis with authentication
|
||||
|
||||
**Configuration**:
|
||||
```bash
|
||||
docker build -t ml-experiment-manager:latest .
|
||||
cd deployments
|
||||
make prod-up
|
||||
make prod-down
|
||||
make prod-status
|
||||
```
|
||||
|
||||
#### Run Container
|
||||
**Features**:
|
||||
- Secure Redis with authentication
|
||||
- TLS/WSS via reverse proxy termination (Caddy)
|
||||
- Production-optimized configurations
|
||||
- Health checks and restart policies
|
||||
|
||||
### 3. Homelab Secure Environment
|
||||
|
||||
**Purpose**: Secure homelab deployment
|
||||
|
||||
**Services**: API Server, Redis, Caddy reverse proxy
|
||||
|
||||
**Configuration**:
|
||||
```bash
|
||||
docker run -d \
|
||||
--name ml-api \
|
||||
-p 9100:9100 \
|
||||
-p 9101:9101 \
|
||||
-v $(pwd)/configs:/app/configs:ro \
|
||||
-v experiment-data:/data/ml-experiments \
|
||||
ml-experiment-manager:latest
|
||||
cd deployments
|
||||
make homelab-up
|
||||
make homelab-down
|
||||
make homelab-status
|
||||
```
|
||||
|
||||
#### Docker Compose
|
||||
```bash
|
||||
# Development mode (uses root docker-compose.yml)
|
||||
docker-compose up -d
|
||||
**Features**:
|
||||
- Caddy reverse proxy
|
||||
- TLS termination
|
||||
- Network isolation
|
||||
- External networks
|
||||
|
||||
# Production deployment
|
||||
docker-compose -f deployments/docker-compose.prod.yml up -d
|
||||
## Environment Setup
|
||||
|
||||
# Secure homelab deployment
|
||||
docker-compose -f deployments/docker-compose.homelab-secure.yml up -d
|
||||
|
||||
# With custom configuration
|
||||
docker-compose -f deployments/docker-compose.prod.yml up --env-file .env.prod
|
||||
```
|
||||
|
||||
### 3. Homelab Setup
|
||||
### Development Environment
|
||||
|
||||
```bash
|
||||
# Use the simple setup script
|
||||
./setup.sh
|
||||
# Copy example environment
|
||||
cp deployments/env.dev.example .env
|
||||
|
||||
# Or manually with Docker Compose
|
||||
docker-compose up -d (testing only)
|
||||
# Edit as needed
|
||||
vim .env
|
||||
```
|
||||
|
||||
### 4. Cloud Deployment
|
||||
**Key Variables**:
|
||||
- `LOG_LEVEL=info`
|
||||
- `GRAFANA_ADMIN_PASSWORD=admin123`
|
||||
|
||||
### Production Environment
|
||||
|
||||
#### AWS ECS
|
||||
```bash
|
||||
# Build and push to ECR
|
||||
aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REGISTRY
|
||||
docker build -t $ECR_REGISTRY/ml-experiment-manager:latest .
|
||||
docker push $ECR_REGISTRY/ml-experiment-manager:latest
|
||||
# Copy example environment
|
||||
cp deployments/env.prod.example .env
|
||||
|
||||
# Deploy with ECS CLI
|
||||
ecs-cli compose --project-name ml-experiment-manager up
|
||||
# Edit with production values
|
||||
vim .env
|
||||
```
|
||||
|
||||
#### Google Cloud Run
|
||||
**Key Variables**:
|
||||
- `REDIS_PASSWORD=your-secure-password`
|
||||
- `JWT_SECRET=your-jwt-secret`
|
||||
- `SSL_CERT_PATH=/path/to/cert`
|
||||
|
||||
## Monitoring Setup
|
||||
|
||||
### Automatic Configuration
|
||||
|
||||
Monitoring dashboards and datasources are auto-provisioned:
|
||||
|
||||
```bash
|
||||
# Build and push
|
||||
gcloud builds submit --tag gcr.io/$PROJECT_ID/ml-experiment-manager
|
||||
# Setup monitoring provisioning (Grafana datasources/providers)
|
||||
python3 scripts/setup_monitoring.py
|
||||
|
||||
# Deploy
|
||||
gcloud run deploy ml-experiment-manager \
|
||||
--image gcr.io/$PROJECT_ID/ml-experiment-manager \
|
||||
--platform managed \
|
||||
--region us-central1 \
|
||||
--allow-unauthenticated
|
||||
# Start services (includes monitoring)
|
||||
make dev-up
|
||||
```
|
||||
|
||||
## Configuration
|
||||
### Available Dashboards
|
||||
|
||||
### Environment Variables
|
||||
```yaml
|
||||
# configs/config-local.yaml
|
||||
base_path: "/data/ml-experiments"
|
||||
auth:
|
||||
enabled: true
|
||||
api_keys:
|
||||
- "your-production-api-key"
|
||||
server:
|
||||
address: ":9100"
|
||||
tls:
|
||||
enabled: true
|
||||
cert_file: "/app/ssl/cert.pem"
|
||||
key_file: "/app/ssl/key.pem"
|
||||
1. **Load Test Performance**: Request rates, response times, error rates
|
||||
2. **System Health**: Service status, memory, CPU usage
|
||||
3. **Log Analysis**: Error logs, service logs, log aggregation
|
||||
|
||||
### Manual Configuration
|
||||
|
||||
If auto-provisioning fails:
|
||||
|
||||
1. **Access Grafana**: http://localhost:3000
|
||||
2. **Add Data Sources**:
|
||||
- Prometheus: http://prometheus:9090
|
||||
- Loki: http://loki:3100
|
||||
3. **Import Dashboards**: From `monitoring/grafana/dashboards/`
|
||||
|
||||
## Testing Procedures
|
||||
|
||||
### Pre-Deployment Testing
|
||||
|
||||
```bash
|
||||
# Run unit tests
|
||||
make test-unit
|
||||
|
||||
# Run integration tests
|
||||
make test-integration
|
||||
|
||||
# Run full test suite
|
||||
make test
|
||||
|
||||
# Run with coverage
|
||||
make test-coverage
|
||||
```
|
||||
|
||||
### Docker Compose Environment
|
||||
```yaml
|
||||
# docker-compose.yml
|
||||
version: '3.8'
|
||||
services:
|
||||
api-server:
|
||||
environment:
|
||||
- REDIS_URL=redis://redis:6379
|
||||
- LOG_LEVEL=info
|
||||
volumes:
|
||||
- ./configs:/configs:ro
|
||||
- ./data:/data/experiments
|
||||
```
|
||||
### Load Testing
|
||||
|
||||
## Monitoring & Logging
|
||||
```bash
|
||||
# Run load tests
|
||||
make load-test
|
||||
|
||||
# Run specific load scenarios
|
||||
make benchmark-local
|
||||
|
||||
# Track performance over time
|
||||
./scripts/track_performance.sh
|
||||
```
|
||||
|
||||
### Health Checks
|
||||
- HTTP: `GET /health`
|
||||
- WebSocket: Connection test
|
||||
- Redis: Ping check
|
||||
|
||||
### Metrics
|
||||
- Prometheus metrics at `/metrics`
|
||||
- Custom application metrics
|
||||
- Container resource usage
|
||||
|
||||
### Logging
|
||||
- Structured JSON logging
|
||||
- Log levels: DEBUG, INFO, WARN, ERROR
|
||||
- Centralized logging via ELK stack
|
||||
|
||||
## Security
|
||||
|
||||
### TLS Configuration
|
||||
```bash
|
||||
# Generate self-signed cert (development)
|
||||
openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365 -nodes
|
||||
# Check service health
|
||||
curl -f http://localhost:8080/health
|
||||
|
||||
# Production - use Let's Encrypt
|
||||
certbot certonly --standalone -d ml-experiments.example.com
|
||||
# Check monitoring services
|
||||
curl -f http://localhost:3000/api/health
|
||||
curl -f http://localhost:9090/api/v1/query?query=up
|
||||
curl -f http://localhost:3100/ready
|
||||
```
|
||||
|
||||
### Network Security
|
||||
- Firewall rules (ports 9100, 9101, 6379)
|
||||
- VPN access for internal services
|
||||
- API key authentication
|
||||
- Rate limiting
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
### Resource Allocation
|
||||
FetchML now centralizes pacing and container limits under a `resources` section in every server/worker config. Example for a homelab box:
|
||||
```yaml
|
||||
resources:
|
||||
max_workers: 1
|
||||
desired_rps_per_worker: 2 # conservative pacing per worker
|
||||
podman_cpus: "2" # Podman --cpus, keeps host responsive
|
||||
podman_memory: "8g" # Podman --memory, isolates experiment installs
|
||||
```
|
||||
|
||||
For high-end machines (e.g., M2 Ultra, 18 performance cores / 64 GB RAM), start with:
|
||||
```yaml
|
||||
resources:
|
||||
max_workers: 2 # two concurrent experiments
|
||||
desired_rps_per_worker: 5 # faster job submission
|
||||
podman_cpus: "8"
|
||||
podman_memory: "32g"
|
||||
```
|
||||
Adjust upward only if experiments stay GPU-bound; keeping Podman limits in place ensures users can install packages inside the container without jeopardizing the host.
|
||||
|
||||
### Scaling Strategies
|
||||
- Horizontal pod autoscaling
|
||||
- Redis clustering
|
||||
- Load balancing
|
||||
- CDN for static assets
|
||||
|
||||
## Backup & Recovery
|
||||
|
||||
### Data Backup
|
||||
```bash
|
||||
# Backup experiment data
|
||||
docker-compose exec redis redis-cli BGSAVE
|
||||
docker cp $(docker-compose ps -q redis):/data/dump.rdb ./redis-backup.rdb
|
||||
|
||||
# Backup data volume
|
||||
docker run --rm -v ml-experiments_redis_data:/data -v $(pwd):/backup alpine tar czf /backup/redis-backup.tar.gz -C /data .
|
||||
```
|
||||
|
||||
### Disaster Recovery
|
||||
1. Restore Redis data
|
||||
2. Restart services
|
||||
3. Verify experiment metadata
|
||||
4. Test API endpoints
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
#### API Server Not Starting
|
||||
**Port Conflicts**:
|
||||
```bash
|
||||
# Check logs
|
||||
docker-compose logs api-server
|
||||
# Check port usage
|
||||
lsof -i :9101
|
||||
lsof -i :3000
|
||||
lsof -i :9090
|
||||
|
||||
# Check configuration
|
||||
cat configs/config-local.yaml
|
||||
|
||||
# Check Redis connection
|
||||
docker-compose exec redis redis-cli ping
|
||||
# Kill conflicting processes
|
||||
kill -9 <PID>
|
||||
```
|
||||
|
||||
#### WebSocket Connection Issues
|
||||
**Container Issues**:
|
||||
```bash
|
||||
# Test WebSocket
|
||||
wscat -c ws://localhost:9100/ws
|
||||
# View container logs
|
||||
docker logs ml-experiments-api
|
||||
docker logs ml-experiments-grafana
|
||||
|
||||
# Check TLS
|
||||
openssl s_client -connect localhost:9101 -servername localhost
|
||||
# Restart services
|
||||
make dev-restart
|
||||
|
||||
# Clean restart
|
||||
make dev-down && make dev-up
|
||||
```
|
||||
|
||||
#### Performance Issues
|
||||
**Monitoring Issues**:
|
||||
```bash
|
||||
# Check resource usage
|
||||
docker-compose exec api-server ps aux
|
||||
# Re-setup monitoring configuration
|
||||
python3 scripts/setup_monitoring.py
|
||||
|
||||
# Check Redis memory
|
||||
docker-compose exec redis redis-cli info memory
|
||||
# Restart Grafana only
|
||||
docker restart ml-experiments-grafana
|
||||
```
|
||||
|
||||
### Debug Mode
|
||||
### Performance Issues
|
||||
|
||||
**High Memory Usage**:
|
||||
- Check Grafana dashboards for memory metrics
|
||||
- Adjust Prometheus retention in `prometheus.yml`
|
||||
- Monitor log retention in `loki-config.yml`
|
||||
|
||||
**Slow Response Times**:
|
||||
- Check network connectivity between containers
|
||||
- Verify Redis performance
|
||||
- Review API server logs for bottlenecks
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Regular Tasks
|
||||
|
||||
**Weekly**:
|
||||
- Check Grafana dashboards for anomalies
|
||||
- Review log files for errors
|
||||
- Verify backup procedures
|
||||
|
||||
**Monthly**:
|
||||
- Update Docker images
|
||||
- Clean up old Docker volumes
|
||||
- Review and rotate secrets
|
||||
|
||||
### Backup Procedures
|
||||
|
||||
**Data Backup**:
|
||||
```bash
|
||||
# Enable debug logging
|
||||
export LOG_LEVEL=debug
|
||||
./bin/api-server -config configs/config-local.yaml
|
||||
# Backup application data
|
||||
docker run --rm -v ml_data:/data -v $(pwd):/backup alpine tar czf /backup/data-backup.tar.gz -C /data .
|
||||
|
||||
# Backup monitoring data
|
||||
docker run --rm -v prometheus_data:/data -v $(pwd):/backup alpine tar czf /backup/prometheus-backup.tar.gz -C /data .
|
||||
```
|
||||
|
||||
## CI/CD Integration
|
||||
**Configuration Backup**:
|
||||
```bash
|
||||
# Backup configurations
|
||||
tar czf config-backup.tar.gz monitoring/ deployments/ configs/
|
||||
```
|
||||
|
||||
### GitHub Actions
|
||||
- Automated testing on PR
|
||||
- Multi-platform builds
|
||||
- Security scanning
|
||||
- Automatic releases
|
||||
## Security Considerations
|
||||
|
||||
### Deployment Pipeline
|
||||
1. Code commit → GitHub
|
||||
2. CI/CD pipeline triggers
|
||||
3. Build and test
|
||||
4. Security scan
|
||||
5. Deploy to staging
|
||||
6. Run integration tests
|
||||
7. Deploy to production
|
||||
8. Post-deployment verification
|
||||
### Development Environment
|
||||
- Change default Grafana password
|
||||
- Use environment variables for secrets
|
||||
- Monitor container logs for security events
|
||||
|
||||
### Production Environment
|
||||
- Enable Redis authentication
|
||||
- Use SSL/TLS certificates
|
||||
- Implement network segmentation
|
||||
- Regular security updates
|
||||
- Monitor access logs
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Resource Limits
|
||||
|
||||
**Development**:
|
||||
```yaml
|
||||
# docker-compose.dev.yml
|
||||
services:
|
||||
api-server:
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 512M
|
||||
cpus: '0.5'
|
||||
```
|
||||
|
||||
**Production**:
|
||||
```yaml
|
||||
# docker-compose.prod.yml
|
||||
services:
|
||||
api-server:
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 2G
|
||||
cpus: '2.0'
|
||||
```
|
||||
|
||||
### Monitoring Optimization
|
||||
|
||||
**Prometheus**:
|
||||
- Adjust scrape intervals
|
||||
- Configure retention periods
|
||||
- Use recording rules for frequent queries
|
||||
|
||||
**Loki**:
|
||||
- Configure log retention
|
||||
- Use log sampling for high-volume sources
|
||||
- Optimize label cardinality
|
||||
|
||||
## Non-Docker Production (systemd)
|
||||
|
||||
This project can be run in production without Docker. The recommended model is:
|
||||
|
||||
- Run `api-server` and `worker` as systemd services.
|
||||
- Terminate TLS/WSS at Caddy and keep the API server on internal plain HTTP/WS.
|
||||
|
||||
The unit templates below are copy-paste friendly, but you must adjust paths, users, and config locations to your environment.
|
||||
|
||||
### `fetchml-api.service`
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=FetchML API Server
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
User=fetchml
|
||||
Group=fetchml
|
||||
WorkingDirectory=/var/lib/fetchml
|
||||
|
||||
Environment=LOG_LEVEL=info
|
||||
|
||||
ExecStart=/usr/local/bin/api-server -config /etc/fetchml/api.yaml
|
||||
Restart=on-failure
|
||||
RestartSec=2
|
||||
|
||||
NoNewPrivileges=true
|
||||
PrivateTmp=true
|
||||
ProtectSystem=strict
|
||||
ProtectHome=true
|
||||
ReadWritePaths=/var/lib/fetchml /var/log/fetchml
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
### `fetchml-worker.service`
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=FetchML Worker
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
User=fetchml
|
||||
Group=fetchml
|
||||
WorkingDirectory=/var/lib/fetchml
|
||||
|
||||
Environment=LOG_LEVEL=info
|
||||
|
||||
ExecStart=/usr/local/bin/worker -config /etc/fetchml/worker.yaml
|
||||
Restart=on-failure
|
||||
RestartSec=2
|
||||
|
||||
NoNewPrivileges=true
|
||||
PrivateTmp=true
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
### Optional: `caddy.service`
|
||||
|
||||
Most distros ship a Caddy systemd unit. If you do not have one available, you can use this template.
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=Caddy
|
||||
Documentation=https://caddyserver.com/docs/
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=notify
|
||||
User=caddy
|
||||
Group=caddy
|
||||
ExecStart=/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile
|
||||
ExecReload=/usr/bin/caddy reload --config /etc/caddy/Caddyfile
|
||||
TimeoutStopSec=5s
|
||||
LimitNOFILE=1048576
|
||||
LimitNPROC=512
|
||||
PrivateTmp=true
|
||||
ProtectSystem=full
|
||||
AmbientCapabilities=CAP_NET_BIND_SERVICE
|
||||
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
|
||||
NoNewPrivileges=true
|
||||
Restart=on-failure
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
## Migration Guide
|
||||
|
||||
### From Development to Production
|
||||
|
||||
1. **Export Data**:
|
||||
```bash
|
||||
docker exec ml-data redis-cli BGSAVE
|
||||
docker cp ml-data:/data/dump.rdb ./redis-backup.rdb
|
||||
```
|
||||
|
||||
2. **Update Configuration**:
|
||||
```bash
|
||||
cp deployments/env.dev.example deployments/env.prod.example
|
||||
# Edit with production values
|
||||
```
|
||||
|
||||
3. **Deploy Production**:
|
||||
```bash
|
||||
cd deployments
|
||||
make prod-up
|
||||
```
|
||||
|
||||
4. **Import Data**:
|
||||
```bash
|
||||
docker cp ./redis-backup.rdb ml-prod-redis:/data/dump.rdb
|
||||
docker restart ml-prod-redis
|
||||
```
|
||||
|
||||
## Support
|
||||
|
||||
For deployment issues:
|
||||
1. Check this guide
|
||||
2. Review logs
|
||||
3. Check GitHub Issues
|
||||
4. Contact maintainers
|
||||
1. Check troubleshooting section
|
||||
2. Review container logs
|
||||
3. Verify network connectivity
|
||||
4. Check resource usage in Grafana
|
||||
|
|
@ -1,231 +1,626 @@
|
|||
# Performance Monitoring
|
||||
|
||||
This document describes the performance monitoring system for Fetch ML, which automatically tracks benchmark metrics through CI/CD integration with Prometheus and Grafana.
|
||||
Comprehensive performance monitoring system for Fetch ML with CI/CD integration, profiling, and production deployment.
|
||||
|
||||
## Overview
|
||||
## Quick Start
|
||||
|
||||
The performance monitoring system provides:
|
||||
### 5-Minute Setup
|
||||
|
||||
- **Automatic benchmark execution** on every CI/CD run
|
||||
- **Real-time metrics collection** via Prometheus Pushgateway
|
||||
- **Historical trend visualization** in Grafana dashboards
|
||||
- **Performance regression detection**
|
||||
- **Cross-commit comparisons**
|
||||
```bash
|
||||
# Start monitoring stack
|
||||
make dev-up
|
||||
|
||||
# Run benchmarks
|
||||
make benchmark
|
||||
|
||||
# View results in Grafana
|
||||
open http://localhost:3000
|
||||
```
|
||||
|
||||
### Basic Profiling
|
||||
|
||||
```bash
|
||||
# CPU profiling
|
||||
make profile-load-norate
|
||||
|
||||
# View interactive profile
|
||||
go tool pprof -http=:8080 cpu_load.out
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
**Development**: Docker Compose with integrated monitoring
|
||||
**Production**: Podman + systemd (Linux)
|
||||
**CI/CD**: GitHub Actions → Prometheus Pushgateway → Grafana
|
||||
|
||||
```
|
||||
GitHub Actions → Benchmark Tests → Prometheus Pushgateway → Prometheus → Grafana Dashboard
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
### 1. GitHub Actions Workflow
|
||||
- **File**: `.github/workflows/benchmark-metrics.yml`
|
||||
### 1. Development Monitoring (Docker Compose)
|
||||
|
||||
**Services**:
|
||||
- **Grafana**: http://localhost:3000 (admin/admin123)
|
||||
- **Prometheus**: http://localhost:9090
|
||||
- **Loki**: http://localhost:3100
|
||||
- **Promtail**: Log aggregation
|
||||
|
||||
**Configuration**:
|
||||
```bash
|
||||
# Start dev stack with monitoring
|
||||
make dev-up
|
||||
|
||||
# Verify services
|
||||
curl -f http://localhost:3000/api/health
|
||||
curl -f http://localhost:9090/api/v1/query?query=up
|
||||
curl -f http://localhost:3100/ready
|
||||
```
|
||||
|
||||
### 2. Production Monitoring (Podman + systemd)
|
||||
|
||||
**Architecture**:
|
||||
- Each service runs as separate Podman container
|
||||
- Managed by systemd for automatic restarts
|
||||
- Proper lifecycle management
|
||||
|
||||
**Setup**:
|
||||
```bash
|
||||
# Run production setup script
|
||||
sudo ./scripts/setup-monitoring-prod.sh /data/monitoring ml-user ml-group
|
||||
|
||||
# Start services
|
||||
sudo systemctl start prometheus loki promtail grafana
|
||||
sudo systemctl enable prometheus loki promtail grafana
|
||||
```
|
||||
|
||||
**Access**:
|
||||
- URL: `http://YOUR_SERVER_IP:3000`
|
||||
- Username: `admin`
|
||||
- Password: `admin` (change on first login)
|
||||
|
||||
### 3. CI/CD Integration
|
||||
|
||||
**GitHub Actions Workflow**:
|
||||
- **Triggers**: Push to main/develop, PRs, daily schedule, manual
|
||||
- **Function**: Runs benchmarks and pushes metrics to Prometheus
|
||||
|
||||
### 2. Prometheus Pushgateway
|
||||
- **Port**: 9091
|
||||
- **Purpose**: Receives benchmark metrics from CI/CD runs
|
||||
- **URL**: `http://localhost:9091`
|
||||
|
||||
### 3. Prometheus Server
|
||||
- **Configuration**: `monitoring/prometheus.yml`
|
||||
- **Scrapes**: Pushgateway for benchmark metrics
|
||||
- **Retention**: Configurable retention period
|
||||
|
||||
### 4. Grafana Dashboard
|
||||
- **Location**: `monitoring/dashboards/performance-dashboard.json`
|
||||
- **Visualizations**: Performance trends, regressions, comparisons
|
||||
- **Access**: http://localhost:3001
|
||||
|
||||
## Setup
|
||||
|
||||
### 1. Start Monitoring Stack
|
||||
|
||||
**Setup**:
|
||||
```bash
|
||||
make monitoring-performance
|
||||
```
|
||||
|
||||
This starts:
|
||||
- Grafana: http://localhost:3001 (admin/admin)
|
||||
- Loki: http://localhost:3100
|
||||
- Pushgateway: http://localhost:9091
|
||||
|
||||
### 2. Configure GitHub Secrets
|
||||
|
||||
Add this secret to your GitHub repository:
|
||||
|
||||
```
|
||||
# Add GitHub secret
|
||||
PROMETHEUS_PUSHGATEWAY_URL=http://your-pushgateway:9091
|
||||
```
|
||||
|
||||
### 3. Verify Integration
|
||||
## Performance Testing
|
||||
|
||||
1. Push code to trigger the workflow
|
||||
2. Check Pushgateway: http://localhost:9091
|
||||
3. View metrics in Grafana dashboard
|
||||
|
||||
## Available Metrics
|
||||
|
||||
### Benchmark Metrics
|
||||
|
||||
- `benchmark_time_per_op` - Time per operation in nanoseconds
|
||||
- `benchmark_memory_per_op` - Memory per operation in bytes
|
||||
- `benchmark_allocs_per_op` - Allocations per operation
|
||||
|
||||
Labels:
|
||||
- `benchmark` - Benchmark name (sanitized)
|
||||
- `job` - Always "benchmark"
|
||||
- `instance` - GitHub Actions run ID
|
||||
|
||||
### Example Metrics Output
|
||||
|
||||
```
|
||||
benchmark_time_per_op{benchmark="BenchmarkAPIServerCreateJobSimple"} 42653
|
||||
benchmark_memory_per_op{benchmark="BenchmarkAPIServerCreateJobSimple"} 13518
|
||||
benchmark_allocs_per_op{benchmark="BenchmarkAPIServerCreateJobSimple"} 98
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Manual Benchmark Execution
|
||||
### Benchmarks
|
||||
|
||||
```bash
|
||||
# Run benchmarks locally
|
||||
make benchmark
|
||||
|
||||
# View results in console
|
||||
# Or run with detailed output
|
||||
go test -bench=. -benchmem ./tests/benchmarks/...
|
||||
|
||||
# Run specific benchmark
|
||||
go test -bench=BenchmarkName -benchmem ./tests/benchmarks/...
|
||||
|
||||
# Run with race detection
|
||||
go test -race -bench=. ./tests/benchmarks/...
|
||||
```
|
||||
|
||||
### Automated Monitoring
|
||||
### Load Testing
|
||||
|
||||
The system automatically runs benchmarks on:
|
||||
```bash
|
||||
# Run load test suite
|
||||
make load-test
|
||||
|
||||
- **Every push** to main/develop branches
|
||||
- **Pull requests** to main branch
|
||||
- **Daily schedule** at 6:00 AM UTC
|
||||
- **Manual trigger** via GitHub Actions UI
|
||||
# Start Redis for load tests
|
||||
docker run -d -p 6379:6379 redis:alpine
|
||||
```
|
||||
|
||||
### Viewing Results
|
||||
### CPU Profiling
|
||||
|
||||
1. **Grafana Dashboard**: http://localhost:3001
|
||||
2. **Pushgateway**: http://localhost:9091/metrics
|
||||
3. **Prometheus**: http://localhost:9090/targets
|
||||
#### HTTP Load Test Profiling
|
||||
|
||||
## Configuration
|
||||
```bash
|
||||
# CPU profile MediumLoad HTTP test (no rate limiting - recommended)
|
||||
make profile-load-norate
|
||||
|
||||
### Prometheus Configuration
|
||||
# CPU profile MediumLoad HTTP test (with rate limiting)
|
||||
make profile-load
|
||||
```
|
||||
|
||||
Edit `monitoring/prometheus.yml` to adjust:
|
||||
**Analyze Results**:
|
||||
```bash
|
||||
# View interactive profile (web UI)
|
||||
go tool pprof -http=:8081 cpu_load.out
|
||||
|
||||
# View interactive profile (terminal)
|
||||
go tool pprof cpu_load.out
|
||||
|
||||
# Generate flame graph
|
||||
go tool pprof -raw cpu_load.out | go-flamegraph.pl > cpu_flame.svg
|
||||
|
||||
# View top functions
|
||||
go tool pprof -top cpu_load.out
|
||||
```
|
||||
|
||||
#### WebSocket Queue Profiling
|
||||
|
||||
```bash
|
||||
# CPU profile WebSocket → Redis queue → worker path
|
||||
make profile-ws-queue
|
||||
|
||||
# View interactive profile
|
||||
go tool pprof -http=:8082 cpu_ws.out
|
||||
```
|
||||
|
||||
### Profiling Tips
|
||||
|
||||
- Use `profile-load-norate` for cleaner CPU profiles (no rate limiting delays)
|
||||
- Profiles run for 60 seconds by default
|
||||
- Requires Redis running on localhost:6379
|
||||
- Results show throughput, latency, and error rate metrics
|
||||
|
||||
## Grafana Dashboards
|
||||
|
||||
### Development Dashboards
|
||||
|
||||
**Access**: http://localhost:3000 (admin/admin123)
|
||||
|
||||
**Available Dashboards**:
|
||||
1. **Load Test Performance**: Request metrics, response times, error rates
|
||||
2. **System Health**: Service status, resource usage, memory, CPU
|
||||
3. **Log Analysis**: Error logs, service logs, log aggregation
|
||||
|
||||
### Production Dashboards
|
||||
|
||||
**Auto-loaded Dashboards**:
|
||||
- **ML Task Queue Monitoring** (metrics)
|
||||
- **Application Logs** (Loki logs)
|
||||
|
||||
### Key Metrics
|
||||
|
||||
- `benchmark_time_per_op` - Execution time
|
||||
- `benchmark_memory_per_op` - Memory usage
|
||||
- `benchmark_allocs_per_op` - Allocation count
|
||||
- HTTP request rates and response times
|
||||
- Error rates and system health metrics
|
||||
|
||||
### Worker Resource Metrics
|
||||
|
||||
The worker exposes a Prometheus endpoint (default `:9100/metrics`) which includes ResourceManager and task execution metrics.
|
||||
|
||||
**Resource availability**:
|
||||
- `fetchml_resources_cpu_total` - Total CPU tokens managed by the worker.
|
||||
- `fetchml_resources_cpu_free` - Currently free CPU tokens.
|
||||
- `fetchml_resources_gpu_slots_total{gpu_index="N"}` - Total GPU slots per GPU index.
|
||||
- `fetchml_resources_gpu_slots_free{gpu_index="N"}` - Free GPU slots per GPU index.
|
||||
|
||||
**Acquisition pressure**:
|
||||
- `fetchml_resources_acquire_total` - Total resource acquisition attempts.
|
||||
- `fetchml_resources_acquire_wait_total` - Number of acquisitions that had to wait.
|
||||
- `fetchml_resources_acquire_timeout_total` - Number of acquisitions that timed out.
|
||||
- `fetchml_resources_acquire_wait_seconds_total` - Total time spent waiting for resources.
|
||||
|
||||
**Why these help**:
|
||||
- Debug why runs slow down under load (wait time increases).
|
||||
- Confirm GPU slot sharing is working (free slots fluctuate as expected).
|
||||
- Detect saturation and timeouts before tasks start failing.
|
||||
|
||||
### Prometheus Scrape Example (Worker)
|
||||
|
||||
If you run the worker locally on your machine (default metrics port `:9100`) and Prometheus runs in Docker Compose, use `host.docker.internal`:
|
||||
|
||||
```yaml
|
||||
scrape_configs:
|
||||
- job_name: 'benchmark'
|
||||
- job_name: 'worker'
|
||||
static_configs:
|
||||
- targets: ['pushgateway:9091']
|
||||
- targets: ['host.docker.internal:9100']
|
||||
- targets: ['worker:9100']
|
||||
metrics_path: /metrics
|
||||
honor_labels: true
|
||||
scrape_interval: 15s
|
||||
```
|
||||
|
||||
### Grafana Dashboard
|
||||
## Production Deployment
|
||||
|
||||
Customize the dashboard in `monitoring/dashboards/performance-dashboard.json`:
|
||||
### Prerequisites
|
||||
|
||||
- Add new panels
|
||||
- Modify queries
|
||||
- Adjust visualization types
|
||||
- Set up alerts
|
||||
- Linux distribution with systemd (Rocky/RHEL/CentOS, Ubuntu/Debian, Arch, SUSE)
|
||||
- Production app already deployed
|
||||
- Root or sudo access
|
||||
- Ports 3000, 9090, 3100 available
|
||||
|
||||
### Service Configuration
|
||||
|
||||
**Prometheus**:
|
||||
- **Port**: 9090
|
||||
- **Config**: `/etc/fetch_ml/monitoring/prometheus/prometheus.yml`
|
||||
- **Data**: `/data/monitoring/prometheus`
|
||||
- **Purpose**: Scrapes metrics from API server
|
||||
|
||||
**Loki**:
|
||||
- **Port**: 3100
|
||||
- **Config**: `/etc/fetch_ml/monitoring/loki-config.yml`
|
||||
- **Data**: `/data/monitoring/loki`
|
||||
- **Purpose**: Log aggregation
|
||||
|
||||
**Promtail**:
|
||||
- **Config**: `/etc/fetch_ml/monitoring/promtail-config.yml`
|
||||
- **Log Source**: `/var/log/fetch_ml/*.log`
|
||||
- **Purpose**: Ships logs to Loki
|
||||
|
||||
**Grafana**:
|
||||
- **Port**: 3000
|
||||
- **Config**: `/etc/fetch_ml/monitoring/grafana/provisioning`
|
||||
- **Data**: `/data/monitoring/grafana`
|
||||
- **Dashboards**: `/var/lib/grafana/dashboards`
|
||||
|
||||
### Management Commands
|
||||
|
||||
```bash
|
||||
# Check status
|
||||
sudo systemctl status prometheus grafana loki promtail
|
||||
|
||||
# View logs
|
||||
sudo journalctl -u prometheus -f
|
||||
sudo journalctl -u grafana -f
|
||||
sudo journalctl -u loki -f
|
||||
sudo journalctl -u promtail -f
|
||||
|
||||
# Restart services
|
||||
sudo systemctl restart prometheus
|
||||
sudo systemctl restart grafana
|
||||
|
||||
# Stop all monitoring
|
||||
sudo systemctl stop prometheus grafana loki promtail
|
||||
```
|
||||
|
||||
## Data Retention
|
||||
|
||||
### Prometheus
|
||||
Default: 15 days. Edit `/etc/fetch_ml/monitoring/prometheus/prometheus.yml`:
|
||||
```yaml
|
||||
storage:
|
||||
tsdb:
|
||||
retention.time: 30d
|
||||
```
|
||||
|
||||
### Loki
|
||||
Default: 30 days. Edit `/etc/fetch_ml/monitoring/loki-config.yml`:
|
||||
```yaml
|
||||
limits_config:
|
||||
retention_period: 30d
|
||||
```
|
||||
|
||||
## Security
|
||||
|
||||
### Firewall
|
||||
|
||||
**RHEL/Rocky/Fedora (firewalld)**:
|
||||
```bash
|
||||
# Remove public access
|
||||
sudo firewall-cmd --permanent --remove-port=3000/tcp
|
||||
sudo firewall-cmd --permanent --remove-port=9090/tcp
|
||||
|
||||
# Add specific source
|
||||
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.0.0.0/24" port port="3000" protocol="tcp" accept'
|
||||
sudo firewall-cmd --reload
|
||||
```
|
||||
|
||||
**Ubuntu/Debian (ufw)**:
|
||||
```bash
|
||||
# Remove public access
|
||||
sudo ufw delete allow 3000/tcp
|
||||
sudo ufw delete allow 9090/tcp
|
||||
|
||||
# Add specific source
|
||||
sudo ufw allow from 10.0.0.0/24 to any port 3000 proto tcp
|
||||
```
|
||||
|
||||
### Authentication
|
||||
|
||||
Change Grafana admin password:
|
||||
1. Login to Grafana
|
||||
2. User menu → Profile → Change Password
|
||||
|
||||
### TLS (Optional)
|
||||
|
||||
For HTTPS, configure reverse proxy (nginx/Apache) in front of Grafana.
|
||||
|
||||
## Performance Regression Detection
|
||||
|
||||
```bash
|
||||
# Create baseline
|
||||
make detect-regressions
|
||||
|
||||
# Analyze current performance
|
||||
go test -bench=. -benchmem ./tests/benchmarks/... | tee current.json
|
||||
|
||||
# Track performance over time
|
||||
./scripts/track_performance.sh
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **Metrics not appearing in Grafana**
|
||||
- Check Pushgateway: http://localhost:9091
|
||||
- Verify Prometheus targets: http://localhost:9090/targets
|
||||
- Check GitHub Actions logs
|
||||
|
||||
2. **GitHub Actions workflow failing**
|
||||
- Verify `PROMETHEUS_PUSHGATEWAY_URL` secret
|
||||
- Check workflow syntax
|
||||
- Review benchmark execution logs
|
||||
|
||||
3. **Pushgateway not receiving metrics**
|
||||
- Verify URL accessibility from CI/CD
|
||||
- Check network connectivity
|
||||
- Review curl command in workflow
|
||||
|
||||
### Debug Commands
|
||||
### Development Issues
|
||||
|
||||
**No metrics in Grafana?**
|
||||
```bash
|
||||
# Check running services
|
||||
docker ps --filter "name=monitoring"
|
||||
# Check services
|
||||
docker ps --filter "name=ml-"
|
||||
|
||||
# View Pushgateway metrics
|
||||
curl http://localhost:9091/metrics
|
||||
|
||||
# Check Prometheus targets
|
||||
curl http://localhost:9090/api/v1/targets
|
||||
|
||||
# Test manual metric push
|
||||
echo "test_metric 123" | curl --data-binary @- http://localhost:9091/metrics/job/test
|
||||
# Check monitoring services
|
||||
curl http://localhost:3000/api/health
|
||||
curl http://localhost:9090/api/v1/query?query=up
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
**Workflow failing?**
|
||||
- Verify GitHub secret configuration
|
||||
- Check workflow logs in GitHub Actions
|
||||
|
||||
### Benchmark Naming
|
||||
**Profiling Issues**:
|
||||
```bash
|
||||
# Flag error handling
|
||||
go test ./tests/load -run TestLoadProfile_Medium -count=1 -cpuprofile cpu_load.out -v -args -profile-norate
|
||||
|
||||
Use consistent naming conventions:
|
||||
- `BenchmarkAPIServerCreateJob`
|
||||
- `BenchmarkMLExperimentTraining`
|
||||
- `BenchmarkDatasetOperations`
|
||||
# Redis not available expansion
|
||||
docker run -d -p 6379:6379 redis:alpine
|
||||
|
||||
# Port conflicts
|
||||
lsof -i :3000 # Grafana
|
||||
lsof -i :8080 # pprof web UI
|
||||
lsof -i :6379 # Redis
|
||||
```
|
||||
|
||||
### Production Issues
|
||||
|
||||
**Grafana shows no data**:
|
||||
```bash
|
||||
# Check if Prometheus is reachable
|
||||
curl http://localhost:9090/-/healthy
|
||||
|
||||
# Check datasource in Grafana
|
||||
# Settings → Data Sources → Prometheus → Save & Test
|
||||
```
|
||||
|
||||
**Loki not receiving logs**:
|
||||
```bash
|
||||
# Check Promtail is running
|
||||
sudo systemctl status promtail
|
||||
|
||||
# Verify log file exists
|
||||
ls -l /var/log/fetch_ml/
|
||||
|
||||
# Check Promtail can reach Loki
|
||||
curl http://localhost:3100/ready
|
||||
```
|
||||
|
||||
**Podman containers not starting**:
|
||||
```bash
|
||||
# Check pod status
|
||||
sudo -u ml-user podman pod ps
|
||||
sudo -u ml-user podman ps -a
|
||||
|
||||
# Remove and recreate
|
||||
sudo -u ml-user podman pod stop monitoring
|
||||
sudo -u ml-user podman pod rm monitoring
|
||||
sudo systemctl restart prometheus
|
||||
```
|
||||
|
||||
## Backup and Recovery
|
||||
|
||||
### Backup Procedures
|
||||
|
||||
```bash
|
||||
# Development backup
|
||||
docker run --rm -v prometheus_data:/data -v $(pwd):/backup alpine tar czf /backup/prometheus-backup.tar.gz -C /data .
|
||||
docker run --rm -v grafana_data:/data -v $(pwd):/backup alpine tar czf /backup/grafana-backup.tar.gz -C /data .
|
||||
|
||||
# Production backup
|
||||
sudo tar -czf grafana-backup.tar.gz /data/monitoring/grafana
|
||||
sudo tar -czf prometheus-backup.tar.gz /data/monitoring/prometheus
|
||||
```
|
||||
|
||||
### Configuration Backup
|
||||
|
||||
```bash
|
||||
# Backup configurations
|
||||
tar czf monitoring-config-backup.tar.gz monitoring/ deployments/
|
||||
```
|
||||
|
||||
## Updates and Maintenance
|
||||
|
||||
### Development Updates
|
||||
|
||||
```bash
|
||||
# Update monitoring provisioning (Grafana datasources/providers)
|
||||
python3 scripts/setup_monitoring.py
|
||||
|
||||
# Restart services
|
||||
make dev-down && make dev-up
|
||||
```
|
||||
|
||||
### Production Updates
|
||||
|
||||
```bash
|
||||
# Pull latest images
|
||||
sudo -u ml-user podman pull docker.io/grafana/grafana:latest
|
||||
sudo -u ml-user podman pull docker.io/prom/prometheus:latest
|
||||
sudo -u ml-user podman pull docker.io/grafana/loki:latest
|
||||
sudo -u ml-user podman pull docker.io/grafana/promtail:latest
|
||||
|
||||
# Restart services to use new images
|
||||
sudo systemctl restart grafana prometheus loki promtail
|
||||
```
|
||||
|
||||
### Regular Maintenance
|
||||
|
||||
**Weekly**:
|
||||
- Check Grafana dashboards for anomalies
|
||||
- Review log files for errors
|
||||
- Verify backup procedures
|
||||
|
||||
**Monthly**:
|
||||
- Update Docker/Podman images
|
||||
- Clean up old data volumes
|
||||
- Review and rotate secrets
|
||||
|
||||
## Metrics Reference
|
||||
|
||||
### Worker Metrics
|
||||
|
||||
**Task Processing**:
|
||||
- `fetchml_tasks_processed_total` - Total tasks processed successfully
|
||||
- `fetchml_tasks_failed_total` - Total tasks failed
|
||||
- `fetchml_tasks_active` - Currently active tasks
|
||||
- `fetchml_tasks_queued` - Current queue depth
|
||||
|
||||
**Data Transfer**:
|
||||
- `fetchml_data_transferred_bytes_total` - Total bytes transferred
|
||||
- `fetchml_data_fetch_time_seconds_total` - Total time fetching datasets
|
||||
- `fetchml_execution_time_seconds_total` - Total task execution time
|
||||
|
||||
**Prewarming**:
|
||||
- `fetchml_prewarm_env_hit_total` - Environment prewarm hits (warm image existed)
|
||||
- `fetchml_prewarm_env_miss_total` - Environment prewarm misses (warm image not found)
|
||||
- `fetchml_prewarm_env_built_total` - Environment images built for prewarming
|
||||
- `fetchml_prewarm_env_time_seconds_total` - Total time building prewarm images
|
||||
- `fetchml_prewarm_snapshot_hit_total` - Snapshot prewarm hits (found in .prewarm/)
|
||||
- `fetchml_prewarm_snapshot_miss_total` - Snapshot prewarm misses (not in .prewarm/)
|
||||
- `fetchml_prewarm_snapshot_built_total` - Snapshots prewarmed into .prewarm/
|
||||
- `fetchml_prewarm_snapshot_time_seconds_total` - Total time prewarming snapshots
|
||||
|
||||
**Resources**:
|
||||
- `fetchml_resources_cpu_total` - Total CPU tokens
|
||||
- `fetchml_resources_cpu_free` - Free CPU tokens
|
||||
- `fetchml_resources_gpu_slots_total` - Total GPU slots per index
|
||||
- `fetchml_resources_gpu_slots_free` - Free GPU slots per index
|
||||
|
||||
### API Server Metrics
|
||||
|
||||
**HTTP**:
|
||||
- `fetchml_http_requests_total` - Total HTTP requests
|
||||
- `fetchml_http_duration_seconds` - HTTP request duration
|
||||
|
||||
**WebSocket**:
|
||||
- `fetchml_websocket_connections` - Active WebSocket connections
|
||||
- `fetchml_websocket_messages_total` - Total WebSocket messages
|
||||
- `fetchml_websocket_duration_seconds` - Message processing duration
|
||||
- `fetchml_websocket_errors_total` - WebSocket errors
|
||||
|
||||
**Jupyter**:
|
||||
- `fetchml_jupyter_services` - Jupyter services count
|
||||
- `fetchml_jupyter_operations_total` - Jupyter operations
|
||||
|
||||
## Worker Configuration: Prewarming
|
||||
|
||||
### Prewarm Flag
|
||||
|
||||
Enable Phase 1 prewarming in worker configuration:
|
||||
|
||||
```yaml
|
||||
# worker-config.yaml
|
||||
prewarm_enabled: true # Default: false (opt-in)
|
||||
```
|
||||
|
||||
**Behavior**:
|
||||
- When `false`: No prewarming loops run
|
||||
- When `true`: Worker stages next snapshot and fetches datasets when idle
|
||||
|
||||
**What gets prewarmed**:
|
||||
1. **Snapshots**: Copied to `.prewarm/snapshots/<taskID>/`
|
||||
2. **Datasets**: Fetched to `.prewarm/datasets/` (if `auto_fetch_data: true`)
|
||||
3. **Environment images**: Warmed in envpool cache (if deps manifest exists)
|
||||
|
||||
**Execution path**:
|
||||
- During task execution, `StageSnapshotFromPath` checks `.prewarm/snapshots/<taskID>/`
|
||||
- If found: **Hit** - Renames prewarmed directory into job (fast)
|
||||
- If not found: **Miss** - Copies from snapshot store (slower)
|
||||
|
||||
**Metrics impact**:
|
||||
- Prewarm hits reduce task startup latency
|
||||
- Metrics track hit/miss ratios and prewarm timing
|
||||
- Use `fetchml_prewarm_snapshot_*` metrics to monitor effectiveness
|
||||
|
||||
### Grafana Dashboards
|
||||
|
||||
**Prewarm Performance Dashboard**:
|
||||
1. Import `monitoring/grafana/dashboards/prewarm-performance.txt` into Grafana
|
||||
2. Shows hit rates, build times, and efficiency metrics
|
||||
3. Use for monitoring prewarm effectiveness
|
||||
|
||||
**Worker Resources Dashboard**:
|
||||
- Added prewarm panels to existing worker-resources dashboard
|
||||
- Environment and snapshot hit rate percentages
|
||||
- Prewarm hits vs misses graphs
|
||||
- Build time and build count metrics
|
||||
|
||||
### Prometheus Queries
|
||||
|
||||
**Hit Rate Calculations**:
|
||||
```promql
|
||||
# Environment prewarm hit rate
|
||||
100 * (fetchml_prewarm_env_hit_total / clamp_min(fetchml_prewarm_env_hit_total + fetchml_prewarm_env_miss_total, 1))
|
||||
|
||||
# Snapshot prewarm hit rate
|
||||
100 * (fetchml_prewarm_snapshot_hit_total / clamp_min(fetchml_prewarm_snapshot_hit_total + fetchml_prewarm_snapshot_miss_total, 1))
|
||||
```
|
||||
|
||||
**Rate-based Monitoring**:
|
||||
```promql
|
||||
# Prewarm activity rate
|
||||
rate(fetchml_prewarm_env_hit_total[5m])
|
||||
rate(fetchml_prewarm_snapshot_hit_total[5m])
|
||||
|
||||
# Build time rate
|
||||
rate(fetchml_prewarm_env_time_seconds_total[5m])
|
||||
rate(fetchml_prewarm_snapshot_time_seconds_total[5m])
|
||||
```
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Custom Dashboards
|
||||
|
||||
1. **Access Grafana**: http://localhost:3000
|
||||
2. **Create Dashboard**: + → Dashboard
|
||||
3. **Add Panels**: Use Prometheus queries
|
||||
4. **Save Dashboard**: Export JSON for sharing
|
||||
|
||||
### Alerting
|
||||
|
||||
Set up Grafana alerts for:
|
||||
- Performance regressions (>10% degradation)
|
||||
- Missing benchmark data
|
||||
- High memory allocation rates
|
||||
- High error rates (> 5%)
|
||||
- Slow response times (> 1s)
|
||||
- Service downtime
|
||||
- Resource exhaustion
|
||||
- Low prewarm hit rates (< 50%)
|
||||
|
||||
### Retention
|
||||
### Custom Metrics
|
||||
|
||||
Configure appropriate retention periods:
|
||||
- Raw metrics: 30 days
|
||||
- Aggregated data: 1 year
|
||||
- Dashboard snapshots: Permanent
|
||||
Add custom metrics to your Go code:
|
||||
|
||||
## Integration with Existing Workflows
|
||||
```go
|
||||
import "github.com/prometheus/client_golang/prometheus"
|
||||
|
||||
The benchmark monitoring integrates seamlessly with:
|
||||
var (
|
||||
requestDuration = prometheus.NewHistogramVec(
|
||||
prometheus.HistogramOpts{
|
||||
Name: "http_request_duration_seconds",
|
||||
Help: "HTTP request duration in seconds",
|
||||
},
|
||||
[]string{"method", "endpoint"},
|
||||
)
|
||||
)
|
||||
|
||||
- **CI/CD pipelines**: Automatic execution
|
||||
- **Code reviews**: Performance impact visible
|
||||
- **Release management**: Performance trends over time
|
||||
- **Development**: Local testing with same metrics
|
||||
// Record metrics
|
||||
requestDuration.WithLabelValues("GET", "/api/v1/jobs").Observe(duration)
|
||||
```
|
||||
|
||||
## Future Enhancements
|
||||
## See Also
|
||||
|
||||
Potential improvements:
|
||||
|
||||
1. **Automated performance regression alerts**
|
||||
2. **Performance budgets and gates**
|
||||
3. **Comparative analysis across branches**
|
||||
4. **Integration with load testing results**
|
||||
5. **Performance impact scoring**
|
||||
|
||||
## Support
|
||||
|
||||
For issuesundles:
|
||||
|
||||
1. Check this documentation
|
||||
2. Review GitHub Actions logs
|
||||
3. Verify monitoring stack status
|
||||
4. Consult Grafana/Prometheus docs
|
||||
|
||||
---
|
||||
|
||||
*Last updated: December 2024*
|
||||
- **[Testing Guide](testing.md)** - Testing with monitoring
|
||||
- **[Deployment Guide](deployment.md)** - Deployment procedures
|
||||
- **[Architecture Guide](architecture.md)** - System architecture
|
||||
- **[Troubleshooting](troubleshooting.md)** - Common issues
|
||||
|
|
@ -1,245 +0,0 @@
|
|||
# Performance Monitoring Quick Start
|
||||
|
||||
Get started with performance monitoring and profiling in 5 minutes.
|
||||
|
||||
## Quick Start Options
|
||||
|
||||
### Option 1: Basic Benchmarking
|
||||
```bash
|
||||
# Run benchmarks
|
||||
make benchmark
|
||||
|
||||
# View results in Grafana
|
||||
open http://localhost:3001
|
||||
```
|
||||
|
||||
### Option 2: CPU Profiling
|
||||
```bash
|
||||
# Generate CPU profile
|
||||
make profile-load-norate
|
||||
|
||||
# View interactive profile
|
||||
go tool pprof -http=:8080 cpu_load.out
|
||||
```
|
||||
|
||||
### Option 3: Full Monitoring Stack
|
||||
```bash
|
||||
# Start monitoring services
|
||||
make monitoring-performance
|
||||
|
||||
# Run benchmarks with metrics collection
|
||||
make benchmark
|
||||
|
||||
# View in Grafana dashboard
|
||||
open http://localhost:3001
|
||||
```
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Docker and Docker Compose
|
||||
- Go 1.21 or later
|
||||
- Redis (for load tests)
|
||||
- GitHub repository (for CI/CD integration)
|
||||
|
||||
## 1. Setup & Installation
|
||||
|
||||
### Start Monitoring Stack (Optional)
|
||||
|
||||
For full metrics visualization:
|
||||
|
||||
```bash
|
||||
make monitoring-performance
|
||||
```
|
||||
|
||||
This starts:
|
||||
- **Grafana**: http://localhost:3001 (admin/admin)
|
||||
- **Pushgateway**: http://localhost:9091
|
||||
- **Loki**: http://localhost:3100
|
||||
|
||||
### Start Redis (Required for Load Tests)
|
||||
|
||||
```bash
|
||||
docker run -d -p 6379:6379 redis:alpine
|
||||
```
|
||||
|
||||
## 2. Performance Testing
|
||||
|
||||
### Benchmarks
|
||||
|
||||
```bash
|
||||
# Run benchmarks locally
|
||||
make benchmark
|
||||
|
||||
# Or run with detailed output
|
||||
go test -bench=. -benchmem ./tests/benchmarks/...
|
||||
```
|
||||
|
||||
### Load Testing
|
||||
|
||||
```bash
|
||||
# Run load test suite
|
||||
make load-test
|
||||
```
|
||||
|
||||
## 3. CPU Profiling
|
||||
|
||||
### HTTP Load Test Profiling
|
||||
|
||||
```bash
|
||||
# CPU profile MediumLoad HTTP test (with rate limiting)
|
||||
make profile-load
|
||||
|
||||
# CPU profile MediumLoad HTTP test (no rate limiting - recommended)
|
||||
make profile-load-norate
|
||||
```
|
||||
|
||||
**Analyze Results:**
|
||||
```bash
|
||||
# View interactive profile (web UI)
|
||||
go tool pprof -http=:8081 cpu_load.out
|
||||
|
||||
# View interactive profile (terminal)
|
||||
go tool pprof cpu_load.out
|
||||
|
||||
# Generate flame graph
|
||||
go tool pprof -raw cpu_load.out | go-flamegraph.pl > cpu_flame.svg
|
||||
|
||||
# View top functions
|
||||
go tool pprof -top cpu_load.out
|
||||
```
|
||||
|
||||
Web UI: http://localhost:8080
|
||||
|
||||
### WebSocket Queue Profiling
|
||||
|
||||
```bash
|
||||
# CPU profile WebSocket → Redis queue → worker path
|
||||
make profile-ws-queue
|
||||
```
|
||||
|
||||
**Analyze Results:**
|
||||
```bash
|
||||
# View interactive profile (web UI)
|
||||
go tool pprof -http=:8082 cpu_ws.out
|
||||
|
||||
# View interactive profile (terminal)
|
||||
go tool pprof cpu_ws.out
|
||||
```
|
||||
|
||||
### Profiling Tips
|
||||
|
||||
- Use `profile-load-norate` for cleaner CPU profiles (no rate limiting delays)
|
||||
- Profiles run for 60 seconds by default
|
||||
- Requires Redis running on localhost:6379
|
||||
- Results show throughput, latency, and error rate metrics
|
||||
|
||||
## 4. Results & Visualization
|
||||
|
||||
### Grafana Dashboard
|
||||
|
||||
Open: http://localhost:3001 (admin/admin)
|
||||
|
||||
Navigate to the **Performance Dashboard** to see:
|
||||
- Real-time benchmark results
|
||||
- Historical trends
|
||||
- Performance comparisons
|
||||
|
||||
### Key Metrics
|
||||
|
||||
- `benchmark_time_per_op` - Execution time
|
||||
- `benchmark_memory_per_op` - Memory usage
|
||||
- `benchmark_allocs_per_op` - Allocation count
|
||||
|
||||
## 5. CI/CD Integration
|
||||
|
||||
### Setup GitHub Integration
|
||||
|
||||
Add GitHub secret:
|
||||
```
|
||||
PROMETHEUS_PUSHGATEWAY_URL=http://your-pushgateway:9091
|
||||
```
|
||||
|
||||
Now benchmarks run automatically on:
|
||||
- Every push to main/develop
|
||||
- Pull requests
|
||||
- Daily schedule
|
||||
|
||||
### Verify Integration
|
||||
|
||||
1. Push code to trigger workflow
|
||||
2. Check Pushgateway: http://localhost:9091/metrics
|
||||
3. View metrics in Grafana
|
||||
|
||||
## 6. Troubleshooting
|
||||
|
||||
### Monitoring Stack Issues
|
||||
|
||||
**No metrics in Grafana?**
|
||||
```bash
|
||||
# Check services
|
||||
docker ps --filter "name=monitoring"
|
||||
|
||||
# Check Pushgateway
|
||||
curl http://localhost:9091/metrics
|
||||
```
|
||||
|
||||
**Workflow failing?**
|
||||
- Verify GitHub secret configuration
|
||||
- Check workflow logs in GitHub Actions
|
||||
|
||||
### Profiling Issues
|
||||
|
||||
**Flag error like "flag provided but not defined: -test.paniconexit0"**
|
||||
```bash
|
||||
# This should be fixed now, but if it persists:
|
||||
go test ./tests/load -run TestLoadProfile_Medium -count=1 -cpuprofile cpu_load.out -v -args -profile-norate
|
||||
```
|
||||
|
||||
**Redis not available?**
|
||||
```bash
|
||||
# Start Redis for profiling tests
|
||||
docker run -d -p 6379:6379 redis:alpine
|
||||
|
||||
# Check profile file generated
|
||||
ls -la cpu_load.out
|
||||
```
|
||||
|
||||
**Port conflicts?**
|
||||
```bash
|
||||
# Check if ports are in use
|
||||
lsof -i :3001 # Grafana
|
||||
lsof -i :8080 # pprof web UI
|
||||
lsof -i :6379 # Redis
|
||||
```
|
||||
|
||||
## 7. Advanced Usage
|
||||
|
||||
### Performance Regression Detection
|
||||
```bash
|
||||
# Create baseline
|
||||
make detect-regressions
|
||||
|
||||
# Analyze current performance
|
||||
go test -bench=. -benchmem ./tests/benchmarks/... | tee current.json
|
||||
```
|
||||
|
||||
### Custom Benchmarks
|
||||
```bash
|
||||
# Run specific benchmark
|
||||
go test -bench=BenchmarkName -benchmem ./tests/benchmarks/...
|
||||
|
||||
# Run with race detection
|
||||
go test -race -bench=. ./tests/benchmarks/...
|
||||
```
|
||||
|
||||
## 8. Further Reading
|
||||
|
||||
- [Full Documentation](performance-monitoring.md)
|
||||
- [Dashboard Customization](performance-monitoring.md#grafana-dashboard)
|
||||
- [Alert Configuration](performance-monitoring.md#alerting)
|
||||
- [Architecture Guide](architecture.md)
|
||||
- [Testing Guide](testing.md)
|
||||
|
||||
---
|
||||
|
||||
*Ready in 5 minutes!*
|
||||
|
|
@ -1,217 +0,0 @@
|
|||
# Production Monitoring Deployment Guide (Linux)
|
||||
|
||||
This guide covers deploying the monitoring stack (Prometheus, Grafana, Loki, Promtail) on Linux production servers.
|
||||
|
||||
## Architecture
|
||||
|
||||
**Testing**: Docker Compose (macOS/Linux)
|
||||
**Production**: Podman + systemd (Linux)
|
||||
|
||||
**Important**: Docker is for testing only. Podman is used for running actual ML experiments in production.
|
||||
|
||||
**Dev (Testing)**: Docker Compose
|
||||
**Prod (Experiments)**: Podman + systemd
|
||||
|
||||
Each service runs as a separate Podman container managed by systemd for automatic restarts and proper lifecycle management.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
**Container Runtimes:**
|
||||
- **Docker Compose**: For testing and development only
|
||||
- **Podman**: For production experiment execution
|
||||
|
||||
- Linux distribution with systemd (Rocky/RHEL/CentOS, Ubuntu/Debian, Arch, SUSE, etc.)
|
||||
- Production app already deployed (see `scripts/setup-prod.sh`)
|
||||
- Root or sudo access
|
||||
- Ports 3000, 9090, 3100 available
|
||||
|
||||
## Quick Setup
|
||||
|
||||
### 1. Run Setup Script
|
||||
```bash
|
||||
cd /path/to/fetch_ml
|
||||
sudo ./scripts/setup-monitoring-prod.sh /data/monitoring ml-user ml-group
|
||||
```
|
||||
|
||||
This will:
|
||||
- Create directory structure at `/data/monitoring`
|
||||
- Copy configuration files to `/etc/fetch_ml/monitoring`
|
||||
- Create systemd services for each component
|
||||
- Set up firewall rules
|
||||
|
||||
### 2. Start Services
|
||||
```bash
|
||||
# Start all monitoring services
|
||||
sudo systemctl start prometheus
|
||||
sudo systemctl start loki
|
||||
sudo systemctl start promtail
|
||||
sudo systemctl start grafana
|
||||
|
||||
# Enable on boot
|
||||
sudo systemctl enable prometheus loki promtail grafana
|
||||
```
|
||||
|
||||
### 3. Access Grafana
|
||||
- URL: `http://YOUR_SERVER_IP:3000`
|
||||
- Username: `admin`
|
||||
- Password: `admin` (change on first login)
|
||||
|
||||
Dashboards will auto-load:
|
||||
- **ML Task Queue Monitoring** (metrics)
|
||||
- **Application Logs** (Loki logs)
|
||||
|
||||
## Service Details
|
||||
|
||||
### Prometheus
|
||||
- **Port**: 9090
|
||||
- **Config**: `/etc/fetch_ml/monitoring/prometheus.yml`
|
||||
- **Data**: `/data/monitoring/prometheus`
|
||||
- **Purpose**: Scrapes metrics from API server
|
||||
|
||||
### Loki
|
||||
- **Port**: 3100
|
||||
- **Config**: `/etc/fetch_ml/monitoring/loki-config.yml`
|
||||
- **Data**: `/data/monitoring/loki`
|
||||
- **Purpose**: Log aggregation
|
||||
|
||||
### Promtail
|
||||
- **Config**: `/etc/fetch_ml/monitoring/promtail-config.yml`
|
||||
- **Log Source**: `/var/log/fetch_ml/*.log`
|
||||
- **Purpose**: Ships logs to Loki
|
||||
|
||||
### Grafana
|
||||
- **Port**: 3000
|
||||
- **Config**: `/etc/fetch_ml/monitoring/grafana/provisioning`
|
||||
- **Data**: `/data/monitoring/grafana`
|
||||
- **Dashboards**: `/var/lib/grafana/dashboards`
|
||||
|
||||
## Management Commands
|
||||
|
||||
```bash
|
||||
# Check status
|
||||
sudo systemctl status prometheus grafana loki promtail
|
||||
|
||||
# View logs
|
||||
sudo journalctl -u prometheus -f
|
||||
sudo journalctl -u grafana -f
|
||||
sudo journalctl -u loki -f
|
||||
sudo journalctl -u promtail -f
|
||||
|
||||
# Restart services
|
||||
sudo systemctl restart prometheus
|
||||
sudo systemctl restart grafana
|
||||
|
||||
# Stop all monitoring
|
||||
sudo systemctl stop prometheus grafana loki promtail
|
||||
```
|
||||
|
||||
## Data Retention
|
||||
|
||||
### Prometheus
|
||||
Default: 15 days. Edit `/etc/fetch_ml/monitoring/prometheus.yml`:
|
||||
```yaml
|
||||
storage:
|
||||
tsdb:
|
||||
retention.time: 30d
|
||||
```
|
||||
|
||||
### Loki
|
||||
Default: 30 days. Edit `/etc/fetch_ml/monitoring/loki-config.yml`:
|
||||
```yaml
|
||||
limits_config:
|
||||
retention_period: 30d
|
||||
```
|
||||
|
||||
## Security
|
||||
|
||||
### Firewall
|
||||
The setup script automatically configures firewall rules using the detected firewall manager (firewalld or ufw).
|
||||
|
||||
For manual firewall configuration:
|
||||
|
||||
**RHEL/Rocky/Fedora (firewalld)**:
|
||||
```bash
|
||||
# Remove public access
|
||||
sudo firewall-cmd --permanent --remove-port=3000/tcp
|
||||
sudo firewall-cmd --permanent --remove-port=9090/tcp
|
||||
|
||||
# Add specific source
|
||||
sudo firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.0.0.0/24" port port="3000" protocol="tcp" accept'
|
||||
sudo firewall-cmd --reload
|
||||
```
|
||||
|
||||
**Ubuntu/Debian (ufw)**:
|
||||
```bash
|
||||
# Remove public access
|
||||
sudo ufw delete allow 3000/tcp
|
||||
sudo ufw delete allow 9090/tcp
|
||||
|
||||
# Add specific source
|
||||
sudo ufw allow from 10.0.0.0/24 to any port 3000 proto tcp
|
||||
```
|
||||
|
||||
### Authentication
|
||||
Change Grafana admin password:
|
||||
1. Login to Grafana
|
||||
2. User menu → Profile → Change Password
|
||||
|
||||
### TLS (Optional)
|
||||
For HTTPS, configure reverse proxy (nginx/Apache) in front of Grafana.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Grafana shows no data
|
||||
```bash
|
||||
# Check if Prometheus is reachable
|
||||
curl http://localhost:9090/-/healthy
|
||||
|
||||
# Check datasource in Grafana
|
||||
# Settings → Data Sources → Prometheus → Save & Test
|
||||
```
|
||||
|
||||
### Loki not receiving logs
|
||||
```bash
|
||||
# Check Promtail is running
|
||||
sudo systemctl status promtail
|
||||
|
||||
# Verify log file exists
|
||||
ls -l /var/log/fetch_ml/
|
||||
|
||||
# Check Promtail can reach Loki
|
||||
curl http://localhost:3100/ready
|
||||
```
|
||||
|
||||
### Podman containers not starting
|
||||
```bash
|
||||
# Check pod status
|
||||
sudo -u ml-user podman pod ps
|
||||
sudo -u ml-user podman ps -a
|
||||
|
||||
# Remove and recreate
|
||||
sudo -u ml-user podman pod stop monitoring
|
||||
sudo -u ml-user podman pod rm monitoring
|
||||
sudo systemctl restart prometheus
|
||||
```
|
||||
|
||||
## Backup
|
||||
|
||||
```bash
|
||||
# Backup Grafana dashboards and data
|
||||
sudo tar -czf grafana-backup.tar.gz /data/monitoring/grafana
|
||||
|
||||
# Backup Prometheus data
|
||||
sudo tar -czf prometheus-backup.tar.gz /data/monitoring/prometheus
|
||||
```
|
||||
|
||||
## Updates
|
||||
|
||||
```bash
|
||||
# Pull latest images
|
||||
sudo -u ml-user podman pull docker.io/grafana/grafana:latest
|
||||
sudo -u ml-user podman pull docker.io/prom/prometheus:latest
|
||||
sudo -u ml-user podman pull docker.io/grafana/loki:latest
|
||||
sudo -u ml-user podman pull docker.io/grafana/promtail:latest
|
||||
|
||||
# Restart services to use new images
|
||||
sudo systemctl restart grafana prometheus loki promtail
|
||||
```
|
||||
|
|
@ -1,6 +1,6 @@
|
|||
# Quick Start
|
||||
|
||||
Get Fetch ML running in minutes with Docker Compose.
|
||||
Get Fetch ML running in minutes with Docker Compose and integrated monitoring.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
|
|
@ -8,9 +8,13 @@ Get Fetch ML running in minutes with Docker Compose.
|
|||
- **Docker Compose**: For testing and development only
|
||||
- **Podman**: For production experiment execution
|
||||
|
||||
**Requirements:**
|
||||
- Go 1.21+
|
||||
- Zig 0.11+
|
||||
- Docker Compose (testing only)
|
||||
- 4GB+ RAM
|
||||
- 2GB+ disk space
|
||||
- Git
|
||||
|
||||
## One-Command Setup
|
||||
|
||||
|
|
@ -18,108 +22,312 @@ Get Fetch ML running in minutes with Docker Compose.
|
|||
# Clone and start
|
||||
git clone https://github.com/jfraeys/fetch_ml.git
|
||||
cd fetch_ml
|
||||
docker-compose up -d (testing only)
|
||||
make dev-up
|
||||
|
||||
# Wait for services (30 seconds)
|
||||
sleep 30
|
||||
|
||||
# Verify setup
|
||||
curl http://localhost:9101/health
|
||||
curl http://localhost:8080/health
|
||||
```
|
||||
|
||||
Note: the development compose runs the API server over HTTP/WS for CLI compatibility. For HTTPS/WSS, terminate TLS at a reverse proxy.
|
||||
|
||||
**Access Services:**
|
||||
- **API Server (via Caddy)**: http://localhost:8080
|
||||
- **API Server (via Caddy + internal TLS)**: https://localhost:8443
|
||||
- **Grafana**: http://localhost:3000 (admin/admin123)
|
||||
- **Prometheus**: http://localhost:9090
|
||||
- **Loki**: http://localhost:3100
|
||||
|
||||
## Development Setup
|
||||
|
||||
### Build Components
|
||||
|
||||
```bash
|
||||
# Build all components
|
||||
make build
|
||||
|
||||
# Development build
|
||||
make dev
|
||||
```
|
||||
|
||||
### Start Services
|
||||
|
||||
```bash
|
||||
# Start development stack with monitoring
|
||||
make dev-up
|
||||
|
||||
# Check status
|
||||
make dev-status
|
||||
|
||||
# Stop services
|
||||
make dev-down
|
||||
```
|
||||
|
||||
### Verify Setup
|
||||
|
||||
```bash
|
||||
# Check API health
|
||||
curl -f http://localhost:8080/health
|
||||
|
||||
# Check monitoring services
|
||||
curl -f http://localhost:3000/api/health
|
||||
curl -f http://localhost:9090/api/v1/query?query=up
|
||||
curl -f http://localhost:3100/ready
|
||||
|
||||
# Check Redis
|
||||
docker exec ml-experiments-redis redis-cli ping
|
||||
```
|
||||
|
||||
## First Experiment
|
||||
|
||||
```bash
|
||||
# Submit a simple ML job (see [First Experiment](first-experiment.md) for details)
|
||||
curl -X POST http://localhost:9101/api/v1/jobs \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "X-API-Key: admin" \
|
||||
-d '{
|
||||
"job_name": "hello-world",
|
||||
"args": "--echo Hello World",
|
||||
"priority": 1
|
||||
}'
|
||||
|
||||
# Check job status
|
||||
curl http://localhost:9101/api/v1/jobs \
|
||||
-H "X-API-Key: admin"
|
||||
```
|
||||
|
||||
## CLI Access
|
||||
### 1. Setup CLI
|
||||
|
||||
```bash
|
||||
# Build CLI
|
||||
cd cli && zig build dev
|
||||
cd cli && zig build --release=fast
|
||||
|
||||
# List jobs
|
||||
./cli/zig-out/dev/ml --server http://localhost:9101 list-jobs
|
||||
|
||||
# Submit new job
|
||||
./cli/zig-out/dev/ml --server http://localhost:9101 submit \
|
||||
--name "test-job" --args "--epochs 10"
|
||||
# Initialize CLI config
|
||||
./cli/zig-out/bin/ml init
|
||||
```
|
||||
|
||||
## Local Mode (Zero-Install)
|
||||
|
||||
Run workers locally without Redis or SSH for development and testing:
|
||||
### 2. Queue Job
|
||||
|
||||
```bash
|
||||
# Start a local worker (uses configs/worker-dev.yaml)
|
||||
./cmd/worker/worker -config configs/worker-dev.yaml
|
||||
# Simple test job
|
||||
echo "test experiment" | ./cli/zig-out/bin/ml queue test-job
|
||||
|
||||
# In another terminal, submit a job to the local worker
|
||||
curl -X POST http://localhost:9101/api/v1/jobs \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "X-API-Key: admin" \
|
||||
-d '{
|
||||
"job_name": "local-test",
|
||||
"args": "--echo Local Mode Works",
|
||||
"priority": 1
|
||||
}'
|
||||
|
||||
# The worker will execute locally using:
|
||||
# - Local command execution (no SSH)
|
||||
# - Local job directories (pending/running/finished)
|
||||
# - In-memory task queue (no Redis required)
|
||||
# Check status
|
||||
./cli/zig-out/bin/ml status
|
||||
```
|
||||
|
||||
Local mode configuration (`configs/worker-dev.yaml`):
|
||||
```yaml
|
||||
local_mode: true # Enable local execution
|
||||
base_path: "./jobs" # Local job directory
|
||||
redis_addr: "" # Optional: skip Redis
|
||||
host: "" # Optional: skip SSH
|
||||
### 3. Monitor Progress
|
||||
|
||||
```bash
|
||||
# View in Grafana
|
||||
open http://localhost:3000
|
||||
|
||||
# Check logs in Grafana Log Analysis dashboard
|
||||
# Or view container logs
|
||||
docker logs ml-experiments-api -f
|
||||
```
|
||||
|
||||
## Related Documentation
|
||||
## Key Commands
|
||||
|
||||
- [Installation Guide](installation.md) - Detailed setup options
|
||||
- [First Experiment](first-experiment.md) - Complete ML workflow
|
||||
- [Development Setup](development-setup.md) - Local development
|
||||
- [Security](security.md) - Authentication and permissions
|
||||
### Development Commands
|
||||
|
||||
```bash
|
||||
make help # Show all commands
|
||||
make build # Build all components
|
||||
make dev-up # Start dev environment
|
||||
make dev-down # Stop dev environment
|
||||
make dev-status # Check dev status
|
||||
make test # Run tests
|
||||
make test-unit # Run unit tests
|
||||
make test-integration # Run integration tests
|
||||
```
|
||||
|
||||
### CLI Commands
|
||||
|
||||
```bash
|
||||
# Build CLI
|
||||
cd cli && zig build --release=fast
|
||||
|
||||
# Common operations
|
||||
./cli/zig-out/bin/ml status # Check system status
|
||||
./cli/zig-out/bin/ml queue job-name # Queue job
|
||||
./cli/zig-out/bin/ml list # List jobs
|
||||
./cli/zig-out/bin/ml help # Show help
|
||||
```
|
||||
|
||||
### Monitoring Commands
|
||||
|
||||
```bash
|
||||
# Access monitoring services
|
||||
open http://localhost:3000 # Grafana
|
||||
open http://localhost:9090 # Prometheus
|
||||
open http://localhost:3100 # Loki
|
||||
|
||||
# (Optional) Re-generate Grafana provisioning (datasources/providers)
|
||||
python3 scripts/setup_monitoring.py
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Setup
|
||||
|
||||
```bash
|
||||
# Copy example environment
|
||||
cp deployments/env.dev.example .env
|
||||
|
||||
# Edit as needed
|
||||
vim .env
|
||||
```
|
||||
|
||||
**Key Variables**:
|
||||
- `LOG_LEVEL=info`
|
||||
- `GRAFANA_ADMIN_PASSWORD=admin123`
|
||||
|
||||
### CLI Configuration
|
||||
|
||||
```bash
|
||||
# Setup CLI config
|
||||
mkdir -p ~/.ml
|
||||
|
||||
# Create config file if needed
|
||||
touch ~/.ml/config.toml
|
||||
|
||||
# Edit configuration
|
||||
vim ~/.ml/config.toml
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
### Quick Test
|
||||
|
||||
```bash
|
||||
# 5-minute authentication test
|
||||
make test-auth
|
||||
|
||||
# Clean up
|
||||
make self-cleanup
|
||||
```
|
||||
|
||||
### Full Test Suite
|
||||
|
||||
```bash
|
||||
# Run all tests
|
||||
make test
|
||||
|
||||
# Run with coverage
|
||||
make test-coverage
|
||||
|
||||
# Run specific test types
|
||||
make test-unit
|
||||
make test-integration
|
||||
make test-e2e
|
||||
```
|
||||
|
||||
### Load Testing
|
||||
|
||||
```bash
|
||||
# Run load tests
|
||||
make load-test
|
||||
|
||||
# Run benchmarks
|
||||
make benchmark
|
||||
|
||||
# Track performance
|
||||
./scripts/track_performance.sh
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Services not starting?**
|
||||
### Common Issues
|
||||
|
||||
**Port Conflicts**:
|
||||
```bash
|
||||
# Check logs
|
||||
docker-compose logs
|
||||
# Check port usage
|
||||
lsof -i :8080
|
||||
lsof -i :8443
|
||||
lsof -i :3000
|
||||
lsof -i :9090
|
||||
|
||||
# Kill conflicting processes
|
||||
kill -9 <PID>
|
||||
```
|
||||
|
||||
**Build Issues**:
|
||||
```bash
|
||||
# Fix Go modules
|
||||
go mod tidy
|
||||
|
||||
# Fix Zig build
|
||||
cd cli && rm -rf zig-out zig-cache && zig build --release=fast
|
||||
```
|
||||
|
||||
**Container Issues**:
|
||||
```bash
|
||||
# Check container status
|
||||
docker ps --filter "name=ml-"
|
||||
|
||||
# View logs
|
||||
docker logs ml-experiments-api
|
||||
docker logs ml-experiments-grafana
|
||||
|
||||
# Restart services
|
||||
docker-compose down && docker-compose up -d (testing only)
|
||||
make dev-down && make dev-up
|
||||
```
|
||||
|
||||
**API not responding?**
|
||||
**Monitoring Issues**:
|
||||
```bash
|
||||
# Check health
|
||||
curl http://localhost:9101/health
|
||||
# Re-setup monitoring
|
||||
python3 scripts/setup_monitoring.py
|
||||
|
||||
# Verify ports
|
||||
docker-compose ps
|
||||
# Restart Grafana
|
||||
docker restart ml-experiments-grafana
|
||||
|
||||
# Check datasources in Grafana
|
||||
# Settings → Data Sources → Test connection
|
||||
```
|
||||
|
||||
**Permission denied?**
|
||||
### Debug Mode
|
||||
|
||||
```bash
|
||||
# Check API key
|
||||
curl -H "X-API-Key: admin" http://localhost:9101/api/v1/jobs
|
||||
# Enable debug logging
|
||||
export LOG_LEVEL=debug
|
||||
make dev-up
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
### Explore Features
|
||||
|
||||
1. **Job Management**: Queue and monitor ML experiments
|
||||
2. **WebSocket Communication**: Real-time updates
|
||||
3. **Multi-User Authentication**: Role-based access control
|
||||
4. **Performance Monitoring**: Grafana dashboards and metrics
|
||||
5. **Log Aggregation**: Centralized logging with Loki
|
||||
|
||||
### Advanced Configuration
|
||||
|
||||
- **Production Setup**: See [Deployment Guide](deployment.md)
|
||||
- **Performance Monitoring**: See [Performance Monitoring](performance-monitoring.md)
|
||||
- **Testing Procedures**: See [Testing Guide](testing.md)
|
||||
- **CLI Reference**: See [CLI Reference](cli-reference.md)
|
||||
|
||||
### Production Deployment
|
||||
|
||||
For production deployment:
|
||||
1. Review [Deployment Guide](deployment.md)
|
||||
2. Set up production monitoring
|
||||
3. Configure security and authentication
|
||||
4. Set up backup procedures
|
||||
|
||||
## Help and Support
|
||||
|
||||
### Get Help
|
||||
|
||||
```bash
|
||||
make help # Show all available commands
|
||||
./cli/zig-out/bin/ml --help # CLI help
|
||||
```
|
||||
|
||||
### Documentation
|
||||
|
||||
- **[Testing Guide](testing.md)** - Comprehensive testing procedures
|
||||
- **[Deployment Guide](deployment.md)** - Production deployment
|
||||
- **[Performance Monitoring](performance-monitoring.md)** - Monitoring setup
|
||||
- **[Architecture Guide](architecture.md)** - System architecture
|
||||
- **[Troubleshooting](troubleshooting.md)** - Common issues
|
||||
|
||||
### Community
|
||||
|
||||
- Check logs: `docker logs ml-experiments-api`
|
||||
- Review documentation in `docs/src/`
|
||||
- Use `--debug` flag with CLI commands for detailed output
|
||||
|
||||
---
|
||||
|
||||
*Ready in minutes!*
|
||||
Loading…
Reference in a new issue