Jeremie Fraeys 8157f73a70 docs(ops): consolidate deployment and performance monitoring docs for Caddy-based setup

2026-01-05 12:37:40 -05:00

9.7 KiB

Raw Blame History

ML Experiment Manager - Deployment Guide

Overview

The ML Experiment Manager supports multiple deployment methods from local development to production setups with integrated monitoring.

TLS / WSS Policy

The Zig CLI currently supports ws:// only (native wss:// is not implemented).
For production, use a reverse proxy (Caddy) to terminate TLS/WSS (Approach A) and keep the API server on internal HTTP/WS.
If you need remote CLI access, use one of:
- an SSH tunnel to the internal ws:// endpoint
- a private network/VPN so ws:// is not exposed to the public Internet
When server.tls.enabled: false, the API server still runs on plain HTTP/WS internally. In development, access it via Caddy at http://localhost:8080/health.

Data Directories

base_path is where experiment directories live.
data_dir is used for dataset/snapshot materialization and integrity validation.
If you want ml validate to check snapshots/datasets, you must mount data_dir into the API server container.

Quick Start

Development Deployment with Monitoring

# Start development stack with monitoring
make dev-up

# Alternative: Use deployment script
cd deployments && make dev-up

# Check status
make dev-status

Access Services:

API Server (via Caddy): http://localhost:8080
API Server (via Caddy + internal TLS): https://localhost:8443
Grafana: http://localhost:3000 (admin/admin123)
Prometheus: http://localhost:9090
Loki: http://localhost:3100

Deployment Options

1. Development Environment

Purpose: Local development with full monitoring stack

Services: API Server, Redis, Prometheus, Grafana, Loki, Promtail

Configuration:

# Using Makefile (recommended)
make dev-up
make dev-down
make dev-status

# Using deployment script
cd deployments
make dev-up
make dev-down
make dev-status

Features:

Auto-provisioned Grafana dashboards
Real-time metrics and logs
Hot reload for development
Local data persistence

2. Production Environment

Purpose: Production deployment with security

Services: API Server, Worker, Redis with authentication

Configuration:

cd deployments
make prod-up
make prod-down
make prod-status

Features:

Secure Redis with authentication
TLS/WSS via reverse proxy termination (Caddy)
Production-optimized configurations
Health checks and restart policies

3. Homelab Secure Environment

Purpose: Secure homelab deployment

Services: API Server, Redis, Caddy reverse proxy

Configuration:

cd deployments
make homelab-up
make homelab-down
make homelab-status

Features:

Caddy reverse proxy
TLS termination
Network isolation
External networks

Environment Setup

Development Environment

# Copy example environment
cp deployments/env.dev.example .env

# Edit as needed
vim .env

Key Variables:

LOG_LEVEL=info
GRAFANA_ADMIN_PASSWORD=admin123

Production Environment

# Copy example environment
cp deployments/env.prod.example .env

# Edit with production values
vim .env

Key Variables:

REDIS_PASSWORD=your-secure-password
JWT_SECRET=your-jwt-secret
SSL_CERT_PATH=/path/to/cert

Monitoring Setup

Automatic Configuration

Monitoring dashboards and datasources are auto-provisioned:

# Setup monitoring provisioning (Grafana datasources/providers)
python3 scripts/setup_monitoring.py

# Start services (includes monitoring)
make dev-up

Available Dashboards

Load Test Performance: Request rates, response times, error rates
System Health: Service status, memory, CPU usage
Log Analysis: Error logs, service logs, log aggregation

Manual Configuration

If auto-provisioning fails:

Access Grafana: http://localhost:3000
Add Data Sources:
- Prometheus: http://prometheus:9090
- Loki: http://loki:3100
Import Dashboards: From monitoring/grafana/dashboards/

Testing Procedures

Pre-Deployment Testing

# Run unit tests
make test-unit

# Run integration tests
make test-integration

# Run full test suite
make test

# Run with coverage
make test-coverage

Load Testing

# Run load tests
make load-test

# Run specific load scenarios
make benchmark-local

# Track performance over time
./scripts/track_performance.sh

Health Checks

# Check service health
curl -f http://localhost:8080/health

# Check monitoring services
curl -f http://localhost:3000/api/health
curl -f http://localhost:9090/api/v1/query?query=up
curl -f http://localhost:3100/ready

Troubleshooting

Common Issues

Port Conflicts:

# Check port usage
lsof -i :9101
lsof -i :3000
lsof -i :9090

# Kill conflicting processes
kill -9 <PID>

Container Issues:

# View container logs
docker logs ml-experiments-api
docker logs ml-experiments-grafana

# Restart services
make dev-restart

# Clean restart
make dev-down && make dev-up

Monitoring Issues:

# Re-setup monitoring configuration
python3 scripts/setup_monitoring.py

# Restart Grafana only
docker restart ml-experiments-grafana

Performance Issues

High Memory Usage:

Check Grafana dashboards for memory metrics
Adjust Prometheus retention in prometheus.yml
Monitor log retention in loki-config.yml

Slow Response Times:

Check network connectivity between containers
Verify Redis performance
Review API server logs for bottlenecks

Maintenance

Regular Tasks

Weekly:

Check Grafana dashboards for anomalies
Review log files for errors
Verify backup procedures

Monthly:

Update Docker images
Clean up old Docker volumes
Review and rotate secrets

Backup Procedures

Data Backup:

# Backup application data
docker run --rm -v ml_data:/data -v $(pwd):/backup alpine tar czf /backup/data-backup.tar.gz -C /data .

# Backup monitoring data
docker run --rm -v prometheus_data:/data -v $(pwd):/backup alpine tar czf /backup/prometheus-backup.tar.gz -C /data .

Configuration Backup:

# Backup configurations
tar czf config-backup.tar.gz monitoring/ deployments/ configs/

Security Considerations

Development Environment

Change default Grafana password
Use environment variables for secrets
Monitor container logs for security events

Production Environment

Enable Redis authentication
Use SSL/TLS certificates
Implement network segmentation
Regular security updates
Monitor access logs

Performance Optimization

Resource Limits

Development:

# docker-compose.dev.yml
services:
  api-server:
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: '0.5'

Production:

# docker-compose.prod.yml
services:
  api-server:
    deploy:
      resources:
        limits:
          memory: 2G
          cpus: '2.0'

Monitoring Optimization

Prometheus:

Adjust scrape intervals
Configure retention periods
Use recording rules for frequent queries

Loki:

Configure log retention
Use log sampling for high-volume sources
Optimize label cardinality

Non-Docker Production (systemd)

This project can be run in production without Docker. The recommended model is:

Run api-server and worker as systemd services.
Terminate TLS/WSS at Caddy and keep the API server on internal plain HTTP/WS.

The unit templates below are copy-paste friendly, but you must adjust paths, users, and config locations to your environment.

`fetchml-api.service`

[Unit]
Description=FetchML API Server
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=fetchml
Group=fetchml
WorkingDirectory=/var/lib/fetchml

Environment=LOG_LEVEL=info

ExecStart=/usr/local/bin/api-server -config /etc/fetchml/api.yaml
Restart=on-failure
RestartSec=2

NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/fetchml /var/log/fetchml

[Install]
WantedBy=multi-user.target

`fetchml-worker.service`

[Unit]
Description=FetchML Worker
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=fetchml
Group=fetchml
WorkingDirectory=/var/lib/fetchml

Environment=LOG_LEVEL=info

ExecStart=/usr/local/bin/worker -config /etc/fetchml/worker.yaml
Restart=on-failure
RestartSec=2

NoNewPrivileges=true
PrivateTmp=true

[Install]
WantedBy=multi-user.target

Optional: `caddy.service`

Most distros ship a Caddy systemd unit. If you do not have one available, you can use this template.

[Unit]
Description=Caddy
Documentation=https://caddyserver.com/docs/
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
User=caddy
Group=caddy
ExecStart=/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile
ExecReload=/usr/bin/caddy reload --config /etc/caddy/Caddyfile
TimeoutStopSec=5s
LimitNOFILE=1048576
LimitNPROC=512
PrivateTmp=true
ProtectSystem=full
AmbientCapabilities=CAP_NET_BIND_SERVICE
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
NoNewPrivileges=true
Restart=on-failure

[Install]
WantedBy=multi-user.target

Migration Guide

From Development to Production

Export Data:

docker exec ml-data redis-cli BGSAVE
docker cp ml-data:/data/dump.rdb ./redis-backup.rdb

Update Configuration:

cp deployments/env.dev.example deployments/env.prod.example
# Edit with production values

Deploy Production:
```
cd deployments
make prod-up
```

Import Data:

docker cp ./redis-backup.rdb ml-prod-redis:/data/dump.rdb
docker restart ml-prod-redis

Support

For deployment issues:

Check troubleshooting section
Review container logs
Verify network connectivity
Check resource usage in Grafana

9.7 KiB Raw Blame History

ML Experiment Manager - Deployment Guide

Overview

TLS / WSS Policy

Data Directories

Quick Start

Development Deployment with Monitoring

Deployment Options

1. Development Environment

2. Production Environment

3. Homelab Secure Environment

Environment Setup

Development Environment

Production Environment

Monitoring Setup

Automatic Configuration

Available Dashboards

Manual Configuration

Testing Procedures

Pre-Deployment Testing

Load Testing

Health Checks

Troubleshooting

Common Issues

Performance Issues

Maintenance

Regular Tasks

Backup Procedures

Security Considerations

Development Environment

Production Environment

Performance Optimization

Resource Limits

Monitoring Optimization

Non-Docker Production (systemd)

fetchml-api.service

fetchml-worker.service

Optional: caddy.service

Migration Guide

From Development to Production

Support

9.7 KiB

Raw Blame History

`fetchml-api.service`

`fetchml-worker.service`

Optional: `caddy.service`