fetch_ml/docs/src/deployment.md

9.7 KiB

ML Experiment Manager - Deployment Guide

Overview

The ML Experiment Manager supports multiple deployment methods from local development to production setups with integrated monitoring.

TLS / WSS Policy

  • The Zig CLI currently supports ws:// only (native wss:// is not implemented).
  • For production, use a reverse proxy (Caddy) to terminate TLS/WSS (Approach A) and keep the API server on internal HTTP/WS.
  • If you need remote CLI access, use one of:
    • an SSH tunnel to the internal ws:// endpoint
    • a private network/VPN so ws:// is not exposed to the public Internet
  • When server.tls.enabled: false, the API server still runs on plain HTTP/WS internally. In development, access it via Caddy at http://localhost:8080/health.

Data Directories

  • base_path is where experiment directories live.
  • data_dir is used for dataset/snapshot materialization and integrity validation.
  • If you want ml validate to check snapshots/datasets, you must mount data_dir into the API server container.

Quick Start

Development Deployment with Monitoring

# Start development stack with monitoring
make dev-up

# Alternative: Use deployment script
cd deployments && make dev-up

# Check status
make dev-status

Access Services:

Deployment Options

1. Development Environment

Purpose: Local development with full monitoring stack

Services: API Server, Redis, Prometheus, Grafana, Loki, Promtail

Configuration:

# Using Makefile (recommended)
make dev-up
make dev-down
make dev-status

# Using deployment script
cd deployments
make dev-up
make dev-down
make dev-status

Features:

  • Auto-provisioned Grafana dashboards
  • Real-time metrics and logs
  • Hot reload for development
  • Local data persistence

2. Production Environment

Purpose: Production deployment with security

Services: API Server, Worker, Redis with authentication

Configuration:

cd deployments
make prod-up
make prod-down
make prod-status

Features:

  • Secure Redis with authentication
  • TLS/WSS via reverse proxy termination (Caddy)
  • Production-optimized configurations
  • Health checks and restart policies

3. Homelab Secure Environment

Purpose: Secure homelab deployment

Services: API Server, Redis, Caddy reverse proxy

Configuration:

cd deployments
make homelab-up
make homelab-down
make homelab-status

Features:

  • Caddy reverse proxy
  • TLS termination
  • Network isolation
  • External networks

Environment Setup

Development Environment

# Copy example environment
cp deployments/env.dev.example .env

# Edit as needed
vim .env

Key Variables:

  • LOG_LEVEL=info
  • GRAFANA_ADMIN_PASSWORD=admin123

Production Environment

# Copy example environment
cp deployments/env.prod.example .env

# Edit with production values
vim .env

Key Variables:

  • REDIS_PASSWORD=your-secure-password
  • JWT_SECRET=your-jwt-secret
  • SSL_CERT_PATH=/path/to/cert

Monitoring Setup

Automatic Configuration

Monitoring dashboards and datasources are auto-provisioned:

# Setup monitoring provisioning (Grafana datasources/providers)
python3 scripts/setup_monitoring.py

# Start services (includes monitoring)
make dev-up

Available Dashboards

  1. Load Test Performance: Request rates, response times, error rates
  2. System Health: Service status, memory, CPU usage
  3. Log Analysis: Error logs, service logs, log aggregation

Manual Configuration

If auto-provisioning fails:

  1. Access Grafana: http://localhost:3000
  2. Add Data Sources:
  3. Import Dashboards: From monitoring/grafana/dashboards/

Testing Procedures

Pre-Deployment Testing

# Run unit tests
make test-unit

# Run integration tests
make test-integration

# Run full test suite
make test

# Run with coverage
make test-coverage

Load Testing

# Run load tests
make load-test

# Run specific load scenarios
make benchmark-local

# Track performance over time
./scripts/track_performance.sh

Health Checks

# Check service health
curl -f http://localhost:8080/health

# Check monitoring services
curl -f http://localhost:3000/api/health
curl -f http://localhost:9090/api/v1/query?query=up
curl -f http://localhost:3100/ready

Troubleshooting

Common Issues

Port Conflicts:

# Check port usage
lsof -i :9101
lsof -i :3000
lsof -i :9090

# Kill conflicting processes
kill -9 <PID>

Container Issues:

# View container logs
docker logs ml-experiments-api
docker logs ml-experiments-grafana

# Restart services
make dev-restart

# Clean restart
make dev-down && make dev-up

Monitoring Issues:

# Re-setup monitoring configuration
python3 scripts/setup_monitoring.py

# Restart Grafana only
docker restart ml-experiments-grafana

Performance Issues

High Memory Usage:

  • Check Grafana dashboards for memory metrics
  • Adjust Prometheus retention in prometheus.yml
  • Monitor log retention in loki-config.yml

Slow Response Times:

  • Check network connectivity between containers
  • Verify Redis performance
  • Review API server logs for bottlenecks

Maintenance

Regular Tasks

Weekly:

  • Check Grafana dashboards for anomalies
  • Review log files for errors
  • Verify backup procedures

Monthly:

  • Update Docker images
  • Clean up old Docker volumes
  • Review and rotate secrets

Backup Procedures

Data Backup:

# Backup application data
docker run --rm -v ml_data:/data -v $(pwd):/backup alpine tar czf /backup/data-backup.tar.gz -C /data .

# Backup monitoring data
docker run --rm -v prometheus_data:/data -v $(pwd):/backup alpine tar czf /backup/prometheus-backup.tar.gz -C /data .

Configuration Backup:

# Backup configurations
tar czf config-backup.tar.gz monitoring/ deployments/ configs/

Security Considerations

Development Environment

  • Change default Grafana password
  • Use environment variables for secrets
  • Monitor container logs for security events

Production Environment

  • Enable Redis authentication
  • Use SSL/TLS certificates
  • Implement network segmentation
  • Regular security updates
  • Monitor access logs

Performance Optimization

Resource Limits

Development:

# docker-compose.dev.yml
services:
  api-server:
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: '0.5'

Production:

# docker-compose.prod.yml
services:
  api-server:
    deploy:
      resources:
        limits:
          memory: 2G
          cpus: '2.0'

Monitoring Optimization

Prometheus:

  • Adjust scrape intervals
  • Configure retention periods
  • Use recording rules for frequent queries

Loki:

  • Configure log retention
  • Use log sampling for high-volume sources
  • Optimize label cardinality

Non-Docker Production (systemd)

This project can be run in production without Docker. The recommended model is:

  • Run api-server and worker as systemd services.
  • Terminate TLS/WSS at Caddy and keep the API server on internal plain HTTP/WS.

The unit templates below are copy-paste friendly, but you must adjust paths, users, and config locations to your environment.

fetchml-api.service

[Unit]
Description=FetchML API Server
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=fetchml
Group=fetchml
WorkingDirectory=/var/lib/fetchml

Environment=LOG_LEVEL=info

ExecStart=/usr/local/bin/api-server -config /etc/fetchml/api.yaml
Restart=on-failure
RestartSec=2

NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/fetchml /var/log/fetchml

[Install]
WantedBy=multi-user.target

fetchml-worker.service

[Unit]
Description=FetchML Worker
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=fetchml
Group=fetchml
WorkingDirectory=/var/lib/fetchml

Environment=LOG_LEVEL=info

ExecStart=/usr/local/bin/worker -config /etc/fetchml/worker.yaml
Restart=on-failure
RestartSec=2

NoNewPrivileges=true
PrivateTmp=true

[Install]
WantedBy=multi-user.target

Optional: caddy.service

Most distros ship a Caddy systemd unit. If you do not have one available, you can use this template.

[Unit]
Description=Caddy
Documentation=https://caddyserver.com/docs/
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
User=caddy
Group=caddy
ExecStart=/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile
ExecReload=/usr/bin/caddy reload --config /etc/caddy/Caddyfile
TimeoutStopSec=5s
LimitNOFILE=1048576
LimitNPROC=512
PrivateTmp=true
ProtectSystem=full
AmbientCapabilities=CAP_NET_BIND_SERVICE
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
NoNewPrivileges=true
Restart=on-failure

[Install]
WantedBy=multi-user.target

Migration Guide

From Development to Production

  1. Export Data:

    docker exec ml-data redis-cli BGSAVE
    docker cp ml-data:/data/dump.rdb ./redis-backup.rdb
    
  2. Update Configuration:

    cp deployments/env.dev.example deployments/env.prod.example
    # Edit with production values
    
  3. Deploy Production:

    cd deployments
    make prod-up
    
  4. Import Data:

    docker cp ./redis-backup.rdb ml-prod-redis:/data/dump.rdb
    docker restart ml-prod-redis
    

Support

For deployment issues:

  1. Check troubleshooting section
  2. Review container logs
  3. Verify network connectivity
  4. Check resource usage in Grafana