9.7 KiB
ML Experiment Manager - Deployment Guide
Overview
The ML Experiment Manager supports multiple deployment methods from local development to production setups with integrated monitoring.
TLS / WSS Policy
- The Zig CLI currently supports
ws://only (nativewss://is not implemented). - For production, use a reverse proxy (Caddy) to terminate TLS/WSS (Approach A) and keep the API server on internal HTTP/WS.
- If you need remote CLI access, use one of:
- an SSH tunnel to the internal
ws://endpoint - a private network/VPN so
ws://is not exposed to the public Internet
- an SSH tunnel to the internal
- When
server.tls.enabled: false, the API server still runs on plain HTTP/WS internally. In development, access it via Caddy athttp://localhost:8080/health.
Data Directories
base_pathis where experiment directories live.data_diris used for dataset/snapshot materialization and integrity validation.- If you want
ml validateto check snapshots/datasets, you must mountdata_dirinto the API server container.
Quick Start
Development Deployment with Monitoring
# Start development stack with monitoring
make dev-up
# Alternative: Use deployment script
cd deployments && make dev-up
# Check status
make dev-status
Access Services:
- API Server (via Caddy): http://localhost:8080
- API Server (via Caddy + internal TLS): https://localhost:8443
- Grafana: http://localhost:3000 (admin/admin123)
- Prometheus: http://localhost:9090
- Loki: http://localhost:3100
Deployment Options
1. Development Environment
Purpose: Local development with full monitoring stack
Services: API Server, Redis, Prometheus, Grafana, Loki, Promtail
Configuration:
# Using Makefile (recommended)
make dev-up
make dev-down
make dev-status
# Using deployment script
cd deployments
make dev-up
make dev-down
make dev-status
Features:
- Auto-provisioned Grafana dashboards
- Real-time metrics and logs
- Hot reload for development
- Local data persistence
2. Production Environment
Purpose: Production deployment with security
Services: API Server, Worker, Redis with authentication
Configuration:
cd deployments
make prod-up
make prod-down
make prod-status
Features:
- Secure Redis with authentication
- TLS/WSS via reverse proxy termination (Caddy)
- Production-optimized configurations
- Health checks and restart policies
3. Homelab Secure Environment
Purpose: Secure homelab deployment
Services: API Server, Redis, Caddy reverse proxy
Configuration:
cd deployments
make homelab-up
make homelab-down
make homelab-status
Features:
- Caddy reverse proxy
- TLS termination
- Network isolation
- External networks
Environment Setup
Development Environment
# Copy example environment
cp deployments/env.dev.example .env
# Edit as needed
vim .env
Key Variables:
LOG_LEVEL=infoGRAFANA_ADMIN_PASSWORD=admin123
Production Environment
# Copy example environment
cp deployments/env.prod.example .env
# Edit with production values
vim .env
Key Variables:
REDIS_PASSWORD=your-secure-passwordJWT_SECRET=your-jwt-secretSSL_CERT_PATH=/path/to/cert
Monitoring Setup
Automatic Configuration
Monitoring dashboards and datasources are auto-provisioned:
# Setup monitoring provisioning (Grafana datasources/providers)
python3 scripts/setup_monitoring.py
# Start services (includes monitoring)
make dev-up
Available Dashboards
- Load Test Performance: Request rates, response times, error rates
- System Health: Service status, memory, CPU usage
- Log Analysis: Error logs, service logs, log aggregation
Manual Configuration
If auto-provisioning fails:
- Access Grafana: http://localhost:3000
- Add Data Sources:
- Prometheus: http://prometheus:9090
- Loki: http://loki:3100
- Import Dashboards: From
monitoring/grafana/dashboards/
Testing Procedures
Pre-Deployment Testing
# Run unit tests
make test-unit
# Run integration tests
make test-integration
# Run full test suite
make test
# Run with coverage
make test-coverage
Load Testing
# Run load tests
make load-test
# Run specific load scenarios
make benchmark-local
# Track performance over time
./scripts/track_performance.sh
Health Checks
# Check service health
curl -f http://localhost:8080/health
# Check monitoring services
curl -f http://localhost:3000/api/health
curl -f http://localhost:9090/api/v1/query?query=up
curl -f http://localhost:3100/ready
Troubleshooting
Common Issues
Port Conflicts:
# Check port usage
lsof -i :9101
lsof -i :3000
lsof -i :9090
# Kill conflicting processes
kill -9 <PID>
Container Issues:
# View container logs
docker logs ml-experiments-api
docker logs ml-experiments-grafana
# Restart services
make dev-restart
# Clean restart
make dev-down && make dev-up
Monitoring Issues:
# Re-setup monitoring configuration
python3 scripts/setup_monitoring.py
# Restart Grafana only
docker restart ml-experiments-grafana
Performance Issues
High Memory Usage:
- Check Grafana dashboards for memory metrics
- Adjust Prometheus retention in
prometheus.yml - Monitor log retention in
loki-config.yml
Slow Response Times:
- Check network connectivity between containers
- Verify Redis performance
- Review API server logs for bottlenecks
Maintenance
Regular Tasks
Weekly:
- Check Grafana dashboards for anomalies
- Review log files for errors
- Verify backup procedures
Monthly:
- Update Docker images
- Clean up old Docker volumes
- Review and rotate secrets
Backup Procedures
Data Backup:
# Backup application data
docker run --rm -v ml_data:/data -v $(pwd):/backup alpine tar czf /backup/data-backup.tar.gz -C /data .
# Backup monitoring data
docker run --rm -v prometheus_data:/data -v $(pwd):/backup alpine tar czf /backup/prometheus-backup.tar.gz -C /data .
Configuration Backup:
# Backup configurations
tar czf config-backup.tar.gz monitoring/ deployments/ configs/
Security Considerations
Development Environment
- Change default Grafana password
- Use environment variables for secrets
- Monitor container logs for security events
Production Environment
- Enable Redis authentication
- Use SSL/TLS certificates
- Implement network segmentation
- Regular security updates
- Monitor access logs
Performance Optimization
Resource Limits
Development:
# docker-compose.dev.yml
services:
api-server:
deploy:
resources:
limits:
memory: 512M
cpus: '0.5'
Production:
# docker-compose.prod.yml
services:
api-server:
deploy:
resources:
limits:
memory: 2G
cpus: '2.0'
Monitoring Optimization
Prometheus:
- Adjust scrape intervals
- Configure retention periods
- Use recording rules for frequent queries
Loki:
- Configure log retention
- Use log sampling for high-volume sources
- Optimize label cardinality
Non-Docker Production (systemd)
This project can be run in production without Docker. The recommended model is:
- Run
api-serverandworkeras systemd services. - Terminate TLS/WSS at Caddy and keep the API server on internal plain HTTP/WS.
The unit templates below are copy-paste friendly, but you must adjust paths, users, and config locations to your environment.
fetchml-api.service
[Unit]
Description=FetchML API Server
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=fetchml
Group=fetchml
WorkingDirectory=/var/lib/fetchml
Environment=LOG_LEVEL=info
ExecStart=/usr/local/bin/api-server -config /etc/fetchml/api.yaml
Restart=on-failure
RestartSec=2
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/fetchml /var/log/fetchml
[Install]
WantedBy=multi-user.target
fetchml-worker.service
[Unit]
Description=FetchML Worker
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=fetchml
Group=fetchml
WorkingDirectory=/var/lib/fetchml
Environment=LOG_LEVEL=info
ExecStart=/usr/local/bin/worker -config /etc/fetchml/worker.yaml
Restart=on-failure
RestartSec=2
NoNewPrivileges=true
PrivateTmp=true
[Install]
WantedBy=multi-user.target
Optional: caddy.service
Most distros ship a Caddy systemd unit. If you do not have one available, you can use this template.
[Unit]
Description=Caddy
Documentation=https://caddyserver.com/docs/
After=network-online.target
Wants=network-online.target
[Service]
Type=notify
User=caddy
Group=caddy
ExecStart=/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile
ExecReload=/usr/bin/caddy reload --config /etc/caddy/Caddyfile
TimeoutStopSec=5s
LimitNOFILE=1048576
LimitNPROC=512
PrivateTmp=true
ProtectSystem=full
AmbientCapabilities=CAP_NET_BIND_SERVICE
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
NoNewPrivileges=true
Restart=on-failure
[Install]
WantedBy=multi-user.target
Migration Guide
From Development to Production
-
Export Data:
docker exec ml-data redis-cli BGSAVE docker cp ml-data:/data/dump.rdb ./redis-backup.rdb -
Update Configuration:
cp deployments/env.dev.example deployments/env.prod.example # Edit with production values -
Deploy Production:
cd deployments make prod-up -
Import Data:
docker cp ./redis-backup.rdb ml-prod-redis:/data/dump.rdb docker restart ml-prod-redis
Support
For deployment issues:
- Check troubleshooting section
- Review container logs
- Verify network connectivity
- Check resource usage in Grafana