# ML Experiment Manager - Deployment Guide ## Overview The ML Experiment Manager supports multiple deployment methods from local development to production setups with integrated monitoring. ## TLS / WSS Policy - The Zig CLI currently supports `ws://` only (native `wss://` is not implemented). - For production, use a reverse proxy (Caddy) to terminate TLS/WSS (Approach A) and keep the API server on internal HTTP/WS. - If you need remote CLI access, use one of: - an SSH tunnel to the internal `ws://` endpoint - a private network/VPN so `ws://` is not exposed to the public Internet - When `server.tls.enabled: false`, the API server still runs on plain HTTP/WS internally. In development, access it via Caddy at `http://localhost:8080/health`. ## Data Directories - `base_path` is where experiment directories live. - `data_dir` is used for dataset/snapshot materialization and integrity validation. - If you want `ml validate` to check snapshots/datasets, you must mount `data_dir` into the API server container. ## Quick Start ### Development Deployment with Monitoring ```bash # Start development stack with monitoring make dev-up # Alternative: Use deployment script cd deployments && make dev-up # Check status make dev-status ``` **Access Services:** - **API Server (via Caddy)**: http://localhost:8080 - **API Server (via Caddy + internal TLS)**: https://localhost:8443 - **Grafana**: http://localhost:3000 (admin/admin123) - **Prometheus**: http://localhost:9090 - **Loki**: http://localhost:3100 ## Deployment Options ### 1. Development Environment **Purpose**: Local development with full monitoring stack **Services**: API Server, Redis, Prometheus, Grafana, Loki, Promtail **Configuration**: ```bash # Using Makefile (recommended) make dev-up make dev-down make dev-status # Using deployment script cd deployments make dev-up make dev-down make dev-status ``` **Features**: - Auto-provisioned Grafana dashboards - Real-time metrics and logs - Hot reload for development - Local data persistence ### 2. Production Environment **Purpose**: Production deployment with security **Services**: API Server, Worker, Redis with authentication **Configuration**: ```bash cd deployments make prod-up make prod-down make prod-status ``` **Features**: - Secure Redis with authentication - TLS/WSS via reverse proxy termination (Caddy) - Production-optimized configurations - Health checks and restart policies ### 3. Homelab Secure Environment **Purpose**: Secure homelab deployment **Services**: API Server, Redis, Caddy reverse proxy **Configuration**: ```bash cd deployments make homelab-up make homelab-down make homelab-status ``` **Features**: - Caddy reverse proxy - TLS termination - Network isolation - External networks ## Environment Setup ### Development Environment ```bash # Copy example environment cp deployments/env.dev.example .env # Edit as needed vim .env ``` **Key Variables**: - `LOG_LEVEL=info` - `GRAFANA_ADMIN_PASSWORD=admin123` ### Production Environment ```bash # Copy example environment cp deployments/env.prod.example .env # Edit with production values vim .env ``` **Key Variables**: - `REDIS_PASSWORD=your-secure-password` - `JWT_SECRET=your-jwt-secret` - `SSL_CERT_PATH=/path/to/cert` ## Monitoring Setup ### Automatic Configuration Monitoring dashboards and datasources are auto-provisioned: ```bash # Setup monitoring provisioning (Grafana datasources/providers) python3 scripts/setup_monitoring.py # Start services (includes monitoring) make dev-up ``` ### Available Dashboards 1. **Load Test Performance**: Request rates, response times, error rates 2. **System Health**: Service status, memory, CPU usage 3. **Log Analysis**: Error logs, service logs, log aggregation ### Manual Configuration If auto-provisioning fails: 1. **Access Grafana**: http://localhost:3000 2. **Add Data Sources**: - Prometheus: http://prometheus:9090 - Loki: http://loki:3100 3. **Import Dashboards**: From `monitoring/grafana/dashboards/` ## Testing Procedures ### Pre-Deployment Testing ```bash # Run unit tests make test-unit # Run integration tests make test-integration # Run full test suite make test # Run with coverage make test-coverage ``` ### Load Testing ```bash # Run load tests make load-test # Run specific load scenarios make benchmark-local # Track performance over time ./scripts/track_performance.sh ``` ### Health Checks ```bash # Check service health curl -f http://localhost:8080/health # Check monitoring services curl -f http://localhost:3000/api/health curl -f http://localhost:9090/api/v1/query?query=up curl -f http://localhost:3100/ready ``` ## Troubleshooting ### Common Issues **Port Conflicts**: ```bash # Check port usage lsof -i :9101 lsof -i :3000 lsof -i :9090 # Kill conflicting processes kill -9 ``` **Container Issues**: ```bash # View container logs docker logs ml-experiments-api docker logs ml-experiments-grafana # Restart services make dev-restart # Clean restart make dev-down && make dev-up ``` **Monitoring Issues**: ```bash # Re-setup monitoring configuration python3 scripts/setup_monitoring.py # Restart Grafana only docker restart ml-experiments-grafana ``` ### Performance Issues **High Memory Usage**: - Check Grafana dashboards for memory metrics - Adjust Prometheus retention in `prometheus.yml` - Monitor log retention in `loki-config.yml` **Slow Response Times**: - Check network connectivity between containers - Verify Redis performance - Review API server logs for bottlenecks ## Maintenance ### Regular Tasks **Weekly**: - Check Grafana dashboards for anomalies - Review log files for errors - Verify backup procedures **Monthly**: - Update Docker images - Clean up old Docker volumes - Review and rotate secrets ### Backup Procedures **Data Backup**: ```bash # Backup application data docker run --rm -v ml_data:/data -v $(pwd):/backup alpine tar czf /backup/data-backup.tar.gz -C /data . # Backup monitoring data docker run --rm -v prometheus_data:/data -v $(pwd):/backup alpine tar czf /backup/prometheus-backup.tar.gz -C /data . ``` **Configuration Backup**: ```bash # Backup configurations tar czf config-backup.tar.gz monitoring/ deployments/ configs/ ``` ## Security Considerations ### Development Environment - Change default Grafana password - Use environment variables for secrets - Monitor container logs for security events ### Production Environment - Enable Redis authentication - Use SSL/TLS certificates - Implement network segmentation - Regular security updates - Monitor access logs ## Performance Optimization ### Resource Limits **Development**: ```yaml # docker-compose.dev.yml services: api-server: deploy: resources: limits: memory: 512M cpus: '0.5' ``` **Production**: ```yaml # docker-compose.prod.yml services: api-server: deploy: resources: limits: memory: 2G cpus: '2.0' ``` ### Monitoring Optimization **Prometheus**: - Adjust scrape intervals - Configure retention periods - Use recording rules for frequent queries **Loki**: - Configure log retention - Use log sampling for high-volume sources - Optimize label cardinality ## Non-Docker Production (systemd) This project can be run in production without Docker. The recommended model is: - Run `api-server` and `worker` as systemd services. - Terminate TLS/WSS at Caddy and keep the API server on internal plain HTTP/WS. The unit templates below are copy-paste friendly, but you must adjust paths, users, and config locations to your environment. ### `fetchml-api.service` ```ini [Unit] Description=FetchML API Server After=network-online.target Wants=network-online.target [Service] Type=simple User=fetchml Group=fetchml WorkingDirectory=/var/lib/fetchml Environment=LOG_LEVEL=info ExecStart=/usr/local/bin/api-server -config /etc/fetchml/api.yaml Restart=on-failure RestartSec=2 NoNewPrivileges=true PrivateTmp=true ProtectSystem=strict ProtectHome=true ReadWritePaths=/var/lib/fetchml /var/log/fetchml [Install] WantedBy=multi-user.target ``` ### `fetchml-worker.service` ```ini [Unit] Description=FetchML Worker After=network-online.target Wants=network-online.target [Service] Type=simple User=fetchml Group=fetchml WorkingDirectory=/var/lib/fetchml Environment=LOG_LEVEL=info ExecStart=/usr/local/bin/worker -config /etc/fetchml/worker.yaml Restart=on-failure RestartSec=2 NoNewPrivileges=true PrivateTmp=true [Install] WantedBy=multi-user.target ``` ### Optional: `caddy.service` Most distros ship a Caddy systemd unit. If you do not have one available, you can use this template. ```ini [Unit] Description=Caddy Documentation=https://caddyserver.com/docs/ After=network-online.target Wants=network-online.target [Service] Type=notify User=caddy Group=caddy ExecStart=/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile ExecReload=/usr/bin/caddy reload --config /etc/caddy/Caddyfile TimeoutStopSec=5s LimitNOFILE=1048576 LimitNPROC=512 PrivateTmp=true ProtectSystem=full AmbientCapabilities=CAP_NET_BIND_SERVICE CapabilityBoundingSet=CAP_NET_BIND_SERVICE NoNewPrivileges=true Restart=on-failure [Install] WantedBy=multi-user.target ``` ## Migration Guide ### From Development to Production 1. **Export Data**: ```bash docker exec ml-data redis-cli BGSAVE docker cp ml-data:/data/dump.rdb ./redis-backup.rdb ``` 2. **Update Configuration**: ```bash cp deployments/env.dev.example deployments/env.prod.example # Edit with production values ``` 3. **Deploy Production**: ```bash cd deployments make prod-up ``` 4. **Import Data**: ```bash docker cp ./redis-backup.rdb ml-prod-redis:/data/dump.rdb docker restart ml-prod-redis ``` ## Support For deployment issues: 1. Check troubleshooting section 2. Review container logs 3. Verify network connectivity 4. Check resource usage in Grafana