# ML Experiment Manager - Deployment Guide ## Overview The ML Experiment Manager supports multiple deployment methods from local development to homelab Docker setups. ## Quick Start ### Docker Compose (Recommended for Development) ```bash # Clone repository git clone https://github.com/your-org/fetch_ml.git cd fetch_ml # Start all services docker-compose up -d (testing only) # Check status docker-compose ps # View logs docker-compose logs -f api-server ``` Access the API at `http://localhost:9100` ## Deployment Options ### 1. Local Development #### Prerequisites **Container Runtimes:** - **Docker Compose**: For testing and development only - **Podman**: For production experiment execution - Go 1.25+ - Zig 0.15.2 - Redis 7+ - Docker & Docker Compose (optional) #### Manual Setup ```bash # Start Redis redis-server # Build and run Go server go build -o bin/api-server ./cmd/api-server ./bin/api-server -config configs/config-local.yaml # Build Zig CLI cd cli zig build prod ./zig-out/bin/ml --help ``` ### 2. Docker Deployment #### Build Image ```bash docker build -t ml-experiment-manager:latest . ``` #### Run Container ```bash docker run -d \ --name ml-api \ -p 9100:9100 \ -p 9101:9101 \ -v $(pwd)/configs:/app/configs:ro \ -v experiment-data:/data/ml-experiments \ ml-experiment-manager:latest ``` #### Docker Compose ```bash # Development mode (uses root docker-compose.yml) docker-compose up -d # Production deployment docker-compose -f deployments/docker-compose.prod.yml up -d # Secure homelab deployment docker-compose -f deployments/docker-compose.homelab-secure.yml up -d # With custom configuration docker-compose -f deployments/docker-compose.prod.yml up --env-file .env.prod ``` ### 3. Homelab Setup ```bash # Use the simple setup script ./setup.sh # Or manually with Docker Compose docker-compose up -d (testing only) ``` ### 4. Cloud Deployment #### AWS ECS ```bash # Build and push to ECR aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REGISTRY docker build -t $ECR_REGISTRY/ml-experiment-manager:latest . docker push $ECR_REGISTRY/ml-experiment-manager:latest # Deploy with ECS CLI ecs-cli compose --project-name ml-experiment-manager up ``` #### Google Cloud Run ```bash # Build and push gcloud builds submit --tag gcr.io/$PROJECT_ID/ml-experiment-manager # Deploy gcloud run deploy ml-experiment-manager \ --image gcr.io/$PROJECT_ID/ml-experiment-manager \ --platform managed \ --region us-central1 \ --allow-unauthenticated ``` ## Configuration ### Environment Variables ```yaml # configs/config-local.yaml base_path: "/data/ml-experiments" auth: enabled: true api_keys: - "your-production-api-key" server: address: ":9100" tls: enabled: true cert_file: "/app/ssl/cert.pem" key_file: "/app/ssl/key.pem" ``` ### Docker Compose Environment ```yaml # docker-compose.yml version: '3.8' services: api-server: environment: - REDIS_URL=redis://redis:6379 - LOG_LEVEL=info volumes: - ./configs:/configs:ro - ./data:/data/experiments ``` ## Monitoring & Logging ### Health Checks - HTTP: `GET /health` - WebSocket: Connection test - Redis: Ping check ### Metrics - Prometheus metrics at `/metrics` - Custom application metrics - Container resource usage ### Logging - Structured JSON logging - Log levels: DEBUG, INFO, WARN, ERROR - Centralized logging via ELK stack ## Security ### TLS Configuration ```bash # Generate self-signed cert (development) openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365 -nodes # Production - use Let's Encrypt certbot certonly --standalone -d ml-experiments.example.com ``` ### Network Security - Firewall rules (ports 9100, 9101, 6379) - VPN access for internal services - API key authentication - Rate limiting ## Performance Tuning ### Resource Allocation FetchML now centralizes pacing and container limits under a `resources` section in every server/worker config. Example for a homelab box: ```yaml resources: max_workers: 1 desired_rps_per_worker: 2 # conservative pacing per worker podman_cpus: "2" # Podman --cpus, keeps host responsive podman_memory: "8g" # Podman --memory, isolates experiment installs ``` For high-end machines (e.g., M2 Ultra, 18 performance cores / 64 GB RAM), start with: ```yaml resources: max_workers: 2 # two concurrent experiments desired_rps_per_worker: 5 # faster job submission podman_cpus: "8" podman_memory: "32g" ``` Adjust upward only if experiments stay GPU-bound; keeping Podman limits in place ensures users can install packages inside the container without jeopardizing the host. ### Scaling Strategies - Horizontal pod autoscaling - Redis clustering - Load balancing - CDN for static assets ## Backup & Recovery ### Data Backup ```bash # Backup experiment data docker-compose exec redis redis-cli BGSAVE docker cp $(docker-compose ps -q redis):/data/dump.rdb ./redis-backup.rdb # Backup data volume docker run --rm -v ml-experiments_redis_data:/data -v $(pwd):/backup alpine tar czf /backup/redis-backup.tar.gz -C /data . ``` ### Disaster Recovery 1. Restore Redis data 2. Restart services 3. Verify experiment metadata 4. Test API endpoints ## Troubleshooting ### Common Issues #### API Server Not Starting ```bash # Check logs docker-compose logs api-server # Check configuration cat configs/config-local.yaml # Check Redis connection docker-compose exec redis redis-cli ping ``` #### WebSocket Connection Issues ```bash # Test WebSocket wscat -c ws://localhost:9100/ws # Check TLS openssl s_client -connect localhost:9101 -servername localhost ``` #### Performance Issues ```bash # Check resource usage docker-compose exec api-server ps aux # Check Redis memory docker-compose exec redis redis-cli info memory ``` ### Debug Mode ```bash # Enable debug logging export LOG_LEVEL=debug ./bin/api-server -config configs/config-local.yaml ``` ## CI/CD Integration ### GitHub Actions - Automated testing on PR - Multi-platform builds - Security scanning - Automatic releases ### Deployment Pipeline 1. Code commit → GitHub 2. CI/CD pipeline triggers 3. Build and test 4. Security scan 5. Deploy to staging 6. Run integration tests 7. Deploy to production 8. Post-deployment verification ## Support For deployment issues: 1. Check this guide 2. Review logs 3. Check GitHub Issues 4. Contact maintainers