- Update deployment.md to reference new deployments/ directory structure - Update CLI reference with new multi-user authentication system - Add roles and permissions examples to configuration - Fix docker-compose paths in testing documentation - Remove references to non-existent docker-compose.test.yml - Update troubleshooting with correct test commands - Remove misplaced README files from test directories
6.3 KiB
6.3 KiB
ML Experiment Manager - Deployment Guide
Overview
The ML Experiment Manager supports multiple deployment methods from local development to homelab Docker setups.
Quick Start
Docker Compose (Recommended for Development)
# Clone repository
git clone https://github.com/your-org/fetch_ml.git
cd fetch_ml
# Start all services
docker-compose up -d (testing only)
# Check status
docker-compose ps
# View logs
docker-compose logs -f api-server
Access the API at http://localhost:9100
Deployment Options
1. Local Development
Prerequisites
Container Runtimes:
- Docker Compose: For testing and development only
- Podman: For production experiment execution
- Go 1.25+
- Zig 0.15.2
- Redis 7+
- Docker & Docker Compose (optional)
Manual Setup
# Start Redis
redis-server
# Build and run Go server
go build -o bin/api-server ./cmd/api-server
./bin/api-server -config configs/config-local.yaml
# Build Zig CLI
cd cli
zig build prod
./zig-out/bin/ml --help
2. Docker Deployment
Build Image
docker build -t ml-experiment-manager:latest .
Run Container
docker run -d \
--name ml-api \
-p 9100:9100 \
-p 9101:9101 \
-v $(pwd)/configs:/app/configs:ro \
-v experiment-data:/data/ml-experiments \
ml-experiment-manager:latest
Docker Compose
# Development mode (uses root docker-compose.yml)
docker-compose up -d
# Production deployment
docker-compose -f deployments/docker-compose.prod.yml up -d
# Secure homelab deployment
docker-compose -f deployments/docker-compose.homelab-secure.yml up -d
# With custom configuration
docker-compose -f deployments/docker-compose.prod.yml up --env-file .env.prod
3. Homelab Setup
# Use the simple setup script
./setup.sh
# Or manually with Docker Compose
docker-compose up -d (testing only)
4. Cloud Deployment
AWS ECS
# Build and push to ECR
aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REGISTRY
docker build -t $ECR_REGISTRY/ml-experiment-manager:latest .
docker push $ECR_REGISTRY/ml-experiment-manager:latest
# Deploy with ECS CLI
ecs-cli compose --project-name ml-experiment-manager up
Google Cloud Run
# Build and push
gcloud builds submit --tag gcr.io/$PROJECT_ID/ml-experiment-manager
# Deploy
gcloud run deploy ml-experiment-manager \
--image gcr.io/$PROJECT_ID/ml-experiment-manager \
--platform managed \
--region us-central1 \
--allow-unauthenticated
Configuration
Environment Variables
# configs/config-local.yaml
base_path: "/data/ml-experiments"
auth:
enabled: true
api_keys:
- "your-production-api-key"
server:
address: ":9100"
tls:
enabled: true
cert_file: "/app/ssl/cert.pem"
key_file: "/app/ssl/key.pem"
Docker Compose Environment
# docker-compose.yml
version: '3.8'
services:
api-server:
environment:
- REDIS_URL=redis://redis:6379
- LOG_LEVEL=info
volumes:
- ./configs:/configs:ro
- ./data:/data/experiments
Monitoring & Logging
Health Checks
- HTTP:
GET /health - WebSocket: Connection test
- Redis: Ping check
Metrics
- Prometheus metrics at
/metrics - Custom application metrics
- Container resource usage
Logging
- Structured JSON logging
- Log levels: DEBUG, INFO, WARN, ERROR
- Centralized logging via ELK stack
Security
TLS Configuration
# Generate self-signed cert (development)
openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365 -nodes
# Production - use Let's Encrypt
certbot certonly --standalone -d ml-experiments.example.com
Network Security
- Firewall rules (ports 9100, 9101, 6379)
- VPN access for internal services
- API key authentication
- Rate limiting
Performance Tuning
Resource Allocation
FetchML now centralizes pacing and container limits under a resources section in every server/worker config. Example for a homelab box:
resources:
max_workers: 1
desired_rps_per_worker: 2 # conservative pacing per worker
podman_cpus: "2" # Podman --cpus, keeps host responsive
podman_memory: "8g" # Podman --memory, isolates experiment installs
For high-end machines (e.g., M2 Ultra, 18 performance cores / 64 GB RAM), start with:
resources:
max_workers: 2 # two concurrent experiments
desired_rps_per_worker: 5 # faster job submission
podman_cpus: "8"
podman_memory: "32g"
Adjust upward only if experiments stay GPU-bound; keeping Podman limits in place ensures users can install packages inside the container without jeopardizing the host.
Scaling Strategies
- Horizontal pod autoscaling
- Redis clustering
- Load balancing
- CDN for static assets
Backup & Recovery
Data Backup
# Backup experiment data
docker-compose exec redis redis-cli BGSAVE
docker cp $(docker-compose ps -q redis):/data/dump.rdb ./redis-backup.rdb
# Backup data volume
docker run --rm -v ml-experiments_redis_data:/data -v $(pwd):/backup alpine tar czf /backup/redis-backup.tar.gz -C /data .
Disaster Recovery
- Restore Redis data
- Restart services
- Verify experiment metadata
- Test API endpoints
Troubleshooting
Common Issues
API Server Not Starting
# Check logs
docker-compose logs api-server
# Check configuration
cat configs/config-local.yaml
# Check Redis connection
docker-compose exec redis redis-cli ping
WebSocket Connection Issues
# Test WebSocket
wscat -c ws://localhost:9100/ws
# Check TLS
openssl s_client -connect localhost:9101 -servername localhost
Performance Issues
# Check resource usage
docker-compose exec api-server ps aux
# Check Redis memory
docker-compose exec redis redis-cli info memory
Debug Mode
# Enable debug logging
export LOG_LEVEL=debug
./bin/api-server -config configs/config-local.yaml
CI/CD Integration
GitHub Actions
- Automated testing on PR
- Multi-platform builds
- Security scanning
- Automatic releases
Deployment Pipeline
- Code commit → GitHub
- CI/CD pipeline triggers
- Build and test
- Security scan
- Deploy to staging
- Run integration tests
- Deploy to production
- Post-deployment verification
Support
For deployment issues:
- Check this guide
- Review logs
- Check GitHub Issues
- Contact maintainers