fetch_ml/docs/src/deployment.md
Jeremie Fraeys 605829dfc3 Update documentation for new features and fix outdated references
- Update deployment.md to reference new deployments/ directory structure
- Update CLI reference with new multi-user authentication system
- Add roles and permissions examples to configuration
- Fix docker-compose paths in testing documentation
- Remove references to non-existent docker-compose.test.yml
- Update troubleshooting with correct test commands
- Remove misplaced README files from test directories
2025-12-06 13:46:40 -05:00

6.3 KiB
Raw Blame History

ML Experiment Manager - Deployment Guide

Overview

The ML Experiment Manager supports multiple deployment methods from local development to homelab Docker setups.

Quick Start

# Clone repository
git clone https://github.com/your-org/fetch_ml.git
cd fetch_ml

# Start all services
docker-compose up -d (testing only)

# Check status
docker-compose ps

# View logs
docker-compose logs -f api-server

Access the API at http://localhost:9100

Deployment Options

1. Local Development

Prerequisites

Container Runtimes:

  • Docker Compose: For testing and development only
  • Podman: For production experiment execution
  • Go 1.25+
  • Zig 0.15.2
  • Redis 7+
  • Docker & Docker Compose (optional)

Manual Setup

# Start Redis
redis-server

# Build and run Go server
go build -o bin/api-server ./cmd/api-server
./bin/api-server -config configs/config-local.yaml

# Build Zig CLI
cd cli
zig build prod
./zig-out/bin/ml --help

2. Docker Deployment

Build Image

docker build -t ml-experiment-manager:latest .

Run Container

docker run -d \
  --name ml-api \
  -p 9100:9100 \
  -p 9101:9101 \
  -v $(pwd)/configs:/app/configs:ro \
  -v experiment-data:/data/ml-experiments \
  ml-experiment-manager:latest

Docker Compose

# Development mode (uses root docker-compose.yml)
docker-compose up -d

# Production deployment
docker-compose -f deployments/docker-compose.prod.yml up -d

# Secure homelab deployment
docker-compose -f deployments/docker-compose.homelab-secure.yml up -d

# With custom configuration
docker-compose -f deployments/docker-compose.prod.yml up --env-file .env.prod

3. Homelab Setup

# Use the simple setup script
./setup.sh

# Or manually with Docker Compose
docker-compose up -d (testing only)

4. Cloud Deployment

AWS ECS

# Build and push to ECR
aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REGISTRY
docker build -t $ECR_REGISTRY/ml-experiment-manager:latest .
docker push $ECR_REGISTRY/ml-experiment-manager:latest

# Deploy with ECS CLI
ecs-cli compose --project-name ml-experiment-manager up

Google Cloud Run

# Build and push
gcloud builds submit --tag gcr.io/$PROJECT_ID/ml-experiment-manager

# Deploy
gcloud run deploy ml-experiment-manager \
  --image gcr.io/$PROJECT_ID/ml-experiment-manager \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated

Configuration

Environment Variables

# configs/config-local.yaml
base_path: "/data/ml-experiments"
auth:
  enabled: true
  api_keys:
    - "your-production-api-key"
server:
  address: ":9100"
  tls:
    enabled: true
    cert_file: "/app/ssl/cert.pem"
    key_file: "/app/ssl/key.pem"

Docker Compose Environment

# docker-compose.yml
version: '3.8'
services:
  api-server:
    environment:
      - REDIS_URL=redis://redis:6379
      - LOG_LEVEL=info
    volumes:
      - ./configs:/configs:ro
      - ./data:/data/experiments

Monitoring & Logging

Health Checks

  • HTTP: GET /health
  • WebSocket: Connection test
  • Redis: Ping check

Metrics

  • Prometheus metrics at /metrics
  • Custom application metrics
  • Container resource usage

Logging

  • Structured JSON logging
  • Log levels: DEBUG, INFO, WARN, ERROR
  • Centralized logging via ELK stack

Security

TLS Configuration

# Generate self-signed cert (development)
openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365 -nodes

# Production - use Let's Encrypt
certbot certonly --standalone -d ml-experiments.example.com

Network Security

  • Firewall rules (ports 9100, 9101, 6379)
  • VPN access for internal services
  • API key authentication
  • Rate limiting

Performance Tuning

Resource Allocation

FetchML now centralizes pacing and container limits under a resources section in every server/worker config. Example for a homelab box:

resources:
  max_workers: 1
  desired_rps_per_worker: 2   # conservative pacing per worker
  podman_cpus: "2"            # Podman --cpus, keeps host responsive
  podman_memory: "8g"         # Podman --memory, isolates experiment installs

For high-end machines (e.g., M2 Ultra, 18 performance cores / 64GB RAM), start with:

resources:
  max_workers: 2              # two concurrent experiments
  desired_rps_per_worker: 5   # faster job submission
  podman_cpus: "8"
  podman_memory: "32g"

Adjust upward only if experiments stay GPU-bound; keeping Podman limits in place ensures users can install packages inside the container without jeopardizing the host.

Scaling Strategies

  • Horizontal pod autoscaling
  • Redis clustering
  • Load balancing
  • CDN for static assets

Backup & Recovery

Data Backup

# Backup experiment data
docker-compose exec redis redis-cli BGSAVE
docker cp $(docker-compose ps -q redis):/data/dump.rdb ./redis-backup.rdb

# Backup data volume
docker run --rm -v ml-experiments_redis_data:/data -v $(pwd):/backup alpine tar czf /backup/redis-backup.tar.gz -C /data .

Disaster Recovery

  1. Restore Redis data
  2. Restart services
  3. Verify experiment metadata
  4. Test API endpoints

Troubleshooting

Common Issues

API Server Not Starting

# Check logs
docker-compose logs api-server

# Check configuration
cat configs/config-local.yaml

# Check Redis connection
docker-compose exec redis redis-cli ping

WebSocket Connection Issues

# Test WebSocket
wscat -c ws://localhost:9100/ws

# Check TLS
openssl s_client -connect localhost:9101 -servername localhost

Performance Issues

# Check resource usage
docker-compose exec api-server ps aux

# Check Redis memory
docker-compose exec redis redis-cli info memory

Debug Mode

# Enable debug logging
export LOG_LEVEL=debug
./bin/api-server -config configs/config-local.yaml

CI/CD Integration

GitHub Actions

  • Automated testing on PR
  • Multi-platform builds
  • Security scanning
  • Automatic releases

Deployment Pipeline

  1. Code commit → GitHub
  2. CI/CD pipeline triggers
  3. Build and test
  4. Security scan
  5. Deploy to staging
  6. Run integration tests
  7. Deploy to production
  8. Post-deployment verification

Support

For deployment issues:

  1. Check this guide
  2. Review logs
  3. Check GitHub Issues
  4. Contact maintainers