Skip to content

ML Experiment Manager - Deployment Guide

Overview

The ML Experiment Manager supports multiple deployment methods from local development to homelab Docker setups.

Quick Start

# Clone repository
git clone https://github.com/your-org/fetch_ml.git
cd fetch_ml

# Start all services
docker-compose up -d (testing only)

# Check status
docker-compose ps

# View logs
docker-compose logs -f api-server

Access the API at http://localhost:9100

Deployment Options

1. Local Development

Prerequisites

Container Runtimes: - Docker Compose: For testing and development only - Podman: For production experiment execution - Go 1.25+ - Zig 0.15.2 - Redis 7+ - Docker & Docker Compose (optional)

Manual Setup

# Start Redis
redis-server

# Build and run Go server
go build -o bin/api-server ./cmd/api-server
./bin/api-server -config configs/config-local.yaml

# Build Zig CLI
cd cli
zig build prod
./zig-out/bin/ml --help

2. Docker Deployment

Build Image

docker build -t ml-experiment-manager:latest .

Run Container

docker run -d \
  --name ml-api \
  -p 9100:9100 \
  -p 9101:9101 \
  -v $(pwd)/configs:/app/configs:ro \
  -v experiment-data:/data/ml-experiments \
  ml-experiment-manager:latest

Docker Compose

# Production mode
docker-compose -f docker-compose.yml up -d

# Development mode with logs
docker-compose -f docker-compose.yml up

3. Homelab Setup

# Use the simple setup script
./setup.sh

# Or manually with Docker Compose
docker-compose up -d (testing only)

4. Cloud Deployment

AWS ECS

# Build and push to ECR
aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REGISTRY
docker build -t $ECR_REGISTRY/ml-experiment-manager:latest .
docker push $ECR_REGISTRY/ml-experiment-manager:latest

# Deploy with ECS CLI
ecs-cli compose --project-name ml-experiment-manager up

Google Cloud Run

# Build and push
gcloud builds submit --tag gcr.io/$PROJECT_ID/ml-experiment-manager

# Deploy
gcloud run deploy ml-experiment-manager \
  --image gcr.io/$PROJECT_ID/ml-experiment-manager \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated

Configuration

Environment Variables

# configs/config-local.yaml
base_path: "/data/ml-experiments"
auth:
  enabled: true
  api_keys:
    - "your-production-api-key"
server:
  address: ":9100"
  tls:
    enabled: true
    cert_file: "/app/ssl/cert.pem"
    key_file: "/app/ssl/key.pem"

Docker Compose Environment

# docker-compose.yml
version: '3.8'
services:
  api-server:
    environment:
      - REDIS_URL=redis://redis:6379
      - LOG_LEVEL=info
    volumes:
      - ./configs:/configs:ro
      - ./data:/data/experiments

Monitoring & Logging

Health Checks

  • HTTP: GET /health
  • WebSocket: Connection test
  • Redis: Ping check

Metrics

  • Prometheus metrics at /metrics
  • Custom application metrics
  • Container resource usage

Logging

  • Structured JSON logging
  • Log levels: DEBUG, INFO, WARN, ERROR
  • Centralized logging via ELK stack

Security

TLS Configuration

# Generate self-signed cert (development)
openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365 -nodes

# Production - use Let's Encrypt
certbot certonly --standalone -d ml-experiments.example.com

Network Security

  • Firewall rules (ports 9100, 9101, 6379)
  • VPN access for internal services
  • API key authentication
  • Rate limiting

Performance Tuning

Resource Allocation

resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "1Gi"
    cpu: "1000m"

Scaling Strategies

  • Horizontal pod autoscaling
  • Redis clustering
  • Load balancing
  • CDN for static assets

Backup & Recovery

Data Backup

# Backup experiment data
docker-compose exec redis redis-cli BGSAVE
docker cp $(docker-compose ps -q redis):/data/dump.rdb ./redis-backup.rdb

# Backup data volume
docker run --rm -v ml-experiments_redis_data:/data -v $(pwd):/backup alpine tar czf /backup/redis-backup.tar.gz -C /data .

Disaster Recovery

  1. Restore Redis data
  2. Restart services
  3. Verify experiment metadata
  4. Test API endpoints

Troubleshooting

Common Issues

API Server Not Starting

# Check logs
docker-compose logs api-server

# Check configuration
cat configs/config-local.yaml

# Check Redis connection
docker-compose exec redis redis-cli ping

WebSocket Connection Issues

# Test WebSocket
wscat -c ws://localhost:9100/ws

# Check TLS
openssl s_client -connect localhost:9101 -servername localhost

Performance Issues

# Check resource usage
docker-compose exec api-server ps aux

# Check Redis memory
docker-compose exec redis redis-cli info memory

Debug Mode

# Enable debug logging
export LOG_LEVEL=debug
./bin/api-server -config configs/config-local.yaml

CI/CD Integration

GitHub Actions

  • Automated testing on PR
  • Multi-platform builds
  • Security scanning
  • Automatic releases

Deployment Pipeline

  1. Code commit → GitHub
  2. CI/CD pipeline triggers
  3. Build and test
  4. Security scan
  5. Deploy to staging
  6. Run integration tests
  7. Deploy to production
  8. Post-deployment verification

Support

For deployment issues: 1. Check this guide 2. Review logs 3. Check GitHub Issues 4. Contact maintainers