Jeremie Fraeys 605829dfc3 Update documentation for new features and fix outdated references

- Update deployment.md to reference new deployments/ directory structure
- Update CLI reference with new multi-user authentication system
- Add roles and permissions examples to configuration
- Fix docker-compose paths in testing documentation
- Remove references to non-existent docker-compose.test.yml
- Update troubleshooting with correct test commands
- Remove misplaced README files from test directories

2025-12-06 13:46:40 -05:00

6.3 KiB

Raw Blame History

ML Experiment Manager - Deployment Guide

Overview

The ML Experiment Manager supports multiple deployment methods from local development to homelab Docker setups.

Quick Start

Docker Compose (Recommended for Development)

# Clone repository
git clone https://github.com/your-org/fetch_ml.git
cd fetch_ml

# Start all services
docker-compose up -d (testing only)

# Check status
docker-compose ps

# View logs
docker-compose logs -f api-server

Access the API at http://localhost:9100

Deployment Options

1. Local Development

Prerequisites

Container Runtimes:

Docker Compose: For testing and development only
Podman: For production experiment execution
Go 1.25+
Zig 0.15.2
Redis 7+
Docker & Docker Compose (optional)

Manual Setup

# Start Redis
redis-server

# Build and run Go server
go build -o bin/api-server ./cmd/api-server
./bin/api-server -config configs/config-local.yaml

# Build Zig CLI
cd cli
zig build prod
./zig-out/bin/ml --help

2. Docker Deployment

Build Image

docker build -t ml-experiment-manager:latest .

Run Container

docker run -d \
  --name ml-api \
  -p 9100:9100 \
  -p 9101:9101 \
  -v $(pwd)/configs:/app/configs:ro \
  -v experiment-data:/data/ml-experiments \
  ml-experiment-manager:latest

Docker Compose

# Development mode (uses root docker-compose.yml)
docker-compose up -d

# Production deployment
docker-compose -f deployments/docker-compose.prod.yml up -d

# Secure homelab deployment
docker-compose -f deployments/docker-compose.homelab-secure.yml up -d

# With custom configuration
docker-compose -f deployments/docker-compose.prod.yml up --env-file .env.prod

3. Homelab Setup

# Use the simple setup script
./setup.sh

# Or manually with Docker Compose
docker-compose up -d (testing only)

4. Cloud Deployment

AWS ECS

# Build and push to ECR
aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REGISTRY
docker build -t $ECR_REGISTRY/ml-experiment-manager:latest .
docker push $ECR_REGISTRY/ml-experiment-manager:latest

# Deploy with ECS CLI
ecs-cli compose --project-name ml-experiment-manager up

Google Cloud Run

# Build and push
gcloud builds submit --tag gcr.io/$PROJECT_ID/ml-experiment-manager

# Deploy
gcloud run deploy ml-experiment-manager \
  --image gcr.io/$PROJECT_ID/ml-experiment-manager \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated

Configuration

Environment Variables

# configs/config-local.yaml
base_path: "/data/ml-experiments"
auth:
  enabled: true
  api_keys:
    - "your-production-api-key"
server:
  address: ":9100"
  tls:
    enabled: true
    cert_file: "/app/ssl/cert.pem"
    key_file: "/app/ssl/key.pem"

Docker Compose Environment

# docker-compose.yml
version: '3.8'
services:
  api-server:
    environment:
      - REDIS_URL=redis://redis:6379
      - LOG_LEVEL=info
    volumes:
      - ./configs:/configs:ro
      - ./data:/data/experiments

Monitoring & Logging

Health Checks

HTTP: GET /health
WebSocket: Connection test
Redis: Ping check

Metrics

Prometheus metrics at /metrics
Custom application metrics
Container resource usage

Logging

Structured JSON logging
Log levels: DEBUG, INFO, WARN, ERROR
Centralized logging via ELK stack

Security

TLS Configuration

# Generate self-signed cert (development)
openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365 -nodes

# Production - use Let's Encrypt
certbot certonly --standalone -d ml-experiments.example.com

Network Security

Firewall rules (ports 9100, 9101, 6379)
VPN access for internal services
API key authentication
Rate limiting

Performance Tuning

Resource Allocation

FetchML now centralizes pacing and container limits under a resources section in every server/worker config. Example for a homelab box:

resources:
  max_workers: 1
  desired_rps_per_worker: 2   # conservative pacing per worker
  podman_cpus: "2"            # Podman --cpus, keeps host responsive
  podman_memory: "8g"         # Podman --memory, isolates experiment installs

For high-end machines (e.g., M2 Ultra, 18 performance cores / 64 GB RAM), start with:

resources:
  max_workers: 2              # two concurrent experiments
  desired_rps_per_worker: 5   # faster job submission
  podman_cpus: "8"
  podman_memory: "32g"

Adjust upward only if experiments stay GPU-bound; keeping Podman limits in place ensures users can install packages inside the container without jeopardizing the host.

Scaling Strategies

Horizontal pod autoscaling
Redis clustering
Load balancing
CDN for static assets

Backup & Recovery

Data Backup

# Backup experiment data
docker-compose exec redis redis-cli BGSAVE
docker cp $(docker-compose ps -q redis):/data/dump.rdb ./redis-backup.rdb

# Backup data volume
docker run --rm -v ml-experiments_redis_data:/data -v $(pwd):/backup alpine tar czf /backup/redis-backup.tar.gz -C /data .

Disaster Recovery

Restore Redis data
Restart services
Verify experiment metadata
Test API endpoints

Troubleshooting

Common Issues

API Server Not Starting

# Check logs
docker-compose logs api-server

# Check configuration
cat configs/config-local.yaml

# Check Redis connection
docker-compose exec redis redis-cli ping

WebSocket Connection Issues

# Test WebSocket
wscat -c ws://localhost:9100/ws

# Check TLS
openssl s_client -connect localhost:9101 -servername localhost

Performance Issues

# Check resource usage
docker-compose exec api-server ps aux

# Check Redis memory
docker-compose exec redis redis-cli info memory

Debug Mode

# Enable debug logging
export LOG_LEVEL=debug
./bin/api-server -config configs/config-local.yaml

CI/CD Integration

GitHub Actions

Automated testing on PR
Multi-platform builds
Security scanning
Automatic releases

Deployment Pipeline

Code commit → GitHub
CI/CD pipeline triggers
Build and test
Security scan
Deploy to staging
Run integration tests
Deploy to production
Post-deployment verification

Support

For deployment issues:

Check this guide
Review logs
Check GitHub Issues
Contact maintainers

6.3 KiB Raw Blame History Unescape Escape