# ML Experiment Manager - Deployment Guide

## Overview

The ML Experiment Manager supports multiple deployment methods from local development to homelab Docker setups.

## Quick Start

### Docker Compose (Recommended for Development)

```bash
# Clone repository
git clone https://github.com/your-org/fetch_ml.git
cd fetch_ml

# Start all services
docker-compose up -d (testing only)

# Check status
docker-compose ps

# View logs
docker-compose logs -f api-server
```

Access the API at `http://localhost:9100`

## Deployment Options

### 1. Local Development

#### Prerequisites

**Container Runtimes:**
- **Docker Compose**: For testing and development only
- **Podman**: For production experiment execution
- Go 1.25+
- Zig 0.15.2
- Redis 7+
- Docker & Docker Compose (optional)

#### Manual Setup
```bash
# Start Redis
redis-server

# Build and run Go server
go build -o bin/api-server ./cmd/api-server
./bin/api-server -config configs/config-local.yaml

# Build Zig CLI
cd cli
zig build prod
./zig-out/bin/ml --help
```

### 2. Docker Deployment

#### Build Image
```bash
docker build -t ml-experiment-manager:latest .
```

#### Run Container
```bash
docker run -d \
  --name ml-api \
  -p 9100:9100 \
  -p 9101:9101 \
  -v $(pwd)/configs:/app/configs:ro \
  -v experiment-data:/data/ml-experiments \
  ml-experiment-manager:latest
```

#### Docker Compose
```bash
# Development mode (uses root docker-compose.yml)
docker-compose up -d

# Production deployment
docker-compose -f deployments/docker-compose.prod.yml up -d

# Secure homelab deployment
docker-compose -f deployments/docker-compose.homelab-secure.yml up -d

# With custom configuration
docker-compose -f deployments/docker-compose.prod.yml up --env-file .env.prod
```

### 3. Homelab Setup

```bash
# Use the simple setup script
./setup.sh

# Or manually with Docker Compose
docker-compose up -d (testing only)
```

### 4. Cloud Deployment

#### AWS ECS
```bash
# Build and push to ECR
aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REGISTRY
docker build -t $ECR_REGISTRY/ml-experiment-manager:latest .
docker push $ECR_REGISTRY/ml-experiment-manager:latest

# Deploy with ECS CLI
ecs-cli compose --project-name ml-experiment-manager up
```

#### Google Cloud Run
```bash
# Build and push
gcloud builds submit --tag gcr.io/$PROJECT_ID/ml-experiment-manager

# Deploy
gcloud run deploy ml-experiment-manager \
  --image gcr.io/$PROJECT_ID/ml-experiment-manager \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated
```

## Configuration

### Environment Variables
```yaml
# configs/config-local.yaml
base_path: "/data/ml-experiments"
auth:
  enabled: true
  api_keys:
    - "your-production-api-key"
server:
  address: ":9100"
  tls:
    enabled: true
    cert_file: "/app/ssl/cert.pem"
    key_file: "/app/ssl/key.pem"
```

### Docker Compose Environment
```yaml
# docker-compose.yml
version: '3.8'
services:
  api-server:
    environment:
      - REDIS_URL=redis://redis:6379
      - LOG_LEVEL=info
    volumes:
      - ./configs:/configs:ro
      - ./data:/data/experiments
```

## Monitoring & Logging

### Health Checks
- HTTP: `GET /health`
- WebSocket: Connection test
- Redis: Ping check

### Metrics
- Prometheus metrics at `/metrics`
- Custom application metrics
- Container resource usage

### Logging
- Structured JSON logging
- Log levels: DEBUG, INFO, WARN, ERROR
- Centralized logging via ELK stack

## Security

### TLS Configuration
```bash
# Generate self-signed cert (development)
openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365 -nodes

# Production - use Let's Encrypt
certbot certonly --standalone -d ml-experiments.example.com
```

### Network Security
- Firewall rules (ports 9100, 9101, 6379)
- VPN access for internal services
- API key authentication
- Rate limiting

## Performance Tuning

### Resource Allocation
FetchML now centralizes pacing and container limits under a `resources` section in every server/worker config. Example for a homelab box:
```yaml
resources:
  max_workers: 1
  desired_rps_per_worker: 2   # conservative pacing per worker
  podman_cpus: "2"            # Podman --cpus, keeps host responsive
  podman_memory: "8g"         # Podman --memory, isolates experiment installs
```

For high-end machines (e.g., M2 Ultra, 18 performance cores / 64 GB RAM), start with:
```yaml
resources:
  max_workers: 2              # two concurrent experiments
  desired_rps_per_worker: 5   # faster job submission
  podman_cpus: "8"
  podman_memory: "32g"
```
Adjust upward only if experiments stay GPU-bound; keeping Podman limits in place ensures users can install packages inside the container without jeopardizing the host.

### Scaling Strategies
- Horizontal pod autoscaling
- Redis clustering
- Load balancing
- CDN for static assets

## Backup & Recovery

### Data Backup
```bash
# Backup experiment data
docker-compose exec redis redis-cli BGSAVE
docker cp $(docker-compose ps -q redis):/data/dump.rdb ./redis-backup.rdb

# Backup data volume
docker run --rm -v ml-experiments_redis_data:/data -v $(pwd):/backup alpine tar czf /backup/redis-backup.tar.gz -C /data .
```

### Disaster Recovery
1. Restore Redis data
2. Restart services
3. Verify experiment metadata
4. Test API endpoints

## Troubleshooting

### Common Issues

#### API Server Not Starting
```bash
# Check logs
docker-compose logs api-server

# Check configuration
cat configs/config-local.yaml

# Check Redis connection
docker-compose exec redis redis-cli ping
```

#### WebSocket Connection Issues
```bash
# Test WebSocket
wscat -c ws://localhost:9100/ws

# Check TLS
openssl s_client -connect localhost:9101 -servername localhost
```

#### Performance Issues
```bash
# Check resource usage
docker-compose exec api-server ps aux

# Check Redis memory
docker-compose exec redis redis-cli info memory
```

### Debug Mode
```bash
# Enable debug logging
export LOG_LEVEL=debug
./bin/api-server -config configs/config-local.yaml
```

## CI/CD Integration

### GitHub Actions
- Automated testing on PR
- Multi-platform builds
- Security scanning
- Automatic releases

### Deployment Pipeline
1. Code commit → GitHub
2. CI/CD pipeline triggers
3. Build and test
4. Security scan
5. Deploy to staging
6. Run integration tests
7. Deploy to production
8. Post-deployment verification

## Support

For deployment issues:
1. Check this guide
2. Review logs
3. Check GitHub Issues
4. Contact maintainers