# ML Experiment Manager - Deployment Guide

## Overview

The ML Experiment Manager supports multiple deployment methods from local development to production setups with integrated monitoring.

## TLS / WSS Policy

- The Zig CLI currently supports `ws://` only (native `wss://` is not implemented).
- For production, use a reverse proxy (Caddy) to terminate TLS/WSS (Approach A) and keep the API server on internal HTTP/WS.
- If you need remote CLI access, use one of:
  - an SSH tunnel to the internal `ws://` endpoint
  - a private network/VPN so `ws://` is not exposed to the public Internet
- When `server.tls.enabled: false`, the API server still runs on plain HTTP/WS internally. In development, access it via Caddy at `http://localhost:8080/health`.

## Data Directories

- `base_path` is where experiment directories live.
- `data_dir` is used for dataset/snapshot materialization and integrity validation.
- If you want `ml validate` to check snapshots/datasets, you must mount `data_dir` into the API server container.

## Quick Start

### Development Deployment with Monitoring

```bash
# Start development stack with monitoring
make dev-up

# Alternative: Use deployment script
cd deployments && make dev-up

# Check status
make dev-status
```

**Access Services:**
- **API Server (via Caddy)**: http://localhost:8080
- **API Server (via Caddy + internal TLS)**: https://localhost:8443
- **Grafana**: http://localhost:3000 (admin/admin123)
- **Prometheus**: http://localhost:9090
- **Loki**: http://localhost:3100

## Deployment Options

### 1. Development Environment

**Purpose**: Local development with full monitoring stack

**Services**: API Server, Redis, Prometheus, Grafana, Loki, Promtail

**Configuration**:
```bash
# Using Makefile (recommended)
make dev-up
make dev-down
make dev-status

# Using deployment script
cd deployments
make dev-up
make dev-down
make dev-status
```

**Features**:
- Auto-provisioned Grafana dashboards
- Real-time metrics and logs
- Hot reload for development
- Local data persistence

### 2. Production Environment

**Purpose**: Production deployment with security

**Services**: API Server, Worker, Redis with authentication

**Configuration**:
```bash
cd deployments
make prod-up
make prod-down
make prod-status
```

**Features**:
- Secure Redis with authentication
- TLS/WSS via reverse proxy termination (Caddy)
- Production-optimized configurations
- Health checks and restart policies

### 3. Homelab Secure Environment

**Purpose**: Secure homelab deployment

**Services**: API Server, Redis, Caddy reverse proxy

**Configuration**:
```bash
cd deployments
make homelab-up
make homelab-down
make homelab-status
```

**Features**:
- Caddy reverse proxy
- TLS termination
- Network isolation
- External networks

## Environment Setup

### Development Environment

```bash
# Copy example environment
cp deployments/env.dev.example .env

# Edit as needed
vim .env
```

**Key Variables**:
- `LOG_LEVEL=info`
- `GRAFANA_ADMIN_PASSWORD=admin123`

### Production Environment

```bash
# Copy example environment
cp deployments/env.prod.example .env

# Edit with production values
vim .env
```

**Key Variables**:
- `REDIS_PASSWORD=your-secure-password`
- `JWT_SECRET=your-jwt-secret`
- `SSL_CERT_PATH=/path/to/cert`

## Monitoring Setup

### Automatic Configuration

Monitoring dashboards and datasources are auto-provisioned:

```bash
# Setup monitoring provisioning (Grafana datasources/providers)
python3 scripts/setup_monitoring.py

# Start services (includes monitoring)
make dev-up
```

### Available Dashboards

1. **Load Test Performance**: Request rates, response times, error rates
2. **System Health**: Service status, memory, CPU usage
3. **Log Analysis**: Error logs, service logs, log aggregation

### Manual Configuration

If auto-provisioning fails:

1. **Access Grafana**: http://localhost:3000
2. **Add Data Sources**:
   - Prometheus: http://prometheus:9090
   - Loki: http://loki:3100
3. **Import Dashboards**: From `monitoring/grafana/dashboards/`

## Testing Procedures

### Pre-Deployment Testing

```bash
# Run unit tests
make test-unit

# Run integration tests
make test-integration

# Run full test suite
make test

# Run with coverage
make test-coverage
```

### Load Testing

```bash
# Run load tests
make load-test

# Run specific load scenarios
make benchmark-local

# Track performance over time
./scripts/track_performance.sh
```

### Health Checks

```bash
# Check service health
curl -f http://localhost:8080/health

# Check monitoring services
curl -f http://localhost:3000/api/health
curl -f http://localhost:9090/api/v1/query?query=up
curl -f http://localhost:3100/ready
```

## Troubleshooting

### Common Issues

**Port Conflicts**:
```bash
# Check port usage
lsof -i :9101
lsof -i :3000
lsof -i :9090

# Kill conflicting processes
kill -9 <PID>
```

**Container Issues**:
```bash
# View container logs
docker logs ml-experiments-api
docker logs ml-experiments-grafana

# Restart services
make dev-restart

# Clean restart
make dev-down && make dev-up
```

**Monitoring Issues**:
```bash
# Re-setup monitoring configuration
python3 scripts/setup_monitoring.py

# Restart Grafana only
docker restart ml-experiments-grafana
```

### Performance Issues

**High Memory Usage**:
- Check Grafana dashboards for memory metrics
- Adjust Prometheus retention in `prometheus.yml`
- Monitor log retention in `loki-config.yml`

**Slow Response Times**:
- Check network connectivity between containers
- Verify Redis performance
- Review API server logs for bottlenecks

## Maintenance

### Regular Tasks

**Weekly**:
- Check Grafana dashboards for anomalies
- Review log files for errors
- Verify backup procedures

**Monthly**:
- Update Docker images
- Clean up old Docker volumes
- Review and rotate secrets

### Backup Procedures

**Data Backup**:
```bash
# Backup application data
docker run --rm -v ml_data:/data -v $(pwd):/backup alpine tar czf /backup/data-backup.tar.gz -C /data .

# Backup monitoring data
docker run --rm -v prometheus_data:/data -v $(pwd):/backup alpine tar czf /backup/prometheus-backup.tar.gz -C /data .
```

**Configuration Backup**:
```bash
# Backup configurations
tar czf config-backup.tar.gz monitoring/ deployments/ configs/
```

## Security Considerations

### Development Environment
- Change default Grafana password
- Use environment variables for secrets
- Monitor container logs for security events

### Production Environment
- Enable Redis authentication
- Use SSL/TLS certificates
- Implement network segmentation
- Regular security updates
- Monitor access logs

## Performance Optimization

### Resource Limits

**Development**:
```yaml
# docker-compose.dev.yml
services:
  api-server:
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: '0.5'
```

**Production**:
```yaml
# docker-compose.prod.yml
services:
  api-server:
    deploy:
      resources:
        limits:
          memory: 2G
          cpus: '2.0'
```

### Monitoring Optimization

**Prometheus**:
- Adjust scrape intervals
- Configure retention periods
- Use recording rules for frequent queries

**Loki**:
- Configure log retention
- Use log sampling for high-volume sources
- Optimize label cardinality

## Non-Docker Production (systemd)

This project can be run in production without Docker. The recommended model is:

- Run `api-server` and `worker` as systemd services.
- Terminate TLS/WSS at Caddy and keep the API server on internal plain HTTP/WS.

The unit templates below are copy-paste friendly, but you must adjust paths, users, and config locations to your environment.

### `fetchml-api.service`

```ini
[Unit]
Description=FetchML API Server
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=fetchml
Group=fetchml
WorkingDirectory=/var/lib/fetchml

Environment=LOG_LEVEL=info

ExecStart=/usr/local/bin/api-server -config /etc/fetchml/api.yaml
Restart=on-failure
RestartSec=2

NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/fetchml /var/log/fetchml

[Install]
WantedBy=multi-user.target
```

### `fetchml-worker.service`

```ini
[Unit]
Description=FetchML Worker
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=fetchml
Group=fetchml
WorkingDirectory=/var/lib/fetchml

Environment=LOG_LEVEL=info

ExecStart=/usr/local/bin/worker -config /etc/fetchml/worker.yaml
Restart=on-failure
RestartSec=2

NoNewPrivileges=true
PrivateTmp=true

[Install]
WantedBy=multi-user.target
```

### Optional: `caddy.service`

Most distros ship a Caddy systemd unit. If you do not have one available, you can use this template.

```ini
[Unit]
Description=Caddy
Documentation=https://caddyserver.com/docs/
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
User=caddy
Group=caddy
ExecStart=/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile
ExecReload=/usr/bin/caddy reload --config /etc/caddy/Caddyfile
TimeoutStopSec=5s
LimitNOFILE=1048576
LimitNPROC=512
PrivateTmp=true
ProtectSystem=full
AmbientCapabilities=CAP_NET_BIND_SERVICE
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
NoNewPrivileges=true
Restart=on-failure

[Install]
WantedBy=multi-user.target
```

## Migration Guide

### From Development to Production

1. **Export Data**:
   ```bash
   docker exec ml-data redis-cli BGSAVE
   docker cp ml-data:/data/dump.rdb ./redis-backup.rdb
   ```

2. **Update Configuration**:
   ```bash
   cp deployments/env.dev.example deployments/env.prod.example
   # Edit with production values
   ```

3. **Deploy Production**:
   ```bash
   cd deployments
   make prod-up
   ```

4. **Import Data**:
   ```bash
   docker cp ./redis-backup.rdb ml-prod-redis:/data/dump.rdb
   docker restart ml-prod-redis
   ```

## Support

For deployment issues:
1. Check troubleshooting section
2. Review container logs
3. Verify network connectivity
4. Check resource usage in Grafana