fetch_ml/deployments/README.md

# Docker Compose Deployments

This directory contains Docker Compose configurations for different deployment environments.

## Environment Configurations

### Development (`docker-compose.dev.yml`)
- Full development stack with monitoring
- Includes: API, Worker, Redis, MinIO (snapshots), Prometheus, Grafana, Loki, Promtail
- Optimized for local development and testing
- **Usage**: `docker-compose -f deployments/docker-compose.dev.yml up -d`

### Homelab - Secure (`docker-compose.homelab-secure.yml`)
- Secure homelab deployment with authentication and a Caddy reverse proxy
- TLS is terminated at the reverse proxy (Approach A)
- Includes: API, Redis (password protected), Caddy reverse proxy
- **Usage**: `docker-compose -f deployments/docker-compose.homelab-secure.yml up -d`

### Production (`docker-compose.prod.yml`)
- Production deployment configuration
- Optimized for performance and security
- External services assumed (Redis, monitoring)
- **Usage**: `docker-compose -f deployments/docker-compose.prod.yml up -d`

Note: `docker-compose.prod.yml` is a reproducible staging/testing harness. Real production deployments do not require Docker; you can run the Go services directly (systemd) and use Caddy for TLS/WSS termination.

## TLS / WSS Policy

- The Zig CLI currently supports `ws://` only (native `wss://` is not implemented).
- Production deployments terminate TLS/WSS at a reverse proxy (Caddy in `docker-compose.prod.yml`) and keep the API server on internal `ws://`.
- Homelab deployments terminate TLS/WSS at a reverse proxy (Caddy) and keep the API server on internal `ws://`.
- Health checks in compose files should use `http://localhost:9101/health` when `server.tls.enabled: false`.

## Required Volume Mounts

- `base_path` (experiments) must be writable by the API server.
- `data_dir` should be mounted if you want snapshot/dataset integrity validation via `ml validate`.

For the default configs:

- `base_path`: `/data/experiments` (dev/homelab configs) or `/app/data/experiments` (prod configs)
- `data_dir`: `/data/active`

## Quick Start

```bash
# Development (most common)
docker-compose -f deployments/docker-compose.dev.yml up -d

# Check status
docker-compose -f deployments/docker-compose.dev.yml ps

# View logs
docker-compose -f deployments/docker-compose.dev.yml logs -f api-server

# Stop services
docker-compose -f deployments/docker-compose.dev.yml down
```

## Dev: MinIO-backed snapshots (smoke test)

The dev compose file provisions a MinIO bucket and uploads a small example snapshot object at:

`s3://fetchml-snapshots/snapshots/snap-1.tar.gz`

To queue a task that forces the worker to pull the snapshot from MinIO:

1. Start the dev stack:
   `docker-compose -f deployments/docker-compose.dev.yml up -d`

2. Read the `snapshot_sha256` printed by the init job:
   `docker-compose -f deployments/docker-compose.dev.yml logs minio-init`

3. Queue a job using the snapshot fields:
   `ml queue <job-name> --snapshot-id snap-1 --snapshot-sha256 <snapshot_sha256>`

## Smoke tests

- `make dev-smoke` runs the development stack smoke test.
- `make prod-smoke` runs a Docker-based staging smoke test for the production stack, using a localhost-only Caddy configuration.

   Note: `ml queue` by itself will generate a random commit ID. For full provenance enforcement (manifest + dependency manifest), use `ml sync ./your-project --queue` so the server has real code + dependency files.

   Examples:
   - `ml queue train-mnist --priority 3 --snapshot-id snap-1 --snapshot-sha256 <snapshot_sha256>`
   - `ml queue train-a train-b train-c --priority 5 --snapshot-id snap-1 --snapshot-sha256 <snapshot_sha256>`

## Environment Variables

Create a `.env` file in the project root:

```bash
# Grafana
GRAFANA_ADMIN_PASSWORD=your_secure_password

# API Configuration
LOG_LEVEL=info

# TLS (for secure deployments)
TLS_CERT_PATH=/app/ssl/cert.pem
TLS_KEY_PATH=/app/ssl/key.pem
```

## Service Ports

| Service | Development | Homelab | Production |
|---------|-------------|---------|------------|
| API Server | 9101 | 9101 | 9101 |
| Redis | 6379 | 6379 | - |
| Prometheus | 9090 | - | - |
| Grafana | 3000 | - | - |
| Loki | 3100 | - | - |
| JupyterLab | 8888* | 8888* | - |
| vLLM | 8000* | 8000* | - |

*Plugin service ports are dynamically allocated from the 8000-9000 range by the scheduler.

## Plugin Services

The deployment configurations include support for interactive ML services:

### Jupyter Notebook/Lab
- **Image**: `quay.io/jupyter/base-notebook:latest`
- **Security**: Trusted channels (conda-forge, defaults), blocked packages (http clients)
- **Resources**: Configurable GPU/memory limits
- **Access**: Via scheduler-assigned port (8000-9000 range)

### vLLM Inference
- **Image**: `vllm/vllm-openai:latest`
- **Features**: OpenAI-compatible API, quantization support (AWQ, GPTQ, FP8)
- **Model Cache**: Configurable path for model storage
- **Resources**: Multi-GPU tensor parallelism support

## Scheduler GPU Quotas

The scheduler supports GPU quota management for plugin services:
- **Global Limit**: Total GPUs across all plugins
- **Per-User Limits**: GPU and service count per user
- **Per-Plugin Limits**: vLLM and Jupyter-specific limits
- **User Overrides**: Special permissions for admins/researchers

See `configs/scheduler/scheduler.yaml.example` for quota configuration.

## Monitoring

- **Development**: Full monitoring stack included
- **Homelab**: Basic monitoring (configurable)
- **Production**: External monitoring assumed

## Security Notes

- If you need HTTPS externally, terminate TLS at a reverse proxy.
- API keys should be managed via environment variables
- Database credentials should use secrets management in production
- **HIPAA deployments**: Plugins are disabled by default for compliance