# Docker Compose Deployments This directory contains Docker Compose configurations for different deployment environments. ## Environment Configurations ### Development (`docker-compose.dev.yml`) - Full development stack with monitoring - Includes: API, Worker, Redis, MinIO (snapshots), Prometheus, Grafana, Loki, Promtail - Optimized for local development and testing - **Usage**: `docker-compose -f deployments/docker-compose.dev.yml up -d` ### Homelab - Secure (`docker-compose.homelab-secure.yml`) - Secure homelab deployment with authentication and a Caddy reverse proxy - TLS is terminated at the reverse proxy (Approach A) - Includes: API, Redis (password protected), Caddy reverse proxy - **Usage**: `docker-compose -f deployments/docker-compose.homelab-secure.yml up -d` ### Production (`docker-compose.prod.yml`) - Production deployment configuration - Optimized for performance and security - External services assumed (Redis, monitoring) - **Usage**: `docker-compose -f deployments/docker-compose.prod.yml up -d` Note: `docker-compose.prod.yml` is a reproducible staging/testing harness. Real production deployments do not require Docker; you can run the Go services directly (systemd) and use Caddy for TLS/WSS termination. ## TLS / WSS Policy - The Zig CLI currently supports `ws://` only (native `wss://` is not implemented). - Production deployments terminate TLS/WSS at a reverse proxy (Caddy in `docker-compose.prod.yml`) and keep the API server on internal `ws://`. - Homelab deployments terminate TLS/WSS at a reverse proxy (Caddy) and keep the API server on internal `ws://`. - Health checks in compose files should use `http://localhost:9101/health` when `server.tls.enabled: false`. ## Required Volume Mounts - `base_path` (experiments) must be writable by the API server. - `data_dir` should be mounted if you want snapshot/dataset integrity validation via `ml validate`. For the default configs: - `base_path`: `/data/experiments` (dev/homelab configs) or `/app/data/experiments` (prod configs) - `data_dir`: `/data/active` ## Quick Start ```bash # Development (most common) docker-compose -f deployments/docker-compose.dev.yml up -d # Check status docker-compose -f deployments/docker-compose.dev.yml ps # View logs docker-compose -f deployments/docker-compose.dev.yml logs -f api-server # Stop services docker-compose -f deployments/docker-compose.dev.yml down ``` ## Dev: MinIO-backed snapshots (smoke test) The dev compose file provisions a MinIO bucket and uploads a small example snapshot object at: `s3://fetchml-snapshots/snapshots/snap-1.tar.gz` To queue a task that forces the worker to pull the snapshot from MinIO: 1. Start the dev stack: `docker-compose -f deployments/docker-compose.dev.yml up -d` 2. Read the `snapshot_sha256` printed by the init job: `docker-compose -f deployments/docker-compose.dev.yml logs minio-init` 3. Queue a job using the snapshot fields: `ml queue --snapshot-id snap-1 --snapshot-sha256 ` ## Smoke tests - `make dev-smoke` runs the development stack smoke test. - `make prod-smoke` runs a Docker-based staging smoke test for the production stack, using a localhost-only Caddy configuration. Note: `ml queue` by itself will generate a random commit ID. For full provenance enforcement (manifest + dependency manifest), use `ml sync ./your-project --queue` so the server has real code + dependency files. Examples: - `ml queue train-mnist --priority 3 --snapshot-id snap-1 --snapshot-sha256 ` - `ml queue train-a train-b train-c --priority 5 --snapshot-id snap-1 --snapshot-sha256 ` ## Environment Variables Create a `.env` file in the project root: ```bash # Grafana GRAFANA_ADMIN_PASSWORD=your_secure_password # API Configuration LOG_LEVEL=info # TLS (for secure deployments) TLS_CERT_PATH=/app/ssl/cert.pem TLS_KEY_PATH=/app/ssl/key.pem ``` ## Service Ports | Service | Development | Homelab | Production | |---------|-------------|---------|------------| | API Server | 9101 | 9101 | 9101 | | Redis | 6379 | 6379 | - | | Prometheus | 9090 | - | - | | Grafana | 3000 | - | - | | Loki | 3100 | - | - | | JupyterLab | 8888* | 8888* | - | | vLLM | 8000* | 8000* | - | *Plugin service ports are dynamically allocated from the 8000-9000 range by the scheduler. ## Plugin Services The deployment configurations include support for interactive ML services: ### Jupyter Notebook/Lab - **Image**: `quay.io/jupyter/base-notebook:latest` - **Security**: Trusted channels (conda-forge, defaults), blocked packages (http clients) - **Resources**: Configurable GPU/memory limits - **Access**: Via scheduler-assigned port (8000-9000 range) ### vLLM Inference - **Image**: `vllm/vllm-openai:latest` - **Features**: OpenAI-compatible API, quantization support (AWQ, GPTQ, FP8) - **Model Cache**: Configurable path for model storage - **Resources**: Multi-GPU tensor parallelism support ## Scheduler GPU Quotas The scheduler supports GPU quota management for plugin services: - **Global Limit**: Total GPUs across all plugins - **Per-User Limits**: GPU and service count per user - **Per-Plugin Limits**: vLLM and Jupyter-specific limits - **User Overrides**: Special permissions for admins/researchers See `configs/scheduler/scheduler.yaml.example` for quota configuration. ## Monitoring - **Development**: Full monitoring stack included - **Homelab**: Basic monitoring (configurable) - **Production**: External monitoring assumed ## Security Notes - If you need HTTPS externally, terminate TLS at a reverse proxy. - API keys should be managed via environment variables - Database credentials should use secrets management in production - **HIPAA deployments**: Plugins are disabled by default for compliance