fetch_ml/deployments
Jeremie Fraeys 98a0d42213
deploy: consolidate docker-compose files using profiles
- Merge logs-debug.yml into test.yml with 'debug' profile
- Merge local.yml into dev.yml with 'local' profile
- Merge prod.smoke.yml into prod.yml with 'smoke' profile
- Reduces compose files from 8 to 5, simplifies maintenance
- Update TEST_COMPOSE to use deployments/docker-compose.test.yml
2026-03-04 13:22:17 -05:00
..
configs/worker config: add Plugin GPU Quota, plugins, and audit logging to configs 2026-02-26 14:34:42 -05:00
Caddyfile.dev chore(ops): reorganize deployments/monitoring and remove legacy scripts 2026-01-05 12:31:26 -05:00
Caddyfile.homelab-secure chore(ops): reorganize deployments/monitoring and remove legacy scripts 2026-01-05 12:31:26 -05:00
Caddyfile.prod.smoke feat: add TUI SSH usability testing infrastructure 2026-02-18 17:48:02 -05:00
Caddyfile.smoke chore(ops): reorganize deployments/monitoring and remove legacy scripts 2026-01-05 12:31:26 -05:00
deploy.sh chore(deploy): update deployment configs and TUI for scheduler 2026-02-26 12:08:31 -05:00
docker-compose.dev.yml deploy: consolidate docker-compose files using profiles 2026-03-04 13:22:17 -05:00
docker-compose.homelab-secure.yml chore(deploy): update deployment configs and TUI for scheduler 2026-02-26 12:08:31 -05:00
docker-compose.local.yml chore(deploy): update deployment configs and TUI for scheduler 2026-02-26 12:08:31 -05:00
docker-compose.prod.smoke.yml chore(deploy): update deployment configs and TUI for scheduler 2026-02-26 12:08:31 -05:00
docker-compose.prod.yml deploy: consolidate docker-compose files using profiles 2026-03-04 13:22:17 -05:00
docker-compose.staging.yml ci(deploy): add Forgejo workflows and deployment automation 2026-02-26 12:04:23 -05:00
docker-compose.test.yml deploy: consolidate docker-compose files using profiles 2026-03-04 13:22:17 -05:00
env.dev.example chore(ops): reorganize deployments/monitoring and remove legacy scripts 2026-01-05 12:31:26 -05:00
env.prod.example chore(ops): reorganize deployments/monitoring and remove legacy scripts 2026-01-05 12:31:26 -05:00
Makefile chore(deploy): update deployment configs and TUI for scheduler 2026-02-26 12:08:31 -05:00
README.md config: add Plugin GPU Quota, plugins, and audit logging to configs 2026-02-26 14:34:42 -05:00
ROLLBACK.md ci(deploy): add Forgejo workflows and deployment automation 2026-02-26 12:04:23 -05:00
setup.sh chore(ops): reorganize deployments/monitoring and remove legacy scripts 2026-01-05 12:31:26 -05:00
tui-test-config.toml feat: add TUI SSH usability testing infrastructure 2026-02-18 17:48:02 -05:00

Docker Compose Deployments

This directory contains Docker Compose configurations for different deployment environments.

Environment Configurations

Development (docker-compose.dev.yml)

  • Full development stack with monitoring
  • Includes: API, Worker, Redis, MinIO (snapshots), Prometheus, Grafana, Loki, Promtail
  • Optimized for local development and testing
  • Usage: docker-compose -f deployments/docker-compose.dev.yml up -d

Homelab - Secure (docker-compose.homelab-secure.yml)

  • Secure homelab deployment with authentication and a Caddy reverse proxy
  • TLS is terminated at the reverse proxy (Approach A)
  • Includes: API, Redis (password protected), Caddy reverse proxy
  • Usage: docker-compose -f deployments/docker-compose.homelab-secure.yml up -d

Production (docker-compose.prod.yml)

  • Production deployment configuration
  • Optimized for performance and security
  • External services assumed (Redis, monitoring)
  • Usage: docker-compose -f deployments/docker-compose.prod.yml up -d

Note: docker-compose.prod.yml is a reproducible staging/testing harness. Real production deployments do not require Docker; you can run the Go services directly (systemd) and use Caddy for TLS/WSS termination.

TLS / WSS Policy

  • The Zig CLI currently supports ws:// only (native wss:// is not implemented).
  • Production deployments terminate TLS/WSS at a reverse proxy (Caddy in docker-compose.prod.yml) and keep the API server on internal ws://.
  • Homelab deployments terminate TLS/WSS at a reverse proxy (Caddy) and keep the API server on internal ws://.
  • Health checks in compose files should use http://localhost:9101/health when server.tls.enabled: false.

Required Volume Mounts

  • base_path (experiments) must be writable by the API server.
  • data_dir should be mounted if you want snapshot/dataset integrity validation via ml validate.

For the default configs:

  • base_path: /data/experiments (dev/homelab configs) or /app/data/experiments (prod configs)
  • data_dir: /data/active

Quick Start

# Development (most common)
docker-compose -f deployments/docker-compose.dev.yml up -d

# Check status
docker-compose -f deployments/docker-compose.dev.yml ps

# View logs
docker-compose -f deployments/docker-compose.dev.yml logs -f api-server

# Stop services
docker-compose -f deployments/docker-compose.dev.yml down

Dev: MinIO-backed snapshots (smoke test)

The dev compose file provisions a MinIO bucket and uploads a small example snapshot object at:

s3://fetchml-snapshots/snapshots/snap-1.tar.gz

To queue a task that forces the worker to pull the snapshot from MinIO:

  1. Start the dev stack: docker-compose -f deployments/docker-compose.dev.yml up -d

  2. Read the snapshot_sha256 printed by the init job: docker-compose -f deployments/docker-compose.dev.yml logs minio-init

  3. Queue a job using the snapshot fields: ml queue <job-name> --snapshot-id snap-1 --snapshot-sha256 <snapshot_sha256>

Smoke tests

  • make dev-smoke runs the development stack smoke test.

  • make prod-smoke runs a Docker-based staging smoke test for the production stack, using a localhost-only Caddy configuration.

    Note: ml queue by itself will generate a random commit ID. For full provenance enforcement (manifest + dependency manifest), use ml sync ./your-project --queue so the server has real code + dependency files.

    Examples:

    • ml queue train-mnist --priority 3 --snapshot-id snap-1 --snapshot-sha256 <snapshot_sha256>
    • ml queue train-a train-b train-c --priority 5 --snapshot-id snap-1 --snapshot-sha256 <snapshot_sha256>

Environment Variables

Create a .env file in the project root:

# Grafana
GRAFANA_ADMIN_PASSWORD=your_secure_password

# API Configuration
LOG_LEVEL=info

# TLS (for secure deployments)
TLS_CERT_PATH=/app/ssl/cert.pem
TLS_KEY_PATH=/app/ssl/key.pem

Service Ports

Service Development Homelab Production
API Server 9101 9101 9101
Redis 6379 6379 -
Prometheus 9090 - -
Grafana 3000 - -
Loki 3100 - -
JupyterLab 8888* 8888* -
vLLM 8000* 8000* -

*Plugin service ports are dynamically allocated from the 8000-9000 range by the scheduler.

Plugin Services

The deployment configurations include support for interactive ML services:

Jupyter Notebook/Lab

  • Image: quay.io/jupyter/base-notebook:latest
  • Security: Trusted channels (conda-forge, defaults), blocked packages (http clients)
  • Resources: Configurable GPU/memory limits
  • Access: Via scheduler-assigned port (8000-9000 range)

vLLM Inference

  • Image: vllm/vllm-openai:latest
  • Features: OpenAI-compatible API, quantization support (AWQ, GPTQ, FP8)
  • Model Cache: Configurable path for model storage
  • Resources: Multi-GPU tensor parallelism support

Scheduler GPU Quotas

The scheduler supports GPU quota management for plugin services:

  • Global Limit: Total GPUs across all plugins
  • Per-User Limits: GPU and service count per user
  • Per-Plugin Limits: vLLM and Jupyter-specific limits
  • User Overrides: Special permissions for admins/researchers

See configs/scheduler/scheduler.yaml.example for quota configuration.

Monitoring

  • Development: Full monitoring stack included
  • Homelab: Basic monitoring (configurable)
  • Production: External monitoring assumed

Security Notes

  • If you need HTTPS externally, terminate TLS at a reverse proxy.
  • API keys should be managed via environment variables
  • Database credentials should use secrets management in production
  • HIPAA deployments: Plugins are disabled by default for compliance