fetch_ml/configs/README.md
Jeremie Fraeys 8a7e7695f4
config: consolidate and cleanup configuration files
- Remove redundant config examples (distributed/, standalone/, examples/)
- Delete dev-local.yaml variants (use dev.yaml with env vars)
- Delete prod.yaml (use multi-user.yaml or homelab-secure.yaml)
- Clean up worker configs: remove docker.yaml, homelab-sandbox.yaml
- Update remaining configs with current best practices
- Simplify config schema and documentation
2026-03-04 13:22:52 -05:00

6.7 KiB

fetch_ml Configuration Guide

Quick Start

# Development with 2 workers
cd deployments
CONFIG_DIR=../configs docker-compose -f docker-compose.dev.yml up -d

# Scale to 4 workers
docker-compose -f docker-compose.dev.yml up -d --scale worker=4

# Production with scheduler
CONFIG_DIR=../configs docker-compose -f docker-compose.prod.yml up -d

Key Environment Variables

Variable Description Default
CONFIG_DIR Path to config directory ../configs
DATA_DIR Path to data directory ./data/<env>
LOG_LEVEL Logging level info
REDIS_URL Redis connection URL redis://redis:6379

Architecture Overview

┌─────────────────┐     ┌─────────────┐     ┌─────────────────┐
│   API Server    │────▶│   Redis     │◀────│   Scheduler     │
│  (with builtin  │     │   Queue     │     │  (in api-server)│
│   scheduler)    │     └─────────────┘     └─────────────────┘
└─────────────────┘            │                    │
         │                     │                    │
         │            ┌────────┴────────┐          │
         │            ▼                 ▼          │
         │     ┌─────────┐        ┌─────────┐      │
         └────▶│ Worker 1│        │ Worker 2│      │
               │ (Podman)│        │ (Podman)│      │
               └─────────┘        └─────────┘      │
                     │                  │          │
                     └──────────────────┴──────────┘
                              Heartbeats

The scheduler is built into the API server and manages multiple workers dynamically.

Configuration Structure

configs/
├── api/
│   ├── dev.yaml              # Development API config
│   ├── multi-user.yaml       # Production multi-worker
│   └── homelab-secure.yaml   # Homelab secure config
├── worker/
│   ├── docker-dev.yaml       # Development worker
│   ├── docker-prod.yaml      # Production worker
│   ├── docker-staging.yaml   # Staging worker
│   ├── docker-standard.yaml  # Standard compliance
│   └── homelab-secure.yaml   # Homelab secure worker
└── schema/
    └── *.yaml                # Validation schemas

Scheduler Configuration

The scheduler is configured in the API server config:

# configs/api/multi-user.yaml
resources:
  max_workers: 4              # Max concurrent workers
  desired_rps_per_worker: 3   # Target requests/sec per worker

scheduler:
  enabled: true
  strategy: "round-robin"     # round-robin, least-loaded, priority
  max_concurrent_jobs: 16     # Max jobs across all workers
  queue:
    type: "redis"
    redis_addr: "redis:6379"
  worker_discovery:
    mode: "dynamic"           # dynamic or static
    heartbeat_timeout: "30s"
    health_check_interval: "10s"

Scheduling Strategies

Strategy Description Use Case
round-robin Distribute evenly across workers Balanced load
least-loaded Send to worker with fewest jobs Variable job sizes
priority Respect job priorities first Mixed priority workloads

Worker Configuration

Workers connect to the scheduler via Redis queue:

# configs/worker/docker-prod.yaml
backend:
  type: "redis"
  redis:
    addr: "redis:6379"
    password: ""           # Set via REDIS_PASSWORD env var
    db: 0

worker:
  id: "${FETCHML_WORKER_ID}"  # Unique worker ID
  mode: "distributed"         # Uses scheduler via Redis
  heartbeat_interval: "10s"
  max_concurrent_jobs: 4      # Jobs this worker can run

sandbox:
  type: "podman"
  podman:
    socket: "/run/podman/podman.sock"
    cpus: "2"
    memory: "4Gi"

Scaling Workers

# Development - 2 workers by default
docker-compose -f deployments/docker-compose.dev.yml up -d

# Scale to 4 workers
docker-compose -f deployments/docker-compose.dev.yml up -d --scale worker=4

# Scale down to 1 worker
docker-compose -f deployments/docker-compose.dev.yml up -d --scale worker=1

Kubernetes / Manual Deployment

# Each worker needs unique ID
export FETCHML_WORKER_ID="worker-$(hostname)-$(date +%s)"
./worker -config configs/worker/docker-prod.yaml

Environment-Specific Setups

Development (docker-compose.dev.yml)

  • 2 workers by default
  • Redis for queue
  • Local MinIO for storage
  • Caddy reverse proxy
make dev-up          # Start with 2 workers
make dev-up SCALE=4  # Start with 4 workers

Production (docker-compose.prod.yml)

  • 4 workers configured
  • Redis cluster recommended
  • External MinIO/S3
  • Health checks enabled
CONFIG_DIR=./configs DATA_DIR=/var/lib/fetchml \
  docker-compose -f deployments/docker-compose.prod.yml up -d

Staging (docker-compose.staging.yml)

  • 2 workers
  • Audit logging enabled
  • Same as prod but smaller scale

Monitoring

Check Worker Status

# Via API
curl http://localhost:9101/api/v1/workers

# Via Redis
redis-cli LRANGE fetchml:workers 0 -1
redis-cli HGETALL fetchml:worker:status

View Logs

# All workers
docker-compose -f deployments/docker-compose.dev.yml logs worker

# Specific worker (by container name)
docker logs ml-experiments-worker-1
docker logs ml-experiments-worker-2

Troubleshooting

Workers Not Registering

  1. Check Redis connection: redis-cli ping
  2. Verify worker config has mode: distributed
  3. Check API server scheduler is enabled
  4. Review worker logs: docker logs <worker-container>

Jobs Stuck in Queue

  1. Check worker capacity: max_concurrent_jobs not exceeded
  2. Verify workers are healthy: docker ps
  3. Check Redis queue length: redis-cli LLEN fetchml:queue:pending

Worker ID Collisions

Ensure FETCHML_WORKER_ID is unique per worker instance:

environment:
  - FETCHML_WORKER_ID=${HOSTNAME}-${COMPOSE_PROJECT_NAME}-${RANDOM}

Security Notes

  • Workers run in privileged mode for Podman containers
  • Redis should be firewalled (not exposed publicly in prod)
  • Worker-to-scheduler communication is via Redis only
  • No direct API-to-worker connections required

See Also

  • deployments/README.md - Deployment environments
  • docs/src/deployment.md - Full deployment guide
  • docs/src/cicd.md - CI/CD workflows