- Remove redundant config examples (distributed/, standalone/, examples/) - Delete dev-local.yaml variants (use dev.yaml with env vars) - Delete prod.yaml (use multi-user.yaml or homelab-secure.yaml) - Clean up worker configs: remove docker.yaml, homelab-sandbox.yaml - Update remaining configs with current best practices - Simplify config schema and documentation
239 lines
6.7 KiB
Markdown
239 lines
6.7 KiB
Markdown
# fetch_ml Configuration Guide
|
|
|
|
## Quick Start
|
|
|
|
### Docker Compose (Recommended)
|
|
|
|
```bash
|
|
# Development with 2 workers
|
|
cd deployments
|
|
CONFIG_DIR=../configs docker-compose -f docker-compose.dev.yml up -d
|
|
|
|
# Scale to 4 workers
|
|
docker-compose -f docker-compose.dev.yml up -d --scale worker=4
|
|
|
|
# Production with scheduler
|
|
CONFIG_DIR=../configs docker-compose -f docker-compose.prod.yml up -d
|
|
```
|
|
|
|
### Key Environment Variables
|
|
|
|
| Variable | Description | Default |
|
|
|----------|-------------|---------|
|
|
| `CONFIG_DIR` | Path to config directory | `../configs` |
|
|
| `DATA_DIR` | Path to data directory | `./data/<env>` |
|
|
| `LOG_LEVEL` | Logging level | `info` |
|
|
| `REDIS_URL` | Redis connection URL | `redis://redis:6379` |
|
|
|
|
## Architecture Overview
|
|
|
|
```
|
|
┌─────────────────┐ ┌─────────────┐ ┌─────────────────┐
|
|
│ API Server │────▶│ Redis │◀────│ Scheduler │
|
|
│ (with builtin │ │ Queue │ │ (in api-server)│
|
|
│ scheduler) │ └─────────────┘ └─────────────────┘
|
|
└─────────────────┘ │ │
|
|
│ │ │
|
|
│ ┌────────┴────────┐ │
|
|
│ ▼ ▼ │
|
|
│ ┌─────────┐ ┌─────────┐ │
|
|
└────▶│ Worker 1│ │ Worker 2│ │
|
|
│ (Podman)│ │ (Podman)│ │
|
|
└─────────┘ └─────────┘ │
|
|
│ │ │
|
|
└──────────────────┴──────────┘
|
|
Heartbeats
|
|
```
|
|
|
|
The scheduler is built into the API server and manages multiple workers dynamically.
|
|
|
|
## Configuration Structure
|
|
|
|
```
|
|
configs/
|
|
├── api/
|
|
│ ├── dev.yaml # Development API config
|
|
│ ├── multi-user.yaml # Production multi-worker
|
|
│ └── homelab-secure.yaml # Homelab secure config
|
|
├── worker/
|
|
│ ├── docker-dev.yaml # Development worker
|
|
│ ├── docker-prod.yaml # Production worker
|
|
│ ├── docker-staging.yaml # Staging worker
|
|
│ ├── docker-standard.yaml # Standard compliance
|
|
│ └── homelab-secure.yaml # Homelab secure worker
|
|
└── schema/
|
|
└── *.yaml # Validation schemas
|
|
```
|
|
|
|
## Scheduler Configuration
|
|
|
|
The scheduler is configured in the API server config:
|
|
|
|
```yaml
|
|
# configs/api/multi-user.yaml
|
|
resources:
|
|
max_workers: 4 # Max concurrent workers
|
|
desired_rps_per_worker: 3 # Target requests/sec per worker
|
|
|
|
scheduler:
|
|
enabled: true
|
|
strategy: "round-robin" # round-robin, least-loaded, priority
|
|
max_concurrent_jobs: 16 # Max jobs across all workers
|
|
queue:
|
|
type: "redis"
|
|
redis_addr: "redis:6379"
|
|
worker_discovery:
|
|
mode: "dynamic" # dynamic or static
|
|
heartbeat_timeout: "30s"
|
|
health_check_interval: "10s"
|
|
```
|
|
|
|
### Scheduling Strategies
|
|
|
|
| Strategy | Description | Use Case |
|
|
|----------|-------------|----------|
|
|
| `round-robin` | Distribute evenly across workers | Balanced load |
|
|
| `least-loaded` | Send to worker with fewest jobs | Variable job sizes |
|
|
| `priority` | Respect job priorities first | Mixed priority workloads |
|
|
|
|
## Worker Configuration
|
|
|
|
Workers connect to the scheduler via Redis queue:
|
|
|
|
```yaml
|
|
# configs/worker/docker-prod.yaml
|
|
backend:
|
|
type: "redis"
|
|
redis:
|
|
addr: "redis:6379"
|
|
password: "" # Set via REDIS_PASSWORD env var
|
|
db: 0
|
|
|
|
worker:
|
|
id: "${FETCHML_WORKER_ID}" # Unique worker ID
|
|
mode: "distributed" # Uses scheduler via Redis
|
|
heartbeat_interval: "10s"
|
|
max_concurrent_jobs: 4 # Jobs this worker can run
|
|
|
|
sandbox:
|
|
type: "podman"
|
|
podman:
|
|
socket: "/run/podman/podman.sock"
|
|
cpus: "2"
|
|
memory: "4Gi"
|
|
```
|
|
|
|
## Scaling Workers
|
|
|
|
### Docker Compose (Recommended)
|
|
|
|
```bash
|
|
# Development - 2 workers by default
|
|
docker-compose -f deployments/docker-compose.dev.yml up -d
|
|
|
|
# Scale to 4 workers
|
|
docker-compose -f deployments/docker-compose.dev.yml up -d --scale worker=4
|
|
|
|
# Scale down to 1 worker
|
|
docker-compose -f deployments/docker-compose.dev.yml up -d --scale worker=1
|
|
```
|
|
|
|
### Kubernetes / Manual Deployment
|
|
|
|
```bash
|
|
# Each worker needs unique ID
|
|
export FETCHML_WORKER_ID="worker-$(hostname)-$(date +%s)"
|
|
./worker -config configs/worker/docker-prod.yaml
|
|
```
|
|
|
|
## Environment-Specific Setups
|
|
|
|
### Development (docker-compose.dev.yml)
|
|
|
|
- 2 workers by default
|
|
- Redis for queue
|
|
- Local MinIO for storage
|
|
- Caddy reverse proxy
|
|
|
|
```bash
|
|
make dev-up # Start with 2 workers
|
|
make dev-up SCALE=4 # Start with 4 workers
|
|
```
|
|
|
|
### Production (docker-compose.prod.yml)
|
|
|
|
- 4 workers configured
|
|
- Redis cluster recommended
|
|
- External MinIO/S3
|
|
- Health checks enabled
|
|
|
|
```bash
|
|
CONFIG_DIR=./configs DATA_DIR=/var/lib/fetchml \
|
|
docker-compose -f deployments/docker-compose.prod.yml up -d
|
|
```
|
|
|
|
### Staging (docker-compose.staging.yml)
|
|
|
|
- 2 workers
|
|
- Audit logging enabled
|
|
- Same as prod but smaller scale
|
|
|
|
## Monitoring
|
|
|
|
### Check Worker Status
|
|
|
|
```bash
|
|
# Via API
|
|
curl http://localhost:9101/api/v1/workers
|
|
|
|
# Via Redis
|
|
redis-cli LRANGE fetchml:workers 0 -1
|
|
redis-cli HGETALL fetchml:worker:status
|
|
```
|
|
|
|
### View Logs
|
|
|
|
```bash
|
|
# All workers
|
|
docker-compose -f deployments/docker-compose.dev.yml logs worker
|
|
|
|
# Specific worker (by container name)
|
|
docker logs ml-experiments-worker-1
|
|
docker logs ml-experiments-worker-2
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Workers Not Registering
|
|
|
|
1. Check Redis connection: `redis-cli ping`
|
|
2. Verify worker config has `mode: distributed`
|
|
3. Check API server scheduler is enabled
|
|
4. Review worker logs: `docker logs <worker-container>`
|
|
|
|
### Jobs Stuck in Queue
|
|
|
|
1. Check worker capacity: `max_concurrent_jobs` not exceeded
|
|
2. Verify workers are healthy: `docker ps`
|
|
3. Check Redis queue length: `redis-cli LLEN fetchml:queue:pending`
|
|
|
|
### Worker ID Collisions
|
|
|
|
Ensure `FETCHML_WORKER_ID` is unique per worker instance:
|
|
```yaml
|
|
environment:
|
|
- FETCHML_WORKER_ID=${HOSTNAME}-${COMPOSE_PROJECT_NAME}-${RANDOM}
|
|
```
|
|
|
|
## Security Notes
|
|
|
|
- Workers run in privileged mode for Podman containers
|
|
- Redis should be firewalled (not exposed publicly in prod)
|
|
- Worker-to-scheduler communication is via Redis only
|
|
- No direct API-to-worker connections required
|
|
|
|
## See Also
|
|
|
|
- `deployments/README.md` - Deployment environments
|
|
- `docs/src/deployment.md` - Full deployment guide
|
|
- `docs/src/cicd.md` - CI/CD workflows
|