fetch_ml/docs/_pages/operations.md
Jeremie Fraeys 385d2cf386 docs: add comprehensive documentation with MkDocs site
- Add complete API documentation and architecture guides
- Include quick start, installation, and deployment guides
- Add troubleshooting and security documentation
- Include CLI reference and configuration schema docs
- Add production monitoring and operations guides
- Implement MkDocs configuration with search functionality
- Include comprehensive user and developer documentation

Provides complete documentation for users and developers
covering all aspects of the FetchML platform.
2025-12-04 16:54:57 -05:00

310 lines
5.4 KiB
Markdown

---
layout: page
title: "Operations Runbook"
permalink: /operations/
nav_order: 6
---
# Operations Runbook
Operational guide for troubleshooting and maintaining the ML experiment system.
## Task Queue Operations
### Monitoring Queue Health
```redis
# Check queue depth
ZCARD task:queue
# List pending tasks
ZRANGE task:queue 0 -1 WITHSCORES
# Check dead letter queue
KEYS task:dlq:*
```
### Handling Stuck Tasks
**Symptom:** Tasks stuck in "running" status
**Diagnosis:**
```bash
# Check for expired leases
redis-cli GET task:{task-id}
# Look for LeaseExpiry in past
```
**Rem
ediation:**
Tasks with expired leases are automatically reclaimed every 1 minute. To force immediate reclamation:
```bash
# Restart worker to trigger reclaim cycle
systemctl restart ml-worker
```
### Dead Letter Queue Management
**View failed tasks:**
```redis
KEYS task:dlq:*
```
**Inspect failed task:**
```redis
GET task:dlq:{task-id}
```
**Retry from DLQ:**
```bash
# Manual retry (requires custom script)
# 1. Get task from DLQ
# 2. Reset retry count
# 3. Re-queue task
```
### Worker Crashes
**Symptom:** Worker disappeared mid-task
**What Happens:**
1. Lease expires after 30 minutes (default)
2. Background reclaim job detects expired lease
3. Task is retried (up to 3 attempts)
4. After max retries → Dead Letter Queue
**Prevention:**
- Monitor worker heartbeats
- Set up alerts for worker down
- Use process manager (systemd, supervisor)
## Worker Operations
### Graceful Shutdown
```bash
# Send SIGTERM for graceful shutdown
kill -TERM $(pgrep ml-worker)
# Worker will:
# 1. Stop accepting new tasks
# 2. Finish active tasks (up to 5min timeout)
# 3. Release all leases
# 4. Exit cleanly
```
### Force Shutdown
```bash
# Force kill (leases will be reclaimed automatically)
kill -9 $(pgrep ml-worker)
```
### Worker Heartbeat Monitoring
```redis
# Check worker heartbeats
HGETALL worker:heartbeat
# Example output:
# worker-abc123 1701234567
# worker-def456 1701234580
```
**Alert if:** Heartbeat timestamp > 5 minutes old
## Redis Operations
### Backup
```bash
# Manual backup
redis-cli SAVE
cp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d).rdb
```
### Restore
```bash
# Stop Redis
systemctl stop redis
# Restore snapshot
cp /backup/redis-20231201.rdb /var/lib/redis/dump.rdb
# Start Redis
systemctl start redis
```
### Memory Management
```redis
# Check memory usage
INFO memory
# Evict old data if needed
FLUSHDB # DANGER: Clears all data!
```
## Common Issues
### Issue: Queue Growing Unbounded
**Symptoms:**
- `ZCARD task:queue` keeps increasing
- No workers processing tasks
**Diagnosis:**
```bash
# Check worker status
systemctl status ml-worker
# Check logs
journalctl -u ml-worker -n 100
```
**Resolution:**
1. Verify workers are running
2. Check Redis connectivity
3. Verify lease configuration
### Issue: High Retry Rate
**Symptoms:**
- Many tasks in DLQ
- `retry_count` field high on tasks
**Diagnosis:**
```bash
# Check worker logs for errors
journalctl -u ml-worker | grep "retry"
# Look for patterns (network issues, resource limits, etc)
```
**Resolution:**
- Fix underlying issue (network, resources, etc)
- Adjust retry limits if permanent failures
- Increase task timeout if jobs are slow
### Issue: Leases Expiring Prematurely
**Symptoms:**
- Tasks retried even though worker is healthy
- Logs show "lease expired" frequently
**Diagnosis:**
```yaml
# Check worker config
cat configs/worker-config.yaml | grep -A3 "lease"
task_lease_duration: 30m # Too short?
heartbeat_interval: 1m # Too infrequent?
```
**Resolution:**
```yaml
# Increase lease duration for long-running jobs
task_lease_duration: 60m
heartbeat_interval: 30s # More frequent heartbeats
```
## Performance Tuning
### Worker Concurrency
```yaml
# worker-config.yaml
max_workers: 4 # Number of parallel tasks
# Adjust based on:
# - CPU cores available
# - Memory per task
# - GPU availability
```
### Redis Configuration
```conf
# /etc/redis/redis.conf
# Persistence
save 900 1
save 300 10
# Memory
maxmemory 2gb
maxmemory-policy noeviction
# Performance
tcp-keepalive 300
timeout 0
```
## Alerting Rules
### Critical Alerts
1. **Worker Down** (no heartbeat > 5min)
2. **Queue Depth** > 1000 tasks
3. **DLQ Growth** > 100 tasks/hour
4. **Redis Down** (connection failed)
### Warning Alerts
1. **High Retry Rate** > 10% of tasks
2. **Slow Queue Drain** (depth increasing over 1 hour)
3. **Worker Memory** > 80% usage
## Health Checks
```bash
#!/bin/bash
# health-check.sh
# Check Redis
redis-cli PING || echo "Redis DOWN"
# Check worker heartbeat
WORKER_ID=$(cat /var/run/ml-worker.pid)
LAST_HB=$(redis-cli HGET worker:heartbeat "$WORKER_ID")
NOW=$(date +%s)
if [ $((NOW - LAST_HB)) -gt 300 ]; then
echo "Worker heartbeat stale"
fi
# Check queue depth
DEPTH=$(redis-cli ZCARD task:queue)
if [ "$DEPTH" -gt 1000 ]; then
echo "Queue depth critical: $DEPTH"
fi
```
## Runbook Checklist
### Daily Operations
- [ ] Check queue depth
- [ ] Verify worker heartbeats
- [ ] Review DLQ for patterns
- [ ] Check Redis memory usage
### Weekly Operations
- [ ] Review retry rates
- [ ] Analyze failed task patterns
- [ ] Backup Redis snapshot
- [ ] Review worker logs
### Monthly Operations
- [ ] Performance tuning review
- [ ] Capacity planning
- [ ] Update documentation
- [ ] Test disaster recovery
---
**For homelab setups:**
Most of these operations can be simplified. Focus on:
- Basic monitoring (queue depth, worker status)
- Periodic Redis backups
- Graceful shutdowns for maintenance