fetch_ml/docs/_pages/operations.md
Jeremie Fraeys 385d2cf386 docs: add comprehensive documentation with MkDocs site
- Add complete API documentation and architecture guides
- Include quick start, installation, and deployment guides
- Add troubleshooting and security documentation
- Include CLI reference and configuration schema docs
- Add production monitoring and operations guides
- Implement MkDocs configuration with search functionality
- Include comprehensive user and developer documentation

Provides complete documentation for users and developers
covering all aspects of the FetchML platform.
2025-12-04 16:54:57 -05:00

5.4 KiB

layout title permalink nav_order
page Operations Runbook /operations/ 6

Operations Runbook

Operational guide for troubleshooting and maintaining the ML experiment system.

Task Queue Operations

Monitoring Queue Health

# Check queue depth
ZCARD task:queue

# List pending tasks
ZRANGE task:queue 0 -1 WITHSCORES

# Check dead letter queue
KEYS task:dlq:*

Handling Stuck Tasks

Symptom: Tasks stuck in "running" status

Diagnosis:

# Check for expired leases
redis-cli GET task:{task-id}
# Look for LeaseExpiry in past

**Rem

ediation:** Tasks with expired leases are automatically reclaimed every 1 minute. To force immediate reclamation:

# Restart worker to trigger reclaim cycle
systemctl restart ml-worker

Dead Letter Queue Management

View failed tasks:

KEYS task:dlq:*

Inspect failed task:

GET task:dlq:{task-id}

Retry from DLQ:

# Manual retry (requires custom script)
# 1. Get task from DLQ
# 2. Reset retry count
# 3. Re-queue task

Worker Crashes

Symptom: Worker disappeared mid-task

What Happens:

  1. Lease expires after 30 minutes (default)
  2. Background reclaim job detects expired lease
  3. Task is retried (up to 3 attempts)
  4. After max retries → Dead Letter Queue

Prevention:

  • Monitor worker heartbeats
  • Set up alerts for worker down
  • Use process manager (systemd, supervisor)

Worker Operations

Graceful Shutdown

# Send SIGTERM for graceful shutdown
kill -TERM $(pgrep ml-worker)

# Worker will:
# 1. Stop accepting new tasks
# 2. Finish active tasks (up to 5min timeout)
# 3. Release all leases
# 4. Exit cleanly

Force Shutdown

# Force kill (leases will be reclaimed automatically)
kill -9 $(pgrep ml-worker)

Worker Heartbeat Monitoring

# Check worker heartbeats
HGETALL worker:heartbeat

# Example output:
# worker-abc123 1701234567
# worker-def456 1701234580

Alert if: Heartbeat timestamp > 5 minutes old

Redis Operations

Backup

# Manual backup
redis-cli SAVE
cp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d).rdb

Restore

# Stop Redis
systemctl stop redis

#  Restore snapshot
cp /backup/redis-20231201.rdb /var/lib/redis/dump.rdb

# Start Redis
systemctl start redis

Memory Management

# Check memory usage
INFO memory

# Evict old data if needed
FLUSHDB  # DANGER: Clears all data!

Common Issues

Issue: Queue Growing Unbounded

Symptoms:

  • ZCARD task:queue keeps increasing
  • No workers processing tasks

Diagnosis:

# Check worker status
systemctl status ml-worker

# Check logs
journalctl -u ml-worker -n 100

Resolution:

  1. Verify workers are running
  2. Check Redis connectivity
  3. Verify lease configuration

Issue: High Retry Rate

Symptoms:

  • Many tasks in DLQ
  • retry_count field high on tasks

Diagnosis:

# Check worker logs for errors
journalctl -u ml-worker | grep "retry"

# Look for patterns (network issues, resource limits, etc)

Resolution:

  • Fix underlying issue (network, resources, etc)
  • Adjust retry limits if permanent failures
  • Increase task timeout if jobs are slow

Issue: Leases Expiring Prematurely

Symptoms:

  • Tasks retried even though worker is healthy
  • Logs show "lease expired" frequently

Diagnosis:

# Check worker config
cat configs/worker-config.yaml | grep -A3 "lease"

task_lease_duration: 30m  # Too short?
heartbeat_interval: 1m    # Too infrequent?

Resolution:

# Increase lease duration for long-running jobs
task_lease_duration: 60m
heartbeat_interval: 30s  # More frequent heartbeats

Performance Tuning

Worker Concurrency

# worker-config.yaml
max_workers: 4  # Number of parallel tasks

# Adjust based on:
# - CPU cores available
# - Memory per task
# - GPU availability

Redis Configuration

# /etc/redis/redis.conf

# Persistence
save 900 1
save 300 10

# Memory
maxmemory 2gb
maxmemory-policy noeviction

# Performance
tcp-keepalive 300
timeout 0

Alerting Rules

Critical Alerts

  1. Worker Down (no heartbeat > 5min)
  2. Queue Depth > 1000 tasks
  3. DLQ Growth > 100 tasks/hour
  4. Redis Down (connection failed)

Warning Alerts

  1. High Retry Rate > 10% of tasks
  2. Slow Queue Drain (depth increasing over 1 hour)
  3. Worker Memory > 80% usage

Health Checks

#!/bin/bash
# health-check.sh

# Check Redis
redis-cli PING || echo "Redis DOWN"

# Check worker heartbeat
WORKER_ID=$(cat /var/run/ml-worker.pid)
LAST_HB=$(redis-cli HGET worker:heartbeat "$WORKER_ID")
NOW=$(date +%s)
if [ $((NOW - LAST_HB)) -gt 300 ]; then
  echo "Worker heartbeat stale"
fi

# Check queue depth
DEPTH=$(redis-cli ZCARD task:queue)
if [ "$DEPTH" -gt 1000 ]; then
  echo "Queue depth critical: $DEPTH"
fi

Runbook Checklist

Daily Operations

  • Check queue depth
  • Verify worker heartbeats
  • Review DLQ for patterns
  • Check Redis memory usage

Weekly Operations

  • Review retry rates
  • Analyze failed task patterns
  • Backup Redis snapshot
  • Review worker logs

Monthly Operations

  • Performance tuning review
  • Capacity planning
  • Update documentation
  • Test disaster recovery

For homelab setups: Most of these operations can be simplified. Focus on:

  • Basic monitoring (queue depth, worker status)
  • Periodic Redis backups
  • Graceful shutdowns for maintenance