fetch_ml/docs/src/operations.md
Jeremie Fraeys 5144d291cb
docs: comprehensive documentation updates
- Add architecture, CI/CD, CLI reference documentation
- Update installation, operations, and quick-start guides
- Add Jupyter workflow and queue documentation
- New landing page and research runner plan
2026-02-12 12:05:27 -05:00

5.4 KiB

title url weight
Operations Runbook /operations/ 6

Operations Runbook

Operational guide for troubleshooting and maintaining the ML experiment system.

Task Queue Operations

Monitoring Queue Health

# Check queue depth
ZCARD task:queue

# List pending tasks
ZRANGE task:queue 0 -1 WITHSCORES

# Check dead letter queue
KEYS task:dlq:*

Handling Stuck Tasks

Symptom: Tasks stuck in "running" status

Diagnosis:

# Check for expired leases
redis-cli GET task:{task-id}
# Look for LeaseExpiry in past

**Rem

ediation:** Tasks with expired leases are automatically reclaimed every 1 minute. To force immediate reclamation:

# Restart worker to trigger reclaim cycle
systemctl restart ml-worker

Dead Letter Queue Management

View failed tasks:

KEYS task:dlq:*

Inspect failed task:

GET task:dlq:{task-id}

Retry from DLQ:

# Manual retry (requires custom script)
# 1. Get task from DLQ
# 2. Reset retry count
# 3. Re-queue task

Worker Crashes

Symptom: Worker disappeared mid-task

What Happens:

  1. Lease expires after 30 minutes (default)
  2. Background reclaim job detects expired lease
  3. Task is retried (up to 3 attempts)
  4. After max retries → Dead Letter Queue

Prevention:

  • Monitor worker heartbeats
  • Set up alerts for worker down
  • Use process manager (systemd, supervisor)

Worker Operations

Graceful Shutdown

# Send SIGTERM for graceful shutdown
kill -TERM $(pgrep ml-worker)

# Worker will:
# 1. Stop accepting new tasks
# 2. Finish active tasks (up to 5min timeout)
# 3. Release all leases
# 4. Exit cleanly

Force Shutdown

# Force kill (leases will be reclaimed automatically)
kill -9 $(pgrep ml-worker)

Worker Heartbeat Monitoring

# Check worker heartbeats
HGETALL worker:heartbeat

# Example output:
# worker-abc123 1701234567
# worker-def456 1701234580

Alert if: Heartbeat timestamp > 5 minutes old

Redis Operations

Backup

# Manual backup
redis-cli SAVE
cp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d).rdb

Restore

# Stop Redis
systemctl stop redis

#  Restore snapshot
cp /backup/redis-20231201.rdb /var/lib/redis/dump.rdb

# Start Redis
systemctl start redis

Memory Management

# Check memory usage
INFO memory

# Evict old data if needed
FLUSHDB  # DANGER: Clears all data!

Common Issues

Issue: Queue Growing Unbounded

Symptoms:

  • ZCARD task:queue keeps increasing
  • No workers processing tasks

Diagnosis:

# Check worker status
systemctl status ml-worker

# Check logs
journalctl -u ml-worker -n 100

Resolution:

  1. Verify workers are running
  2. Check Redis connectivity
  3. Verify lease configuration

Issue: High Retry Rate

Symptoms:

  • Many tasks in DLQ
  • retry_count field high on tasks

Diagnosis:

# Check worker logs for errors
journalctl -u ml-worker | grep "retry"

# Look for patterns (network issues, resource limits, etc)

Resolution:

  • Fix underlying issue (network, resources, etc)
  • Adjust retry limits if permanent failures
  • Increase task timeout if jobs are slow

Issue: Leases Expiring Prematurely

Symptoms:

  • Tasks retried even though worker is healthy
  • Logs show "lease expired" frequently

Diagnosis:

# Check worker config
cat configs/worker-config.yaml | grep -A3 "lease"

task_lease_duration: 30m  # Too short?
heartbeat_interval: 1m    # Too infrequent?

Resolution:

# Increase lease duration for long-running jobs
task_lease_duration: 60m
heartbeat_interval: 30s  # More frequent heartbeats

Performance Tuning

Worker Concurrency

# worker-config.yaml
max_workers: 4  # Number of parallel tasks

# Adjust based on:
# - CPU cores available
# - Memory per task
# - GPU availability

Redis Configuration

# /etc/redis/redis.conf

# Persistence
save 900 1
save 300 10

# Memory
maxmemory 2gb
maxmemory-policy noeviction

# Performance
tcp-keepalive 300
timeout 0

Alerting Rules

Critical Alerts

  1. Worker Down (no heartbeat > 5min)
  2. Queue Depth > 1000 tasks
  3. DLQ Growth > 100 tasks/hour
  4. Redis Down (connection failed)

Warning Alerts

  1. High Retry Rate > 10% of tasks
  2. Slow Queue Drain (depth increasing over 1 hour)
  3. Worker Memory > 80% usage

Health Checks

#!/bin/bash
# health-check.sh

# Check Redis
redis-cli PING || echo "Redis DOWN"

# Check worker heartbeat
WORKER_ID=$(cat /var/run/ml-worker.pid)
LAST_HB=$(redis-cli HGET worker:heartbeat "$WORKER_ID")
NOW=$(date +%s)
if [ $((NOW - LAST_HB)) -gt 300 ]; then
  echo "Worker heartbeat stale"
fi

# Check queue depth
DEPTH=$(redis-cli ZCARD task:queue)
if [ "$DEPTH" -gt 1000 ]; then
  echo "Queue depth critical: $DEPTH"
fi

Runbook Checklist

Daily Operations

  1. Check queue depth
  2. Verify worker heartbeats
  3. Review DLQ for patterns
  4. Check Redis memory usage

Weekly Operations

  1. Review retry rates
  2. Analyze failed task patterns
  3. Backup Redis snapshot
  4. Review worker logs

Monthly Operations

  1. Performance tuning review
  2. Capacity planning
  3. Update documentation
  4. Test disaster recovery

For homelab setups: Most of these operations can be simplified. Focus on:

  • Basic monitoring (queue depth, worker status)
  • Periodic Redis backups
  • Graceful shutdowns for maintenance