--- layout: page title: "Operations Runbook" permalink: /operations/ nav_order: 6 --- # Operations Runbook Operational guide for troubleshooting and maintaining the ML experiment system. ## Task Queue Operations ### Monitoring Queue Health ```redis # Check queue depth ZCARD task:queue # List pending tasks ZRANGE task:queue 0 -1 WITHSCORES # Check dead letter queue KEYS task:dlq:* ``` ### Handling Stuck Tasks **Symptom:** Tasks stuck in "running" status **Diagnosis:** ```bash # Check for expired leases redis-cli GET task:{task-id} # Look for LeaseExpiry in past ``` **Rem ediation:** Tasks with expired leases are automatically reclaimed every 1 minute. To force immediate reclamation: ```bash # Restart worker to trigger reclaim cycle systemctl restart ml-worker ``` ### Dead Letter Queue Management **View failed tasks:** ```redis KEYS task:dlq:* ``` **Inspect failed task:** ```redis GET task:dlq:{task-id} ``` **Retry from DLQ:** ```bash # Manual retry (requires custom script) # 1. Get task from DLQ # 2. Reset retry count # 3. Re-queue task ``` ### Worker Crashes **Symptom:** Worker disappeared mid-task **What Happens:** 1. Lease expires after 30 minutes (default) 2. Background reclaim job detects expired lease 3. Task is retried (up to 3 attempts) 4. After max retries → Dead Letter Queue **Prevention:** - Monitor worker heartbeats - Set up alerts for worker down - Use process manager (systemd, supervisor) ## Worker Operations ### Graceful Shutdown ```bash # Send SIGTERM for graceful shutdown kill -TERM $(pgrep ml-worker) # Worker will: # 1. Stop accepting new tasks # 2. Finish active tasks (up to 5min timeout) # 3. Release all leases # 4. Exit cleanly ``` ### Force Shutdown ```bash # Force kill (leases will be reclaimed automatically) kill -9 $(pgrep ml-worker) ``` ### Worker Heartbeat Monitoring ```redis # Check worker heartbeats HGETALL worker:heartbeat # Example output: # worker-abc123 1701234567 # worker-def456 1701234580 ``` **Alert if:** Heartbeat timestamp > 5 minutes old ## Redis Operations ### Backup ```bash # Manual backup redis-cli SAVE cp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d).rdb ``` ### Restore ```bash # Stop Redis systemctl stop redis # Restore snapshot cp /backup/redis-20231201.rdb /var/lib/redis/dump.rdb # Start Redis systemctl start redis ``` ### Memory Management ```redis # Check memory usage INFO memory # Evict old data if needed FLUSHDB # DANGER: Clears all data! ``` ## Common Issues ### Issue: Queue Growing Unbounded **Symptoms:** - `ZCARD task:queue` keeps increasing - No workers processing tasks **Diagnosis:** ```bash # Check worker status systemctl status ml-worker # Check logs journalctl -u ml-worker -n 100 ``` **Resolution:** 1. Verify workers are running 2. Check Redis connectivity 3. Verify lease configuration ### Issue: High Retry Rate **Symptoms:** - Many tasks in DLQ - `retry_count` field high on tasks **Diagnosis:** ```bash # Check worker logs for errors journalctl -u ml-worker | grep "retry" # Look for patterns (network issues, resource limits, etc) ``` **Resolution:** - Fix underlying issue (network, resources, etc) - Adjust retry limits if permanent failures - Increase task timeout if jobs are slow ### Issue: Leases Expiring Prematurely **Symptoms:** - Tasks retried even though worker is healthy - Logs show "lease expired" frequently **Diagnosis:** ```yaml # Check worker config cat configs/worker-config.yaml | grep -A3 "lease" task_lease_duration: 30m # Too short? heartbeat_interval: 1m # Too infrequent? ``` **Resolution:** ```yaml # Increase lease duration for long-running jobs task_lease_duration: 60m heartbeat_interval: 30s # More frequent heartbeats ``` ## Performance Tuning ### Worker Concurrency ```yaml # worker-config.yaml max_workers: 4 # Number of parallel tasks # Adjust based on: # - CPU cores available # - Memory per task # - GPU availability ``` ### Redis Configuration ```conf # /etc/redis/redis.conf # Persistence save 900 1 save 300 10 # Memory maxmemory 2gb maxmemory-policy noeviction # Performance tcp-keepalive 300 timeout 0 ``` ## Alerting Rules ### Critical Alerts 1. **Worker Down** (no heartbeat > 5min) 2. **Queue Depth** > 1000 tasks 3. **DLQ Growth** > 100 tasks/hour 4. **Redis Down** (connection failed) ### Warning Alerts 1. **High Retry Rate** > 10% of tasks 2. **Slow Queue Drain** (depth increasing over 1 hour) 3. **Worker Memory** > 80% usage ## Health Checks ```bash #!/bin/bash # health-check.sh # Check Redis redis-cli PING || echo "Redis DOWN" # Check worker heartbeat WORKER_ID=$(cat /var/run/ml-worker.pid) LAST_HB=$(redis-cli HGET worker:heartbeat "$WORKER_ID") NOW=$(date +%s) if [ $((NOW - LAST_HB)) -gt 300 ]; then echo "Worker heartbeat stale" fi # Check queue depth DEPTH=$(redis-cli ZCARD task:queue) if [ "$DEPTH" -gt 1000 ]; then echo "Queue depth critical: $DEPTH" fi ``` ## Runbook Checklist ### Daily Operations - [ ] Check queue depth - [ ] Verify worker heartbeats - [ ] Review DLQ for patterns - [ ] Check Redis memory usage ### Weekly Operations - [ ] Review retry rates - [ ] Analyze failed task patterns - [ ] Backup Redis snapshot - [ ] Review worker logs ### Monthly Operations - [ ] Performance tuning review - [ ] Capacity planning - [ ] Update documentation - [ ] Test disaster recovery --- **For homelab setups:** Most of these operations can be simplified. Focus on: - Basic monitoring (queue depth, worker status) - Periodic Redis backups - Graceful shutdowns for maintenance