fetch_ml/docs/_pages/operations.md

---
layout: page
title: "Operations Runbook"
permalink: /operations/
nav_order: 6
---

# Operations Runbook

Operational guide for troubleshooting and maintaining the ML experiment system.

## Task Queue Operations

### Monitoring Queue Health

```redis
# Check queue depth
ZCARD task:queue

# List pending tasks
ZRANGE task:queue 0 -1 WITHSCORES

# Check dead letter queue
KEYS task:dlq:*
```

### Handling Stuck Tasks

**Symptom:** Tasks stuck in "running" status

**Diagnosis:**
```bash
# Check for expired leases
redis-cli GET task:{task-id}
# Look for LeaseExpiry in past
```

**Rem

ediation:**
Tasks with expired leases are automatically reclaimed every 1 minute. To force immediate reclamation:
```bash
# Restart worker to trigger reclaim cycle
systemctl restart ml-worker
```

### Dead Letter Queue Management

**View failed tasks:**
```redis
KEYS task:dlq:*
```

**Inspect failed task:**
```redis
GET task:dlq:{task-id}
```

**Retry from DLQ:**
```bash
# Manual retry (requires custom script)
# 1. Get task from DLQ
# 2. Reset retry count
# 3. Re-queue task
```

### Worker Crashes

**Symptom:** Worker disappeared mid-task

**What Happens:**
1. Lease expires after 30 minutes (default)
2. Background reclaim job detects expired lease
3. Task is retried (up to 3 attempts)
4. After max retries → Dead Letter Queue

**Prevention:**
- Monitor worker heartbeats
- Set up alerts for worker down
- Use process manager (systemd, supervisor)

## Worker Operations

### Graceful Shutdown

```bash
# Send SIGTERM for graceful shutdown
kill -TERM $(pgrep ml-worker)

# Worker will:
# 1. Stop accepting new tasks
# 2. Finish active tasks (up to 5min timeout)
# 3. Release all leases
# 4. Exit cleanly
```

### Force Shutdown

```bash
# Force kill (leases will be reclaimed automatically)
kill -9 $(pgrep ml-worker)
```

### Worker Heartbeat Monitoring

```redis
# Check worker heartbeats
HGETALL worker:heartbeat

# Example output:
# worker-abc123 1701234567
# worker-def456 1701234580
```

**Alert if:** Heartbeat timestamp > 5 minutes old

## Redis Operations

### Backup

```bash
# Manual backup
redis-cli SAVE
cp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d).rdb
```

### Restore

```bash
# Stop Redis
systemctl stop redis

#  Restore snapshot
cp /backup/redis-20231201.rdb /var/lib/redis/dump.rdb

# Start Redis
systemctl start redis
```

### Memory Management

```redis
# Check memory usage
INFO memory

# Evict old data if needed
FLUSHDB  # DANGER: Clears all data!
```

## Common Issues

### Issue: Queue Growing Unbounded

**Symptoms:**
- `ZCARD task:queue` keeps increasing
- No workers processing tasks

**Diagnosis:**
```bash
# Check worker status
systemctl status ml-worker

# Check logs
journalctl -u ml-worker -n 100
```

**Resolution:**
1. Verify workers are running
2. Check Redis connectivity
3. Verify lease configuration

### Issue: High Retry Rate

**Symptoms:**
- Many tasks in DLQ
- `retry_count` field high on tasks

**Diagnosis:**
```bash
# Check worker logs for errors
journalctl -u ml-worker | grep "retry"

# Look for patterns (network issues, resource limits, etc)
```

**Resolution:**
- Fix underlying issue (network, resources, etc)
- Adjust retry limits if permanent failures
- Increase task timeout if jobs are slow

### Issue: Leases Expiring Prematurely

**Symptoms:**
- Tasks retried even though worker is healthy
- Logs show "lease expired" frequently

**Diagnosis:**
```yaml
# Check worker config
cat configs/worker-config.yaml | grep -A3 "lease"

task_lease_duration: 30m  # Too short?
heartbeat_interval: 1m    # Too infrequent?
```

**Resolution:**
```yaml
# Increase lease duration for long-running jobs
task_lease_duration: 60m
heartbeat_interval: 30s  # More frequent heartbeats
```

## Performance Tuning

### Worker Concurrency

```yaml
# worker-config.yaml
max_workers: 4  # Number of parallel tasks

# Adjust based on:
# - CPU cores available
# - Memory per task
# - GPU availability
```

### Redis Configuration

```conf
# /etc/redis/redis.conf

# Persistence
save 900 1
save 300 10

# Memory
maxmemory 2gb
maxmemory-policy noeviction

# Performance
tcp-keepalive 300
timeout 0
```

## Alerting Rules

### Critical Alerts

1. **Worker Down** (no heartbeat > 5min)
2. **Queue Depth** > 1000 tasks
3. **DLQ Growth** > 100 tasks/hour
4. **Redis Down** (connection failed)

### Warning Alerts

1. **High Retry Rate** > 10% of tasks
2. **Slow Queue Drain** (depth increasing over 1 hour)
3. **Worker Memory** > 80% usage

## Health Checks

```bash
#!/bin/bash
# health-check.sh

# Check Redis
redis-cli PING || echo "Redis DOWN"

# Check worker heartbeat
WORKER_ID=$(cat /var/run/ml-worker.pid)
LAST_HB=$(redis-cli HGET worker:heartbeat "$WORKER_ID")
NOW=$(date +%s)
if [ $((NOW - LAST_HB)) -gt 300 ]; then
  echo "Worker heartbeat stale"
fi

# Check queue depth
DEPTH=$(redis-cli ZCARD task:queue)
if [ "$DEPTH" -gt 1000 ]; then
  echo "Queue depth critical: $DEPTH"
fi
```

## Runbook Checklist

### Daily Operations
- [ ] Check queue depth
- [ ] Verify worker heartbeats
- [ ] Review DLQ for patterns
- [ ] Check Redis memory usage

### Weekly Operations
- [ ] Review retry rates
- [ ] Analyze failed task patterns
- [ ] Backup Redis snapshot
- [ ] Review worker logs

### Monthly Operations
- [ ] Performance tuning review
- [ ] Capacity planning
- [ ] Update documentation
- [ ] Test disaster recovery

---

**For homelab setups:**
Most of these operations can be simplified. Focus on:
-  Basic monitoring (queue depth, worker status)
- Periodic Redis backups
- Graceful shutdowns for maintenance