- Add complete API documentation and architecture guides - Include quick start, installation, and deployment guides - Add troubleshooting and security documentation - Include CLI reference and configuration schema docs - Add production monitoring and operations guides - Implement MkDocs configuration with search functionality - Include comprehensive user and developer documentation Provides complete documentation for users and developers covering all aspects of the FetchML platform.
310 lines
5.4 KiB
Markdown
310 lines
5.4 KiB
Markdown
---
|
|
layout: page
|
|
title: "Operations Runbook"
|
|
permalink: /operations/
|
|
nav_order: 6
|
|
---
|
|
|
|
# Operations Runbook
|
|
|
|
Operational guide for troubleshooting and maintaining the ML experiment system.
|
|
|
|
## Task Queue Operations
|
|
|
|
### Monitoring Queue Health
|
|
|
|
```redis
|
|
# Check queue depth
|
|
ZCARD task:queue
|
|
|
|
# List pending tasks
|
|
ZRANGE task:queue 0 -1 WITHSCORES
|
|
|
|
# Check dead letter queue
|
|
KEYS task:dlq:*
|
|
```
|
|
|
|
### Handling Stuck Tasks
|
|
|
|
**Symptom:** Tasks stuck in "running" status
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check for expired leases
|
|
redis-cli GET task:{task-id}
|
|
# Look for LeaseExpiry in past
|
|
```
|
|
|
|
**Rem
|
|
|
|
ediation:**
|
|
Tasks with expired leases are automatically reclaimed every 1 minute. To force immediate reclamation:
|
|
```bash
|
|
# Restart worker to trigger reclaim cycle
|
|
systemctl restart ml-worker
|
|
```
|
|
|
|
### Dead Letter Queue Management
|
|
|
|
**View failed tasks:**
|
|
```redis
|
|
KEYS task:dlq:*
|
|
```
|
|
|
|
**Inspect failed task:**
|
|
```redis
|
|
GET task:dlq:{task-id}
|
|
```
|
|
|
|
**Retry from DLQ:**
|
|
```bash
|
|
# Manual retry (requires custom script)
|
|
# 1. Get task from DLQ
|
|
# 2. Reset retry count
|
|
# 3. Re-queue task
|
|
```
|
|
|
|
### Worker Crashes
|
|
|
|
**Symptom:** Worker disappeared mid-task
|
|
|
|
**What Happens:**
|
|
1. Lease expires after 30 minutes (default)
|
|
2. Background reclaim job detects expired lease
|
|
3. Task is retried (up to 3 attempts)
|
|
4. After max retries → Dead Letter Queue
|
|
|
|
**Prevention:**
|
|
- Monitor worker heartbeats
|
|
- Set up alerts for worker down
|
|
- Use process manager (systemd, supervisor)
|
|
|
|
## Worker Operations
|
|
|
|
### Graceful Shutdown
|
|
|
|
```bash
|
|
# Send SIGTERM for graceful shutdown
|
|
kill -TERM $(pgrep ml-worker)
|
|
|
|
# Worker will:
|
|
# 1. Stop accepting new tasks
|
|
# 2. Finish active tasks (up to 5min timeout)
|
|
# 3. Release all leases
|
|
# 4. Exit cleanly
|
|
```
|
|
|
|
### Force Shutdown
|
|
|
|
```bash
|
|
# Force kill (leases will be reclaimed automatically)
|
|
kill -9 $(pgrep ml-worker)
|
|
```
|
|
|
|
### Worker Heartbeat Monitoring
|
|
|
|
```redis
|
|
# Check worker heartbeats
|
|
HGETALL worker:heartbeat
|
|
|
|
# Example output:
|
|
# worker-abc123 1701234567
|
|
# worker-def456 1701234580
|
|
```
|
|
|
|
**Alert if:** Heartbeat timestamp > 5 minutes old
|
|
|
|
## Redis Operations
|
|
|
|
### Backup
|
|
|
|
```bash
|
|
# Manual backup
|
|
redis-cli SAVE
|
|
cp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d).rdb
|
|
```
|
|
|
|
### Restore
|
|
|
|
```bash
|
|
# Stop Redis
|
|
systemctl stop redis
|
|
|
|
# Restore snapshot
|
|
cp /backup/redis-20231201.rdb /var/lib/redis/dump.rdb
|
|
|
|
# Start Redis
|
|
systemctl start redis
|
|
```
|
|
|
|
### Memory Management
|
|
|
|
```redis
|
|
# Check memory usage
|
|
INFO memory
|
|
|
|
# Evict old data if needed
|
|
FLUSHDB # DANGER: Clears all data!
|
|
```
|
|
|
|
## Common Issues
|
|
|
|
### Issue: Queue Growing Unbounded
|
|
|
|
**Symptoms:**
|
|
- `ZCARD task:queue` keeps increasing
|
|
- No workers processing tasks
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check worker status
|
|
systemctl status ml-worker
|
|
|
|
# Check logs
|
|
journalctl -u ml-worker -n 100
|
|
```
|
|
|
|
**Resolution:**
|
|
1. Verify workers are running
|
|
2. Check Redis connectivity
|
|
3. Verify lease configuration
|
|
|
|
### Issue: High Retry Rate
|
|
|
|
**Symptoms:**
|
|
- Many tasks in DLQ
|
|
- `retry_count` field high on tasks
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check worker logs for errors
|
|
journalctl -u ml-worker | grep "retry"
|
|
|
|
# Look for patterns (network issues, resource limits, etc)
|
|
```
|
|
|
|
**Resolution:**
|
|
- Fix underlying issue (network, resources, etc)
|
|
- Adjust retry limits if permanent failures
|
|
- Increase task timeout if jobs are slow
|
|
|
|
### Issue: Leases Expiring Prematurely
|
|
|
|
**Symptoms:**
|
|
- Tasks retried even though worker is healthy
|
|
- Logs show "lease expired" frequently
|
|
|
|
**Diagnosis:**
|
|
```yaml
|
|
# Check worker config
|
|
cat configs/worker-config.yaml | grep -A3 "lease"
|
|
|
|
task_lease_duration: 30m # Too short?
|
|
heartbeat_interval: 1m # Too infrequent?
|
|
```
|
|
|
|
**Resolution:**
|
|
```yaml
|
|
# Increase lease duration for long-running jobs
|
|
task_lease_duration: 60m
|
|
heartbeat_interval: 30s # More frequent heartbeats
|
|
```
|
|
|
|
## Performance Tuning
|
|
|
|
### Worker Concurrency
|
|
|
|
```yaml
|
|
# worker-config.yaml
|
|
max_workers: 4 # Number of parallel tasks
|
|
|
|
# Adjust based on:
|
|
# - CPU cores available
|
|
# - Memory per task
|
|
# - GPU availability
|
|
```
|
|
|
|
### Redis Configuration
|
|
|
|
```conf
|
|
# /etc/redis/redis.conf
|
|
|
|
# Persistence
|
|
save 900 1
|
|
save 300 10
|
|
|
|
# Memory
|
|
maxmemory 2gb
|
|
maxmemory-policy noeviction
|
|
|
|
# Performance
|
|
tcp-keepalive 300
|
|
timeout 0
|
|
```
|
|
|
|
## Alerting Rules
|
|
|
|
### Critical Alerts
|
|
|
|
1. **Worker Down** (no heartbeat > 5min)
|
|
2. **Queue Depth** > 1000 tasks
|
|
3. **DLQ Growth** > 100 tasks/hour
|
|
4. **Redis Down** (connection failed)
|
|
|
|
### Warning Alerts
|
|
|
|
1. **High Retry Rate** > 10% of tasks
|
|
2. **Slow Queue Drain** (depth increasing over 1 hour)
|
|
3. **Worker Memory** > 80% usage
|
|
|
|
## Health Checks
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# health-check.sh
|
|
|
|
# Check Redis
|
|
redis-cli PING || echo "Redis DOWN"
|
|
|
|
# Check worker heartbeat
|
|
WORKER_ID=$(cat /var/run/ml-worker.pid)
|
|
LAST_HB=$(redis-cli HGET worker:heartbeat "$WORKER_ID")
|
|
NOW=$(date +%s)
|
|
if [ $((NOW - LAST_HB)) -gt 300 ]; then
|
|
echo "Worker heartbeat stale"
|
|
fi
|
|
|
|
# Check queue depth
|
|
DEPTH=$(redis-cli ZCARD task:queue)
|
|
if [ "$DEPTH" -gt 1000 ]; then
|
|
echo "Queue depth critical: $DEPTH"
|
|
fi
|
|
```
|
|
|
|
## Runbook Checklist
|
|
|
|
### Daily Operations
|
|
- [ ] Check queue depth
|
|
- [ ] Verify worker heartbeats
|
|
- [ ] Review DLQ for patterns
|
|
- [ ] Check Redis memory usage
|
|
|
|
### Weekly Operations
|
|
- [ ] Review retry rates
|
|
- [ ] Analyze failed task patterns
|
|
- [ ] Backup Redis snapshot
|
|
- [ ] Review worker logs
|
|
|
|
### Monthly Operations
|
|
- [ ] Performance tuning review
|
|
- [ ] Capacity planning
|
|
- [ ] Update documentation
|
|
- [ ] Test disaster recovery
|
|
|
|
---
|
|
|
|
**For homelab setups:**
|
|
Most of these operations can be simplified. Focus on:
|
|
- Basic monitoring (queue depth, worker status)
|
|
- Periodic Redis backups
|
|
- Graceful shutdowns for maintenance
|