- Add complete API documentation and architecture guides - Include quick start, installation, and deployment guides - Add troubleshooting and security documentation - Include CLI reference and configuration schema docs - Add production monitoring and operations guides - Implement MkDocs configuration with search functionality - Include comprehensive user and developer documentation Provides complete documentation for users and developers covering all aspects of the FetchML platform.
5.4 KiB
5.4 KiB
| layout | title | permalink | nav_order |
|---|---|---|---|
| page | Operations Runbook | /operations/ | 6 |
Operations Runbook
Operational guide for troubleshooting and maintaining the ML experiment system.
Task Queue Operations
Monitoring Queue Health
# Check queue depth
ZCARD task:queue
# List pending tasks
ZRANGE task:queue 0 -1 WITHSCORES
# Check dead letter queue
KEYS task:dlq:*
Handling Stuck Tasks
Symptom: Tasks stuck in "running" status
Diagnosis:
# Check for expired leases
redis-cli GET task:{task-id}
# Look for LeaseExpiry in past
**Rem
ediation:** Tasks with expired leases are automatically reclaimed every 1 minute. To force immediate reclamation:
# Restart worker to trigger reclaim cycle
systemctl restart ml-worker
Dead Letter Queue Management
View failed tasks:
KEYS task:dlq:*
Inspect failed task:
GET task:dlq:{task-id}
Retry from DLQ:
# Manual retry (requires custom script)
# 1. Get task from DLQ
# 2. Reset retry count
# 3. Re-queue task
Worker Crashes
Symptom: Worker disappeared mid-task
What Happens:
- Lease expires after 30 minutes (default)
- Background reclaim job detects expired lease
- Task is retried (up to 3 attempts)
- After max retries → Dead Letter Queue
Prevention:
- Monitor worker heartbeats
- Set up alerts for worker down
- Use process manager (systemd, supervisor)
Worker Operations
Graceful Shutdown
# Send SIGTERM for graceful shutdown
kill -TERM $(pgrep ml-worker)
# Worker will:
# 1. Stop accepting new tasks
# 2. Finish active tasks (up to 5min timeout)
# 3. Release all leases
# 4. Exit cleanly
Force Shutdown
# Force kill (leases will be reclaimed automatically)
kill -9 $(pgrep ml-worker)
Worker Heartbeat Monitoring
# Check worker heartbeats
HGETALL worker:heartbeat
# Example output:
# worker-abc123 1701234567
# worker-def456 1701234580
Alert if: Heartbeat timestamp > 5 minutes old
Redis Operations
Backup
# Manual backup
redis-cli SAVE
cp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d).rdb
Restore
# Stop Redis
systemctl stop redis
# Restore snapshot
cp /backup/redis-20231201.rdb /var/lib/redis/dump.rdb
# Start Redis
systemctl start redis
Memory Management
# Check memory usage
INFO memory
# Evict old data if needed
FLUSHDB # DANGER: Clears all data!
Common Issues
Issue: Queue Growing Unbounded
Symptoms:
ZCARD task:queuekeeps increasing- No workers processing tasks
Diagnosis:
# Check worker status
systemctl status ml-worker
# Check logs
journalctl -u ml-worker -n 100
Resolution:
- Verify workers are running
- Check Redis connectivity
- Verify lease configuration
Issue: High Retry Rate
Symptoms:
- Many tasks in DLQ
retry_countfield high on tasks
Diagnosis:
# Check worker logs for errors
journalctl -u ml-worker | grep "retry"
# Look for patterns (network issues, resource limits, etc)
Resolution:
- Fix underlying issue (network, resources, etc)
- Adjust retry limits if permanent failures
- Increase task timeout if jobs are slow
Issue: Leases Expiring Prematurely
Symptoms:
- Tasks retried even though worker is healthy
- Logs show "lease expired" frequently
Diagnosis:
# Check worker config
cat configs/worker-config.yaml | grep -A3 "lease"
task_lease_duration: 30m # Too short?
heartbeat_interval: 1m # Too infrequent?
Resolution:
# Increase lease duration for long-running jobs
task_lease_duration: 60m
heartbeat_interval: 30s # More frequent heartbeats
Performance Tuning
Worker Concurrency
# worker-config.yaml
max_workers: 4 # Number of parallel tasks
# Adjust based on:
# - CPU cores available
# - Memory per task
# - GPU availability
Redis Configuration
# /etc/redis/redis.conf
# Persistence
save 900 1
save 300 10
# Memory
maxmemory 2gb
maxmemory-policy noeviction
# Performance
tcp-keepalive 300
timeout 0
Alerting Rules
Critical Alerts
- Worker Down (no heartbeat > 5min)
- Queue Depth > 1000 tasks
- DLQ Growth > 100 tasks/hour
- Redis Down (connection failed)
Warning Alerts
- High Retry Rate > 10% of tasks
- Slow Queue Drain (depth increasing over 1 hour)
- Worker Memory > 80% usage
Health Checks
#!/bin/bash
# health-check.sh
# Check Redis
redis-cli PING || echo "Redis DOWN"
# Check worker heartbeat
WORKER_ID=$(cat /var/run/ml-worker.pid)
LAST_HB=$(redis-cli HGET worker:heartbeat "$WORKER_ID")
NOW=$(date +%s)
if [ $((NOW - LAST_HB)) -gt 300 ]; then
echo "Worker heartbeat stale"
fi
# Check queue depth
DEPTH=$(redis-cli ZCARD task:queue)
if [ "$DEPTH" -gt 1000 ]; then
echo "Queue depth critical: $DEPTH"
fi
Runbook Checklist
Daily Operations
- Check queue depth
- Verify worker heartbeats
- Review DLQ for patterns
- Check Redis memory usage
Weekly Operations
- Review retry rates
- Analyze failed task patterns
- Backup Redis snapshot
- Review worker logs
Monthly Operations
- Performance tuning review
- Capacity planning
- Update documentation
- Test disaster recovery
For homelab setups: Most of these operations can be simplified. Focus on:
- Basic monitoring (queue depth, worker status)
- Periodic Redis backups
- Graceful shutdowns for maintenance