docs: comprehensive documentation updates

- Add architecture, CI/CD, CLI reference documentation
- Update installation, operations, and quick-start guides
- Add Jupyter workflow and queue documentation
- New landing page and research runner plan

2026-02-12 12:05:27 -05:00

5.4 KiB

Raw Blame History

title	url	weight
Operations Runbook	/operations/	6

Operations Runbook

Operational guide for troubleshooting and maintaining the ML experiment system.

Task Queue Operations

Monitoring Queue Health

# Check queue depth
ZCARD task:queue

# List pending tasks
ZRANGE task:queue 0 -1 WITHSCORES

# Check dead letter queue
KEYS task:dlq:*

Handling Stuck Tasks

Symptom: Tasks stuck in "running" status

Diagnosis:

# Check for expired leases
redis-cli GET task:{task-id}
# Look for LeaseExpiry in past

**Rem

ediation:** Tasks with expired leases are automatically reclaimed every 1 minute. To force immediate reclamation:

# Restart worker to trigger reclaim cycle
systemctl restart ml-worker

Dead Letter Queue Management

View failed tasks:

KEYS task:dlq:*

Inspect failed task:

GET task:dlq:{task-id}

Retry from DLQ:

# Manual retry (requires custom script)
# 1. Get task from DLQ
# 2. Reset retry count
# 3. Re-queue task

Worker Crashes

Symptom: Worker disappeared mid-task

What Happens:

Lease expires after 30 minutes (default)
Background reclaim job detects expired lease
Task is retried (up to 3 attempts)
After max retries → Dead Letter Queue

Prevention:

Monitor worker heartbeats
Set up alerts for worker down
Use process manager (systemd, supervisor)

Worker Operations

Graceful Shutdown

# Send SIGTERM for graceful shutdown
kill -TERM $(pgrep ml-worker)

# Worker will:
# 1. Stop accepting new tasks
# 2. Finish active tasks (up to 5min timeout)
# 3. Release all leases
# 4. Exit cleanly

Force Shutdown

# Force kill (leases will be reclaimed automatically)
kill -9 $(pgrep ml-worker)

Worker Heartbeat Monitoring

# Check worker heartbeats
HGETALL worker:heartbeat

# Example output:
# worker-abc123 1701234567
# worker-def456 1701234580

Alert if: Heartbeat timestamp > 5 minutes old

Redis Operations

Backup

# Manual backup
redis-cli SAVE
cp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d).rdb

Restore

# Stop Redis
systemctl stop redis

#  Restore snapshot
cp /backup/redis-20231201.rdb /var/lib/redis/dump.rdb

# Start Redis
systemctl start redis

Memory Management

# Check memory usage
INFO memory

# Evict old data if needed
FLUSHDB  # DANGER: Clears all data!

Common Issues

Issue: Queue Growing Unbounded

Symptoms:

ZCARD task:queue keeps increasing
No workers processing tasks

Diagnosis:

# Check worker status
systemctl status ml-worker

# Check logs
journalctl -u ml-worker -n 100

Resolution:

Verify workers are running
Check Redis connectivity
Verify lease configuration

Issue: High Retry Rate

Symptoms:

Many tasks in DLQ
retry_count field high on tasks

Diagnosis:

# Check worker logs for errors
journalctl -u ml-worker | grep "retry"

# Look for patterns (network issues, resource limits, etc)

Resolution:

Fix underlying issue (network, resources, etc)
Adjust retry limits if permanent failures
Increase task timeout if jobs are slow

Issue: Leases Expiring Prematurely

Symptoms:

Tasks retried even though worker is healthy
Logs show "lease expired" frequently

Diagnosis:

# Check worker config
cat configs/worker-config.yaml | grep -A3 "lease"

task_lease_duration: 30m  # Too short?
heartbeat_interval: 1m    # Too infrequent?

Resolution:

# Increase lease duration for long-running jobs
task_lease_duration: 60m
heartbeat_interval: 30s  # More frequent heartbeats

Performance Tuning

Worker Concurrency

# worker-config.yaml
max_workers: 4  # Number of parallel tasks

# Adjust based on:
# - CPU cores available
# - Memory per task
# - GPU availability

Redis Configuration

# /etc/redis/redis.conf

# Persistence
save 900 1
save 300 10

# Memory
maxmemory 2gb
maxmemory-policy noeviction

# Performance
tcp-keepalive 300
timeout 0

Alerting Rules

Critical Alerts

Worker Down (no heartbeat > 5min)
Queue Depth > 1000 tasks
DLQ Growth > 100 tasks/hour
Redis Down (connection failed)

Warning Alerts

High Retry Rate > 10% of tasks
Slow Queue Drain (depth increasing over 1 hour)
Worker Memory > 80% usage

Health Checks

#!/bin/bash
# health-check.sh

# Check Redis
redis-cli PING || echo "Redis DOWN"

# Check worker heartbeat
WORKER_ID=$(cat /var/run/ml-worker.pid)
LAST_HB=$(redis-cli HGET worker:heartbeat "$WORKER_ID")
NOW=$(date +%s)
if [ $((NOW - LAST_HB)) -gt 300 ]; then
  echo "Worker heartbeat stale"
fi

# Check queue depth
DEPTH=$(redis-cli ZCARD task:queue)
if [ "$DEPTH" -gt 1000 ]; then
  echo "Queue depth critical: $DEPTH"
fi

Runbook Checklist

Daily Operations

Check queue depth
Verify worker heartbeats
Review DLQ for patterns
Check Redis memory usage

Weekly Operations

Review retry rates
Analyze failed task patterns
Backup Redis snapshot
Review worker logs

Monthly Operations

Performance tuning review
Capacity planning
Update documentation
Test disaster recovery

For homelab setups: Most of these operations can be simplified. Focus on:

Basic monitoring (queue depth, worker status)
Periodic Redis backups
Graceful shutdowns for maintenance

5.4 KiB Raw Blame History

Operations Runbook

Task Queue Operations

Monitoring Queue Health

Handling Stuck Tasks

Dead Letter Queue Management

Worker Crashes

Worker Operations

Graceful Shutdown

Force Shutdown

Worker Heartbeat Monitoring

Redis Operations

Backup

Restore

Memory Management

Common Issues

Issue: Queue Growing Unbounded

Issue: High Retry Rate

Issue: Leases Expiring Prematurely

Performance Tuning

Worker Concurrency

Redis Configuration

Alerting Rules

Critical Alerts

Warning Alerts

Health Checks

Runbook Checklist

Daily Operations

Weekly Operations

Monthly Operations

5.4 KiB

Raw Blame History