fetch_ml/deployments/ROLLBACK.md

# Rollback Procedure and Scope

## Overview

This document defines the rollback procedure for FetchML deployments. **Rollback is explicitly image-only** - it does NOT restore queue state, artifact storage, or the audit log chain.

## What Rollback Does

- Restores the previous container image
- Restarts the worker with the previous binary
- Preserves configuration files (unless explicitly corrupted)

## What Rollback Does NOT Do

- **Does NOT restore Redis queue state** - jobs in the queue remain as-is
- **Does NOT restore artifact storage** - artifacts created by newer version remain
- **Does NOT modify or roll back the audit log chain** - doing so would break the chain
- **Does NOT restore database migrations** - schema changes persist

⚠️ **Critical**: The audit log chain must NEVER be rolled back. Breaking the chain would compromise the entire audit trail.

## When to Rollback

Rollback is appropriate when:
- A deployment causes service crashes or health check failures
- Critical functionality is broken in the new version
- Security vulnerabilities are discovered in the new version

Rollback is NOT appropriate when:
- Data corruption has occurred (needs data recovery, not rollback)
- The audit log shows anomalies (investigate first, don't rollback blindly)
- Queue state is the issue (rollback won't fix this)

## Rollback Procedure

### Automated Rollback (Staging)

Staging deployments have automatic rollback on failure:

```bash
# This happens automatically in the CI pipeline
cd deployments
docker compose -f docker-compose.staging.yml down
docker compose -f docker-compose.staging.yml up -d
```

### Manual Rollback (Production)

For production, manual rollback is required:

```bash
# 1. Identify the previous working image
PREVIOUS_SHA=$(tail -2 .prod-audit.log | head -1 | grep -o 'sha-[a-f0-9]*' || echo "previous")

# 2. Verify the previous image exists
docker pull ghcr.io/jfraeysd/fetchml-worker:$PREVIOUS_SHA

# 3. Stop current services
cd deployments
docker compose -f docker-compose.prod.yml down

# 4. Update compose to use previous image
# Edit docker-compose.prod.yml to reference $PREVIOUS_SHA

# 5. Start with previous image
docker compose -f docker-compose.prod.yml up -d

# 6. Verify health
curl -fsS http://localhost:9101/health

# 7. Write rollback entry to audit log
echo "$(date -Iseconds) | rollback | success | from=${{ gitea.sha }} | to=$PREVIOUS_SHA | actor=$(whoami)" >> .prod-audit.log
```

### Using deploy.sh

The deploy.sh script includes a rollback function:

```bash
# Rollback to previous deployment
cd deployments
./deploy.sh prod rollback

# This will:
# - Read previous SHA from .prod-deployment.log
# - Pull the previous image
# - Restart services
# - Write audit log entry
```

## Post-Rollback Actions

After rollback, you MUST:

1. **Verify health endpoints** - Ensure all services are responding
2. **Check queue state** - There may be stuck or failed jobs
3. **Review audit log** - Ensure chain is intact
4. **Notify team** - Document what happened and why
5. **Analyze failure** - Root cause analysis for the failed deployment

## Rollback Audit Log

Every rollback MUST write an entry to the audit log:

```
2024-01-15T14:30:00Z | rollback | success | from=sha-abc123 | to=sha-def456 | actor=deploy-user | reason=health-check-failure
```

This entry is REQUIRED even in emergency situations.

## Rollback Scope Diagram

```
┌─────────────────────────────────────────────────────────┐
│  Deployment State                                       │
├─────────────────────────────────────────────────────────┤
│  ✓ Rolled back:                                         │
│    - Container image                                    │
│    - Worker binary                                      │
│    - API server binary                                  │
│                                                         │
│  ✗ NOT rolled back:                                     │
│    - Redis queue state                                  │
│    - Artifact storage (new artifacts remain)            │
│    - Audit log chain (must never be modified)           │
│    - Database schema (migrations persist)                 │
│    - MinIO snapshots (new snapshots remain)             │
└─────────────────────────────────────────────────────────┘
```

## Compliance Notes (HIPAA)

For HIPAA deployments:

1. **Audit log chain integrity** is paramount
   - The rollback entry is appended, never replaces existing entries
   - Chain validation must still succeed post-rollback

2. **Verify compliance_mode after rollback**
   ```bash
   curl http://localhost:9101/health | grep compliance_mode
   ```

3. **Document the incident**
   - Why was the deployment rolled back?
   - What was the impact on PHI handling?
   - Were there any data exposure risks?

## Testing Rollback

Test rollback procedures in staging regularly:

```bash
# Simulate a failed deployment
cd deployments
./deploy.sh staging up

# Trigger rollback
./deploy.sh staging rollback

# Verify services
./deploy.sh staging status
```

## See Also

- `.forgejno/workflows/deploy-staging.yml` - Automated rollback in staging
- `.forgejo/workflows/deploy-prod.yml` - Manual rollback for production
- `deployments/deploy.sh` - Rollback script implementation
- `scripts/check-audit-sink.sh` - Audit sink verification