# Rollback Procedure and Scope ## Overview This document defines the rollback procedure for FetchML deployments. **Rollback is explicitly image-only** - it does NOT restore queue state, artifact storage, or the audit log chain. ## What Rollback Does - Restores the previous container image - Restarts the worker with the previous binary - Preserves configuration files (unless explicitly corrupted) ## What Rollback Does NOT Do - **Does NOT restore Redis queue state** - jobs in the queue remain as-is - **Does NOT restore artifact storage** - artifacts created by newer version remain - **Does NOT modify or roll back the audit log chain** - doing so would break the chain - **Does NOT restore database migrations** - schema changes persist ⚠️ **Critical**: The audit log chain must NEVER be rolled back. Breaking the chain would compromise the entire audit trail. ## When to Rollback Rollback is appropriate when: - A deployment causes service crashes or health check failures - Critical functionality is broken in the new version - Security vulnerabilities are discovered in the new version Rollback is NOT appropriate when: - Data corruption has occurred (needs data recovery, not rollback) - The audit log shows anomalies (investigate first, don't rollback blindly) - Queue state is the issue (rollback won't fix this) ## Rollback Procedure ### Automated Rollback (Staging) Staging deployments have automatic rollback on failure: ```bash # This happens automatically in the CI pipeline cd deployments docker compose -f docker-compose.staging.yml down docker compose -f docker-compose.staging.yml up -d ``` ### Manual Rollback (Production) For production, manual rollback is required: ```bash # 1. Identify the previous working image PREVIOUS_SHA=$(tail -2 .prod-audit.log | head -1 | grep -o 'sha-[a-f0-9]*' || echo "previous") # 2. Verify the previous image exists docker pull ghcr.io/jfraeysd/fetchml-worker:$PREVIOUS_SHA # 3. Stop current services cd deployments docker compose -f docker-compose.prod.yml down # 4. Update compose to use previous image # Edit docker-compose.prod.yml to reference $PREVIOUS_SHA # 5. Start with previous image docker compose -f docker-compose.prod.yml up -d # 6. Verify health curl -fsS http://localhost:9101/health # 7. Write rollback entry to audit log echo "$(date -Iseconds) | rollback | success | from=${{ gitea.sha }} | to=$PREVIOUS_SHA | actor=$(whoami)" >> .prod-audit.log ``` ### Using deploy.sh The deploy.sh script includes a rollback function: ```bash # Rollback to previous deployment cd deployments ./deploy.sh prod rollback # This will: # - Read previous SHA from .prod-deployment.log # - Pull the previous image # - Restart services # - Write audit log entry ``` ## Post-Rollback Actions After rollback, you MUST: 1. **Verify health endpoints** - Ensure all services are responding 2. **Check queue state** - There may be stuck or failed jobs 3. **Review audit log** - Ensure chain is intact 4. **Notify team** - Document what happened and why 5. **Analyze failure** - Root cause analysis for the failed deployment ## Rollback Audit Log Every rollback MUST write an entry to the audit log: ``` 2024-01-15T14:30:00Z | rollback | success | from=sha-abc123 | to=sha-def456 | actor=deploy-user | reason=health-check-failure ``` This entry is REQUIRED even in emergency situations. ## Rollback Scope Diagram ``` ┌─────────────────────────────────────────────────────────┐ │ Deployment State │ ├─────────────────────────────────────────────────────────┤ │ ✓ Rolled back: │ │ - Container image │ │ - Worker binary │ │ - API server binary │ │ │ │ ✗ NOT rolled back: │ │ - Redis queue state │ │ - Artifact storage (new artifacts remain) │ │ - Audit log chain (must never be modified) │ │ - Database schema (migrations persist) │ │ - MinIO snapshots (new snapshots remain) │ └─────────────────────────────────────────────────────────┘ ``` ## Compliance Notes (HIPAA) For HIPAA deployments: 1. **Audit log chain integrity** is paramount - The rollback entry is appended, never replaces existing entries - Chain validation must still succeed post-rollback 2. **Verify compliance_mode after rollback** ```bash curl http://localhost:9101/health | grep compliance_mode ``` 3. **Document the incident** - Why was the deployment rolled back? - What was the impact on PHI handling? - Were there any data exposure risks? ## Testing Rollback Test rollback procedures in staging regularly: ```bash # Simulate a failed deployment cd deployments ./deploy.sh staging up # Trigger rollback ./deploy.sh staging rollback # Verify services ./deploy.sh staging status ``` ## See Also - `.forgejno/workflows/deploy-staging.yml` - Automated rollback in staging - `.forgejo/workflows/deploy-prod.yml` - Manual rollback for production - `deployments/deploy.sh` - Rollback script implementation - `scripts/check-audit-sink.sh` - Audit sink verification