fetch_ml/deployments/ROLLBACK.md
Jeremie Fraeys 685f79c4a7
ci(deploy): add Forgejo workflows and deployment automation
Add CI/CD pipelines for Forgejo/GitHub Actions:
- build.yml - Main build pipeline with matrix builds
- deploy-staging.yml - Automated staging deployment
- deploy-prod.yml - Production deployment with rollback support
- security-modes-test.yml - Security mode validation tests

Add deployment artifacts:
- docker-compose.staging.yml for staging environment
- ROLLBACK.md with rollback procedures and playbooks

Supports multi-environment deployment workflow with proper
gates between staging and production.
2026-02-26 12:04:23 -05:00

170 lines
5.6 KiB
Markdown

# Rollback Procedure and Scope
## Overview
This document defines the rollback procedure for FetchML deployments. **Rollback is explicitly image-only** - it does NOT restore queue state, artifact storage, or the audit log chain.
## What Rollback Does
- Restores the previous container image
- Restarts the worker with the previous binary
- Preserves configuration files (unless explicitly corrupted)
## What Rollback Does NOT Do
- **Does NOT restore Redis queue state** - jobs in the queue remain as-is
- **Does NOT restore artifact storage** - artifacts created by newer version remain
- **Does NOT modify or roll back the audit log chain** - doing so would break the chain
- **Does NOT restore database migrations** - schema changes persist
⚠️ **Critical**: The audit log chain must NEVER be rolled back. Breaking the chain would compromise the entire audit trail.
## When to Rollback
Rollback is appropriate when:
- A deployment causes service crashes or health check failures
- Critical functionality is broken in the new version
- Security vulnerabilities are discovered in the new version
Rollback is NOT appropriate when:
- Data corruption has occurred (needs data recovery, not rollback)
- The audit log shows anomalies (investigate first, don't rollback blindly)
- Queue state is the issue (rollback won't fix this)
## Rollback Procedure
### Automated Rollback (Staging)
Staging deployments have automatic rollback on failure:
```bash
# This happens automatically in the CI pipeline
cd deployments
docker compose -f docker-compose.staging.yml down
docker compose -f docker-compose.staging.yml up -d
```
### Manual Rollback (Production)
For production, manual rollback is required:
```bash
# 1. Identify the previous working image
PREVIOUS_SHA=$(tail -2 .prod-audit.log | head -1 | grep -o 'sha-[a-f0-9]*' || echo "previous")
# 2. Verify the previous image exists
docker pull ghcr.io/jfraeysd/fetchml-worker:$PREVIOUS_SHA
# 3. Stop current services
cd deployments
docker compose -f docker-compose.prod.yml down
# 4. Update compose to use previous image
# Edit docker-compose.prod.yml to reference $PREVIOUS_SHA
# 5. Start with previous image
docker compose -f docker-compose.prod.yml up -d
# 6. Verify health
curl -fsS http://localhost:9101/health
# 7. Write rollback entry to audit log
echo "$(date -Iseconds) | rollback | success | from=${{ gitea.sha }} | to=$PREVIOUS_SHA | actor=$(whoami)" >> .prod-audit.log
```
### Using deploy.sh
The deploy.sh script includes a rollback function:
```bash
# Rollback to previous deployment
cd deployments
./deploy.sh prod rollback
# This will:
# - Read previous SHA from .prod-deployment.log
# - Pull the previous image
# - Restart services
# - Write audit log entry
```
## Post-Rollback Actions
After rollback, you MUST:
1. **Verify health endpoints** - Ensure all services are responding
2. **Check queue state** - There may be stuck or failed jobs
3. **Review audit log** - Ensure chain is intact
4. **Notify team** - Document what happened and why
5. **Analyze failure** - Root cause analysis for the failed deployment
## Rollback Audit Log
Every rollback MUST write an entry to the audit log:
```
2024-01-15T14:30:00Z | rollback | success | from=sha-abc123 | to=sha-def456 | actor=deploy-user | reason=health-check-failure
```
This entry is REQUIRED even in emergency situations.
## Rollback Scope Diagram
```
┌─────────────────────────────────────────────────────────┐
│ Deployment State │
├─────────────────────────────────────────────────────────┤
│ ✓ Rolled back: │
│ - Container image │
│ - Worker binary │
│ - API server binary │
│ │
│ ✗ NOT rolled back: │
│ - Redis queue state │
│ - Artifact storage (new artifacts remain) │
│ - Audit log chain (must never be modified) │
│ - Database schema (migrations persist) │
│ - MinIO snapshots (new snapshots remain) │
└─────────────────────────────────────────────────────────┘
```
## Compliance Notes (HIPAA)
For HIPAA deployments:
1. **Audit log chain integrity** is paramount
- The rollback entry is appended, never replaces existing entries
- Chain validation must still succeed post-rollback
2. **Verify compliance_mode after rollback**
```bash
curl http://localhost:9101/health | grep compliance_mode
```
3. **Document the incident**
- Why was the deployment rolled back?
- What was the impact on PHI handling?
- Were there any data exposure risks?
## Testing Rollback
Test rollback procedures in staging regularly:
```bash
# Simulate a failed deployment
cd deployments
./deploy.sh staging up
# Trigger rollback
./deploy.sh staging rollback
# Verify services
./deploy.sh staging status
```
## See Also
- `.forgejno/workflows/deploy-staging.yml` - Automated rollback in staging
- `.forgejo/workflows/deploy-prod.yml` - Manual rollback for production
- `deployments/deploy.sh` - Rollback script implementation
- `scripts/check-audit-sink.sh` - Audit sink verification