Add CI/CD pipelines for Forgejo/GitHub Actions: - build.yml - Main build pipeline with matrix builds - deploy-staging.yml - Automated staging deployment - deploy-prod.yml - Production deployment with rollback support - security-modes-test.yml - Security mode validation tests Add deployment artifacts: - docker-compose.staging.yml for staging environment - ROLLBACK.md with rollback procedures and playbooks Supports multi-environment deployment workflow with proper gates between staging and production.
170 lines
5.6 KiB
Markdown
170 lines
5.6 KiB
Markdown
# Rollback Procedure and Scope
|
|
|
|
## Overview
|
|
|
|
This document defines the rollback procedure for FetchML deployments. **Rollback is explicitly image-only** - it does NOT restore queue state, artifact storage, or the audit log chain.
|
|
|
|
## What Rollback Does
|
|
|
|
- Restores the previous container image
|
|
- Restarts the worker with the previous binary
|
|
- Preserves configuration files (unless explicitly corrupted)
|
|
|
|
## What Rollback Does NOT Do
|
|
|
|
- **Does NOT restore Redis queue state** - jobs in the queue remain as-is
|
|
- **Does NOT restore artifact storage** - artifacts created by newer version remain
|
|
- **Does NOT modify or roll back the audit log chain** - doing so would break the chain
|
|
- **Does NOT restore database migrations** - schema changes persist
|
|
|
|
⚠️ **Critical**: The audit log chain must NEVER be rolled back. Breaking the chain would compromise the entire audit trail.
|
|
|
|
## When to Rollback
|
|
|
|
Rollback is appropriate when:
|
|
- A deployment causes service crashes or health check failures
|
|
- Critical functionality is broken in the new version
|
|
- Security vulnerabilities are discovered in the new version
|
|
|
|
Rollback is NOT appropriate when:
|
|
- Data corruption has occurred (needs data recovery, not rollback)
|
|
- The audit log shows anomalies (investigate first, don't rollback blindly)
|
|
- Queue state is the issue (rollback won't fix this)
|
|
|
|
## Rollback Procedure
|
|
|
|
### Automated Rollback (Staging)
|
|
|
|
Staging deployments have automatic rollback on failure:
|
|
|
|
```bash
|
|
# This happens automatically in the CI pipeline
|
|
cd deployments
|
|
docker compose -f docker-compose.staging.yml down
|
|
docker compose -f docker-compose.staging.yml up -d
|
|
```
|
|
|
|
### Manual Rollback (Production)
|
|
|
|
For production, manual rollback is required:
|
|
|
|
```bash
|
|
# 1. Identify the previous working image
|
|
PREVIOUS_SHA=$(tail -2 .prod-audit.log | head -1 | grep -o 'sha-[a-f0-9]*' || echo "previous")
|
|
|
|
# 2. Verify the previous image exists
|
|
docker pull ghcr.io/jfraeysd/fetchml-worker:$PREVIOUS_SHA
|
|
|
|
# 3. Stop current services
|
|
cd deployments
|
|
docker compose -f docker-compose.prod.yml down
|
|
|
|
# 4. Update compose to use previous image
|
|
# Edit docker-compose.prod.yml to reference $PREVIOUS_SHA
|
|
|
|
# 5. Start with previous image
|
|
docker compose -f docker-compose.prod.yml up -d
|
|
|
|
# 6. Verify health
|
|
curl -fsS http://localhost:9101/health
|
|
|
|
# 7. Write rollback entry to audit log
|
|
echo "$(date -Iseconds) | rollback | success | from=${{ gitea.sha }} | to=$PREVIOUS_SHA | actor=$(whoami)" >> .prod-audit.log
|
|
```
|
|
|
|
### Using deploy.sh
|
|
|
|
The deploy.sh script includes a rollback function:
|
|
|
|
```bash
|
|
# Rollback to previous deployment
|
|
cd deployments
|
|
./deploy.sh prod rollback
|
|
|
|
# This will:
|
|
# - Read previous SHA from .prod-deployment.log
|
|
# - Pull the previous image
|
|
# - Restart services
|
|
# - Write audit log entry
|
|
```
|
|
|
|
## Post-Rollback Actions
|
|
|
|
After rollback, you MUST:
|
|
|
|
1. **Verify health endpoints** - Ensure all services are responding
|
|
2. **Check queue state** - There may be stuck or failed jobs
|
|
3. **Review audit log** - Ensure chain is intact
|
|
4. **Notify team** - Document what happened and why
|
|
5. **Analyze failure** - Root cause analysis for the failed deployment
|
|
|
|
## Rollback Audit Log
|
|
|
|
Every rollback MUST write an entry to the audit log:
|
|
|
|
```
|
|
2024-01-15T14:30:00Z | rollback | success | from=sha-abc123 | to=sha-def456 | actor=deploy-user | reason=health-check-failure
|
|
```
|
|
|
|
This entry is REQUIRED even in emergency situations.
|
|
|
|
## Rollback Scope Diagram
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────┐
|
|
│ Deployment State │
|
|
├─────────────────────────────────────────────────────────┤
|
|
│ ✓ Rolled back: │
|
|
│ - Container image │
|
|
│ - Worker binary │
|
|
│ - API server binary │
|
|
│ │
|
|
│ ✗ NOT rolled back: │
|
|
│ - Redis queue state │
|
|
│ - Artifact storage (new artifacts remain) │
|
|
│ - Audit log chain (must never be modified) │
|
|
│ - Database schema (migrations persist) │
|
|
│ - MinIO snapshots (new snapshots remain) │
|
|
└─────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Compliance Notes (HIPAA)
|
|
|
|
For HIPAA deployments:
|
|
|
|
1. **Audit log chain integrity** is paramount
|
|
- The rollback entry is appended, never replaces existing entries
|
|
- Chain validation must still succeed post-rollback
|
|
|
|
2. **Verify compliance_mode after rollback**
|
|
```bash
|
|
curl http://localhost:9101/health | grep compliance_mode
|
|
```
|
|
|
|
3. **Document the incident**
|
|
- Why was the deployment rolled back?
|
|
- What was the impact on PHI handling?
|
|
- Were there any data exposure risks?
|
|
|
|
## Testing Rollback
|
|
|
|
Test rollback procedures in staging regularly:
|
|
|
|
```bash
|
|
# Simulate a failed deployment
|
|
cd deployments
|
|
./deploy.sh staging up
|
|
|
|
# Trigger rollback
|
|
./deploy.sh staging rollback
|
|
|
|
# Verify services
|
|
./deploy.sh staging status
|
|
```
|
|
|
|
## See Also
|
|
|
|
- `.forgejno/workflows/deploy-staging.yml` - Automated rollback in staging
|
|
- `.forgejo/workflows/deploy-prod.yml` - Manual rollback for production
|
|
- `deployments/deploy.sh` - Rollback script implementation
|
|
- `scripts/check-audit-sink.sh` - Audit sink verification
|