Add CI/CD pipelines for Forgejo/GitHub Actions: - build.yml - Main build pipeline with matrix builds - deploy-staging.yml - Automated staging deployment - deploy-prod.yml - Production deployment with rollback support - security-modes-test.yml - Security mode validation tests Add deployment artifacts: - docker-compose.staging.yml for staging environment - ROLLBACK.md with rollback procedures and playbooks Supports multi-environment deployment workflow with proper gates between staging and production.
5.6 KiB
Rollback Procedure and Scope
Overview
This document defines the rollback procedure for FetchML deployments. Rollback is explicitly image-only - it does NOT restore queue state, artifact storage, or the audit log chain.
What Rollback Does
- Restores the previous container image
- Restarts the worker with the previous binary
- Preserves configuration files (unless explicitly corrupted)
What Rollback Does NOT Do
- Does NOT restore Redis queue state - jobs in the queue remain as-is
- Does NOT restore artifact storage - artifacts created by newer version remain
- Does NOT modify or roll back the audit log chain - doing so would break the chain
- Does NOT restore database migrations - schema changes persist
⚠️ Critical: The audit log chain must NEVER be rolled back. Breaking the chain would compromise the entire audit trail.
When to Rollback
Rollback is appropriate when:
- A deployment causes service crashes or health check failures
- Critical functionality is broken in the new version
- Security vulnerabilities are discovered in the new version
Rollback is NOT appropriate when:
- Data corruption has occurred (needs data recovery, not rollback)
- The audit log shows anomalies (investigate first, don't rollback blindly)
- Queue state is the issue (rollback won't fix this)
Rollback Procedure
Automated Rollback (Staging)
Staging deployments have automatic rollback on failure:
# This happens automatically in the CI pipeline
cd deployments
docker compose -f docker-compose.staging.yml down
docker compose -f docker-compose.staging.yml up -d
Manual Rollback (Production)
For production, manual rollback is required:
# 1. Identify the previous working image
PREVIOUS_SHA=$(tail -2 .prod-audit.log | head -1 | grep -o 'sha-[a-f0-9]*' || echo "previous")
# 2. Verify the previous image exists
docker pull ghcr.io/jfraeysd/fetchml-worker:$PREVIOUS_SHA
# 3. Stop current services
cd deployments
docker compose -f docker-compose.prod.yml down
# 4. Update compose to use previous image
# Edit docker-compose.prod.yml to reference $PREVIOUS_SHA
# 5. Start with previous image
docker compose -f docker-compose.prod.yml up -d
# 6. Verify health
curl -fsS http://localhost:9101/health
# 7. Write rollback entry to audit log
echo "$(date -Iseconds) | rollback | success | from=${{ gitea.sha }} | to=$PREVIOUS_SHA | actor=$(whoami)" >> .prod-audit.log
Using deploy.sh
The deploy.sh script includes a rollback function:
# Rollback to previous deployment
cd deployments
./deploy.sh prod rollback
# This will:
# - Read previous SHA from .prod-deployment.log
# - Pull the previous image
# - Restart services
# - Write audit log entry
Post-Rollback Actions
After rollback, you MUST:
- Verify health endpoints - Ensure all services are responding
- Check queue state - There may be stuck or failed jobs
- Review audit log - Ensure chain is intact
- Notify team - Document what happened and why
- Analyze failure - Root cause analysis for the failed deployment
Rollback Audit Log
Every rollback MUST write an entry to the audit log:
2024-01-15T14:30:00Z | rollback | success | from=sha-abc123 | to=sha-def456 | actor=deploy-user | reason=health-check-failure
This entry is REQUIRED even in emergency situations.
Rollback Scope Diagram
┌─────────────────────────────────────────────────────────┐
│ Deployment State │
├─────────────────────────────────────────────────────────┤
│ ✓ Rolled back: │
│ - Container image │
│ - Worker binary │
│ - API server binary │
│ │
│ ✗ NOT rolled back: │
│ - Redis queue state │
│ - Artifact storage (new artifacts remain) │
│ - Audit log chain (must never be modified) │
│ - Database schema (migrations persist) │
│ - MinIO snapshots (new snapshots remain) │
└─────────────────────────────────────────────────────────┘
Compliance Notes (HIPAA)
For HIPAA deployments:
-
Audit log chain integrity is paramount
- The rollback entry is appended, never replaces existing entries
- Chain validation must still succeed post-rollback
-
Verify compliance_mode after rollback
curl http://localhost:9101/health | grep compliance_mode -
Document the incident
- Why was the deployment rolled back?
- What was the impact on PHI handling?
- Were there any data exposure risks?
Testing Rollback
Test rollback procedures in staging regularly:
# Simulate a failed deployment
cd deployments
./deploy.sh staging up
# Trigger rollback
./deploy.sh staging rollback
# Verify services
./deploy.sh staging status
See Also
.forgejno/workflows/deploy-staging.yml- Automated rollback in staging.forgejo/workflows/deploy-prod.yml- Manual rollback for productiondeployments/deploy.sh- Rollback script implementationscripts/check-audit-sink.sh- Audit sink verification