fetch_ml/deployments/ROLLBACK.md
Jeremie Fraeys 685f79c4a7
ci(deploy): add Forgejo workflows and deployment automation
Add CI/CD pipelines for Forgejo/GitHub Actions:
- build.yml - Main build pipeline with matrix builds
- deploy-staging.yml - Automated staging deployment
- deploy-prod.yml - Production deployment with rollback support
- security-modes-test.yml - Security mode validation tests

Add deployment artifacts:
- docker-compose.staging.yml for staging environment
- ROLLBACK.md with rollback procedures and playbooks

Supports multi-environment deployment workflow with proper
gates between staging and production.
2026-02-26 12:04:23 -05:00

5.6 KiB

Rollback Procedure and Scope

Overview

This document defines the rollback procedure for FetchML deployments. Rollback is explicitly image-only - it does NOT restore queue state, artifact storage, or the audit log chain.

What Rollback Does

  • Restores the previous container image
  • Restarts the worker with the previous binary
  • Preserves configuration files (unless explicitly corrupted)

What Rollback Does NOT Do

  • Does NOT restore Redis queue state - jobs in the queue remain as-is
  • Does NOT restore artifact storage - artifacts created by newer version remain
  • Does NOT modify or roll back the audit log chain - doing so would break the chain
  • Does NOT restore database migrations - schema changes persist

⚠️ Critical: The audit log chain must NEVER be rolled back. Breaking the chain would compromise the entire audit trail.

When to Rollback

Rollback is appropriate when:

  • A deployment causes service crashes or health check failures
  • Critical functionality is broken in the new version
  • Security vulnerabilities are discovered in the new version

Rollback is NOT appropriate when:

  • Data corruption has occurred (needs data recovery, not rollback)
  • The audit log shows anomalies (investigate first, don't rollback blindly)
  • Queue state is the issue (rollback won't fix this)

Rollback Procedure

Automated Rollback (Staging)

Staging deployments have automatic rollback on failure:

# This happens automatically in the CI pipeline
cd deployments
docker compose -f docker-compose.staging.yml down
docker compose -f docker-compose.staging.yml up -d

Manual Rollback (Production)

For production, manual rollback is required:

# 1. Identify the previous working image
PREVIOUS_SHA=$(tail -2 .prod-audit.log | head -1 | grep -o 'sha-[a-f0-9]*' || echo "previous")

# 2. Verify the previous image exists
docker pull ghcr.io/jfraeysd/fetchml-worker:$PREVIOUS_SHA

# 3. Stop current services
cd deployments
docker compose -f docker-compose.prod.yml down

# 4. Update compose to use previous image
# Edit docker-compose.prod.yml to reference $PREVIOUS_SHA

# 5. Start with previous image
docker compose -f docker-compose.prod.yml up -d

# 6. Verify health
curl -fsS http://localhost:9101/health

# 7. Write rollback entry to audit log
echo "$(date -Iseconds) | rollback | success | from=${{ gitea.sha }} | to=$PREVIOUS_SHA | actor=$(whoami)" >> .prod-audit.log

Using deploy.sh

The deploy.sh script includes a rollback function:

# Rollback to previous deployment
cd deployments
./deploy.sh prod rollback

# This will:
# - Read previous SHA from .prod-deployment.log
# - Pull the previous image
# - Restart services
# - Write audit log entry

Post-Rollback Actions

After rollback, you MUST:

  1. Verify health endpoints - Ensure all services are responding
  2. Check queue state - There may be stuck or failed jobs
  3. Review audit log - Ensure chain is intact
  4. Notify team - Document what happened and why
  5. Analyze failure - Root cause analysis for the failed deployment

Rollback Audit Log

Every rollback MUST write an entry to the audit log:

2024-01-15T14:30:00Z | rollback | success | from=sha-abc123 | to=sha-def456 | actor=deploy-user | reason=health-check-failure

This entry is REQUIRED even in emergency situations.

Rollback Scope Diagram

┌─────────────────────────────────────────────────────────┐
│  Deployment State                                       │
├─────────────────────────────────────────────────────────┤
│  ✓ Rolled back:                                         │
│    - Container image                                    │
│    - Worker binary                                      │
│    - API server binary                                  │
│                                                         │
│  ✗ NOT rolled back:                                     │
│    - Redis queue state                                  │
│    - Artifact storage (new artifacts remain)            │
│    - Audit log chain (must never be modified)           │
│    - Database schema (migrations persist)                 │
│    - MinIO snapshots (new snapshots remain)             │
└─────────────────────────────────────────────────────────┘

Compliance Notes (HIPAA)

For HIPAA deployments:

  1. Audit log chain integrity is paramount

    • The rollback entry is appended, never replaces existing entries
    • Chain validation must still succeed post-rollback
  2. Verify compliance_mode after rollback

    curl http://localhost:9101/health | grep compliance_mode
    
  3. Document the incident

    • Why was the deployment rolled back?
    • What was the impact on PHI handling?
    • Were there any data exposure risks?

Testing Rollback

Test rollback procedures in staging regularly:

# Simulate a failed deployment
cd deployments
./deploy.sh staging up

# Trigger rollback
./deploy.sh staging rollback

# Verify services
./deploy.sh staging status

See Also

  • .forgejno/workflows/deploy-staging.yml - Automated rollback in staging
  • .forgejo/workflows/deploy-prod.yml - Manual rollback for production
  • deployments/deploy.sh - Rollback script implementation
  • scripts/check-audit-sink.sh - Audit sink verification