ci(deploy): add Forgejo workflows and deployment automation

Add CI/CD pipelines for Forgejo/GitHub Actions:
- build.yml - Main build pipeline with matrix builds
- deploy-staging.yml - Automated staging deployment
- deploy-prod.yml - Production deployment with rollback support
- security-modes-test.yml - Security mode validation tests

Add deployment artifacts:
- docker-compose.staging.yml for staging environment
- ROLLBACK.md with rollback procedures and playbooks

Supports multi-environment deployment workflow with proper
gates between staging and production.

2026-02-26 12:04:23 -05:00

5.6 KiB

Raw Blame History

Rollback Procedure and Scope

Overview

This document defines the rollback procedure for FetchML deployments. Rollback is explicitly image-only - it does NOT restore queue state, artifact storage, or the audit log chain.

What Rollback Does

Restores the previous container image
Restarts the worker with the previous binary
Preserves configuration files (unless explicitly corrupted)

What Rollback Does NOT Do

Does NOT restore Redis queue state - jobs in the queue remain as-is
Does NOT restore artifact storage - artifacts created by newer version remain
Does NOT modify or roll back the audit log chain - doing so would break the chain
Does NOT restore database migrations - schema changes persist

⚠️ Critical: The audit log chain must NEVER be rolled back. Breaking the chain would compromise the entire audit trail.

When to Rollback

Rollback is appropriate when:

A deployment causes service crashes or health check failures
Critical functionality is broken in the new version
Security vulnerabilities are discovered in the new version

Rollback is NOT appropriate when:

Data corruption has occurred (needs data recovery, not rollback)
The audit log shows anomalies (investigate first, don't rollback blindly)
Queue state is the issue (rollback won't fix this)

Rollback Procedure

Automated Rollback (Staging)

Staging deployments have automatic rollback on failure:

# This happens automatically in the CI pipeline
cd deployments
docker compose -f docker-compose.staging.yml down
docker compose -f docker-compose.staging.yml up -d

Manual Rollback (Production)

For production, manual rollback is required:

# 1. Identify the previous working image
PREVIOUS_SHA=$(tail -2 .prod-audit.log | head -1 | grep -o 'sha-[a-f0-9]*' || echo "previous")

# 2. Verify the previous image exists
docker pull ghcr.io/jfraeysd/fetchml-worker:$PREVIOUS_SHA

# 3. Stop current services
cd deployments
docker compose -f docker-compose.prod.yml down

# 4. Update compose to use previous image
# Edit docker-compose.prod.yml to reference $PREVIOUS_SHA

# 5. Start with previous image
docker compose -f docker-compose.prod.yml up -d

# 6. Verify health
curl -fsS http://localhost:9101/health

# 7. Write rollback entry to audit log
echo "$(date -Iseconds) | rollback | success | from=${{ gitea.sha }} | to=$PREVIOUS_SHA | actor=$(whoami)" >> .prod-audit.log

Using deploy.sh

The deploy.sh script includes a rollback function:

# Rollback to previous deployment
cd deployments
./deploy.sh prod rollback

# This will:
# - Read previous SHA from .prod-deployment.log
# - Pull the previous image
# - Restart services
# - Write audit log entry

Post-Rollback Actions

After rollback, you MUST:

Verify health endpoints - Ensure all services are responding
Check queue state - There may be stuck or failed jobs
Review audit log - Ensure chain is intact
Notify team - Document what happened and why
Analyze failure - Root cause analysis for the failed deployment

Rollback Audit Log

Every rollback MUST write an entry to the audit log:

2024-01-15T14:30:00Z | rollback | success | from=sha-abc123 | to=sha-def456 | actor=deploy-user | reason=health-check-failure

This entry is REQUIRED even in emergency situations.

Rollback Scope Diagram

┌─────────────────────────────────────────────────────────┐
│  Deployment State                                       │
├─────────────────────────────────────────────────────────┤
│  ✓ Rolled back:                                         │
│    - Container image                                    │
│    - Worker binary                                      │
│    - API server binary                                  │
│                                                         │
│  ✗ NOT rolled back:                                     │
│    - Redis queue state                                  │
│    - Artifact storage (new artifacts remain)            │
│    - Audit log chain (must never be modified)           │
│    - Database schema (migrations persist)                 │
│    - MinIO snapshots (new snapshots remain)             │
└─────────────────────────────────────────────────────────┘

Compliance Notes (HIPAA)

For HIPAA deployments:

Audit log chain integrity is paramount
- The rollback entry is appended, never replaces existing entries
- Chain validation must still succeed post-rollback

Verify compliance_mode after rollback

curl http://localhost:9101/health | grep compliance_mode

Document the incident
- Why was the deployment rolled back?
- What was the impact on PHI handling?
- Were there any data exposure risks?

Testing Rollback

Test rollback procedures in staging regularly:

# Simulate a failed deployment
cd deployments
./deploy.sh staging up

# Trigger rollback
./deploy.sh staging rollback

# Verify services
./deploy.sh staging status

5.6 KiB Raw Blame History