infra/ROLLBACK.md
Jeremie Fraeys e2f732c0f5
infra: cleanup repository and add rollback documentation
- Remove unimplemented placeholder roles (airflow, spark)
- Delete cache files (__pycache__, .DS_Store) and generated inventory
- Remove outdated INFRA_GAP_ANALYSIS.md (functionality now in README)
- Standardize DISABLED comments for monitoring stack (Prometheus, Loki, Grafana)
- Add ROLLBACK.md with comprehensive recovery procedures
- Expand vault.example.yml with all backup and alerting variables
- Update README with complete vault variables documentation
2026-03-06 14:40:56 -05:00

6.3 KiB

Rollback Procedures

Operational recovery procedures for common infrastructure scenarios.

Quick Reference

Scenario Command/Method Location
App rollback rollback.sh <app> [sha] web:/opt/deploy/scripts/
Backup restore restic restore + backup-verify Services host
Container restart docker compose restart /opt/<service>/
Full host rebuild ./setup --no-terraform Local workstation

1. Application Deployment Rollback

Automated Rollback (Last 5 Versions Kept)

On the web host, the app_deployer role maintains the last 5 versions:

# List available versions
ssh deploy@web /opt/deploy/scripts/rollback.sh my-api

# Rollback to previous version
ssh deploy@web /opt/deploy/scripts/rollback.sh my-api

# Rollback to specific version
ssh deploy@web /opt/deploy/scripts/rollback.sh my-api <older-sha>

Manual Recovery

# Check service status
ssh web sudo systemctl status my-api

# View deployment logs
ssh web sudo cat /var/log/deploy.log

# Restart service manually
ssh web sudo systemctl restart my-api

2. Backup Restoration

Prerequisites

Ensure you have:

  • RESTIC_REPOSITORY — backup destination (e.g., s3:https://...)
  • RESTIC_PASSWORD — encryption password
  • RESTIC_AWS_ACCESS_KEY_ID / RESTIC_AWS_SECRET_ACCESS_KEY — S3 credentials

List Available Snapshots

ssh services sudo restic snapshots

Restore Procedure

# Stop the service being restored
ssh services sudo systemctl stop forgejo  # or other service

# Create backup of current state (optional safety)
ssh services sudo mv /opt/forgejo /opt/forgejo.pre-restore.$(date +%Y%m%d)

# Restore from specific snapshot
ssh services sudo restic restore <snapshot-id> --target /

# Verify restore (run built-in verification)
ssh services sudo /usr/local/sbin/backup-verify

# Restart service
ssh services sudo docker compose -f /opt/forgejo/docker-compose.yml up -d

Full System Restore

For catastrophic failure, rebuild from backups:

  1. Reprovision host (if needed): ./setup
  2. Run Ansible to restore configs: ./setup --ansible-only
  3. Restore data via restic for each service
  4. Verify: Run ansible-playbook playbooks/tests/test_config.yml

3. Container Stack Recovery

Individual Service Restart

# On services host
ssh services

# Navigate to service directory
cd /opt/<service>/

# Check status
sudo docker compose ps

# View logs
sudo docker compose logs -f

# Restart service
sudo docker compose restart

# Full recreate (preserves volumes)
sudo docker compose down && sudo docker compose up -d

Traefik Certificate Issues

# Trigger certificate re-request
ssh services sudo docker compose -f /opt/traefik/docker-compose.yml restart

# Check certificate status
ssh services sudo docker compose -f /opt/traefik/docker-compose.yml logs -f

# Force ACME re-validation (if needed)
# Note: acme.json is in /opt/traefik/letsencrypt/

Database Recovery (Postgres/Redis)

If using app_core role for shared databases:

# Restore from backup
ssh web sudo restic restore <snapshot-id> --target / --include /var/lib/docker/volumes/app_core*

# Or recreate from app initialization (if data is disposable)
ssh web sudo docker compose -f /opt/app_core/docker-compose.yml down -v
ssh web sudo docker compose -f /opt/app_core/docker-compose.yml up -d

4. DNS and TLS Recovery

Certificate Expiry / Renewal Issues

# Check certificate expiration
ssh services "echo | openssl s_client -connect 127.0.0.1:443 2>/dev/null | openssl x509 -noout -dates"

# Force Traefik renewal (restart triggers check)
ssh services sudo docker compose -f /opt/traefik/docker-compose.yml restart

DNS Record Recovery

If DNS records are missing or incorrect:

# View Terraform plan (safe, read-only)
./setup -- terraform plan

# Apply DNS changes only
./setup -- terraform apply -target=cloudflare_record

5. Full Host Rebuild

Scenario: Complete server loss

  1. Verify Terraform state (Linode instance exists):

    ./setup -- terraform show
    
  2. Reprovision if needed:

    # If instance is damaged, destroy and recreate
    ./setup -- terraform taint linode_instance.services  # or .web
    ./setup
    
  3. Ansible-only run (if instance exists):

    ./setup --ansible-only
    
  4. Restore data from backups:

    # On rebuilt host
    ssh services sudo /usr/local/sbin/backup-verify
    ssh services sudo restic restore latest --target / --include /opt/forgejo
    

6. Authelia/SSO Recovery

Locked Out of Authelia

If you cannot access Authelia (auth.jfraeys.com):

  1. SSH to services host (direct access, bypasses Authelia)

  2. Edit Authelia config (if misconfiguration):

    ssh services sudo nano /opt/authelia/configuration.yml
    ssh services sudo docker compose -f /opt/authelia/docker-compose.yml restart
    
  3. Emergency bypass (temporary disable): Edit traefik router to remove middleware

LLDAP Password Reset

# Access LLDAP directly (not through Authelia)
# Edit user password via LLDAP admin interface or CLI
ssh services sudo docker compose -f /opt/lldap/docker-compose.yml exec lldap /app/lldap_set_password

7. Forgejo Recovery

Repository Corruption

# Run gitea doctor (maintenance tool)
ssh services sudo docker compose -f /opt/forgejo/docker-compose.yml exec forgejo gitea doctor

# Restore from backup
ssh services sudo restic restore <snapshot-id> --target / --include /opt/forgejo
ssh services sudo docker compose -f /opt/forgejo/docker-compose.yml restart

Runner Re-registration

# On web host - force re-register
ansible-playbook playbooks/web.yml \
  --limit web \
  --tags forgejo_runner \
  -e forgejo_runner_force_reregister=true

8. Testing After Recovery

Always run smoke tests after any recovery:

# Full test suite
ansible-playbook playbooks/tests/test_config.yml --ask-vault-pass

# Or specific host
ansible-playbook playbooks/tests/test_config.yml --limit services

Emergency Contacts / References

  • Backup location: {{ RESTIC_REPOSITORY }} (configured in vault)
  • Alert destination: notifications@jfraeys.com
  • Vault file: secrets/vault.yml (required for all recovery operations)