- Remove unimplemented placeholder roles (airflow, spark) - Delete cache files (__pycache__, .DS_Store) and generated inventory - Remove outdated INFRA_GAP_ANALYSIS.md (functionality now in README) - Standardize DISABLED comments for monitoring stack (Prometheus, Loki, Grafana) - Add ROLLBACK.md with comprehensive recovery procedures - Expand vault.example.yml with all backup and alerting variables - Update README with complete vault variables documentation
6.3 KiB
6.3 KiB
Rollback Procedures
Operational recovery procedures for common infrastructure scenarios.
Quick Reference
| Scenario | Command/Method | Location |
|---|---|---|
| App rollback | rollback.sh <app> [sha] |
web:/opt/deploy/scripts/ |
| Backup restore | restic restore + backup-verify |
Services host |
| Container restart | docker compose restart |
/opt/<service>/ |
| Full host rebuild | ./setup --no-terraform |
Local workstation |
1. Application Deployment Rollback
Automated Rollback (Last 5 Versions Kept)
On the web host, the app_deployer role maintains the last 5 versions:
# List available versions
ssh deploy@web /opt/deploy/scripts/rollback.sh my-api
# Rollback to previous version
ssh deploy@web /opt/deploy/scripts/rollback.sh my-api
# Rollback to specific version
ssh deploy@web /opt/deploy/scripts/rollback.sh my-api <older-sha>
Manual Recovery
# Check service status
ssh web sudo systemctl status my-api
# View deployment logs
ssh web sudo cat /var/log/deploy.log
# Restart service manually
ssh web sudo systemctl restart my-api
2. Backup Restoration
Prerequisites
Ensure you have:
RESTIC_REPOSITORY— backup destination (e.g.,s3:https://...)RESTIC_PASSWORD— encryption passwordRESTIC_AWS_ACCESS_KEY_ID/RESTIC_AWS_SECRET_ACCESS_KEY— S3 credentials
List Available Snapshots
ssh services sudo restic snapshots
Restore Procedure
# Stop the service being restored
ssh services sudo systemctl stop forgejo # or other service
# Create backup of current state (optional safety)
ssh services sudo mv /opt/forgejo /opt/forgejo.pre-restore.$(date +%Y%m%d)
# Restore from specific snapshot
ssh services sudo restic restore <snapshot-id> --target /
# Verify restore (run built-in verification)
ssh services sudo /usr/local/sbin/backup-verify
# Restart service
ssh services sudo docker compose -f /opt/forgejo/docker-compose.yml up -d
Full System Restore
For catastrophic failure, rebuild from backups:
- Reprovision host (if needed):
./setup - Run Ansible to restore configs:
./setup --ansible-only - Restore data via restic for each service
- Verify: Run
ansible-playbook playbooks/tests/test_config.yml
3. Container Stack Recovery
Individual Service Restart
# On services host
ssh services
# Navigate to service directory
cd /opt/<service>/
# Check status
sudo docker compose ps
# View logs
sudo docker compose logs -f
# Restart service
sudo docker compose restart
# Full recreate (preserves volumes)
sudo docker compose down && sudo docker compose up -d
Traefik Certificate Issues
# Trigger certificate re-request
ssh services sudo docker compose -f /opt/traefik/docker-compose.yml restart
# Check certificate status
ssh services sudo docker compose -f /opt/traefik/docker-compose.yml logs -f
# Force ACME re-validation (if needed)
# Note: acme.json is in /opt/traefik/letsencrypt/
Database Recovery (Postgres/Redis)
If using app_core role for shared databases:
# Restore from backup
ssh web sudo restic restore <snapshot-id> --target / --include /var/lib/docker/volumes/app_core*
# Or recreate from app initialization (if data is disposable)
ssh web sudo docker compose -f /opt/app_core/docker-compose.yml down -v
ssh web sudo docker compose -f /opt/app_core/docker-compose.yml up -d
4. DNS and TLS Recovery
Certificate Expiry / Renewal Issues
# Check certificate expiration
ssh services "echo | openssl s_client -connect 127.0.0.1:443 2>/dev/null | openssl x509 -noout -dates"
# Force Traefik renewal (restart triggers check)
ssh services sudo docker compose -f /opt/traefik/docker-compose.yml restart
DNS Record Recovery
If DNS records are missing or incorrect:
# View Terraform plan (safe, read-only)
./setup -- terraform plan
# Apply DNS changes only
./setup -- terraform apply -target=cloudflare_record
5. Full Host Rebuild
Scenario: Complete server loss
-
Verify Terraform state (Linode instance exists):
./setup -- terraform show -
Reprovision if needed:
# If instance is damaged, destroy and recreate ./setup -- terraform taint linode_instance.services # or .web ./setup -
Ansible-only run (if instance exists):
./setup --ansible-only -
Restore data from backups:
# On rebuilt host ssh services sudo /usr/local/sbin/backup-verify ssh services sudo restic restore latest --target / --include /opt/forgejo
6. Authelia/SSO Recovery
Locked Out of Authelia
If you cannot access Authelia (auth.jfraeys.com):
-
SSH to services host (direct access, bypasses Authelia)
-
Edit Authelia config (if misconfiguration):
ssh services sudo nano /opt/authelia/configuration.yml ssh services sudo docker compose -f /opt/authelia/docker-compose.yml restart -
Emergency bypass (temporary disable): Edit traefik router to remove middleware
LLDAP Password Reset
# Access LLDAP directly (not through Authelia)
# Edit user password via LLDAP admin interface or CLI
ssh services sudo docker compose -f /opt/lldap/docker-compose.yml exec lldap /app/lldap_set_password
7. Forgejo Recovery
Repository Corruption
# Run gitea doctor (maintenance tool)
ssh services sudo docker compose -f /opt/forgejo/docker-compose.yml exec forgejo gitea doctor
# Restore from backup
ssh services sudo restic restore <snapshot-id> --target / --include /opt/forgejo
ssh services sudo docker compose -f /opt/forgejo/docker-compose.yml restart
Runner Re-registration
# On web host - force re-register
ansible-playbook playbooks/web.yml \
--limit web \
--tags forgejo_runner \
-e forgejo_runner_force_reregister=true
8. Testing After Recovery
Always run smoke tests after any recovery:
# Full test suite
ansible-playbook playbooks/tests/test_config.yml --ask-vault-pass
# Or specific host
ansible-playbook playbooks/tests/test_config.yml --limit services
Emergency Contacts / References
- Backup location:
{{ RESTIC_REPOSITORY }}(configured in vault) - Alert destination:
notifications@jfraeys.com - Vault file:
secrets/vault.yml(required for all recovery operations)