# Rollback Procedures Operational recovery procedures for common infrastructure scenarios. ## Quick Reference | Scenario | Command/Method | Location | |----------|---------------|----------| | App rollback | `rollback.sh [sha]` | `web:/opt/deploy/scripts/` | | Backup restore | `restic restore` + `backup-verify` | Services host | | Container restart | `docker compose restart` | `/opt//` | | Full host rebuild | `./setup --no-terraform` | Local workstation | --- ## 1. Application Deployment Rollback ### Automated Rollback (Last 5 Versions Kept) On the **web** host, the `app_deployer` role maintains the last 5 versions: ```bash # List available versions ssh deploy@web /opt/deploy/scripts/rollback.sh my-api # Rollback to previous version ssh deploy@web /opt/deploy/scripts/rollback.sh my-api # Rollback to specific version ssh deploy@web /opt/deploy/scripts/rollback.sh my-api ``` ### Manual Recovery ```bash # Check service status ssh web sudo systemctl status my-api # View deployment logs ssh web sudo cat /var/log/deploy.log # Restart service manually ssh web sudo systemctl restart my-api ``` --- ## 2. Backup Restoration ### Prerequisites Ensure you have: - `RESTIC_REPOSITORY` — backup destination (e.g., `s3:https://...`) - `RESTIC_PASSWORD` — encryption password - `RESTIC_AWS_ACCESS_KEY_ID` / `RESTIC_AWS_SECRET_ACCESS_KEY` — S3 credentials ### List Available Snapshots ```bash ssh services sudo restic snapshots ``` ### Restore Procedure ```bash # Stop the service being restored ssh services sudo systemctl stop forgejo # or other service # Create backup of current state (optional safety) ssh services sudo mv /opt/forgejo /opt/forgejo.pre-restore.$(date +%Y%m%d) # Restore from specific snapshot ssh services sudo restic restore --target / # Verify restore (run built-in verification) ssh services sudo /usr/local/sbin/backup-verify # Restart service ssh services sudo docker compose -f /opt/forgejo/docker-compose.yml up -d ``` ### Full System Restore For catastrophic failure, rebuild from backups: 1. **Reprovision host** (if needed): `./setup` 2. **Run Ansible** to restore configs: `./setup --ansible-only` 3. **Restore data** via restic for each service 4. **Verify**: Run `ansible-playbook playbooks/tests/test_config.yml` --- ## 3. Container Stack Recovery ### Individual Service Restart ```bash # On services host ssh services # Navigate to service directory cd /opt// # Check status sudo docker compose ps # View logs sudo docker compose logs -f # Restart service sudo docker compose restart # Full recreate (preserves volumes) sudo docker compose down && sudo docker compose up -d ``` ### Traefik Certificate Issues ```bash # Trigger certificate re-request ssh services sudo docker compose -f /opt/traefik/docker-compose.yml restart # Check certificate status ssh services sudo docker compose -f /opt/traefik/docker-compose.yml logs -f # Force ACME re-validation (if needed) # Note: acme.json is in /opt/traefik/letsencrypt/ ``` ### Database Recovery (Postgres/Redis) If using `app_core` role for shared databases: ```bash # Restore from backup ssh web sudo restic restore --target / --include /var/lib/docker/volumes/app_core* # Or recreate from app initialization (if data is disposable) ssh web sudo docker compose -f /opt/app_core/docker-compose.yml down -v ssh web sudo docker compose -f /opt/app_core/docker-compose.yml up -d ``` --- ## 4. DNS and TLS Recovery ### Certificate Expiry / Renewal Issues ```bash # Check certificate expiration ssh services "echo | openssl s_client -connect 127.0.0.1:443 2>/dev/null | openssl x509 -noout -dates" # Force Traefik renewal (restart triggers check) ssh services sudo docker compose -f /opt/traefik/docker-compose.yml restart ``` ### DNS Record Recovery If DNS records are missing or incorrect: ```bash # View Terraform plan (safe, read-only) ./setup -- terraform plan # Apply DNS changes only ./setup -- terraform apply -target=cloudflare_record ``` --- ## 5. Full Host Rebuild ### Scenario: Complete server loss 1. **Verify Terraform state** (Linode instance exists): ```bash ./setup -- terraform show ``` 2. **Reprovision if needed**: ```bash # If instance is damaged, destroy and recreate ./setup -- terraform taint linode_instance.services # or .web ./setup ``` 3. **Ansible-only run** (if instance exists): ```bash ./setup --ansible-only ``` 4. **Restore data from backups**: ```bash # On rebuilt host ssh services sudo /usr/local/sbin/backup-verify ssh services sudo restic restore latest --target / --include /opt/forgejo ``` --- ## 6. Authelia/SSO Recovery ### Locked Out of Authelia If you cannot access Authelia (auth.jfraeys.com): 1. **SSH to services host** (direct access, bypasses Authelia) 2. **Edit Authelia config** (if misconfiguration): ```bash ssh services sudo nano /opt/authelia/configuration.yml ssh services sudo docker compose -f /opt/authelia/docker-compose.yml restart ``` 3. **Emergency bypass** (temporary disable): Edit traefik router to remove middleware ### LLDAP Password Reset ```bash # Access LLDAP directly (not through Authelia) # Edit user password via LLDAP admin interface or CLI ssh services sudo docker compose -f /opt/lldap/docker-compose.yml exec lldap /app/lldap_set_password ``` --- ## 7. Forgejo Recovery ### Repository Corruption ```bash # Run gitea doctor (maintenance tool) ssh services sudo docker compose -f /opt/forgejo/docker-compose.yml exec forgejo gitea doctor # Restore from backup ssh services sudo restic restore --target / --include /opt/forgejo ssh services sudo docker compose -f /opt/forgejo/docker-compose.yml restart ``` ### Runner Re-registration ```bash # On web host - force re-register ansible-playbook playbooks/web.yml \ --limit web \ --tags forgejo_runner \ -e forgejo_runner_force_reregister=true ``` --- ## 8. Testing After Recovery Always run smoke tests after any recovery: ```bash # Full test suite ansible-playbook playbooks/tests/test_config.yml --ask-vault-pass # Or specific host ansible-playbook playbooks/tests/test_config.yml --limit services ``` --- ## Emergency Contacts / References - **Backup location**: `{{ RESTIC_REPOSITORY }}` (configured in vault) - **Alert destination**: `notifications@jfraeys.com` - **Vault file**: `secrets/vault.yml` (required for all recovery operations)