- Remove unimplemented placeholder roles (airflow, spark) - Delete cache files (__pycache__, .DS_Store) and generated inventory - Remove outdated INFRA_GAP_ANALYSIS.md (functionality now in README) - Standardize DISABLED comments for monitoring stack (Prometheus, Loki, Grafana) - Add ROLLBACK.md with comprehensive recovery procedures - Expand vault.example.yml with all backup and alerting variables - Update README with complete vault variables documentation
269 lines
6.3 KiB
Markdown
269 lines
6.3 KiB
Markdown
# Rollback Procedures
|
|
|
|
Operational recovery procedures for common infrastructure scenarios.
|
|
|
|
## Quick Reference
|
|
|
|
| Scenario | Command/Method | Location |
|
|
|----------|---------------|----------|
|
|
| App rollback | `rollback.sh <app> [sha]` | `web:/opt/deploy/scripts/` |
|
|
| Backup restore | `restic restore` + `backup-verify` | Services host |
|
|
| Container restart | `docker compose restart` | `/opt/<service>/` |
|
|
| Full host rebuild | `./setup --no-terraform` | Local workstation |
|
|
|
|
---
|
|
|
|
## 1. Application Deployment Rollback
|
|
|
|
### Automated Rollback (Last 5 Versions Kept)
|
|
|
|
On the **web** host, the `app_deployer` role maintains the last 5 versions:
|
|
|
|
```bash
|
|
# List available versions
|
|
ssh deploy@web /opt/deploy/scripts/rollback.sh my-api
|
|
|
|
# Rollback to previous version
|
|
ssh deploy@web /opt/deploy/scripts/rollback.sh my-api
|
|
|
|
# Rollback to specific version
|
|
ssh deploy@web /opt/deploy/scripts/rollback.sh my-api <older-sha>
|
|
```
|
|
|
|
### Manual Recovery
|
|
|
|
```bash
|
|
# Check service status
|
|
ssh web sudo systemctl status my-api
|
|
|
|
# View deployment logs
|
|
ssh web sudo cat /var/log/deploy.log
|
|
|
|
# Restart service manually
|
|
ssh web sudo systemctl restart my-api
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Backup Restoration
|
|
|
|
### Prerequisites
|
|
|
|
Ensure you have:
|
|
- `RESTIC_REPOSITORY` — backup destination (e.g., `s3:https://...`)
|
|
- `RESTIC_PASSWORD` — encryption password
|
|
- `RESTIC_AWS_ACCESS_KEY_ID` / `RESTIC_AWS_SECRET_ACCESS_KEY` — S3 credentials
|
|
|
|
### List Available Snapshots
|
|
|
|
```bash
|
|
ssh services sudo restic snapshots
|
|
```
|
|
|
|
### Restore Procedure
|
|
|
|
```bash
|
|
# Stop the service being restored
|
|
ssh services sudo systemctl stop forgejo # or other service
|
|
|
|
# Create backup of current state (optional safety)
|
|
ssh services sudo mv /opt/forgejo /opt/forgejo.pre-restore.$(date +%Y%m%d)
|
|
|
|
# Restore from specific snapshot
|
|
ssh services sudo restic restore <snapshot-id> --target /
|
|
|
|
# Verify restore (run built-in verification)
|
|
ssh services sudo /usr/local/sbin/backup-verify
|
|
|
|
# Restart service
|
|
ssh services sudo docker compose -f /opt/forgejo/docker-compose.yml up -d
|
|
```
|
|
|
|
### Full System Restore
|
|
|
|
For catastrophic failure, rebuild from backups:
|
|
|
|
1. **Reprovision host** (if needed): `./setup`
|
|
2. **Run Ansible** to restore configs: `./setup --ansible-only`
|
|
3. **Restore data** via restic for each service
|
|
4. **Verify**: Run `ansible-playbook playbooks/tests/test_config.yml`
|
|
|
|
---
|
|
|
|
## 3. Container Stack Recovery
|
|
|
|
### Individual Service Restart
|
|
|
|
```bash
|
|
# On services host
|
|
ssh services
|
|
|
|
# Navigate to service directory
|
|
cd /opt/<service>/
|
|
|
|
# Check status
|
|
sudo docker compose ps
|
|
|
|
# View logs
|
|
sudo docker compose logs -f
|
|
|
|
# Restart service
|
|
sudo docker compose restart
|
|
|
|
# Full recreate (preserves volumes)
|
|
sudo docker compose down && sudo docker compose up -d
|
|
```
|
|
|
|
### Traefik Certificate Issues
|
|
|
|
```bash
|
|
# Trigger certificate re-request
|
|
ssh services sudo docker compose -f /opt/traefik/docker-compose.yml restart
|
|
|
|
# Check certificate status
|
|
ssh services sudo docker compose -f /opt/traefik/docker-compose.yml logs -f
|
|
|
|
# Force ACME re-validation (if needed)
|
|
# Note: acme.json is in /opt/traefik/letsencrypt/
|
|
```
|
|
|
|
### Database Recovery (Postgres/Redis)
|
|
|
|
If using `app_core` role for shared databases:
|
|
|
|
```bash
|
|
# Restore from backup
|
|
ssh web sudo restic restore <snapshot-id> --target / --include /var/lib/docker/volumes/app_core*
|
|
|
|
# Or recreate from app initialization (if data is disposable)
|
|
ssh web sudo docker compose -f /opt/app_core/docker-compose.yml down -v
|
|
ssh web sudo docker compose -f /opt/app_core/docker-compose.yml up -d
|
|
```
|
|
|
|
---
|
|
|
|
## 4. DNS and TLS Recovery
|
|
|
|
### Certificate Expiry / Renewal Issues
|
|
|
|
```bash
|
|
# Check certificate expiration
|
|
ssh services "echo | openssl s_client -connect 127.0.0.1:443 2>/dev/null | openssl x509 -noout -dates"
|
|
|
|
# Force Traefik renewal (restart triggers check)
|
|
ssh services sudo docker compose -f /opt/traefik/docker-compose.yml restart
|
|
```
|
|
|
|
### DNS Record Recovery
|
|
|
|
If DNS records are missing or incorrect:
|
|
|
|
```bash
|
|
# View Terraform plan (safe, read-only)
|
|
./setup -- terraform plan
|
|
|
|
# Apply DNS changes only
|
|
./setup -- terraform apply -target=cloudflare_record
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Full Host Rebuild
|
|
|
|
### Scenario: Complete server loss
|
|
|
|
1. **Verify Terraform state** (Linode instance exists):
|
|
```bash
|
|
./setup -- terraform show
|
|
```
|
|
|
|
2. **Reprovision if needed**:
|
|
```bash
|
|
# If instance is damaged, destroy and recreate
|
|
./setup -- terraform taint linode_instance.services # or .web
|
|
./setup
|
|
```
|
|
|
|
3. **Ansible-only run** (if instance exists):
|
|
```bash
|
|
./setup --ansible-only
|
|
```
|
|
|
|
4. **Restore data from backups**:
|
|
```bash
|
|
# On rebuilt host
|
|
ssh services sudo /usr/local/sbin/backup-verify
|
|
ssh services sudo restic restore latest --target / --include /opt/forgejo
|
|
```
|
|
|
|
---
|
|
|
|
## 6. Authelia/SSO Recovery
|
|
|
|
### Locked Out of Authelia
|
|
|
|
If you cannot access Authelia (auth.jfraeys.com):
|
|
|
|
1. **SSH to services host** (direct access, bypasses Authelia)
|
|
2. **Edit Authelia config** (if misconfiguration):
|
|
```bash
|
|
ssh services sudo nano /opt/authelia/configuration.yml
|
|
ssh services sudo docker compose -f /opt/authelia/docker-compose.yml restart
|
|
```
|
|
|
|
3. **Emergency bypass** (temporary disable): Edit traefik router to remove middleware
|
|
|
|
### LLDAP Password Reset
|
|
|
|
```bash
|
|
# Access LLDAP directly (not through Authelia)
|
|
# Edit user password via LLDAP admin interface or CLI
|
|
ssh services sudo docker compose -f /opt/lldap/docker-compose.yml exec lldap /app/lldap_set_password
|
|
```
|
|
|
|
---
|
|
|
|
## 7. Forgejo Recovery
|
|
|
|
### Repository Corruption
|
|
|
|
```bash
|
|
# Run gitea doctor (maintenance tool)
|
|
ssh services sudo docker compose -f /opt/forgejo/docker-compose.yml exec forgejo gitea doctor
|
|
|
|
# Restore from backup
|
|
ssh services sudo restic restore <snapshot-id> --target / --include /opt/forgejo
|
|
ssh services sudo docker compose -f /opt/forgejo/docker-compose.yml restart
|
|
```
|
|
|
|
### Runner Re-registration
|
|
|
|
```bash
|
|
# On web host - force re-register
|
|
ansible-playbook playbooks/web.yml \
|
|
--limit web \
|
|
--tags forgejo_runner \
|
|
-e forgejo_runner_force_reregister=true
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Testing After Recovery
|
|
|
|
Always run smoke tests after any recovery:
|
|
|
|
```bash
|
|
# Full test suite
|
|
ansible-playbook playbooks/tests/test_config.yml --ask-vault-pass
|
|
|
|
# Or specific host
|
|
ansible-playbook playbooks/tests/test_config.yml --limit services
|
|
```
|
|
|
|
---
|
|
|
|
## Emergency Contacts / References
|
|
|
|
- **Backup location**: `{{ RESTIC_REPOSITORY }}` (configured in vault)
|
|
- **Alert destination**: `notifications@jfraeys.com`
|
|
- **Vault file**: `secrets/vault.yml` (required for all recovery operations)
|