infra: cleanup repository and add rollback documentation

- Remove unimplemented placeholder roles (airflow, spark) - Delete cache files (__pycache__, .DS_Store) and generated inventory - Remove outdated INFRA_GAP_ANALYSIS.md (functionality now in README) - Standardize DISABLED comments for monitoring stack (Prometheus, Loki, Grafana) - Add ROLLBACK.md with comprehensive recovery procedures - Expand vault.example.yml with all backup and alerting variables - Update README with complete vault variables documentation
2026-03-06 14:40:56 -05:00 · 2026-03-06 14:40:56 -05:00 · e2f732c0f5
commit e2f732c0f5
parent 6ff842aa9e
8 changed files with 312 additions and 162 deletions
--- a/INFRA_GAP_ANALYSIS.md
+++ b/INFRA_GAP_ANALYSIS.md
@ -1,149 +0,0 @@
-# Infra gap analysis
-
-This document summarizes what this repo already covers well and what is typically added in small-but-established Ansible infra repos, with a focus on a dev/data-science oriented setup.
-
-## What you already have
-
- **Provisioning**
-  - Terraform for Linode instances and Cloudflare DNS.
-  - Linode StackScripts bootstrap a baseline configuration.
-
- **Bootstrap hardening (StackScript)**
-  - SSH hardening (non-root user, key-only, custom port support, root login disabled).
-  - UFW baseline (deny incoming, allow outgoing) + open `80/443` + SSH rate limiting.
-  - Fail2ban for `sshd`.
-  - Docker + Compose plugin installed.
-
- **Runtime platform (Ansible roles + playbooks)**
-  - Traefik (on both hosts) with ACME via Cloudflare DNS-01 (file provider fallback for Docker API compatibility).
-  - SSO/IdM: Authelia OIDC + LLDAP.
-  - Email: Postfix relay to Postmark for transactional email.
-  - Observability stack: exporters (node-exporter + cAdvisor) + Prometheus + Loki + Grafana (provisioned datasources) + Alertmanager (email alerts).
-  - Forgejo + Forgejo Actions runner (with AI scrapers blocklist).
-  - Watchtower for label-based container updates.
-
- **Testing**
-  - `playbooks/test_config.yml` does real smoke-testing: container health, TLS sanity, OIDC discovery, and optional object storage credential checks.
-
-## Main gaps vs “well-established” small infra repos (prioritized)
-
-### 1) Backups + restore drills (biggest practical gap)
-
-A mature repo almost always includes a backup strategy and automation.
-
- **Backup tool**
-  - Common: `restic` or `borg`.
-  - Optional: provider snapshots, `rclone`, etc.
-
- **What to back up (likely targets for this repo)**
-  - `/opt/forgejo` (repos/config/DB volume depending on compose).
-  - `/opt/grafana` (if data is persisted; provisioning covers datasources but dashboards/users may matter).
-  - `/opt/authelia` and `/opt/lldap` (identity config + data).
-  - `/opt/traefik/letsencrypt/acme.json` (optional; certs can be reissued but rate-limits exist).
-  - Any Postgres/Redis volumes from `app_core`.
-  - Loki data is usually optional to restore unless you explicitly want log retention across rebuilds.
-
- **Offsite destination**
-  - S3-compatible object storage works well here; you already have patterns for Linode Object Storage.
-
- **Restore validation**
-  - Add a small restore smoke test or runbook/playbook to verify backups are usable.
-
-### 2) Ongoing OS updates (unattended security upgrades / reboots)
-
-StackScript does initial upgrades, but established repos usually add ongoing patching:
-
- `unattended-upgrades` for Debian security updates.
- A policy for reboots when required.
- Optional notification on “reboot required”.
-
-### 3) Alerting ✅ IMPLEMENTED
-
-Alertmanager is deployed with email notifications via Postfix:
-
- **Alertmanager** routes alerts to Postfix on `localhost:25`
- **Postfix** relays to Postmark for reliable delivery
- **Notification channel**: Email to `notifications@jfraeys.com`
- **Base alerts configured**:
-  - host down / exporter down
-  - disk almost full
-  - high load / memory pressure
-  - container health (cAdvisor)
-  - certificate expiry
-
-### 4) Firewall convergence via Ansible (not only StackScript)
-
-You configure UFW in StackScript and add some service rules (e.g., Loki allowlist), but many repos keep firewall rules converged in Ansible so the policy is explicit and continuously enforced.
-
-Suggested pattern:
-
- A dedicated `firewall` role that:
-  - sets UFW defaults
-  - allows SSH (correct port) from desired sources
-  - allows `80/443`
-  - handles service allowlists (like Loki)
-
-### 5) Container hygiene/security baseline
-
-Optional but common additions:
-
- Scheduled `docker system prune` (carefully, to avoid breaking rollbacks).
- Vulnerability scanning (Trivy/Grype) on images.
- Service-by-service hardening in compose (read-only FS, drop caps, etc.).
-
-### 6) Dev/ds oriented additions (optional)
-
-Depending on goals:
-
- MinIO (local S3) if you want on-prem-style object storage.
- A small artifact/cache service (or rely on external registries).
- JupyterHub / code-server behind SSO (only if you want interactive dev environments).
-
-## Structural note: “hardening” is split across StackScript + Ansible
-
-Your `roles/hardening` currently focuses on rsyslog + UFW log file rotation.
-
-Many repos keep more hardening in Ansible (sshd, fail2ban, unattended upgrades, firewall). Your approach (StackScript) is valid, but the tradeoff is StackScript changes don’t automatically converge over time unless you reprovision.
-
-## Minimal recommended additions (small but complete)
-
-If you want to stay small while matching typical “complete” infra repos, prioritize:
-
- **Backups** (restic/borg + systemd timer + offsite storage).
- **Unattended upgrades** (plus a reboot policy).
- **Alerting** (Alertmanager + a few base alerts).
- **Firewall role** (optional if you’re happy with StackScript-only, but recommended for convergence).
-
-## Open questions (to pick the right implementation)
-
- Backups: filesystem/volume backups via `restic`, plus an app-aware Forgejo export via `forgejo dump`.
- Alerting: webhook notifications (Slack or Discord).
-
-## Variables introduced by the backup + alerting + email implementation
-
- **Backups**
-  - `RESTIC_REPOSITORY`
-  - `RESTIC_PASSWORD`
-  - `RESTIC_AWS_ACCESS_KEY_ID`
-  - `RESTIC_AWS_SECRET_ACCESS_KEY`
-  - `RESTIC_AWS_DEFAULT_REGION` (optional)
-  - `INFRA_BACKUP_ONCALENDAR` (optional)
-  - `RESTIC_KEEP_DAILY` (optional)
-  - `RESTIC_KEEP_WEEKLY` (optional)
-  - `RESTIC_KEEP_MONTHLY` (optional)
-
- **Alerting**
-  - Set **exactly one**:
-    - `ALERTMANAGER_SLACK_WEBHOOK_URL`
-    - `ALERTMANAGER_DISCORD_WEBHOOK_URL`
-  - Optional (Slack only):
-    - `ALERTMANAGER_SLACK_CHANNEL`
-    - `ALERTMANAGER_SLACK_USERNAME`
-
- **Email (Postfix + Postmark)**
-  - `POSTFIX_RELAYHOST` (default: `smtp.postmarkapp.com`)
-  - `POSTFIX_RELAYHOST_PORT` (default: `2525`)
-  - `POSTFIX_RELAYHOST_USERNAME` (Postmark server token)
-  - `POSTFIX_RELAYHOST_PASSWORD` (Postmark server token)
-  - `AUTHELIA_SMTP_SENDER` (e.g., `notifications@yourdomain.com`)
-  - `AUTHELIA_SMTP_IDENTIFIER` (e.g., `yourdomain.com`)
--- a/README.md
+++ b/README.md
@ -133,6 +133,7 @@ Postmark validates these during account setup.

 Add to `secrets/vault.yml`:

+**Email (Postfix + Postmark):**
 ```yaml
 POSTFIX_RELAYHOST_USERNAME: "your-postmark-server-token"
 POSTFIX_RELAYHOST_PASSWORD: "your-postmark-server-token"
@ -140,6 +141,31 @@ AUTHELIA_SMTP_SENDER: "notifications@jfraeys.com"
 AUTHELIA_SMTP_IDENTIFIER: "jfraeys.com"
 ```

+**Backups (Restic):**
+```yaml
+RESTIC_REPOSITORY: "s3:https://us-east-1.linodeobjects.com/mybucket/backups"
+RESTIC_PASSWORD: "strong-encryption-password"
+RESTIC_AWS_ACCESS_KEY_ID: "your-linode-access-key"
+RESTIC_AWS_SECRET_ACCESS_KEY: "your-linode-secret-key"
+# Optional:
+RESTIC_AWS_DEFAULT_REGION: "us-east-1"
+RESTIC_KEEP_DAILY: 7
+RESTIC_KEEP_WEEKLY: 4
+RESTIC_KEEP_MONTHLY: 6
+INFRA_BACKUP_ONCALENDAR: "daily"  # systemd calendar spec
+```
+
+**Alerting (set exactly one):**
+```yaml
+# Slack option:
+ALERTMANAGER_SLACK_WEBHOOK_URL: "https://hooks.slack.com/services/..."
+ALERTMANAGER_SLACK_CHANNEL: "#alerts"
+ALERTMANAGER_SLACK_USERNAME: "alertmanager"
+
+# Discord option:
+ALERTMANAGER_DISCORD_WEBHOOK_URL: "https://discord.com/api/webhooks/..."
+```
+
 ## Secrets (Ansible Vault)

 Secrets are stored in `secrets/vault.yml` (encrypted).
--- a/ROLLBACK.md
+++ b/ROLLBACK.md
@ -0,0 +1,269 @@
+# Rollback Procedures
+
+Operational recovery procedures for common infrastructure scenarios.
+
+## Quick Reference
+
+| Scenario | Command/Method | Location |
+|----------|---------------|----------|
+| App rollback | `rollback.sh <app> [sha]` | `web:/opt/deploy/scripts/` |
+| Backup restore | `restic restore` + `backup-verify` | Services host |
+| Container restart | `docker compose restart` | `/opt/<service>/` |
+| Full host rebuild | `./setup --no-terraform` | Local workstation |
+
+---
+
+## 1. Application Deployment Rollback
+
+### Automated Rollback (Last 5 Versions Kept)
+
+On the **web** host, the `app_deployer` role maintains the last 5 versions:
+
+```bash
+# List available versions
+ssh deploy@web /opt/deploy/scripts/rollback.sh my-api
+
+# Rollback to previous version
+ssh deploy@web /opt/deploy/scripts/rollback.sh my-api
+
+# Rollback to specific version
+ssh deploy@web /opt/deploy/scripts/rollback.sh my-api <older-sha>
+```
+
+### Manual Recovery
+
+```bash
+# Check service status
+ssh web sudo systemctl status my-api
+
+# View deployment logs
+ssh web sudo cat /var/log/deploy.log
+
+# Restart service manually
+ssh web sudo systemctl restart my-api
+```
+
+---
+
+## 2. Backup Restoration
+
+### Prerequisites
+
+Ensure you have:
+- `RESTIC_REPOSITORY` — backup destination (e.g., `s3:https://...`)
+- `RESTIC_PASSWORD` — encryption password
+- `RESTIC_AWS_ACCESS_KEY_ID` / `RESTIC_AWS_SECRET_ACCESS_KEY` — S3 credentials
+
+### List Available Snapshots
+
+```bash
+ssh services sudo restic snapshots
+```
+
+### Restore Procedure
+
+```bash
+# Stop the service being restored
+ssh services sudo systemctl stop forgejo  # or other service
+
+# Create backup of current state (optional safety)
+ssh services sudo mv /opt/forgejo /opt/forgejo.pre-restore.$(date +%Y%m%d)
+
+# Restore from specific snapshot
+ssh services sudo restic restore <snapshot-id> --target /
+
+# Verify restore (run built-in verification)
+ssh services sudo /usr/local/sbin/backup-verify
+
+# Restart service
+ssh services sudo docker compose -f /opt/forgejo/docker-compose.yml up -d
+```
+
+### Full System Restore
+
+For catastrophic failure, rebuild from backups:
+
+1. **Reprovision host** (if needed): `./setup`
+2. **Run Ansible** to restore configs: `./setup --ansible-only`
+3. **Restore data** via restic for each service
+4. **Verify**: Run `ansible-playbook playbooks/tests/test_config.yml`
+
+---
+
+## 3. Container Stack Recovery
+
+### Individual Service Restart
+
+```bash
+# On services host
+ssh services
+
+# Navigate to service directory
+cd /opt/<service>/
+
+# Check status
+sudo docker compose ps
+
+# View logs
+sudo docker compose logs -f
+
+# Restart service
+sudo docker compose restart
+
+# Full recreate (preserves volumes)
+sudo docker compose down && sudo docker compose up -d
+```
+
+### Traefik Certificate Issues
+
+```bash
+# Trigger certificate re-request
+ssh services sudo docker compose -f /opt/traefik/docker-compose.yml restart
+
+# Check certificate status
+ssh services sudo docker compose -f /opt/traefik/docker-compose.yml logs -f
+
+# Force ACME re-validation (if needed)
+# Note: acme.json is in /opt/traefik/letsencrypt/
+```
+
+### Database Recovery (Postgres/Redis)
+
+If using `app_core` role for shared databases:
+
+```bash
+# Restore from backup
+ssh web sudo restic restore <snapshot-id> --target / --include /var/lib/docker/volumes/app_core*
+
+# Or recreate from app initialization (if data is disposable)
+ssh web sudo docker compose -f /opt/app_core/docker-compose.yml down -v
+ssh web sudo docker compose -f /opt/app_core/docker-compose.yml up -d
+```
+
+---
+
+## 4. DNS and TLS Recovery
+
+### Certificate Expiry / Renewal Issues
+
+```bash
+# Check certificate expiration
+ssh services "echo | openssl s_client -connect 127.0.0.1:443 2>/dev/null | openssl x509 -noout -dates"
+
+# Force Traefik renewal (restart triggers check)
+ssh services sudo docker compose -f /opt/traefik/docker-compose.yml restart
+```
+
+### DNS Record Recovery
+
+If DNS records are missing or incorrect:
+
+```bash
+# View Terraform plan (safe, read-only)
+./setup -- terraform plan
+
+# Apply DNS changes only
+./setup -- terraform apply -target=cloudflare_record
+```
+
+---
+
+## 5. Full Host Rebuild
+
+### Scenario: Complete server loss
+
+1. **Verify Terraform state** (Linode instance exists):
+   ```bash
+   ./setup -- terraform show
+   ```
+
+2. **Reprovision if needed**:
+   ```bash
+   # If instance is damaged, destroy and recreate
+   ./setup -- terraform taint linode_instance.services  # or .web
+   ./setup
+   ```
+
+3. **Ansible-only run** (if instance exists):
+   ```bash
+   ./setup --ansible-only
+   ```
+
+4. **Restore data from backups**:
+   ```bash
+   # On rebuilt host
+   ssh services sudo /usr/local/sbin/backup-verify
+   ssh services sudo restic restore latest --target / --include /opt/forgejo
+   ```
+
+---
+
+## 6. Authelia/SSO Recovery
+
+### Locked Out of Authelia
+
+If you cannot access Authelia (auth.jfraeys.com):
+
+1. **SSH to services host** (direct access, bypasses Authelia)
+2. **Edit Authelia config** (if misconfiguration):
+   ```bash
+   ssh services sudo nano /opt/authelia/configuration.yml
+   ssh services sudo docker compose -f /opt/authelia/docker-compose.yml restart
+   ```
+
+3. **Emergency bypass** (temporary disable): Edit traefik router to remove middleware
+
+### LLDAP Password Reset
+
+```bash
+# Access LLDAP directly (not through Authelia)
+# Edit user password via LLDAP admin interface or CLI
+ssh services sudo docker compose -f /opt/lldap/docker-compose.yml exec lldap /app/lldap_set_password
+```
+
+---
+
+## 7. Forgejo Recovery
+
+### Repository Corruption
+
+```bash
+# Run gitea doctor (maintenance tool)
+ssh services sudo docker compose -f /opt/forgejo/docker-compose.yml exec forgejo gitea doctor
+
+# Restore from backup
+ssh services sudo restic restore <snapshot-id> --target / --include /opt/forgejo
+ssh services sudo docker compose -f /opt/forgejo/docker-compose.yml restart
+```
+
+### Runner Re-registration
+
+```bash
+# On web host - force re-register
+ansible-playbook playbooks/web.yml \
+  --limit web \
+  --tags forgejo_runner \
+  -e forgejo_runner_force_reregister=true
+```
+
+---
+
+## 8. Testing After Recovery
+
+Always run smoke tests after any recovery:
+
+```bash
+# Full test suite
+ansible-playbook playbooks/tests/test_config.yml --ask-vault-pass
+
+# Or specific host
+ansible-playbook playbooks/tests/test_config.yml --limit services
+```
+
+---
+
+## Emergency Contacts / References
+
+- **Backup location**: `{{ RESTIC_REPOSITORY }}` (configured in vault)
+- **Alert destination**: `notifications@jfraeys.com`
+- **Vault file**: `secrets/vault.yml` (required for all recovery operations)
--- a/playbooks/services.yml
+++ b/playbooks/services.yml
@ -34,10 +34,13 @@
      tags: [exporters]
    - role: alertmanager
      tags: [alertmanager]
+    # DISABLED: Monitoring stack (Prometheus) - uncomment to enable
    # - role: prometheus
    #   tags: [prometheus]
+    # DISABLED: Monitoring stack (Loki) - uncomment to enable
    # - role: loki
    #   tags: [loki]
+    # DISABLED: Monitoring stack (Grafana) - uncomment to enable
    # - role: grafana
    #   tags: [grafana]
    - role: forgejo
@ -52,6 +55,7 @@
      tags: [backups]

  post_tasks:
+    # DISABLED: Grafana post-tasks - uncomment when Grafana is enabled
    # Grafana post-tasks disabled (monitoring stack not deployed on 1GB node)
    # - name: Read Grafana Traefik router rule label
    #   shell: |
@ -129,6 +133,7 @@
      until: forgejo_origin_tls.rc == 0
      tags: [forgejo]

+    # DISABLED: Prometheus post-tasks - uncomment when Prometheus is enabled
    # Prometheus post-tasks disabled (monitoring stack not deployed on 1GB node)
    # - name: Trigger Traefik certificate request for Prometheus hostname
    #   command: curl -k -s -o /dev/null -w "%{http_code}" --resolve "{{ prometheus_hostname }}:443:127.0.0.1" "https://{{ prometheus_hostname }}/"
--- a/roles/airflow/tasks/main.yml
+++ b/roles/airflow/tasks/main.yml
@ -1,4 +0,0 @@
---
- name: Airflow role placeholder
-  debug:
-    msg: "Airflow role is not implemented yet (deploy_airflow is optional)."
--- a/roles/spark/tasks/main.yml
+++ b/roles/spark/tasks/main.yml
@ -1,4 +0,0 @@
---
- name: Spark role placeholder
-  debug:
-    msg: "Spark role is not implemented yet (deploy_spark is optional)."
--- a/secrets/vault.example.yml
+++ b/secrets/vault.example.yml
@ -26,9 +26,9 @@ AUTHELIA_SMTP_SENDER:
 AUTHELIA_SMTP_IDENTIFIER:
 AUTHELIA_SMTP_STARTUP_CHECK_ADDRESS:
 POSTFIX_RELAYHOST: "smtp.postmarkapp.com"
-# POSTFIX_RELAYHOST_PORT: "2525"
-# POSTFIX_RELAYHOST_USERNAME: "your-postmark-server-token"
-# POSTFIX_RELAYHOST_PASSWORD: "your-postmark-server-token"
+POSTFIX_RELAYHOST_PORT: "2525"
+POSTFIX_RELAYHOST_USERNAME: "your-postmark-server-token"
+POSTFIX_RELAYHOST_PASSWORD: "your-postmark-server-token"
 FORGEJO_RUNNER_REGISTRATION_TOKEN:
 FORGEJO_API_TOKEN:
 FORGEJO_BASE_URL:
@ -39,9 +39,16 @@ SERVICE_SSH_DEREGISTER_PUBLIC_KEY:
 RESTIC_PASSWORD:
 RESTIC_AWS_ACCESS_KEY_ID:
 RESTIC_AWS_SECRET_ACCESS_KEY:
+RESTIC_AWS_DEFAULT_REGION: "us-east-1"
 # RESTIC_REPOSITORY: "s3:https://us-east-1.linodeobjects.com/bucket-name/infra"
+RESTIC_KEEP_DAILY: 7
+RESTIC_KEEP_WEEKLY: 4
+RESTIC_KEEP_MONTHLY: 6
+INFRA_BACKUP_ONCALENDAR: "daily"

 ALERTMANAGER_SLACK_WEBHOOK_URL:
+ALERTMANAGER_SLACK_CHANNEL: "#alerts"
+ALERTMANAGER_SLACK_USERNAME: "alertmanager"
 ALERTMANAGER_DISCORD_WEBHOOK_URL:

 # Alertmanager Email Settings (uses Postfix on localhost:25 by default)
--- a/terraform/main.tf
+++ b/terraform/main.tf
@ -233,7 +233,7 @@ resource "cloudflare_record" "git_services_aaaa" {
  proxied = false
 }

-# Grafana DNS records - currently unused
+# DISABLED: Grafana DNS records - uncomment to enable monitoring stack
 # resource "cloudflare_record" "grafana_services_a" {
 #   count   = var.enable_cloudflare_dns ? 1 : 0
 #   zone_id = var.cloudflare_zone_id
@ -254,7 +254,7 @@ resource "cloudflare_record" "git_services_aaaa" {
 #   proxied = true
 # }

-# Prometheus DNS records - currently unused
+# DISABLED: Prometheus DNS records - uncomment to enable monitoring stack
 # resource "cloudflare_record" "prometheus_services_a" {
 #   count   = var.enable_cloudflare_dns ? 1 : 0
 #   zone_id = var.cloudflare_zone_id