Add documentation and infrastructure gap analysis

- Update README.md with current architecture documentation - Add INFRA_GAP_ANALYSIS.md for tracking infrastructure improvements - Add .python-version for pyenv version management
2026-02-21 18:30:33 -05:00 · 2026-02-21 18:30:33 -05:00 · ac19b5918f
commit ac19b5918f
parent e7b9546f7f
3 changed files with 222 additions and 6 deletions
--- a/.python-version
+++ b/.python-version
@ -0,0 +1 @@
+3.11
--- a/INFRA_GAP_ANALYSIS.md
+++ b/INFRA_GAP_ANALYSIS.md
@ -0,0 +1,139 @@
+# Infra gap analysis
+
+This document summarizes what this repo already covers well and what is typically added in small-but-established Ansible infra repos, with a focus on a dev/data-science oriented setup.
+
+## What you already have
+
+- **Provisioning**
+  - Terraform for Linode instances and Cloudflare DNS.
+  - Linode StackScripts bootstrap a baseline configuration.
+
+- **Bootstrap hardening (StackScript)**
+  - SSH hardening (non-root user, key-only, custom port support, root login disabled).
+  - UFW baseline (deny incoming, allow outgoing) + open `80/443` + SSH rate limiting.
+  - Fail2ban for `sshd`.
+  - Docker + Compose plugin installed.
+
+- **Runtime platform (Ansible roles + playbooks)**
+  - Traefik (on both hosts) with ACME via Cloudflare DNS-01.
+  - SSO/IdM: Authelia OIDC + LLDAP.
+  - Observability stack: exporters (node-exporter + cAdvisor) + Prometheus + Loki + Grafana (provisioned datasources).
+  - Forgejo + Forgejo Actions runner.
+  - Watchtower for label-based container updates.
+
+- **Testing**
+  - `playbooks/test_config.yml` does real smoke-testing: container health, TLS sanity, OIDC discovery, and optional object storage credential checks.
+
+## Main gaps vs “well-established” small infra repos (prioritized)
+
+### 1) Backups + restore drills (biggest practical gap)
+
+A mature repo almost always includes a backup strategy and automation.
+
+- **Backup tool**
+  - Common: `restic` or `borg`.
+  - Optional: provider snapshots, `rclone`, etc.
+
+- **What to back up (likely targets for this repo)**
+  - `/opt/forgejo` (repos/config/DB volume depending on compose).
+  - `/opt/grafana` (if data is persisted; provisioning covers datasources but dashboards/users may matter).
+  - `/opt/authelia` and `/opt/lldap` (identity config + data).
+  - `/opt/traefik/letsencrypt/acme.json` (optional; certs can be reissued but rate-limits exist).
+  - Any Postgres/Redis volumes from `app_core`.
+  - Loki data is usually optional to restore unless you explicitly want log retention across rebuilds.
+
+- **Offsite destination**
+  - S3-compatible object storage works well here; you already have patterns for Linode Object Storage.
+
+- **Restore validation**
+  - Add a small restore smoke test or runbook/playbook to verify backups are usable.
+
+### 2) Ongoing OS updates (unattended security upgrades / reboots)
+
+StackScript does initial upgrades, but established repos usually add ongoing patching:
+
+- `unattended-upgrades` for Debian security updates.
+- A policy for reboots when required.
+- Optional notification on “reboot required”.
+
+### 3) Alerting (metrics exist, but alerts are missing)
+
+You have Prometheus/Grafana/Loki, but a typical baseline also includes:
+
+- **Alertmanager** (or another alert routing mechanism).
+- A notification channel (email/webhook/etc.).
+- A small base ruleset:
+  - host down / exporter down
+  - disk almost full
+  - high load / memory pressure
+  - container down (cAdvisor)
+  - cert expiry
+
+### 4) Firewall convergence via Ansible (not only StackScript)
+
+You configure UFW in StackScript and add some service rules (e.g., Loki allowlist), but many repos keep firewall rules converged in Ansible so the policy is explicit and continuously enforced.
+
+Suggested pattern:
+
+- A dedicated `firewall` role that:
+  - sets UFW defaults
+  - allows SSH (correct port) from desired sources
+  - allows `80/443`
+  - handles service allowlists (like Loki)
+
+### 5) Container hygiene/security baseline
+
+Optional but common additions:
+
+- Scheduled `docker system prune` (carefully, to avoid breaking rollbacks).
+- Vulnerability scanning (Trivy/Grype) on images.
+- Service-by-service hardening in compose (read-only FS, drop caps, etc.).
+
+### 6) Dev/ds oriented additions (optional)
+
+Depending on goals:
+
+- MinIO (local S3) if you want on-prem-style object storage.
+- A small artifact/cache service (or rely on external registries).
+- JupyterHub / code-server behind SSO (only if you want interactive dev environments).
+
+## Structural note: “hardening” is split across StackScript + Ansible
+
+Your `roles/hardening` currently focuses on rsyslog + UFW log file rotation.
+
+Many repos keep more hardening in Ansible (sshd, fail2ban, unattended upgrades, firewall). Your approach (StackScript) is valid, but the tradeoff is StackScript changes don’t automatically converge over time unless you reprovision.
+
+## Minimal recommended additions (small but complete)
+
+If you want to stay small while matching typical “complete” infra repos, prioritize:
+
+- **Backups** (restic/borg + systemd timer + offsite storage).
+- **Unattended upgrades** (plus a reboot policy).
+- **Alerting** (Alertmanager + a few base alerts).
+- **Firewall role** (optional if you’re happy with StackScript-only, but recommended for convergence).
+
+## Open questions (to pick the right implementation)
+
+- Backups: filesystem/volume backups via `restic`, plus an app-aware Forgejo export via `forgejo dump`.
+- Alerting: webhook notifications (Slack or Discord).
+
+## Variables introduced by the backup + alerting implementation
+
+- **Backups**
+  - `RESTIC_REPOSITORY`
+  - `RESTIC_PASSWORD`
+  - `RESTIC_AWS_ACCESS_KEY_ID`
+  - `RESTIC_AWS_SECRET_ACCESS_KEY`
+  - `RESTIC_AWS_DEFAULT_REGION` (optional)
+  - `INFRA_BACKUP_ONCALENDAR` (optional)
+  - `RESTIC_KEEP_DAILY` (optional)
+  - `RESTIC_KEEP_WEEKLY` (optional)
+  - `RESTIC_KEEP_MONTHLY` (optional)
+
+- **Alerting**
+  - Set **exactly one**:
+    - `ALERTMANAGER_SLACK_WEBHOOK_URL`
+    - `ALERTMANAGER_DISCORD_WEBHOOK_URL`
+  - Optional (Slack only):
+    - `ALERTMANAGER_SLACK_CHANNEL`
+    - `ALERTMANAGER_SLACK_USERNAME`
--- a/README.md
+++ b/README.md
@ -34,7 +34,7 @@ What it does:

 - Applies Terraform from `terraform/`
 - Writes `inventory/hosts.yml` and `inventory/host_vars/web.yml` (gitignored)
- Runs `playbooks/services.yml` and `playbooks/app.yml`
+- Runs `playbooks/services.yml` and `playbooks/web.yml`

 If you want Terraform only:

@ -79,8 +79,8 @@ Traefik uses Let’s Encrypt via Cloudflare DNS-01.

 You must provide a Cloudflare API token in your local environment when running Ansible:

- `CF_DNS_API_TOKEN` (preferred)
- or `TF_VAR_cloudflare_api_token`
+- `CF_DNS_API_TOKEN`
+- `CF_ZONE_API_TOKEN`

 ## SSO (Authelia OIDC)

@ -116,7 +116,7 @@ Notes:
 ## Playbooks

 - `playbooks/services.yml`: deploy observability + forgejo on `services`
- `playbooks/app.yml`: deploy app-side dependencies on `web`
+- `playbooks/web.yml`: deploy app-side dependencies on `web`
 - `playbooks/test_config.yml`: smoke test host config and deployed stacks
 - `playbooks/deploy.yml`: legacy/all-in-one deploy for the services host (no tags)

@ -165,7 +165,7 @@ A Forgejo runner is deployed on the `web` host (`roles/forgejo_runner`).
 To force re-register (e.g. after deleting the runner in Forgejo UI):

 ```bash
-ansible-playbook playbooks/app.yml \
+ansible-playbook playbooks/web.yml \
  --vault-password-file secrets/.vault_pass \
  --limit web \
  --tags forgejo_runner \
@ -217,7 +217,7 @@ ansible-playbook playbooks/services.yml --ask-vault-pass
 Web:

 ```bash
-ansible-playbook playbooks/app.yml --ask-vault-pass
+ansible-playbook playbooks/web.yml --ask-vault-pass
 ```

 ## Terraform
@ -247,3 +247,79 @@ Web host (`web`):
 - `roles/traefik`
 - `roles/app_core` (optional shared Postgres/Redis)
 - `roles/forgejo_runner`
+- `roles/app_deployer` (CI/CD webhook and deployment automation)
+
+## App Deployment
+
+The `app_deployer` role provides automated deployment via webhooks from Forgejo or GitHub Actions.
+
+### Prerequisites
+
+1. **Generate deploy token** (run once):
+   ```bash
+   ./scripts/gen-auth-secrets.sh  # Creates VAULT_DEPLOY_TOKEN
+   # Or add to secrets/vault.yml manually
+   ```
+
+2. **Set DEPLOY_TOKEN in your app repo**:
+   - **Forgejo**: Use the helper script:
+     ```bash
+     ./scripts/set_deploy_token.py --owner <you> --repo <app-name>
+     ```
+   - **GitHub**: Set `DEPLOY_TOKEN` secret via Settings > Secrets and variables > Actions
+
+3. **Add deploy workflow to your app repo**:
+   
+   Copy the sample workflow and customize:
+   ```bash
+   cp roles/app_deployer/files/forgejo-deploy-workflow.yml .forgejo/workflows/deploy.yml
+   # For GitHub: cp to .github/workflows/deploy.yml
+   ```
+
+   Update the workflow for your build (Go, Rust, Node.js, etc.) and app name.
+
+### How It Works
+
+1. **CI builds the app** and uploads binary + checksum to `deploy@web:/opt/artifacts/`
+2. **CI triggers webhook** with `X-Deploy-Token` header
+3. **Webhook validates token** (timing-safe comparison) and runs deployment
+4. **Ansible deploys the app**:
+   - Verifies artifact checksum
+   - Creates app user and directories
+   - Sets up systemd service
+   - Keeps last 5 versions for rollback
+
+### Manual Deployment
+
+For manual deploys or rollbacks:
+
+```bash
+# Deploy a specific version
+ssh deploy@web /opt/deploy/scripts/deploy.sh my-api abc123 prod
+
+# Rollback to previous version
+ssh deploy@web /opt/deploy/scripts/rollback.sh my-api
+# Lists available versions, then:
+ssh deploy@web /opt/deploy/scripts/rollback.sh my-api <older-sha>
+```
+
+### Security Features
+
+- **Timing-safe token validation** prevents timing attacks
+- **Artifact checksums** ensure binary integrity
+- **Sudoers restricted** to only deployment script
+- **Last 5 versions kept** for quick rollback
+- **Deploy user** runs as unprivileged user per app
+
+### Troubleshooting
+
+```bash
+# Check webhook logs
+ssh web sudo journalctl -u webhook -f
+
+# Check deploy logs
+ssh web sudo cat /var/log/deploy.log
+
+# Verify systemd service
+ssh web sudo systemctl status my-api
+```