infra/INFRA_GAP_ANALYSIS.md
Jeremie Fraeys f87512426a
docs(gap-analysis): mark Alertmanager as implemented and add email variables
- Update Alerting section to show IMPLEMENTED status
- Add Postfix/Postmark email to Runtime platform list
- Update Forgejo description to mention AI scrapers blocklist
- Add Email variables section with Postmark configuration
- Update section title to include email implementation
2026-03-06 10:37:42 -05:00

6 KiB
Raw Blame History

Infra gap analysis

This document summarizes what this repo already covers well and what is typically added in small-but-established Ansible infra repos, with a focus on a dev/data-science oriented setup.

What you already have

  • Provisioning

    • Terraform for Linode instances and Cloudflare DNS.
    • Linode StackScripts bootstrap a baseline configuration.
  • Bootstrap hardening (StackScript)

    • SSH hardening (non-root user, key-only, custom port support, root login disabled).
    • UFW baseline (deny incoming, allow outgoing) + open 80/443 + SSH rate limiting.
    • Fail2ban for sshd.
    • Docker + Compose plugin installed.
  • Runtime platform (Ansible roles + playbooks)

    • Traefik (on both hosts) with ACME via Cloudflare DNS-01 (file provider fallback for Docker API compatibility).
    • SSO/IdM: Authelia OIDC + LLDAP.
    • Email: Postfix relay to Postmark for transactional email.
    • Observability stack: exporters (node-exporter + cAdvisor) + Prometheus + Loki + Grafana (provisioned datasources) + Alertmanager (email alerts).
    • Forgejo + Forgejo Actions runner (with AI scrapers blocklist).
    • Watchtower for label-based container updates.
  • Testing

    • playbooks/test_config.yml does real smoke-testing: container health, TLS sanity, OIDC discovery, and optional object storage credential checks.

Main gaps vs “well-established” small infra repos (prioritized)

1) Backups + restore drills (biggest practical gap)

A mature repo almost always includes a backup strategy and automation.

  • Backup tool

    • Common: restic or borg.
    • Optional: provider snapshots, rclone, etc.
  • What to back up (likely targets for this repo)

    • /opt/forgejo (repos/config/DB volume depending on compose).
    • /opt/grafana (if data is persisted; provisioning covers datasources but dashboards/users may matter).
    • /opt/authelia and /opt/lldap (identity config + data).
    • /opt/traefik/letsencrypt/acme.json (optional; certs can be reissued but rate-limits exist).
    • Any Postgres/Redis volumes from app_core.
    • Loki data is usually optional to restore unless you explicitly want log retention across rebuilds.
  • Offsite destination

    • S3-compatible object storage works well here; you already have patterns for Linode Object Storage.
  • Restore validation

    • Add a small restore smoke test or runbook/playbook to verify backups are usable.

2) Ongoing OS updates (unattended security upgrades / reboots)

StackScript does initial upgrades, but established repos usually add ongoing patching:

  • unattended-upgrades for Debian security updates.
  • A policy for reboots when required.
  • Optional notification on “reboot required”.

3) Alerting IMPLEMENTED

Alertmanager is deployed with email notifications via Postfix:

  • Alertmanager routes alerts to Postfix on localhost:25
  • Postfix relays to Postmark for reliable delivery
  • Notification channel: Email to notifications@jfraeys.com
  • Base alerts configured:
    • host down / exporter down
    • disk almost full
    • high load / memory pressure
    • container health (cAdvisor)
    • certificate expiry

4) Firewall convergence via Ansible (not only StackScript)

You configure UFW in StackScript and add some service rules (e.g., Loki allowlist), but many repos keep firewall rules converged in Ansible so the policy is explicit and continuously enforced.

Suggested pattern:

  • A dedicated firewall role that:
    • sets UFW defaults
    • allows SSH (correct port) from desired sources
    • allows 80/443
    • handles service allowlists (like Loki)

5) Container hygiene/security baseline

Optional but common additions:

  • Scheduled docker system prune (carefully, to avoid breaking rollbacks).
  • Vulnerability scanning (Trivy/Grype) on images.
  • Service-by-service hardening in compose (read-only FS, drop caps, etc.).

6) Dev/ds oriented additions (optional)

Depending on goals:

  • MinIO (local S3) if you want on-prem-style object storage.
  • A small artifact/cache service (or rely on external registries).
  • JupyterHub / code-server behind SSO (only if you want interactive dev environments).

Structural note: “hardening” is split across StackScript + Ansible

Your roles/hardening currently focuses on rsyslog + UFW log file rotation.

Many repos keep more hardening in Ansible (sshd, fail2ban, unattended upgrades, firewall). Your approach (StackScript) is valid, but the tradeoff is StackScript changes dont automatically converge over time unless you reprovision.

If you want to stay small while matching typical “complete” infra repos, prioritize:

  • Backups (restic/borg + systemd timer + offsite storage).
  • Unattended upgrades (plus a reboot policy).
  • Alerting (Alertmanager + a few base alerts).
  • Firewall role (optional if youre happy with StackScript-only, but recommended for convergence).

Open questions (to pick the right implementation)

  • Backups: filesystem/volume backups via restic, plus an app-aware Forgejo export via forgejo dump.
  • Alerting: webhook notifications (Slack or Discord).

Variables introduced by the backup + alerting + email implementation

  • Backups

    • RESTIC_REPOSITORY
    • RESTIC_PASSWORD
    • RESTIC_AWS_ACCESS_KEY_ID
    • RESTIC_AWS_SECRET_ACCESS_KEY
    • RESTIC_AWS_DEFAULT_REGION (optional)
    • INFRA_BACKUP_ONCALENDAR (optional)
    • RESTIC_KEEP_DAILY (optional)
    • RESTIC_KEEP_WEEKLY (optional)
    • RESTIC_KEEP_MONTHLY (optional)
  • Alerting

    • Set exactly one:
      • ALERTMANAGER_SLACK_WEBHOOK_URL
      • ALERTMANAGER_DISCORD_WEBHOOK_URL
    • Optional (Slack only):
      • ALERTMANAGER_SLACK_CHANNEL
      • ALERTMANAGER_SLACK_USERNAME
  • Email (Postfix + Postmark)

    • POSTFIX_RELAYHOST (default: smtp.postmarkapp.com)
    • POSTFIX_RELAYHOST_PORT (default: 2525)
    • POSTFIX_RELAYHOST_USERNAME (Postmark server token)
    • POSTFIX_RELAYHOST_PASSWORD (Postmark server token)
    • AUTHELIA_SMTP_SENDER (e.g., notifications@yourdomain.com)
    • AUTHELIA_SMTP_IDENTIFIER (e.g., yourdomain.com)