- Update Alerting section to show IMPLEMENTED status - Add Postfix/Postmark email to Runtime platform list - Update Forgejo description to mention AI scrapers blocklist - Add Email variables section with Postmark configuration - Update section title to include email implementation
6 KiB
Infra gap analysis
This document summarizes what this repo already covers well and what is typically added in small-but-established Ansible infra repos, with a focus on a dev/data-science oriented setup.
What you already have
-
Provisioning
- Terraform for Linode instances and Cloudflare DNS.
- Linode StackScripts bootstrap a baseline configuration.
-
Bootstrap hardening (StackScript)
- SSH hardening (non-root user, key-only, custom port support, root login disabled).
- UFW baseline (deny incoming, allow outgoing) + open
80/443+ SSH rate limiting. - Fail2ban for
sshd. - Docker + Compose plugin installed.
-
Runtime platform (Ansible roles + playbooks)
- Traefik (on both hosts) with ACME via Cloudflare DNS-01 (file provider fallback for Docker API compatibility).
- SSO/IdM: Authelia OIDC + LLDAP.
- Email: Postfix relay to Postmark for transactional email.
- Observability stack: exporters (node-exporter + cAdvisor) + Prometheus + Loki + Grafana (provisioned datasources) + Alertmanager (email alerts).
- Forgejo + Forgejo Actions runner (with AI scrapers blocklist).
- Watchtower for label-based container updates.
-
Testing
playbooks/test_config.ymldoes real smoke-testing: container health, TLS sanity, OIDC discovery, and optional object storage credential checks.
Main gaps vs “well-established” small infra repos (prioritized)
1) Backups + restore drills (biggest practical gap)
A mature repo almost always includes a backup strategy and automation.
-
Backup tool
- Common:
resticorborg. - Optional: provider snapshots,
rclone, etc.
- Common:
-
What to back up (likely targets for this repo)
/opt/forgejo(repos/config/DB volume depending on compose)./opt/grafana(if data is persisted; provisioning covers datasources but dashboards/users may matter)./opt/autheliaand/opt/lldap(identity config + data)./opt/traefik/letsencrypt/acme.json(optional; certs can be reissued but rate-limits exist).- Any Postgres/Redis volumes from
app_core. - Loki data is usually optional to restore unless you explicitly want log retention across rebuilds.
-
Offsite destination
- S3-compatible object storage works well here; you already have patterns for Linode Object Storage.
-
Restore validation
- Add a small restore smoke test or runbook/playbook to verify backups are usable.
2) Ongoing OS updates (unattended security upgrades / reboots)
StackScript does initial upgrades, but established repos usually add ongoing patching:
unattended-upgradesfor Debian security updates.- A policy for reboots when required.
- Optional notification on “reboot required”.
3) Alerting ✅ IMPLEMENTED
Alertmanager is deployed with email notifications via Postfix:
- Alertmanager routes alerts to Postfix on
localhost:25 - Postfix relays to Postmark for reliable delivery
- Notification channel: Email to
notifications@jfraeys.com - Base alerts configured:
- host down / exporter down
- disk almost full
- high load / memory pressure
- container health (cAdvisor)
- certificate expiry
4) Firewall convergence via Ansible (not only StackScript)
You configure UFW in StackScript and add some service rules (e.g., Loki allowlist), but many repos keep firewall rules converged in Ansible so the policy is explicit and continuously enforced.
Suggested pattern:
- A dedicated
firewallrole that:- sets UFW defaults
- allows SSH (correct port) from desired sources
- allows
80/443 - handles service allowlists (like Loki)
5) Container hygiene/security baseline
Optional but common additions:
- Scheduled
docker system prune(carefully, to avoid breaking rollbacks). - Vulnerability scanning (Trivy/Grype) on images.
- Service-by-service hardening in compose (read-only FS, drop caps, etc.).
6) Dev/ds oriented additions (optional)
Depending on goals:
- MinIO (local S3) if you want on-prem-style object storage.
- A small artifact/cache service (or rely on external registries).
- JupyterHub / code-server behind SSO (only if you want interactive dev environments).
Structural note: “hardening” is split across StackScript + Ansible
Your roles/hardening currently focuses on rsyslog + UFW log file rotation.
Many repos keep more hardening in Ansible (sshd, fail2ban, unattended upgrades, firewall). Your approach (StackScript) is valid, but the tradeoff is StackScript changes don’t automatically converge over time unless you reprovision.
Minimal recommended additions (small but complete)
If you want to stay small while matching typical “complete” infra repos, prioritize:
- Backups (restic/borg + systemd timer + offsite storage).
- Unattended upgrades (plus a reboot policy).
- Alerting (Alertmanager + a few base alerts).
- Firewall role (optional if you’re happy with StackScript-only, but recommended for convergence).
Open questions (to pick the right implementation)
- Backups: filesystem/volume backups via
restic, plus an app-aware Forgejo export viaforgejo dump. - Alerting: webhook notifications (Slack or Discord).
Variables introduced by the backup + alerting + email implementation
-
Backups
RESTIC_REPOSITORYRESTIC_PASSWORDRESTIC_AWS_ACCESS_KEY_IDRESTIC_AWS_SECRET_ACCESS_KEYRESTIC_AWS_DEFAULT_REGION(optional)INFRA_BACKUP_ONCALENDAR(optional)RESTIC_KEEP_DAILY(optional)RESTIC_KEEP_WEEKLY(optional)RESTIC_KEEP_MONTHLY(optional)
-
Alerting
- Set exactly one:
ALERTMANAGER_SLACK_WEBHOOK_URLALERTMANAGER_DISCORD_WEBHOOK_URL
- Optional (Slack only):
ALERTMANAGER_SLACK_CHANNELALERTMANAGER_SLACK_USERNAME
- Set exactly one:
-
Email (Postfix + Postmark)
POSTFIX_RELAYHOST(default:smtp.postmarkapp.com)POSTFIX_RELAYHOST_PORT(default:2525)POSTFIX_RELAYHOST_USERNAME(Postmark server token)POSTFIX_RELAYHOST_PASSWORD(Postmark server token)AUTHELIA_SMTP_SENDER(e.g.,notifications@yourdomain.com)AUTHELIA_SMTP_IDENTIFIER(e.g.,yourdomain.com)