infra/INFRA_GAP_ANALYSIS.md
Jeremie Fraeys ac19b5918f
Add documentation and infrastructure gap analysis
- Update README.md with current architecture documentation
- Add INFRA_GAP_ANALYSIS.md for tracking infrastructure improvements
- Add .python-version for pyenv version management
2026-02-21 18:30:33 -05:00

5.4 KiB
Raw Blame History

Infra gap analysis

This document summarizes what this repo already covers well and what is typically added in small-but-established Ansible infra repos, with a focus on a dev/data-science oriented setup.

What you already have

  • Provisioning

    • Terraform for Linode instances and Cloudflare DNS.
    • Linode StackScripts bootstrap a baseline configuration.
  • Bootstrap hardening (StackScript)

    • SSH hardening (non-root user, key-only, custom port support, root login disabled).
    • UFW baseline (deny incoming, allow outgoing) + open 80/443 + SSH rate limiting.
    • Fail2ban for sshd.
    • Docker + Compose plugin installed.
  • Runtime platform (Ansible roles + playbooks)

    • Traefik (on both hosts) with ACME via Cloudflare DNS-01.
    • SSO/IdM: Authelia OIDC + LLDAP.
    • Observability stack: exporters (node-exporter + cAdvisor) + Prometheus + Loki + Grafana (provisioned datasources).
    • Forgejo + Forgejo Actions runner.
    • Watchtower for label-based container updates.
  • Testing

    • playbooks/test_config.yml does real smoke-testing: container health, TLS sanity, OIDC discovery, and optional object storage credential checks.

Main gaps vs “well-established” small infra repos (prioritized)

1) Backups + restore drills (biggest practical gap)

A mature repo almost always includes a backup strategy and automation.

  • Backup tool

    • Common: restic or borg.
    • Optional: provider snapshots, rclone, etc.
  • What to back up (likely targets for this repo)

    • /opt/forgejo (repos/config/DB volume depending on compose).
    • /opt/grafana (if data is persisted; provisioning covers datasources but dashboards/users may matter).
    • /opt/authelia and /opt/lldap (identity config + data).
    • /opt/traefik/letsencrypt/acme.json (optional; certs can be reissued but rate-limits exist).
    • Any Postgres/Redis volumes from app_core.
    • Loki data is usually optional to restore unless you explicitly want log retention across rebuilds.
  • Offsite destination

    • S3-compatible object storage works well here; you already have patterns for Linode Object Storage.
  • Restore validation

    • Add a small restore smoke test or runbook/playbook to verify backups are usable.

2) Ongoing OS updates (unattended security upgrades / reboots)

StackScript does initial upgrades, but established repos usually add ongoing patching:

  • unattended-upgrades for Debian security updates.
  • A policy for reboots when required.
  • Optional notification on “reboot required”.

3) Alerting (metrics exist, but alerts are missing)

You have Prometheus/Grafana/Loki, but a typical baseline also includes:

  • Alertmanager (or another alert routing mechanism).
  • A notification channel (email/webhook/etc.).
  • A small base ruleset:
    • host down / exporter down
    • disk almost full
    • high load / memory pressure
    • container down (cAdvisor)
    • cert expiry

4) Firewall convergence via Ansible (not only StackScript)

You configure UFW in StackScript and add some service rules (e.g., Loki allowlist), but many repos keep firewall rules converged in Ansible so the policy is explicit and continuously enforced.

Suggested pattern:

  • A dedicated firewall role that:
    • sets UFW defaults
    • allows SSH (correct port) from desired sources
    • allows 80/443
    • handles service allowlists (like Loki)

5) Container hygiene/security baseline

Optional but common additions:

  • Scheduled docker system prune (carefully, to avoid breaking rollbacks).
  • Vulnerability scanning (Trivy/Grype) on images.
  • Service-by-service hardening in compose (read-only FS, drop caps, etc.).

6) Dev/ds oriented additions (optional)

Depending on goals:

  • MinIO (local S3) if you want on-prem-style object storage.
  • A small artifact/cache service (or rely on external registries).
  • JupyterHub / code-server behind SSO (only if you want interactive dev environments).

Structural note: “hardening” is split across StackScript + Ansible

Your roles/hardening currently focuses on rsyslog + UFW log file rotation.

Many repos keep more hardening in Ansible (sshd, fail2ban, unattended upgrades, firewall). Your approach (StackScript) is valid, but the tradeoff is StackScript changes dont automatically converge over time unless you reprovision.

If you want to stay small while matching typical “complete” infra repos, prioritize:

  • Backups (restic/borg + systemd timer + offsite storage).
  • Unattended upgrades (plus a reboot policy).
  • Alerting (Alertmanager + a few base alerts).
  • Firewall role (optional if youre happy with StackScript-only, but recommended for convergence).

Open questions (to pick the right implementation)

  • Backups: filesystem/volume backups via restic, plus an app-aware Forgejo export via forgejo dump.
  • Alerting: webhook notifications (Slack or Discord).

Variables introduced by the backup + alerting implementation

  • Backups

    • RESTIC_REPOSITORY
    • RESTIC_PASSWORD
    • RESTIC_AWS_ACCESS_KEY_ID
    • RESTIC_AWS_SECRET_ACCESS_KEY
    • RESTIC_AWS_DEFAULT_REGION (optional)
    • INFRA_BACKUP_ONCALENDAR (optional)
    • RESTIC_KEEP_DAILY (optional)
    • RESTIC_KEEP_WEEKLY (optional)
    • RESTIC_KEEP_MONTHLY (optional)
  • Alerting

    • Set exactly one:
      • ALERTMANAGER_SLACK_WEBHOOK_URL
      • ALERTMANAGER_DISCORD_WEBHOOK_URL
    • Optional (Slack only):
      • ALERTMANAGER_SLACK_CHANNEL
      • ALERTMANAGER_SLACK_USERNAME