infra/README.md
Jeremie Fraeys e2f732c0f5
infra: cleanup repository and add rollback documentation
- Remove unimplemented placeholder roles (airflow, spark)
- Delete cache files (__pycache__, .DS_Store) and generated inventory
- Remove outdated INFRA_GAP_ANALYSIS.md (functionality now in README)
- Standardize DISABLED comments for monitoring stack (Prometheus, Loki, Grafana)
- Add ROLLBACK.md with comprehensive recovery procedures
- Expand vault.example.yml with all backup and alerting variables
- Update README with complete vault variables documentation
2026-03-06 14:40:56 -05:00

419 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# infra
## Overview
This repo manages two hosts:
- `web` (`jfraeys.com`)
- `services` (`services.jfraeys.com`)
The routing convention is `service.server.jfraeys.com`.
Examples:
- `git.jfraeys.com` -> services host (Forgejo)
- `auth.jfraeys.com` -> services host (Authelia)
- `app.jfraeys.com` -> services host (App)
Traefik runs on both servers and routes only the services running on that server.
## Quickstart
This repo is intended to be driven by `setup`:
```bash
./setup
```
For options:
```bash
./setup --help
```
What it does:
- Applies Terraform from `terraform/`
- Writes `inventory/hosts.yml` and `inventory/host_vars/web.yml` (gitignored)
- Runs `playbooks/services.yml` and `playbooks/web.yml`
If you want Terraform only:
```bash
./setup --no-ansible
```
If you want Ansible only (requires an existing `inventory/hosts.yml`):
```bash
./setup --ansible-only
```
## Prereqs (local)
- `terraform`
- `ansible`
- `python3` (for helper scripts)
- `pip` / `python3 -m pip`
- SSH access to the hosts
If your SSH key is passphrase-protected, you must load it into your agent before running Ansible non-interactively:
```bash
ssh-add --apple-use-keychain ~/.ssh/id_ed25519
```
## DNS (Cloudflare)
Create A/CNAME records that point to the correct server IP.
**Active records:**
- `jfraeys.com` -> A record to web server IPv4
- `services.jfraeys.com` -> A record to services server IPv4
- `git.jfraeys.com` -> A/CNAME to services (Forgejo)
- `auth.jfraeys.com` -> A/CNAME to services (Authelia)
- `app.jfraeys.com` -> A/CNAME to services (App)
**Commented out (unused):**
- `grafana.jfraeys.com` -> A/CNAME to services (Grafana - currently disabled)
- `prometheus.jfraeys.com` -> A/CNAME to services (Prometheus - currently disabled)
To enable, uncomment the records in `terraform/main.tf`.
## TLS
Traefik uses Lets Encrypt via Cloudflare DNS-01.
You must provide a Cloudflare API token in your local environment when running Ansible:
- `CF_DNS_API_TOKEN`
- `CF_ZONE_API_TOKEN`
## SSO (Authelia OIDC)
Authelia is exposed at:
- `https://auth.jfraeys.com` (issuer)
- `https://auth.jfraeys.com/.well-known/openid-configuration` (discovery)
Grafana is configured via `roles/grafana` using the Generic OAuth provider.
Forgejo is configured via `roles/forgejo` using the Forgejo admin CLI with `--provider=openidConnect` and `--auto-discover-url`.
Note: Forgejo pages that ask for an "OpenID URI" are legacy OpenID 2.0 and are not used for OIDC.
## Email (Postfix + Postmark)
Transactional email is delivered via Postfix relay to Postmark:
- **Sender**: `notifications@jfraeys.com`
- **Relay**: `smtp.postmarkapp.com:2525`
- **Auth**: Server token authentication
Services using email:
- Authelia (password resets)
- Alertmanager (monitoring alerts)
- Forgejo (CI/CD notifications)
### DNS Records for Email
Terraform manages these Cloudflare records:
| Record | Type | Purpose |
|--------|------|---------|
| `YYYYMMDDDDpm._domainkey` | TXT | DKIM signature |
| `pm-bounces` | CNAME | Return-path for bounces |
| `_dmarc` | TXT | DMARC policy |
Postmark validates these during account setup.
### Vault Variables
Add to `secrets/vault.yml`:
**Email (Postfix + Postmark):**
```yaml
POSTFIX_RELAYHOST_USERNAME: "your-postmark-server-token"
POSTFIX_RELAYHOST_PASSWORD: "your-postmark-server-token"
AUTHELIA_SMTP_SENDER: "notifications@jfraeys.com"
AUTHELIA_SMTP_IDENTIFIER: "jfraeys.com"
```
**Backups (Restic):**
```yaml
RESTIC_REPOSITORY: "s3:https://us-east-1.linodeobjects.com/mybucket/backups"
RESTIC_PASSWORD: "strong-encryption-password"
RESTIC_AWS_ACCESS_KEY_ID: "your-linode-access-key"
RESTIC_AWS_SECRET_ACCESS_KEY: "your-linode-secret-key"
# Optional:
RESTIC_AWS_DEFAULT_REGION: "us-east-1"
RESTIC_KEEP_DAILY: 7
RESTIC_KEEP_WEEKLY: 4
RESTIC_KEEP_MONTHLY: 6
INFRA_BACKUP_ONCALENDAR: "daily" # systemd calendar spec
```
**Alerting (set exactly one):**
```yaml
# Slack option:
ALERTMANAGER_SLACK_WEBHOOK_URL: "https://hooks.slack.com/services/..."
ALERTMANAGER_SLACK_CHANNEL: "#alerts"
ALERTMANAGER_SLACK_USERNAME: "alertmanager"
# Discord option:
ALERTMANAGER_DISCORD_WEBHOOK_URL: "https://discord.com/api/webhooks/..."
```
## Secrets (Ansible Vault)
Secrets are stored in `secrets/vault.yml` (encrypted).
Create your vault from the template:
- `secrets/vault.example.yml` -> `secrets/vault.yml`
Run playbooks with either:
- `--ask-vault-pass`
- or a local password file (not committed): `--vault-password-file .vault_pass`
Notes:
- `secrets/vault.yml` is intentionally gitignored
- `inventory/hosts.yml` and `inventory/host_vars/web.yml` are generated by `setup` and intentionally gitignored
## Playbooks
- `playbooks/services.yml`: deploy observability + forgejo on `services`
- `playbooks/web.yml`: deploy app-side dependencies on `web`
- `playbooks/test_config.yml`: smoke test host config and deployed stacks
- `playbooks/deploy.yml`: legacy/all-in-one deploy for the services host (no tags)
## Configuration split
- Vault (`secrets/vault.yml`): secrets (API tokens, passwords, access keys, and sensitive Terraform `TF_VAR_*` values)
- `.env`: non-secret configuration (still treated as sensitive), such as region/instance type and non-secret endpoints
## Linode Object Storage (demo apps)
If you already have a Linode Object Storage bucket, demo apps can use it via the S3-compatible API.
Recommended env vars (see `.env.example`):
- `S3_BUCKET`
- `S3_ENDPOINT` (example: `https://us-east-1.linodeobjects.com`)
- `S3_REGION`
Secrets (store in `secrets/vault.yml`):
- `S3_ACCESS_KEY_ID`
- `S3_SECRET_ACCESS_KEY`
Create a dedicated access key for demos and scope permissions as tightly as possible.
## Grafana provisioning
Grafana is provisioned with Prometheus and Loki datasources via the Grafana provisioning mechanism (no manual UI setup required).
**Note**: Grafana is deployed but DNS records are commented out. Access via `grafana.jfraeys.com` by uncommenting the records in `terraform/main.tf`, or access directly via the services host IP.
## Host vars
Set `inventory/host_vars/web.yml`:
- `public_ipv4`: public IPv4 of `jfraeys.com`
This is used to allowlist Loki (`services:3100`) to only the web host.
## Forgejo Actions runner (web host)
A Forgejo runner is deployed on the `web` host (`roles/forgejo_runner`).
- Requires `FORGEJO_RUNNER_REGISTRATION_TOKEN` in `secrets/vault.yml`.
- Uses a single `self-hosted` label by default.
- The role auto re-registers the runner if labels change.
### AI Scrapers Blocklist
Forgejo includes a weekly cron job (`roles/forgejo/update-ai-scrapers.sh`) that updates `robots.txt` to block AI scrapers (GPTBot, ClaudeBot, etc.).
### OIDC Configuration
Forgejo is configured with:
- Group claim mapping from Authelia (`groups`)
- Admin group: `admins`
- Auto-discovery from `https://auth.jfraeys.com/.well-known/openid-configuration`
To force re-register (e.g. after deleting the runner in Forgejo UI):
```bash
ansible-playbook playbooks/web.yml \
--vault-password-file secrets/.vault_pass \
--limit web \
--tags forgejo_runner \
-e forgejo_runner_force_reregister=true
```
## SSH from Actions to services
If a workflow running on the `web` runner needs SSH access to the `services` host:
The controller expects two separate SSH keys restricted to forced commands:
- `infra-register-stdin` (register)
- `infra-deregister` (deregister)
Public keys (installed on the `services` host via Ansible/vault):
- `SERVICE_SSH_REGISTER_PUBLIC_KEY`
- `SERVICE_SSH_DEREGISTER_PUBLIC_KEY`
Private keys (stored as Forgejo Actions secrets):
- `SERVICE_SSH_KEY_REGISTER`
- `SERVICE_SSH_KEY_DEREGISTER`
To generate/update both Actions secrets (and optionally update both public keys in vault):
Install Python deps first:
```bash
python3 -m pip install -r requirements.txt
```
```bash
python3 scripts/forgejo_set_actions_secret.py \
--repo jfraeysd/infra-controller \
--generate-ssh-keys \
--update-vault-both-public-keys
```
## Deploy
Services:
```bash
ansible-playbook playbooks/services.yml --ask-vault-pass
```
Web:
```bash
ansible-playbook playbooks/web.yml --ask-vault-pass
```
## Terraform
`./setup` will export `TF_VAR_*` from `secrets/vault.yml` (prompting for vault password if needed) and then run Terraform with a saved plan.
## Notes
- **Grafana/Prometheus/Loki**: Available as optional roles but not deployed by default (commented out in `services.yml`). Enable by uncommenting the role entries.
- Loki is exposed on `services:3100` but allowlisted in UFW to `web` only.
- Watchtower is enabled with label-based updates.
- **Traefik**: Uses file provider exclusively (Docker socket access removed). Services have static router definitions in `/opt/traefik/dynamic/base.yml`.
- **Postfix**: Relays through Postmark port 2525 (avoids ISP blocking on 587).
- **Hardening**: SSH config and unattended-upgrades managed via `hardening` role to prevent StackScript drift.
## Role layout
Services host (`services`):
- `roles/traefik` (file provider only - no Docker socket)
- `roles/postfix` (Postmark SMTP relay for transactional email)
- `roles/exporters` (node-exporter + cAdvisor)
- `roles/app` (active - DNS enabled)
- `roles/prometheus` (optional - commented out in services.yml)
- `roles/loki` (optional - commented out in services.yml)
- `roles/grafana` (optional - commented out in services.yml)
- `roles/forgejo`
- `roles/alertmanager` (uses localhost:25 Postfix relay)
- `roles/watchtower`
- `roles/hardening` (SSH hardening, unattended-upgrades)
- `roles/backups`
- `roles/fail2ban` (Docker-based fail2ban)
Web host (`web`):
- `roles/traefik`
- `roles/app_core` (optional shared Postgres/Redis)
- `roles/forgejo_runner`
- `roles/app_deployer` (CI/CD webhook and deployment automation)
- `roles/hardening` (SSH hardening, unattended-upgrades)
## App Deployment
The `app_deployer` role provides automated deployment via webhooks from Forgejo or GitHub Actions.
### Prerequisites
1. **Generate deploy token** (run once):
```bash
./scripts/gen-auth-secrets.sh # Creates VAULT_DEPLOY_TOKEN
# Or add to secrets/vault.yml manually
```
2. **Set DEPLOY_TOKEN in your app repo**:
- **Forgejo**: Use the helper script:
```bash
./scripts/set_deploy_token.py --owner <you> --repo <app-name>
```
- **GitHub**: Set `DEPLOY_TOKEN` secret via Settings > Secrets and variables > Actions
3. **Add deploy workflow to your app repo**:
Copy the sample workflow and customize:
```bash
cp roles/app_deployer/files/forgejo-deploy-workflow.yml .forgejo/workflows/deploy.yml
# For GitHub: cp to .github/workflows/deploy.yml
```
Update the workflow for your build (Go, Rust, Node.js, etc.) and app name.
### How It Works
1. **CI builds the app** and uploads binary + checksum to `deploy@web:/opt/artifacts/`
2. **CI triggers webhook** with `X-Deploy-Token` header
3. **Webhook validates token** (timing-safe comparison) and runs deployment
4. **Ansible deploys the app**:
- Verifies artifact checksum
- Creates app user and directories
- Sets up systemd service
- Keeps last 5 versions for rollback
### Manual Deployment
For manual deploys or rollbacks:
```bash
# Deploy a specific version
ssh deploy@web /opt/deploy/scripts/deploy.sh my-api abc123 prod
# Rollback to previous version
ssh deploy@web /opt/deploy/scripts/rollback.sh my-api
# Lists available versions, then:
ssh deploy@web /opt/deploy/scripts/rollback.sh my-api <older-sha>
```
### Security Features
- **Timing-safe token validation** prevents timing attacks
- **Artifact checksums** ensure binary integrity
- **Sudoers restricted** to only deployment script
- **Last 5 versions kept** for quick rollback
- **Deploy user** runs as unprivileged user per app
### Troubleshooting
```bash
# Check webhook logs
ssh web sudo journalctl -u webhook -f
# Check deploy logs
ssh web sudo cat /var/log/deploy.log
# Verify systemd service
ssh web sudo systemctl status my-api
```