- Remove unimplemented placeholder roles (airflow, spark) - Delete cache files (__pycache__, .DS_Store) and generated inventory - Remove outdated INFRA_GAP_ANALYSIS.md (functionality now in README) - Standardize DISABLED comments for monitoring stack (Prometheus, Loki, Grafana) - Add ROLLBACK.md with comprehensive recovery procedures - Expand vault.example.yml with all backup and alerting variables - Update README with complete vault variables documentation
419 lines
12 KiB
Markdown
419 lines
12 KiB
Markdown
# infra
|
||
|
||
## Overview
|
||
|
||
This repo manages two hosts:
|
||
|
||
- `web` (`jfraeys.com`)
|
||
- `services` (`services.jfraeys.com`)
|
||
|
||
The routing convention is `service.server.jfraeys.com`.
|
||
|
||
Examples:
|
||
|
||
- `git.jfraeys.com` -> services host (Forgejo)
|
||
- `auth.jfraeys.com` -> services host (Authelia)
|
||
- `app.jfraeys.com` -> services host (App)
|
||
|
||
Traefik runs on both servers and routes only the services running on that server.
|
||
|
||
## Quickstart
|
||
|
||
This repo is intended to be driven by `setup`:
|
||
|
||
```bash
|
||
./setup
|
||
```
|
||
|
||
For options:
|
||
|
||
```bash
|
||
./setup --help
|
||
```
|
||
|
||
What it does:
|
||
|
||
- Applies Terraform from `terraform/`
|
||
- Writes `inventory/hosts.yml` and `inventory/host_vars/web.yml` (gitignored)
|
||
- Runs `playbooks/services.yml` and `playbooks/web.yml`
|
||
|
||
If you want Terraform only:
|
||
|
||
```bash
|
||
./setup --no-ansible
|
||
```
|
||
|
||
If you want Ansible only (requires an existing `inventory/hosts.yml`):
|
||
|
||
```bash
|
||
./setup --ansible-only
|
||
```
|
||
|
||
## Prereqs (local)
|
||
|
||
- `terraform`
|
||
- `ansible`
|
||
- `python3` (for helper scripts)
|
||
- `pip` / `python3 -m pip`
|
||
- SSH access to the hosts
|
||
|
||
If your SSH key is passphrase-protected, you must load it into your agent before running Ansible non-interactively:
|
||
|
||
```bash
|
||
ssh-add --apple-use-keychain ~/.ssh/id_ed25519
|
||
```
|
||
|
||
## DNS (Cloudflare)
|
||
|
||
Create A/CNAME records that point to the correct server IP.
|
||
|
||
**Active records:**
|
||
|
||
- `jfraeys.com` -> A record to web server IPv4
|
||
- `services.jfraeys.com` -> A record to services server IPv4
|
||
- `git.jfraeys.com` -> A/CNAME to services (Forgejo)
|
||
- `auth.jfraeys.com` -> A/CNAME to services (Authelia)
|
||
- `app.jfraeys.com` -> A/CNAME to services (App)
|
||
|
||
**Commented out (unused):**
|
||
|
||
- `grafana.jfraeys.com` -> A/CNAME to services (Grafana - currently disabled)
|
||
- `prometheus.jfraeys.com` -> A/CNAME to services (Prometheus - currently disabled)
|
||
|
||
To enable, uncomment the records in `terraform/main.tf`.
|
||
|
||
## TLS
|
||
|
||
Traefik uses Let’s Encrypt via Cloudflare DNS-01.
|
||
|
||
You must provide a Cloudflare API token in your local environment when running Ansible:
|
||
|
||
- `CF_DNS_API_TOKEN`
|
||
- `CF_ZONE_API_TOKEN`
|
||
|
||
## SSO (Authelia OIDC)
|
||
|
||
Authelia is exposed at:
|
||
|
||
- `https://auth.jfraeys.com` (issuer)
|
||
- `https://auth.jfraeys.com/.well-known/openid-configuration` (discovery)
|
||
|
||
Grafana is configured via `roles/grafana` using the Generic OAuth provider.
|
||
|
||
Forgejo is configured via `roles/forgejo` using the Forgejo admin CLI with `--provider=openidConnect` and `--auto-discover-url`.
|
||
|
||
Note: Forgejo pages that ask for an "OpenID URI" are legacy OpenID 2.0 and are not used for OIDC.
|
||
|
||
## Email (Postfix + Postmark)
|
||
|
||
Transactional email is delivered via Postfix relay to Postmark:
|
||
|
||
- **Sender**: `notifications@jfraeys.com`
|
||
- **Relay**: `smtp.postmarkapp.com:2525`
|
||
- **Auth**: Server token authentication
|
||
|
||
Services using email:
|
||
- Authelia (password resets)
|
||
- Alertmanager (monitoring alerts)
|
||
- Forgejo (CI/CD notifications)
|
||
|
||
### DNS Records for Email
|
||
|
||
Terraform manages these Cloudflare records:
|
||
|
||
| Record | Type | Purpose |
|
||
|--------|------|---------|
|
||
| `YYYYMMDDDDpm._domainkey` | TXT | DKIM signature |
|
||
| `pm-bounces` | CNAME | Return-path for bounces |
|
||
| `_dmarc` | TXT | DMARC policy |
|
||
|
||
Postmark validates these during account setup.
|
||
|
||
### Vault Variables
|
||
|
||
Add to `secrets/vault.yml`:
|
||
|
||
**Email (Postfix + Postmark):**
|
||
```yaml
|
||
POSTFIX_RELAYHOST_USERNAME: "your-postmark-server-token"
|
||
POSTFIX_RELAYHOST_PASSWORD: "your-postmark-server-token"
|
||
AUTHELIA_SMTP_SENDER: "notifications@jfraeys.com"
|
||
AUTHELIA_SMTP_IDENTIFIER: "jfraeys.com"
|
||
```
|
||
|
||
**Backups (Restic):**
|
||
```yaml
|
||
RESTIC_REPOSITORY: "s3:https://us-east-1.linodeobjects.com/mybucket/backups"
|
||
RESTIC_PASSWORD: "strong-encryption-password"
|
||
RESTIC_AWS_ACCESS_KEY_ID: "your-linode-access-key"
|
||
RESTIC_AWS_SECRET_ACCESS_KEY: "your-linode-secret-key"
|
||
# Optional:
|
||
RESTIC_AWS_DEFAULT_REGION: "us-east-1"
|
||
RESTIC_KEEP_DAILY: 7
|
||
RESTIC_KEEP_WEEKLY: 4
|
||
RESTIC_KEEP_MONTHLY: 6
|
||
INFRA_BACKUP_ONCALENDAR: "daily" # systemd calendar spec
|
||
```
|
||
|
||
**Alerting (set exactly one):**
|
||
```yaml
|
||
# Slack option:
|
||
ALERTMANAGER_SLACK_WEBHOOK_URL: "https://hooks.slack.com/services/..."
|
||
ALERTMANAGER_SLACK_CHANNEL: "#alerts"
|
||
ALERTMANAGER_SLACK_USERNAME: "alertmanager"
|
||
|
||
# Discord option:
|
||
ALERTMANAGER_DISCORD_WEBHOOK_URL: "https://discord.com/api/webhooks/..."
|
||
```
|
||
|
||
## Secrets (Ansible Vault)
|
||
|
||
Secrets are stored in `secrets/vault.yml` (encrypted).
|
||
|
||
Create your vault from the template:
|
||
|
||
- `secrets/vault.example.yml` -> `secrets/vault.yml`
|
||
|
||
Run playbooks with either:
|
||
|
||
- `--ask-vault-pass`
|
||
- or a local password file (not committed): `--vault-password-file .vault_pass`
|
||
|
||
Notes:
|
||
|
||
- `secrets/vault.yml` is intentionally gitignored
|
||
- `inventory/hosts.yml` and `inventory/host_vars/web.yml` are generated by `setup` and intentionally gitignored
|
||
|
||
## Playbooks
|
||
|
||
- `playbooks/services.yml`: deploy observability + forgejo on `services`
|
||
- `playbooks/web.yml`: deploy app-side dependencies on `web`
|
||
- `playbooks/test_config.yml`: smoke test host config and deployed stacks
|
||
- `playbooks/deploy.yml`: legacy/all-in-one deploy for the services host (no tags)
|
||
|
||
## Configuration split
|
||
|
||
- Vault (`secrets/vault.yml`): secrets (API tokens, passwords, access keys, and sensitive Terraform `TF_VAR_*` values)
|
||
- `.env`: non-secret configuration (still treated as sensitive), such as region/instance type and non-secret endpoints
|
||
|
||
## Linode Object Storage (demo apps)
|
||
|
||
If you already have a Linode Object Storage bucket, demo apps can use it via the S3-compatible API.
|
||
|
||
Recommended env vars (see `.env.example`):
|
||
|
||
- `S3_BUCKET`
|
||
- `S3_ENDPOINT` (example: `https://us-east-1.linodeobjects.com`)
|
||
- `S3_REGION`
|
||
|
||
Secrets (store in `secrets/vault.yml`):
|
||
|
||
- `S3_ACCESS_KEY_ID`
|
||
- `S3_SECRET_ACCESS_KEY`
|
||
|
||
Create a dedicated access key for demos and scope permissions as tightly as possible.
|
||
|
||
## Grafana provisioning
|
||
|
||
Grafana is provisioned with Prometheus and Loki datasources via the Grafana provisioning mechanism (no manual UI setup required).
|
||
|
||
**Note**: Grafana is deployed but DNS records are commented out. Access via `grafana.jfraeys.com` by uncommenting the records in `terraform/main.tf`, or access directly via the services host IP.
|
||
|
||
## Host vars
|
||
|
||
Set `inventory/host_vars/web.yml`:
|
||
|
||
- `public_ipv4`: public IPv4 of `jfraeys.com`
|
||
|
||
This is used to allowlist Loki (`services:3100`) to only the web host.
|
||
|
||
## Forgejo Actions runner (web host)
|
||
|
||
A Forgejo runner is deployed on the `web` host (`roles/forgejo_runner`).
|
||
|
||
- Requires `FORGEJO_RUNNER_REGISTRATION_TOKEN` in `secrets/vault.yml`.
|
||
- Uses a single `self-hosted` label by default.
|
||
- The role auto re-registers the runner if labels change.
|
||
|
||
### AI Scrapers Blocklist
|
||
|
||
Forgejo includes a weekly cron job (`roles/forgejo/update-ai-scrapers.sh`) that updates `robots.txt` to block AI scrapers (GPTBot, ClaudeBot, etc.).
|
||
|
||
### OIDC Configuration
|
||
|
||
Forgejo is configured with:
|
||
- Group claim mapping from Authelia (`groups`)
|
||
- Admin group: `admins`
|
||
- Auto-discovery from `https://auth.jfraeys.com/.well-known/openid-configuration`
|
||
|
||
To force re-register (e.g. after deleting the runner in Forgejo UI):
|
||
|
||
```bash
|
||
ansible-playbook playbooks/web.yml \
|
||
--vault-password-file secrets/.vault_pass \
|
||
--limit web \
|
||
--tags forgejo_runner \
|
||
-e forgejo_runner_force_reregister=true
|
||
```
|
||
|
||
## SSH from Actions to services
|
||
|
||
If a workflow running on the `web` runner needs SSH access to the `services` host:
|
||
|
||
The controller expects two separate SSH keys restricted to forced commands:
|
||
|
||
- `infra-register-stdin` (register)
|
||
- `infra-deregister` (deregister)
|
||
|
||
Public keys (installed on the `services` host via Ansible/vault):
|
||
|
||
- `SERVICE_SSH_REGISTER_PUBLIC_KEY`
|
||
- `SERVICE_SSH_DEREGISTER_PUBLIC_KEY`
|
||
|
||
Private keys (stored as Forgejo Actions secrets):
|
||
|
||
- `SERVICE_SSH_KEY_REGISTER`
|
||
- `SERVICE_SSH_KEY_DEREGISTER`
|
||
|
||
To generate/update both Actions secrets (and optionally update both public keys in vault):
|
||
|
||
Install Python deps first:
|
||
|
||
```bash
|
||
python3 -m pip install -r requirements.txt
|
||
```
|
||
|
||
```bash
|
||
python3 scripts/forgejo_set_actions_secret.py \
|
||
--repo jfraeysd/infra-controller \
|
||
--generate-ssh-keys \
|
||
--update-vault-both-public-keys
|
||
```
|
||
|
||
## Deploy
|
||
|
||
Services:
|
||
|
||
```bash
|
||
ansible-playbook playbooks/services.yml --ask-vault-pass
|
||
```
|
||
|
||
Web:
|
||
|
||
```bash
|
||
ansible-playbook playbooks/web.yml --ask-vault-pass
|
||
```
|
||
|
||
## Terraform
|
||
|
||
`./setup` will export `TF_VAR_*` from `secrets/vault.yml` (prompting for vault password if needed) and then run Terraform with a saved plan.
|
||
|
||
## Notes
|
||
|
||
- **Grafana/Prometheus/Loki**: Available as optional roles but not deployed by default (commented out in `services.yml`). Enable by uncommenting the role entries.
|
||
- Loki is exposed on `services:3100` but allowlisted in UFW to `web` only.
|
||
- Watchtower is enabled with label-based updates.
|
||
- **Traefik**: Uses file provider exclusively (Docker socket access removed). Services have static router definitions in `/opt/traefik/dynamic/base.yml`.
|
||
- **Postfix**: Relays through Postmark port 2525 (avoids ISP blocking on 587).
|
||
- **Hardening**: SSH config and unattended-upgrades managed via `hardening` role to prevent StackScript drift.
|
||
|
||
## Role layout
|
||
|
||
Services host (`services`):
|
||
|
||
- `roles/traefik` (file provider only - no Docker socket)
|
||
- `roles/postfix` (Postmark SMTP relay for transactional email)
|
||
- `roles/exporters` (node-exporter + cAdvisor)
|
||
- `roles/app` (active - DNS enabled)
|
||
- `roles/prometheus` (optional - commented out in services.yml)
|
||
- `roles/loki` (optional - commented out in services.yml)
|
||
- `roles/grafana` (optional - commented out in services.yml)
|
||
- `roles/forgejo`
|
||
- `roles/alertmanager` (uses localhost:25 Postfix relay)
|
||
- `roles/watchtower`
|
||
- `roles/hardening` (SSH hardening, unattended-upgrades)
|
||
- `roles/backups`
|
||
- `roles/fail2ban` (Docker-based fail2ban)
|
||
|
||
Web host (`web`):
|
||
|
||
- `roles/traefik`
|
||
- `roles/app_core` (optional shared Postgres/Redis)
|
||
- `roles/forgejo_runner`
|
||
- `roles/app_deployer` (CI/CD webhook and deployment automation)
|
||
- `roles/hardening` (SSH hardening, unattended-upgrades)
|
||
|
||
## App Deployment
|
||
|
||
The `app_deployer` role provides automated deployment via webhooks from Forgejo or GitHub Actions.
|
||
|
||
### Prerequisites
|
||
|
||
1. **Generate deploy token** (run once):
|
||
```bash
|
||
./scripts/gen-auth-secrets.sh # Creates VAULT_DEPLOY_TOKEN
|
||
# Or add to secrets/vault.yml manually
|
||
```
|
||
|
||
2. **Set DEPLOY_TOKEN in your app repo**:
|
||
- **Forgejo**: Use the helper script:
|
||
```bash
|
||
./scripts/set_deploy_token.py --owner <you> --repo <app-name>
|
||
```
|
||
- **GitHub**: Set `DEPLOY_TOKEN` secret via Settings > Secrets and variables > Actions
|
||
|
||
3. **Add deploy workflow to your app repo**:
|
||
|
||
Copy the sample workflow and customize:
|
||
```bash
|
||
cp roles/app_deployer/files/forgejo-deploy-workflow.yml .forgejo/workflows/deploy.yml
|
||
# For GitHub: cp to .github/workflows/deploy.yml
|
||
```
|
||
|
||
Update the workflow for your build (Go, Rust, Node.js, etc.) and app name.
|
||
|
||
### How It Works
|
||
|
||
1. **CI builds the app** and uploads binary + checksum to `deploy@web:/opt/artifacts/`
|
||
2. **CI triggers webhook** with `X-Deploy-Token` header
|
||
3. **Webhook validates token** (timing-safe comparison) and runs deployment
|
||
4. **Ansible deploys the app**:
|
||
- Verifies artifact checksum
|
||
- Creates app user and directories
|
||
- Sets up systemd service
|
||
- Keeps last 5 versions for rollback
|
||
|
||
### Manual Deployment
|
||
|
||
For manual deploys or rollbacks:
|
||
|
||
```bash
|
||
# Deploy a specific version
|
||
ssh deploy@web /opt/deploy/scripts/deploy.sh my-api abc123 prod
|
||
|
||
# Rollback to previous version
|
||
ssh deploy@web /opt/deploy/scripts/rollback.sh my-api
|
||
# Lists available versions, then:
|
||
ssh deploy@web /opt/deploy/scripts/rollback.sh my-api <older-sha>
|
||
```
|
||
|
||
### Security Features
|
||
|
||
- **Timing-safe token validation** prevents timing attacks
|
||
- **Artifact checksums** ensure binary integrity
|
||
- **Sudoers restricted** to only deployment script
|
||
- **Last 5 versions kept** for quick rollback
|
||
- **Deploy user** runs as unprivileged user per app
|
||
|
||
### Troubleshooting
|
||
|
||
```bash
|
||
# Check webhook logs
|
||
ssh web sudo journalctl -u webhook -f
|
||
|
||
# Check deploy logs
|
||
ssh web sudo cat /var/log/deploy.log
|
||
|
||
# Verify systemd service
|
||
ssh web sudo systemctl status my-api
|
||
```
|