infra/README.md
Jeremie Fraeys f9a7411cfb
chore(setup): improve setup.sh UX and update README
- Add --help and ansible-only/no-terraform modes\n- Add basic prereq checks and clearer error messages\n- Update README with new setup options and python requirements for helper scripts
2026-01-20 17:19:06 -05:00

249 lines
6.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# infra
## Overview
This repo manages two hosts:
- `web` (`jfraeys.com`)
- `services` (`services.jfraeys.com`)
The routing convention is `service.server.jfraeys.com`.
Examples:
- `grafana.jfraeys.com` -> services host
- `git.jfraeys.com` -> services host
Traefik runs on both servers and routes only the services running on that server.
## Quickstart
This repo is intended to be driven by `setup.sh`:
```bash
./setup.sh
```
For options:
```bash
./setup.sh --help
```
What it does:
- Applies Terraform from `terraform/`
- Writes `inventory/hosts.yml` and `inventory/host_vars/web.yml` (gitignored)
- Runs `playbooks/services.yml` and `playbooks/app.yml`
If you want Terraform only:
```bash
./setup.sh --no-ansible
```
If you want Ansible only (requires an existing `inventory/hosts.yml`):
```bash
./setup.sh --ansible-only
```
## Prereqs (local)
- `terraform`
- `ansible`
- `python3` (for helper scripts)
- `pip` / `python3 -m pip`
- SSH access to the hosts
If your SSH key is passphrase-protected, you must load it into your agent before running Ansible non-interactively:
```bash
ssh-add --apple-use-keychain ~/.ssh/id_ed25519
```
## DNS (Cloudflare)
Create A/CNAME records that point to the correct server IP.
Recommended:
- `jfraeys.com` -> A record to web server IPv4
- `services.jfraeys.com` -> A record to services server IPv4
- `grafana.jfraeys.com` -> A/CNAME to services
- `git.jfraeys.com` -> A/CNAME to services
## TLS
Traefik uses Lets Encrypt via Cloudflare DNS-01.
You must provide a Cloudflare API token in your local environment when running Ansible:
- `CF_DNS_API_TOKEN` (preferred)
- or `TF_VAR_cloudflare_api_token`
## SSO (Authelia OIDC)
Authelia is exposed at:
- `https://auth.jfraeys.com` (issuer)
- `https://auth.jfraeys.com/.well-known/openid-configuration` (discovery)
Grafana is configured via `roles/grafana` using the Generic OAuth provider.
Forgejo is configured via `roles/forgejo` using the Forgejo admin CLI with `--provider=openidConnect` and `--auto-discover-url`.
Note: Forgejo pages that ask for an "OpenID URI" are legacy OpenID 2.0 and are not used for OIDC.
## Secrets (Ansible Vault)
Secrets are stored in `secrets/vault.yml` (encrypted).
Create your vault from the template:
- `secrets/vault.example.yml` -> `secrets/vault.yml`
Run playbooks with either:
- `--ask-vault-pass`
- or a local password file (not committed): `--vault-password-file .vault_pass`
Notes:
- `secrets/vault.yml` is intentionally gitignored
- `inventory/hosts.yml` and `inventory/host_vars/web.yml` are generated by `setup.sh` and intentionally gitignored
## Playbooks
- `playbooks/services.yml`: deploy observability + forgejo on `services`
- `playbooks/app.yml`: deploy app-side dependencies on `web`
- `playbooks/test_config.yml`: smoke test host config and deployed stacks
- `playbooks/deploy.yml`: legacy/all-in-one deploy for the services host (no tags)
## Configuration split
- Vault (`secrets/vault.yml`): secrets (API tokens, passwords, access keys, and sensitive Terraform `TF_VAR_*` values)
- `.env`: non-secret configuration (still treated as sensitive), such as region/instance type and non-secret endpoints
## Linode Object Storage (demo apps)
If you already have a Linode Object Storage bucket, demo apps can use it via the S3-compatible API.
Recommended env vars (see `.env.example`):
- `S3_BUCKET`
- `S3_ENDPOINT` (example: `https://us-east-1.linodeobjects.com`)
- `S3_REGION`
Secrets (store in `secrets/vault.yml`):
- `S3_ACCESS_KEY_ID`
- `S3_SECRET_ACCESS_KEY`
Create a dedicated access key for demos and scope permissions as tightly as possible.
## Grafana provisioning
Grafana is provisioned with Prometheus and Loki datasources via the Grafana provisioning mechanism (no manual UI setup required).
## Host vars
Set `inventory/host_vars/web.yml`:
- `public_ipv4`: public IPv4 of `jfraeys.com`
This is used to allowlist Loki (`services:3100`) to only the web host.
## Forgejo Actions runner (web host)
A Forgejo runner is deployed on the `web` host (`roles/forgejo_runner`).
- Requires `FORGEJO_RUNNER_REGISTRATION_TOKEN` in `secrets/vault.yml`.
- Uses a single `self-hosted` label by default.
- The role auto re-registers the runner if labels change.
To force re-register (e.g. after deleting the runner in Forgejo UI):
```bash
ansible-playbook playbooks/app.yml \
--vault-password-file secrets/.vault_pass \
--limit web \
--tags forgejo_runner \
-e forgejo_runner_force_reregister=true
```
## SSH from Actions to services
If a workflow running on the `web` runner needs SSH access to the `services` host:
The controller expects two separate SSH keys restricted to forced commands:
- `infra-register-stdin` (register)
- `infra-deregister` (deregister)
Public keys (installed on the `services` host via Ansible/vault):
- `SERVICE_SSH_REGISTER_PUBLIC_KEY`
- `SERVICE_SSH_DEREGISTER_PUBLIC_KEY`
Private keys (stored as Forgejo Actions secrets):
- `SERVICE_SSH_KEY_REGISTER`
- `SERVICE_SSH_KEY_DEREGISTER`
To generate/update both Actions secrets (and optionally update both public keys in vault):
Install Python deps first:
```bash
python3 -m pip install -r requirements.txt
```
```bash
python3 scripts/forgejo_set_actions_secret.py \
--repo jfraeysd/infra-controller \
--generate-ssh-keys \
--update-vault-both-public-keys
```
## Deploy
Services:
```bash
ansible-playbook playbooks/services.yml --ask-vault-pass
```
Web:
```bash
ansible-playbook playbooks/app.yml --ask-vault-pass
```
## Terraform
`./setup.sh` will export `TF_VAR_*` from `secrets/vault.yml` (prompting for vault password if needed) and then run Terraform with a saved plan.
## Notes
- Loki is exposed on `services:3100` but allowlisted in UFW to `web` only.
- Watchtower is enabled with label-based updates.
- Airflow/Spark are intentionally optional and can be enabled later via `deploy_airflow` / `deploy_spark`.
## Role layout
Services host (`services`):
- `roles/traefik`
- `roles/exporters` (node-exporter + cAdvisor)
- `roles/prometheus`
- `roles/loki`
- `roles/grafana`
- `roles/forgejo`
- `roles/watchtower`
Web host (`web`):
- `roles/traefik`
- `roles/app_core` (optional shared Postgres/Redis)
- `roles/forgejo_runner`