infra/README.md
Jeremie Fraeys ac19b5918f
Add documentation and infrastructure gap analysis
- Update README.md with current architecture documentation
- Add INFRA_GAP_ANALYSIS.md for tracking infrastructure improvements
- Add .python-version for pyenv version management
2026-02-21 18:30:33 -05:00

325 lines
8.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# infra
## Overview
This repo manages two hosts:
- `web` (`jfraeys.com`)
- `services` (`services.jfraeys.com`)
The routing convention is `service.server.jfraeys.com`.
Examples:
- `grafana.jfraeys.com` -> services host
- `git.jfraeys.com` -> services host
Traefik runs on both servers and routes only the services running on that server.
## Quickstart
This repo is intended to be driven by `setup.sh`:
```bash
./setup.sh
```
For options:
```bash
./setup.sh --help
```
What it does:
- Applies Terraform from `terraform/`
- Writes `inventory/hosts.yml` and `inventory/host_vars/web.yml` (gitignored)
- Runs `playbooks/services.yml` and `playbooks/web.yml`
If you want Terraform only:
```bash
./setup.sh --no-ansible
```
If you want Ansible only (requires an existing `inventory/hosts.yml`):
```bash
./setup.sh --ansible-only
```
## Prereqs (local)
- `terraform`
- `ansible`
- `python3` (for helper scripts)
- `pip` / `python3 -m pip`
- SSH access to the hosts
If your SSH key is passphrase-protected, you must load it into your agent before running Ansible non-interactively:
```bash
ssh-add --apple-use-keychain ~/.ssh/id_ed25519
```
## DNS (Cloudflare)
Create A/CNAME records that point to the correct server IP.
Recommended:
- `jfraeys.com` -> A record to web server IPv4
- `services.jfraeys.com` -> A record to services server IPv4
- `grafana.jfraeys.com` -> A/CNAME to services
- `git.jfraeys.com` -> A/CNAME to services
## TLS
Traefik uses Lets Encrypt via Cloudflare DNS-01.
You must provide a Cloudflare API token in your local environment when running Ansible:
- `CF_DNS_API_TOKEN`
- `CF_ZONE_API_TOKEN`
## SSO (Authelia OIDC)
Authelia is exposed at:
- `https://auth.jfraeys.com` (issuer)
- `https://auth.jfraeys.com/.well-known/openid-configuration` (discovery)
Grafana is configured via `roles/grafana` using the Generic OAuth provider.
Forgejo is configured via `roles/forgejo` using the Forgejo admin CLI with `--provider=openidConnect` and `--auto-discover-url`.
Note: Forgejo pages that ask for an "OpenID URI" are legacy OpenID 2.0 and are not used for OIDC.
## Secrets (Ansible Vault)
Secrets are stored in `secrets/vault.yml` (encrypted).
Create your vault from the template:
- `secrets/vault.example.yml` -> `secrets/vault.yml`
Run playbooks with either:
- `--ask-vault-pass`
- or a local password file (not committed): `--vault-password-file .vault_pass`
Notes:
- `secrets/vault.yml` is intentionally gitignored
- `inventory/hosts.yml` and `inventory/host_vars/web.yml` are generated by `setup.sh` and intentionally gitignored
## Playbooks
- `playbooks/services.yml`: deploy observability + forgejo on `services`
- `playbooks/web.yml`: deploy app-side dependencies on `web`
- `playbooks/test_config.yml`: smoke test host config and deployed stacks
- `playbooks/deploy.yml`: legacy/all-in-one deploy for the services host (no tags)
## Configuration split
- Vault (`secrets/vault.yml`): secrets (API tokens, passwords, access keys, and sensitive Terraform `TF_VAR_*` values)
- `.env`: non-secret configuration (still treated as sensitive), such as region/instance type and non-secret endpoints
## Linode Object Storage (demo apps)
If you already have a Linode Object Storage bucket, demo apps can use it via the S3-compatible API.
Recommended env vars (see `.env.example`):
- `S3_BUCKET`
- `S3_ENDPOINT` (example: `https://us-east-1.linodeobjects.com`)
- `S3_REGION`
Secrets (store in `secrets/vault.yml`):
- `S3_ACCESS_KEY_ID`
- `S3_SECRET_ACCESS_KEY`
Create a dedicated access key for demos and scope permissions as tightly as possible.
## Grafana provisioning
Grafana is provisioned with Prometheus and Loki datasources via the Grafana provisioning mechanism (no manual UI setup required).
## Host vars
Set `inventory/host_vars/web.yml`:
- `public_ipv4`: public IPv4 of `jfraeys.com`
This is used to allowlist Loki (`services:3100`) to only the web host.
## Forgejo Actions runner (web host)
A Forgejo runner is deployed on the `web` host (`roles/forgejo_runner`).
- Requires `FORGEJO_RUNNER_REGISTRATION_TOKEN` in `secrets/vault.yml`.
- Uses a single `self-hosted` label by default.
- The role auto re-registers the runner if labels change.
To force re-register (e.g. after deleting the runner in Forgejo UI):
```bash
ansible-playbook playbooks/web.yml \
--vault-password-file secrets/.vault_pass \
--limit web \
--tags forgejo_runner \
-e forgejo_runner_force_reregister=true
```
## SSH from Actions to services
If a workflow running on the `web` runner needs SSH access to the `services` host:
The controller expects two separate SSH keys restricted to forced commands:
- `infra-register-stdin` (register)
- `infra-deregister` (deregister)
Public keys (installed on the `services` host via Ansible/vault):
- `SERVICE_SSH_REGISTER_PUBLIC_KEY`
- `SERVICE_SSH_DEREGISTER_PUBLIC_KEY`
Private keys (stored as Forgejo Actions secrets):
- `SERVICE_SSH_KEY_REGISTER`
- `SERVICE_SSH_KEY_DEREGISTER`
To generate/update both Actions secrets (and optionally update both public keys in vault):
Install Python deps first:
```bash
python3 -m pip install -r requirements.txt
```
```bash
python3 scripts/forgejo_set_actions_secret.py \
--repo jfraeysd/infra-controller \
--generate-ssh-keys \
--update-vault-both-public-keys
```
## Deploy
Services:
```bash
ansible-playbook playbooks/services.yml --ask-vault-pass
```
Web:
```bash
ansible-playbook playbooks/web.yml --ask-vault-pass
```
## Terraform
`./setup.sh` will export `TF_VAR_*` from `secrets/vault.yml` (prompting for vault password if needed) and then run Terraform with a saved plan.
## Notes
- Loki is exposed on `services:3100` but allowlisted in UFW to `web` only.
- Watchtower is enabled with label-based updates.
- Airflow/Spark are intentionally optional and can be enabled later via `deploy_airflow` / `deploy_spark`.
## Role layout
Services host (`services`):
- `roles/traefik`
- `roles/exporters` (node-exporter + cAdvisor)
- `roles/prometheus`
- `roles/loki`
- `roles/grafana`
- `roles/forgejo`
- `roles/watchtower`
Web host (`web`):
- `roles/traefik`
- `roles/app_core` (optional shared Postgres/Redis)
- `roles/forgejo_runner`
- `roles/app_deployer` (CI/CD webhook and deployment automation)
## App Deployment
The `app_deployer` role provides automated deployment via webhooks from Forgejo or GitHub Actions.
### Prerequisites
1. **Generate deploy token** (run once):
```bash
./scripts/gen-auth-secrets.sh # Creates VAULT_DEPLOY_TOKEN
# Or add to secrets/vault.yml manually
```
2. **Set DEPLOY_TOKEN in your app repo**:
- **Forgejo**: Use the helper script:
```bash
./scripts/set_deploy_token.py --owner <you> --repo <app-name>
```
- **GitHub**: Set `DEPLOY_TOKEN` secret via Settings > Secrets and variables > Actions
3. **Add deploy workflow to your app repo**:
Copy the sample workflow and customize:
```bash
cp roles/app_deployer/files/forgejo-deploy-workflow.yml .forgejo/workflows/deploy.yml
# For GitHub: cp to .github/workflows/deploy.yml
```
Update the workflow for your build (Go, Rust, Node.js, etc.) and app name.
### How It Works
1. **CI builds the app** and uploads binary + checksum to `deploy@web:/opt/artifacts/`
2. **CI triggers webhook** with `X-Deploy-Token` header
3. **Webhook validates token** (timing-safe comparison) and runs deployment
4. **Ansible deploys the app**:
- Verifies artifact checksum
- Creates app user and directories
- Sets up systemd service
- Keeps last 5 versions for rollback
### Manual Deployment
For manual deploys or rollbacks:
```bash
# Deploy a specific version
ssh deploy@web /opt/deploy/scripts/deploy.sh my-api abc123 prod
# Rollback to previous version
ssh deploy@web /opt/deploy/scripts/rollback.sh my-api
# Lists available versions, then:
ssh deploy@web /opt/deploy/scripts/rollback.sh my-api <older-sha>
```
### Security Features
- **Timing-safe token validation** prevents timing attacks
- **Artifact checksums** ensure binary integrity
- **Sudoers restricted** to only deployment script
- **Last 5 versions kept** for quick rollback
- **Deploy user** runs as unprivileged user per app
### Troubleshooting
```bash
# Check webhook logs
ssh web sudo journalctl -u webhook -f
# Check deploy logs
ssh web sudo cat /var/log/deploy.log
# Verify systemd service
ssh web sudo systemctl status my-api
```