# Monitoring Stack ## Directory Structure (Canonical) All monitoring configuration lives under `monitoring/`. ```text monitoring/ prometheus/ prometheus.yml # Prometheus scrape configuration grafana/ dashboards/ # Grafana dashboards (JSON) provisioning/ datasources/ # Grafana data sources (Prometheus/Loki) dashboards/ # Grafana dashboard provider (points at dashboards/) loki-config.yml # Loki configuration promtail-config.yml # Promtail configuration ``` ### What is "Grafana provisioning"? Grafana provisioning is how Grafana auto-configures itself on startup (no clicking in the UI): - **`grafana/provisioning/datasources/*.yml`** - Defines where Grafana reads data from (e.g. Prometheus at `http://prometheus:9090`, Loki at `http://loki:3100`). - **`grafana/provisioning/dashboards/*.yml`** - Tells Grafana to load dashboard JSON files from `/var/lib/grafana/dashboards`. - **`grafana/dashboards/*.json`** - The dashboards themselves. ### Source of truth - **Dashboards**: edit/add JSON in `monitoring/grafana/dashboards/`. - **Grafana provisioning**: edit files in `monitoring/grafana/provisioning/`. - **Prometheus scrape config**: edit `monitoring/prometheus/prometheus.yml`. `scripts/setup_monitoring.py` is intentionally **provisioning-only**: - It (re)writes Grafana **datasources** and the **dashboard provider**. - It does **not** create or overwrite any dashboard JSON files. ## Quick Start ```bash # Start deployment make deploy-up # Access services open http://localhost:3000 # Grafana (admin/admin123) open http://localhost:9090 # Prometheus ``` ## Services ### Grafana (Port 3000) **Main monitoring dashboard** - Username: `admin` - Password: `admin123` - Data source: Prometheus (http://localhost:9090) ### Prometheus (Port 9090) **Metrics collection and storage** ### Loki (Port 3100) **Log aggregation** ## Dashboards Available dashboard configurations in `grafana/dashboards/`: - `load-test-performance.json` - Load test metrics - `websocket-performance.json` - WebSocket performance - `system-health.json` - System health monitoring - `rsync-performance.json` - Rsync performance metrics ### Importing Dashboards 1. Go to Grafana → "+" → "Import" 2. Upload JSON files from `grafana/dashboards/` directory 3. Select Prometheus data source ## Configuration Files - `prometheus/prometheus.yml` - Prometheus configuration - `loki-config.yml` - Loki configuration - `promtail-config.yml` - Promtail configuration - `security_rules.yml` - Security rules ## Usage 1. Start monitoring stack: `make deploy-up` 2. Access Grafana: http://localhost:3000 (admin/admin123) 3. Import dashboards from `grafana/dashboards/` directory 4. View metrics and test results in real-time ## Health Endpoints The API server provides health check endpoints for monitoring: - **`/health`** - Overall service health (for Docker healthcheck) - **`/health/live`** - Liveness probe (is the service running?) - **`/health/ready`** - Readiness probe (can the service accept traffic?) ### Testing Health Endpoints ```bash # Basic health check curl -k https://localhost:9101/health # Liveness check (for K8s or monitoring) curl -k https://localhost:9101/health/live # Readiness check (verifies dependencies) curl -k https://localhost:9101/health/ready ``` See `health-testing.md` for detailed testing procedures. ## Prometheus Integration Prometheus scrapes the following endpoints: - `api-server:9101/metrics` - Application metrics (future) - `api-server:9101/health` - Health status monitoring - `host.docker.internal:9100/metrics` - Worker metrics (when the worker runs on the host) - `worker:9100/metrics` - Worker metrics (when the worker runs as a container in the compose network) ## Cleanup (deprecated paths) These legacy paths may still exist in the repo but are **not used** by the current dev compose config: - `monitoring/dashboards/` (old dashboards location) - `monitoring/prometheus.yml` (old Prometheus config location) - `monitoring/grafana/provisioning/dashboards/dashboard.yml` (duplicate of `dashboards.yml`)