# Centralized Monitoring Stack ## Quick Start ```bash # Start everything docker-compose up -d # Access services open http://localhost:3000 # Grafana (admin/admin) open http://localhost:9090 # Prometheus ``` ## Services ### Grafana (Port 3000) **Main monitoring dashboard** - Username: `admin` - Password: `admin` - Pre-configured datasources: Prometheus + Loki - Pre-loaded ML Queue dashboard ### Prometheus (Port 9090) **Metrics collection** - Scrapes metrics from API server (`:9100/metrics`) - 15s scrape interval - Data retention: 15 days (default) ### Loki (Port 3100) **Log aggregation** - Collects logs from all containers - Collects application logs from `./logs/` - Retention: 7 days ### Promtail **Log shipping** - Watches Docker container logs - Watches `./logs/*.log` - Sends to Loki ## Viewing Data ### Metrics 1. Open Grafana: http://localhost:3000 2. Go to "ML Task Queue Monitoring" dashboard 3. See: queue depth, task duration, error rates, etc. ### Logs 1. Open Grafana → Explore 2. Select "Loki" datasource 3. Query examples: ```logql {job="app_logs"} # All app logs {job="docker",service="api-server"} # API server logs {job="docker"} |= "error" # All errors ``` ## Architecture ``` ┌─────────────┐ │ API Server │──┐ └─────────────┘ │ ├──► Prometheus ──► Grafana ┌─────────────┐ │ ▲ │ Worker │──┘ │ └─────────────┘ │ │ ┌─────────────┐ │ │ App Logs │──┐ │ └─────────────┘ │ │ ├──► Promtail ──► Loki ┘ ┌─────────────┐ │ │Docker Logs │──┘ └─────────────┘ ``` ## Configuration Files - `prometheus.yml` - Metrics scraping config - `loki-config.yml` - Log storage config - `promtail-config.yml` - Log collection config - `grafana/provisioning/` - Auto-configuration ## Customization ### Add More Scrapers Edit `monitoring/prometheus.yml`: ```yaml scrape_configs: - job_name: 'my-service' static_configs: - targets: ['my-service:9100'] ``` ### Change Retention **Prometheus:** Add to command in docker-compose: ```yaml - '--storage.tsdb.retention.time=30d' ``` **Loki:** Edit `loki-config.yml`: ```yaml limits_config: retention_period: 720h # 30 days ``` ## Troubleshooting **No metrics showing:** ```bash # Check if Prometheus can reach targets curl http://localhost:9090/api/v1/targets # Check if API exposes metrics curl http://localhost:9100/metrics ``` **No logs showing:** ```bash # Check Promtail status docker logs ml-experiments-promtail # Verify Loki is receiving logs curl http://localhost:3100/ready ``` **Grafana can't connect to datasources:** ```bash # Restart Grafana docker-compose restart grafana ```