fetch_ml/monitoring/README.md
Jeremie Fraeys ea15af1833 Fix multi-user authentication and clean up debug code
- Fix YAML tags in auth config struct (json -> yaml)
- Update CLI configs to use pre-hashed API keys
- Remove double hashing in WebSocket client
- Fix port mapping (9102 -> 9103) in CLI commands
- Update permission keys to use jobs:read, jobs:create, etc.
- Clean up all debug logging from CLI and server
- All user roles now authenticate correctly:
  * Admin: Can queue jobs and see all jobs
  * Researcher: Can queue jobs and see own jobs
  * Analyst: Can see status (read-only access)

Multi-user authentication is now fully functional.
2025-12-06 12:35:32 -05:00

3.5 KiB

Centralized Monitoring Stack

Quick Start

# Start everything
docker-compose up -d

# Access services
open http://localhost:3000  # Grafana (admin/admin)
open http://localhost:9090  # Prometheus

Services

Grafana (Port 3000)

Main monitoring dashboard

  • Username: admin
  • Password: admin
  • Pre-configured datasources: Prometheus + Loki
  • Pre-loaded ML Queue dashboard

Prometheus (Port 9090)

Metrics collection

  • Scrapes metrics from API server (:9100/metrics)
  • 15s scrape interval
  • Data retention: 15 days (default)

Loki (Port 3100)

Log aggregation

  • Collects logs from all containers
  • Collects application logs from ./logs/
  • Retention: 7 days

Promtail

Log shipping

  • Watches Docker container logs
  • Watches ./logs/*.log
  • Sends to Loki

Viewing Data

Metrics

  1. Open Grafana: http://localhost:3000
  2. Go to "ML Task Queue Monitoring" dashboard
  3. See: queue depth, task duration, error rates, etc.

Logs

  1. Open Grafana → Explore
  2. Select "Loki" datasource
  3. Query examples:
    {job="app_logs"}                    # All app logs
    {job="docker",service="api-server"} # API server logs
    {job="docker"} |= "error"          # All errors
    

Architecture

┌─────────────┐
│  API Server │──┐
└─────────────┘  │
                 ├──► Prometheus ──► Grafana
┌─────────────┐  │                      ▲
│   Worker    │──┘                      │
└─────────────┘                         │
                                        │
┌─────────────┐                         │
│  App Logs   │──┐                      │
└─────────────┘  │                      │
                 ├──► Promtail ──► Loki ┘
┌─────────────┐  │
│Docker Logs  │──┘
└─────────────┘

Configuration Files

  • prometheus.yml - Metrics scraping config
  • loki-config.yml - Log storage config
  • promtail-config.yml - Log collection config
  • grafana/provisioning/ - Auto-configuration

Customization

Add More Scrapers

Edit monitoring/prometheus.yml:

scrape_configs:
  - job_name: 'my-service'
    static_configs:
      - targets: ['my-service:9100']

Change Retention

Prometheus: Add to command in docker-compose:

- '--storage.tsdb.retention.time=30d'

Loki: Edit loki-config.yml:

limits_config:
  retention_period: 720h  # 30 days

Troubleshooting

No metrics showing:

# Check if Prometheus can reach targets
curl http://localhost:9090/api/v1/targets

# Check if API exposes metrics
curl http://localhost:9100/metrics

No logs showing:

# Check Promtail status
docker logs ml-experiments-promtail

# Verify Loki is receiving logs
curl http://localhost:3100/ready

Grafana can't connect to datasources:

# Restart Grafana  
docker-compose restart grafana

Profiling Quick Start

To capture CPU profiles while exercising real workloads:

# HTTP LoadTestSuite (MediumLoad scenario)
make profile-load

# WebSocket → Redis queue → worker integration
make profile-ws-queue

Then inspect profiles with:

go tool pprof cpu_load.out   # HTTP load
go tool pprof cpu_ws.out     # WebSocket/queue/worker