Jeremie Fraeys ea15af1833 Fix multi-user authentication and clean up debug code

- Fix YAML tags in auth config struct (json -> yaml)
- Update CLI configs to use pre-hashed API keys
- Remove double hashing in WebSocket client
- Fix port mapping (9102 -> 9103) in CLI commands
- Update permission keys to use jobs:read, jobs:create, etc.
- Clean up all debug logging from CLI and server
- All user roles now authenticate correctly:
  * Admin: Can queue jobs and see all jobs
  * Researcher: Can queue jobs and see own jobs
  * Analyst: Can see status (read-only access)

Multi-user authentication is now fully functional.

2025-12-06 12:35:32 -05:00

3.5 KiB

Raw Blame History

Centralized Monitoring Stack

Quick Start

# Start everything
docker-compose up -d

# Access services
open http://localhost:3000  # Grafana (admin/admin)
open http://localhost:9090  # Prometheus

Services

Grafana (Port 3000)

Main monitoring dashboard

Username: admin
Password: admin
Pre-configured datasources: Prometheus + Loki
Pre-loaded ML Queue dashboard

Prometheus (Port 9090)

Metrics collection

Scrapes metrics from API server (:9100/metrics)
15s scrape interval
Data retention: 15 days (default)

Loki (Port 3100)

Log aggregation

Collects logs from all containers
Collects application logs from ./logs/
Retention: 7 days

Promtail

Log shipping

Watches Docker container logs
Watches ./logs/*.log
Sends to Loki

Viewing Data

Metrics

Open Grafana: http://localhost:3000
Go to "ML Task Queue Monitoring" dashboard
See: queue depth, task duration, error rates, etc.

Logs

Open Grafana → Explore
Select "Loki" datasource

Query examples:

{job="app_logs"}                    # All app logs
{job="docker",service="api-server"} # API server logs
{job="docker"} |= "error"          # All errors

Architecture

┌─────────────┐
│  API Server │──┐
└─────────────┘  │
                 ├──► Prometheus ──► Grafana
┌─────────────┐  │                      ▲
│   Worker    │──┘                      │
└─────────────┘                         │
                                        │
┌─────────────┐                         │
│  App Logs   │──┐                      │
└─────────────┘  │                      │
                 ├──► Promtail ──► Loki ┘
┌─────────────┐  │
│Docker Logs  │──┘
└─────────────┘

Configuration Files

prometheus.yml - Metrics scraping config
loki-config.yml - Log storage config
promtail-config.yml - Log collection config
grafana/provisioning/ - Auto-configuration

Customization

Add More Scrapers

Edit monitoring/prometheus.yml:

scrape_configs:
  - job_name: 'my-service'
    static_configs:
      - targets: ['my-service:9100']

Change Retention

Prometheus: Add to command in docker-compose:

- '--storage.tsdb.retention.time=30d'

Loki: Edit loki-config.yml:

limits_config:
  retention_period: 720h  # 30 days

Troubleshooting

No metrics showing:

# Check if Prometheus can reach targets
curl http://localhost:9090/api/v1/targets

# Check if API exposes metrics
curl http://localhost:9100/metrics

No logs showing:

# Check Promtail status
docker logs ml-experiments-promtail

# Verify Loki is receiving logs
curl http://localhost:3100/ready

Grafana can't connect to datasources:

# Restart Grafana  
docker-compose restart grafana

Profiling Quick Start

To capture CPU profiles while exercising real workloads:

# HTTP LoadTestSuite (MediumLoad scenario)
make profile-load

# WebSocket → Redis queue → worker integration
make profile-ws-queue

Then inspect profiles with:

go tool pprof cpu_load.out   # HTTP load
go tool pprof cpu_ws.out     # WebSocket/queue/worker

3.5 KiB Raw Blame History