fetch_ml/monitoring/README.md
Jeremie Fraeys ea15af1833 Fix multi-user authentication and clean up debug code
- Fix YAML tags in auth config struct (json -> yaml)
- Update CLI configs to use pre-hashed API keys
- Remove double hashing in WebSocket client
- Fix port mapping (9102 -> 9103) in CLI commands
- Update permission keys to use jobs:read, jobs:create, etc.
- Clean up all debug logging from CLI and server
- All user roles now authenticate correctly:
  * Admin: Can queue jobs and see all jobs
  * Researcher: Can queue jobs and see own jobs
  * Analyst: Can see status (read-only access)

Multi-user authentication is now fully functional.
2025-12-06 12:35:32 -05:00

151 lines
3.5 KiB
Markdown

# Centralized Monitoring Stack
## Quick Start
```bash
# Start everything
docker-compose up -d
# Access services
open http://localhost:3000 # Grafana (admin/admin)
open http://localhost:9090 # Prometheus
```
## Services
### Grafana (Port 3000)
**Main monitoring dashboard**
- Username: `admin`
- Password: `admin`
- Pre-configured datasources: Prometheus + Loki
- Pre-loaded ML Queue dashboard
### Prometheus (Port 9090)
**Metrics collection**
- Scrapes metrics from API server (`:9100/metrics`)
- 15s scrape interval
- Data retention: 15 days (default)
### Loki (Port 3100)
**Log aggregation**
- Collects logs from all containers
- Collects application logs from `./logs/`
- Retention: 7 days
### Promtail
**Log shipping**
- Watches Docker container logs
- Watches `./logs/*.log`
- Sends to Loki
## Viewing Data
### Metrics
1. Open Grafana: http://localhost:3000
2. Go to "ML Task Queue Monitoring" dashboard
3. See: queue depth, task duration, error rates, etc.
### Logs
1. Open Grafana → Explore
2. Select "Loki" datasource
3. Query examples:
```logql
{job="app_logs"} # All app logs
{job="docker",service="api-server"} # API server logs
{job="docker"} |= "error" # All errors
```
## Architecture
```
┌─────────────┐
│ API Server │──┐
└─────────────┘ │
├──► Prometheus ──► Grafana
┌─────────────┐ │ ▲
│ Worker │──┘ │
└─────────────┘ │
┌─────────────┐ │
│ App Logs │──┐ │
└─────────────┘ │ │
├──► Promtail ──► Loki ┘
┌─────────────┐ │
│Docker Logs │──┘
└─────────────┘
```
## Configuration Files
- `prometheus.yml` - Metrics scraping config
- `loki-config.yml` - Log storage config
- `promtail-config.yml` - Log collection config
- `grafana/provisioning/` - Auto-configuration
## Customization
### Add More Scrapers
Edit `monitoring/prometheus.yml`:
```yaml
scrape_configs:
- job_name: 'my-service'
static_configs:
- targets: ['my-service:9100']
```
### Change Retention
**Prometheus:** Add to command in docker-compose:
```yaml
- '--storage.tsdb.retention.time=30d'
```
**Loki:** Edit `loki-config.yml`:
```yaml
limits_config:
retention_period: 720h # 30 days
```
## Troubleshooting
**No metrics showing:**
```bash
# Check if Prometheus can reach targets
curl http://localhost:9090/api/v1/targets
# Check if API exposes metrics
curl http://localhost:9100/metrics
```
**No logs showing:**
```bash
# Check Promtail status
docker logs ml-experiments-promtail
# Verify Loki is receiving logs
curl http://localhost:3100/ready
```
**Grafana can't connect to datasources:**
```bash
# Restart Grafana
docker-compose restart grafana
```
## Profiling Quick Start
To capture CPU profiles while exercising real workloads:
```bash
# HTTP LoadTestSuite (MediumLoad scenario)
make profile-load
# WebSocket → Redis queue → worker integration
make profile-ws-queue
```
Then inspect profiles with:
```bash
go tool pprof cpu_load.out # HTTP load
go tool pprof cpu_ws.out # WebSocket/queue/worker
```