fetch_ml/monitoring/README.md

# Centralized Monitoring Stack

## Quick Start

```bash
# Start everything
docker-compose up -d

# Access services
open http://localhost:3000  # Grafana (admin/admin)
open http://localhost:9090  # Prometheus
```

## Services

### Grafana (Port 3000)
**Main monitoring dashboard**
- Username: `admin`
- Password: `admin`
- Pre-configured datasources: Prometheus + Loki
- Pre-loaded ML Queue dashboard

### Prometheus (Port 9090)
**Metrics collection**
- Scrapes metrics from API server (`:9100/metrics`)
- 15s scrape interval
- Data retention: 15 days (default)

### Loki (Port 3100)
**Log aggregation**
- Collects logs from all containers
- Collects application logs from `./logs/`
- Retention: 7 days

### Promtail
**Log shipping**
- Watches Docker container logs
- Watches `./logs/*.log`
- Sends to Loki

## Viewing Data

### Metrics
1. Open Grafana: http://localhost:3000
2. Go to "ML Task Queue Monitoring" dashboard
3. See: queue depth, task duration, error rates, etc.

### Logs
1. Open Grafana → Explore
2. Select "Loki" datasource
3. Query examples:
   ```logql
   {job="app_logs"}                    # All app logs
   {job="docker",service="api-server"} # API server logs
   {job="docker"} |= "error"          # All errors
   ```

## Architecture

```
┌─────────────┐
│  API Server │──┐
└─────────────┘  │
                 ├──► Prometheus ──► Grafana
┌─────────────┐  │                      ▲
│   Worker    │──┘                      │
└─────────────┘                         │
                                        │
┌─────────────┐                         │
│  App Logs   │──┐                      │
└─────────────┘  │                      │
                 ├──► Promtail ──► Loki ┘
┌─────────────┐  │
│Docker Logs  │──┘
└─────────────┘
```

## Configuration Files

- `prometheus.yml` - Metrics scraping config
- `loki-config.yml` - Log storage config
- `promtail-config.yml` - Log collection config
- `grafana/provisioning/` - Auto-configuration

## Customization

### Add More Scrapers
Edit `monitoring/prometheus.yml`:
```yaml
scrape_configs:
  - job_name: 'my-service'
    static_configs:
      - targets: ['my-service:9100']
```

### Change Retention
**Prometheus:** Add to command in docker-compose:
```yaml
- '--storage.tsdb.retention.time=30d'
```

**Loki:** Edit `loki-config.yml`:
```yaml
limits_config:
  retention_period: 720h  # 30 days
```

## Troubleshooting

**No metrics showing:**
```bash
# Check if Prometheus can reach targets
curl http://localhost:9090/api/v1/targets

# Check if API exposes metrics
curl http://localhost:9100/metrics
```

**No logs showing:**
```bash
# Check Promtail status
docker logs ml-experiments-promtail

# Verify Loki is receiving logs
curl http://localhost:3100/ready
```

**Grafana can't connect to datasources:**
```bash
# Restart Grafana
docker-compose restart grafana
```

## Profiling Quick Start

To capture CPU profiles while exercising real workloads:

```bash
# HTTP LoadTestSuite (MediumLoad scenario)
make profile-load

# WebSocket → Redis queue → worker integration
make profile-ws-queue
```

Then inspect profiles with:

```bash
go tool pprof cpu_load.out   # HTTP load
go tool pprof cpu_ws.out     # WebSocket/queue/worker
```