- Fix YAML tags in auth config struct (json -> yaml) - Update CLI configs to use pre-hashed API keys - Remove double hashing in WebSocket client - Fix port mapping (9102 -> 9103) in CLI commands - Update permission keys to use jobs:read, jobs:create, etc. - Clean up all debug logging from CLI and server - All user roles now authenticate correctly: * Admin: Can queue jobs and see all jobs * Researcher: Can queue jobs and see own jobs * Analyst: Can see status (read-only access) Multi-user authentication is now fully functional.
151 lines
3.5 KiB
Markdown
151 lines
3.5 KiB
Markdown
# Centralized Monitoring Stack
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# Start everything
|
|
docker-compose up -d
|
|
|
|
# Access services
|
|
open http://localhost:3000 # Grafana (admin/admin)
|
|
open http://localhost:9090 # Prometheus
|
|
```
|
|
|
|
## Services
|
|
|
|
### Grafana (Port 3000)
|
|
**Main monitoring dashboard**
|
|
- Username: `admin`
|
|
- Password: `admin`
|
|
- Pre-configured datasources: Prometheus + Loki
|
|
- Pre-loaded ML Queue dashboard
|
|
|
|
### Prometheus (Port 9090)
|
|
**Metrics collection**
|
|
- Scrapes metrics from API server (`:9100/metrics`)
|
|
- 15s scrape interval
|
|
- Data retention: 15 days (default)
|
|
|
|
### Loki (Port 3100)
|
|
**Log aggregation**
|
|
- Collects logs from all containers
|
|
- Collects application logs from `./logs/`
|
|
- Retention: 7 days
|
|
|
|
### Promtail
|
|
**Log shipping**
|
|
- Watches Docker container logs
|
|
- Watches `./logs/*.log`
|
|
- Sends to Loki
|
|
|
|
## Viewing Data
|
|
|
|
### Metrics
|
|
1. Open Grafana: http://localhost:3000
|
|
2. Go to "ML Task Queue Monitoring" dashboard
|
|
3. See: queue depth, task duration, error rates, etc.
|
|
|
|
### Logs
|
|
1. Open Grafana → Explore
|
|
2. Select "Loki" datasource
|
|
3. Query examples:
|
|
```logql
|
|
{job="app_logs"} # All app logs
|
|
{job="docker",service="api-server"} # API server logs
|
|
{job="docker"} |= "error" # All errors
|
|
```
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────┐
|
|
│ API Server │──┐
|
|
└─────────────┘ │
|
|
├──► Prometheus ──► Grafana
|
|
┌─────────────┐ │ ▲
|
|
│ Worker │──┘ │
|
|
└─────────────┘ │
|
|
│
|
|
┌─────────────┐ │
|
|
│ App Logs │──┐ │
|
|
└─────────────┘ │ │
|
|
├──► Promtail ──► Loki ┘
|
|
┌─────────────┐ │
|
|
│Docker Logs │──┘
|
|
└─────────────┘
|
|
```
|
|
|
|
## Configuration Files
|
|
|
|
- `prometheus.yml` - Metrics scraping config
|
|
- `loki-config.yml` - Log storage config
|
|
- `promtail-config.yml` - Log collection config
|
|
- `grafana/provisioning/` - Auto-configuration
|
|
|
|
## Customization
|
|
|
|
### Add More Scrapers
|
|
Edit `monitoring/prometheus.yml`:
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: 'my-service'
|
|
static_configs:
|
|
- targets: ['my-service:9100']
|
|
```
|
|
|
|
### Change Retention
|
|
**Prometheus:** Add to command in docker-compose:
|
|
```yaml
|
|
- '--storage.tsdb.retention.time=30d'
|
|
```
|
|
|
|
**Loki:** Edit `loki-config.yml`:
|
|
```yaml
|
|
limits_config:
|
|
retention_period: 720h # 30 days
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
**No metrics showing:**
|
|
```bash
|
|
# Check if Prometheus can reach targets
|
|
curl http://localhost:9090/api/v1/targets
|
|
|
|
# Check if API exposes metrics
|
|
curl http://localhost:9100/metrics
|
|
```
|
|
|
|
**No logs showing:**
|
|
```bash
|
|
# Check Promtail status
|
|
docker logs ml-experiments-promtail
|
|
|
|
# Verify Loki is receiving logs
|
|
curl http://localhost:3100/ready
|
|
```
|
|
|
|
**Grafana can't connect to datasources:**
|
|
```bash
|
|
# Restart Grafana
|
|
docker-compose restart grafana
|
|
```
|
|
|
|
## Profiling Quick Start
|
|
|
|
To capture CPU profiles while exercising real workloads:
|
|
|
|
```bash
|
|
# HTTP LoadTestSuite (MediumLoad scenario)
|
|
make profile-load
|
|
|
|
# WebSocket → Redis queue → worker integration
|
|
make profile-ws-queue
|
|
```
|
|
|
|
Then inspect profiles with:
|
|
|
|
```bash
|
|
go tool pprof cpu_load.out # HTTP load
|
|
go tool pprof cpu_ws.out # WebSocket/queue/worker
|
|
```
|