fetch_ml/monitoring/README.md
Jeremie Fraeys 4aecd469a1 feat: implement comprehensive monitoring and container orchestration
- Add Prometheus, Grafana, and Loki monitoring stack
- Include pre-configured dashboards for ML metrics and logs
- Add Podman container support with security policies
- Implement ML runtime environments for multiple frameworks
- Add containerized ML project templates (PyTorch, TensorFlow, etc.)
- Include secure runner with isolation and resource limits
- Add comprehensive log aggregation and alerting
2025-12-04 16:54:49 -05:00

132 lines
3.1 KiB
Markdown

# Centralized Monitoring Stack
## Quick Start
```bash
# Start everything
docker-compose up -d
# Access services
open http://localhost:3000 # Grafana (admin/admin)
open http://localhost:9090 # Prometheus
```
## Services
### Grafana (Port 3000)
**Main monitoring dashboard**
- Username: `admin`
- Password: `admin`
- Pre-configured datasources: Prometheus + Loki
- Pre-loaded ML Queue dashboard
### Prometheus (Port 9090)
**Metrics collection**
- Scrapes metrics from API server (`:9100/metrics`)
- 15s scrape interval
- Data retention: 15 days (default)
### Loki (Port 3100)
**Log aggregation**
- Collects logs from all containers
- Collects application logs from `./logs/`
- Retention: 7 days
### Promtail
**Log shipping**
- Watches Docker container logs
- Watches `./logs/*.log`
- Sends to Loki
## Viewing Data
### Metrics
1. Open Grafana: http://localhost:3000
2. Go to "ML Task Queue Monitoring" dashboard
3. See: queue depth, task duration, error rates, etc.
### Logs
1. Open Grafana → Explore
2. Select "Loki" datasource
3. Query examples:
```logql
{job="app_logs"} # All app logs
{job="docker",service="api-server"} # API server logs
{job="docker"} |= "error" # All errors
```
## Architecture
```
┌─────────────┐
│ API Server │──┐
└─────────────┘ │
├──► Prometheus ──► Grafana
┌─────────────┐ │ ▲
│ Worker │──┘ │
└─────────────┘ │
┌─────────────┐ │
│ App Logs │──┐ │
└─────────────┘ │ │
├──► Promtail ──► Loki ┘
┌─────────────┐ │
│Docker Logs │──┘
└─────────────┘
```
## Configuration Files
- `prometheus.yml` - Metrics scraping config
- `loki-config.yml` - Log storage config
- `promtail-config.yml` - Log collection config
- `grafana/provisioning/` - Auto-configuration
## Customization
### Add More Scrapers
Edit `monitoring/prometheus.yml`:
```yaml
scrape_configs:
- job_name: 'my-service'
static_configs:
- targets: ['my-service:9100']
```
### Change Retention
**Prometheus:** Add to command in docker-compose:
```yaml
- '--storage.tsdb.retention.time=30d'
```
**Loki:** Edit `loki-config.yml`:
```yaml
limits_config:
retention_period: 720h # 30 days
```
## Troubleshooting
**No metrics showing:**
```bash
# Check if Prometheus can reach targets
curl http://localhost:9090/api/v1/targets
# Check if API exposes metrics
curl http://localhost:9100/metrics
```
**No logs showing:**
```bash
# Check Promtail status
docker logs ml-experiments-promtail
# Verify Loki is receiving logs
curl http://localhost:3100/ready
```
**Grafana can't connect to datasources:**
```bash
# Restart Grafana
docker-compose restart grafana
```