fetch_ml/monitoring/README.md
Jeremie Fraeys 4aecd469a1 feat: implement comprehensive monitoring and container orchestration
- Add Prometheus, Grafana, and Loki monitoring stack
- Include pre-configured dashboards for ML metrics and logs
- Add Podman container support with security policies
- Implement ML runtime environments for multiple frameworks
- Add containerized ML project templates (PyTorch, TensorFlow, etc.)
- Include secure runner with isolation and resource limits
- Add comprehensive log aggregation and alerting
2025-12-04 16:54:49 -05:00

3.1 KiB

Centralized Monitoring Stack

Quick Start

# Start everything
docker-compose up -d

# Access services
open http://localhost:3000  # Grafana (admin/admin)
open http://localhost:9090  # Prometheus

Services

Grafana (Port 3000)

Main monitoring dashboard

  • Username: admin
  • Password: admin
  • Pre-configured datasources: Prometheus + Loki
  • Pre-loaded ML Queue dashboard

Prometheus (Port 9090)

Metrics collection

  • Scrapes metrics from API server (:9100/metrics)
  • 15s scrape interval
  • Data retention: 15 days (default)

Loki (Port 3100)

Log aggregation

  • Collects logs from all containers
  • Collects application logs from ./logs/
  • Retention: 7 days

Promtail

Log shipping

  • Watches Docker container logs
  • Watches ./logs/*.log
  • Sends to Loki

Viewing Data

Metrics

  1. Open Grafana: http://localhost:3000
  2. Go to "ML Task Queue Monitoring" dashboard
  3. See: queue depth, task duration, error rates, etc.

Logs

  1. Open Grafana → Explore
  2. Select "Loki" datasource
  3. Query examples:
    {job="app_logs"}                    # All app logs
    {job="docker",service="api-server"} # API server logs
    {job="docker"} |= "error"          # All errors
    

Architecture

┌─────────────┐
│  API Server │──┐
└─────────────┘  │
                 ├──► Prometheus ──► Grafana
┌─────────────┐  │                      ▲
│   Worker    │──┘                      │
└─────────────┘                         │
                                        │
┌─────────────┐                         │
│  App Logs   │──┐                      │
└─────────────┘  │                      │
                 ├──► Promtail ──► Loki ┘
┌─────────────┐  │
│Docker Logs  │──┘
└─────────────┘

Configuration Files

  • prometheus.yml - Metrics scraping config
  • loki-config.yml - Log storage config
  • promtail-config.yml - Log collection config
  • grafana/provisioning/ - Auto-configuration

Customization

Add More Scrapers

Edit monitoring/prometheus.yml:

scrape_configs:
  - job_name: 'my-service'
    static_configs:
      - targets: ['my-service:9100']

Change Retention

Prometheus: Add to command in docker-compose:

- '--storage.tsdb.retention.time=30d'

Loki: Edit loki-config.yml:

limits_config:
  retention_period: 720h  # 30 days

Troubleshooting

No metrics showing:

# Check if Prometheus can reach targets
curl http://localhost:9090/api/v1/targets

# Check if API exposes metrics
curl http://localhost:9100/metrics

No logs showing:

# Check Promtail status
docker logs ml-experiments-promtail

# Verify Loki is receiving logs
curl http://localhost:3100/ready

Grafana can't connect to datasources:

# Restart Grafana  
docker-compose restart grafana