Jeremie Fraeys 4aecd469a1 feat: implement comprehensive monitoring and container orchestration

- Add Prometheus, Grafana, and Loki monitoring stack
- Include pre-configured dashboards for ML metrics and logs
- Add Podman container support with security policies
- Implement ML runtime environments for multiple frameworks
- Add containerized ML project templates (PyTorch, TensorFlow, etc.)
- Include secure runner with isolation and resource limits
- Add comprehensive log aggregation and alerting

2025-12-04 16:54:49 -05:00

3.1 KiB

Raw Blame History

Centralized Monitoring Stack

Quick Start

# Start everything
docker-compose up -d

# Access services
open http://localhost:3000  # Grafana (admin/admin)
open http://localhost:9090  # Prometheus

Services

Grafana (Port 3000)

Main monitoring dashboard

Username: admin
Password: admin
Pre-configured datasources: Prometheus + Loki
Pre-loaded ML Queue dashboard

Prometheus (Port 9090)

Metrics collection

Scrapes metrics from API server (:9100/metrics)
15s scrape interval
Data retention: 15 days (default)

Loki (Port 3100)

Log aggregation

Collects logs from all containers
Collects application logs from ./logs/
Retention: 7 days

Promtail

Log shipping

Watches Docker container logs
Watches ./logs/*.log
Sends to Loki

Viewing Data

Metrics

Open Grafana: http://localhost:3000
Go to "ML Task Queue Monitoring" dashboard
See: queue depth, task duration, error rates, etc.

Logs

Open Grafana → Explore
Select "Loki" datasource

Query examples:

{job="app_logs"}                    # All app logs
{job="docker",service="api-server"} # API server logs
{job="docker"} |= "error"          # All errors

Architecture

┌─────────────┐
│  API Server │──┐
└─────────────┘  │
                 ├──► Prometheus ──► Grafana
┌─────────────┐  │                      ▲
│   Worker    │──┘                      │
└─────────────┘                         │
                                        │
┌─────────────┐                         │
│  App Logs   │──┐                      │
└─────────────┘  │                      │
                 ├──► Promtail ──► Loki ┘
┌─────────────┐  │
│Docker Logs  │──┘
└─────────────┘

Configuration Files

prometheus.yml - Metrics scraping config
loki-config.yml - Log storage config
promtail-config.yml - Log collection config
grafana/provisioning/ - Auto-configuration

Customization

Add More Scrapers

Edit monitoring/prometheus.yml:

scrape_configs:
  - job_name: 'my-service'
    static_configs:
      - targets: ['my-service:9100']

Change Retention

Prometheus: Add to command in docker-compose:

- '--storage.tsdb.retention.time=30d'

Loki: Edit loki-config.yml:

limits_config:
  retention_period: 720h  # 30 days

Troubleshooting

No metrics showing:

# Check if Prometheus can reach targets
curl http://localhost:9090/api/v1/targets

# Check if API exposes metrics
curl http://localhost:9100/metrics

No logs showing:

# Check Promtail status
docker logs ml-experiments-promtail

# Verify Loki is receiving logs
curl http://localhost:3100/ready

Grafana can't connect to datasources:

# Restart Grafana  
docker-compose restart grafana

3.1 KiB Raw Blame History