- Add Prometheus, Grafana, and Loki monitoring stack - Include pre-configured dashboards for ML metrics and logs - Add Podman container support with security policies - Implement ML runtime environments for multiple frameworks - Add containerized ML project templates (PyTorch, TensorFlow, etc.) - Include secure runner with isolation and resource limits - Add comprehensive log aggregation and alerting
3.1 KiB
3.1 KiB
Centralized Monitoring Stack
Quick Start
# Start everything
docker-compose up -d
# Access services
open http://localhost:3000 # Grafana (admin/admin)
open http://localhost:9090 # Prometheus
Services
Grafana (Port 3000)
Main monitoring dashboard
- Username:
admin - Password:
admin - Pre-configured datasources: Prometheus + Loki
- Pre-loaded ML Queue dashboard
Prometheus (Port 9090)
Metrics collection
- Scrapes metrics from API server (
:9100/metrics) - 15s scrape interval
- Data retention: 15 days (default)
Loki (Port 3100)
Log aggregation
- Collects logs from all containers
- Collects application logs from
./logs/ - Retention: 7 days
Promtail
Log shipping
- Watches Docker container logs
- Watches
./logs/*.log - Sends to Loki
Viewing Data
Metrics
- Open Grafana: http://localhost:3000
- Go to "ML Task Queue Monitoring" dashboard
- See: queue depth, task duration, error rates, etc.
Logs
- Open Grafana → Explore
- Select "Loki" datasource
- Query examples:
{job="app_logs"} # All app logs {job="docker",service="api-server"} # API server logs {job="docker"} |= "error" # All errors
Architecture
┌─────────────┐
│ API Server │──┐
└─────────────┘ │
├──► Prometheus ──► Grafana
┌─────────────┐ │ ▲
│ Worker │──┘ │
└─────────────┘ │
│
┌─────────────┐ │
│ App Logs │──┐ │
└─────────────┘ │ │
├──► Promtail ──► Loki ┘
┌─────────────┐ │
│Docker Logs │──┘
└─────────────┘
Configuration Files
prometheus.yml- Metrics scraping configloki-config.yml- Log storage configpromtail-config.yml- Log collection configgrafana/provisioning/- Auto-configuration
Customization
Add More Scrapers
Edit monitoring/prometheus.yml:
scrape_configs:
- job_name: 'my-service'
static_configs:
- targets: ['my-service:9100']
Change Retention
Prometheus: Add to command in docker-compose:
- '--storage.tsdb.retention.time=30d'
Loki: Edit loki-config.yml:
limits_config:
retention_period: 720h # 30 days
Troubleshooting
No metrics showing:
# Check if Prometheus can reach targets
curl http://localhost:9090/api/v1/targets
# Check if API exposes metrics
curl http://localhost:9100/metrics
No logs showing:
# Check Promtail status
docker logs ml-experiments-promtail
# Verify Loki is receiving logs
curl http://localhost:3100/ready
Grafana can't connect to datasources:
# Restart Grafana
docker-compose restart grafana