Some checks failed
Security Scan / Security Analysis (push) Waiting to run
Security Scan / Native Library Security (push) Waiting to run
Verification & Maintenance / V.1 - Schema Drift Detection (push) Waiting to run
Verification & Maintenance / V.4 - Custom Go Vet Analyzers (push) Waiting to run
Verification & Maintenance / V.7 - Audit Chain Integrity (push) Waiting to run
Verification & Maintenance / V.6 - Extended Security Scanning (push) Waiting to run
Verification & Maintenance / V.10 - OpenSSF Scorecard (push) Waiting to run
Verification & Maintenance / Verification Summary (push) Blocked by required conditions
Build Pipeline / Build Binaries (push) Failing after 2m4s
Build Pipeline / Build Docker Images (push) Has been skipped
Build Pipeline / Sign HIPAA Config (push) Has been skipped
Build Pipeline / Generate SLSA Provenance (push) Has been skipped
Checkout test / test (push) Successful in 5s
CI Pipeline / Test (push) Failing after 1s
CI Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI Pipeline / Security Scan (push) Has been skipped
CI Pipeline / Test Scripts (push) Has been skipped
CI Pipeline / Test Native Libraries (push) Has been skipped
CI Pipeline / Native Library Build Matrix (push) Has been skipped
Contract Tests / Spec Drift Detection (push) Failing after 16s
Contract Tests / API Contract Tests (push) Has been skipped
Deploy API Docs / Build API Documentation (push) Failing after 5s
Deploy API Docs / Deploy to GitHub Pages (push) Has been skipped
Documentation / build-and-publish (push) Failing after 44s
CI Pipeline / Trigger Build Workflow (push) Failing after 0s
- Add new vLLM workflow documentation (vllm-workflow.md) - Update scheduler-architecture.md with Plugin GPU Quota and audit logging - Add See Also sections to jupyter-workflow.md, quick-start.md, configuration-reference.md for better navigation - Update landing page and index with vLLM and scheduler links - Cross-link all documentation for improved discoverability
6.5 KiB
6.5 KiB
Quick Start
Get Fetch ML running in minutes with Docker Compose and integrated monitoring.
Prerequisites
Container Runtimes:
- Docker Compose: For testing and development only
- Podman: For production experiment execution
Requirements:
- Go 1.25+
- Zig 0.15+
- Docker Compose (testing only)
- 4GB+ RAM
- 2GB+ disk space
- Git
One-Command Setup
# Clone and start
git clone https://github.com/jfraeys/fetch_ml.git
cd fetch_ml
make dev-up
# Wait for services (30 seconds)
sleep 30
# Verify setup
curl http://localhost:8080/health
Note: the development compose runs the API server over HTTP/WS for CLI compatibility. For HTTPS/WSS, terminate TLS at a reverse proxy.
Access Services:
- API Server (via Caddy): http://localhost:8080
- API Server (via Caddy + internal TLS): https://localhost:8443
- Grafana: http://localhost:3000 (admin/admin123)
- Prometheus: http://localhost:9090
- Loki: http://localhost:3100
Development Setup
Build Components
# Build all components
make build
# Development build
make dev
Start Services
# Start development stack with monitoring
make dev-up
# Check status
make dev-status
# Stop services
make dev-down
Verify Setup
# Check API health
curl -f http://localhost:8080/health
# Check monitoring services
curl -f http://localhost:3000/api/health
curl -f http://localhost:9090/api/v1/query?query=up
curl -f http://localhost:3100/ready
# Check Redis
docker exec ml-experiments-redis redis-cli ping
First Experiment
1. Setup CLI
# Build CLI
cd cli && zig build --release=fast
# Initialize CLI config
./cli/zig-out/bin/ml init
2. Queue Job
# Simple test job
echo "test experiment" | ./cli/zig-out/bin/ml queue test-job
# Check status
./cli/zig-out/bin/ml status
3. Monitor Progress
# View in Grafana
open http://localhost:3000
# Check logs in Grafana Log Analysis dashboard
# Or view container logs
docker logs ml-experiments-api -f
Key Commands
Development Commands
make help # Show all commands
make build # Build all components
make dev-up # Start dev environment
make dev-down # Stop dev environment
make dev-status # Check dev status
make test # Run tests
make test-unit # Run unit tests
make test-integration # Run integration tests
CLI Commands
# Build CLI
cd cli && zig build --release=fast
# Common operations
./cli/zig-out/bin/ml status # Check system status
./cli/zig-out/bin/ml queue job-name # Queue job
./cli/zig-out/bin/ml --help # Show help
Monitoring Commands
# Access monitoring services
open http://localhost:3000 # Grafana
open http://localhost:9090 # Prometheus
open http://localhost:3100 # Loki
# (Optional) Re-generate Grafana provisioning (datasources/providers)
python3 scripts/setup_monitoring.py
Configuration
Environment Setup
# Copy example environment
cp deployments/env.dev.example .env
# Edit as needed
vim .env
Key Variables:
LOG_LEVEL=infoGRAFANA_ADMIN_PASSWORD=admin123
CLI Configuration
# Setup CLI config
mkdir -p ~/.ml
# Create config file if needed
touch ~/.ml/config.toml
# Edit configuration
vim ~/.ml/config.toml
Testing
Quick Test
# 5-minute authentication test
make test-auth
# Clean up
make self-cleanup
Full Test Suite
# Run all tests
make test
# Run with coverage
make test-coverage
# Run specific test types
make test-unit
make test-integration
make test-e2e
Load Testing
# Run load tests
make load-test
# Run benchmarks
make benchmark
# Track performance
./scripts/track_performance.sh
Troubleshooting
Common Issues
Port Conflicts:
# Check port usage
lsof -i :8080
lsof -i :8443
lsof -i :3000
lsof -i :9090
# Kill conflicting processes
kill -9 <PID>
Build Issues:
# Fix Go modules
go mod tidy
# Fix Zig build
cd cli && rm -rf zig-out zig-cache && zig build --release=fast
Container Issues:
# Check container status
docker ps --filter "name=ml-"
# View logs
docker logs ml-experiments-api
docker logs ml-experiments-grafana
# Restart services
make dev-down && make dev-up
Monitoring Issues:
# Re-setup monitoring
python3 scripts/setup_monitoring.py
# Restart Grafana
docker restart ml-experiments-grafana
# Check datasources in Grafana
# Settings → Data Sources → Test connection
Debug Mode
# Enable debug logging
export LOG_LEVEL=debug
make dev-up
Next Steps
Explore Features
- Job Management: Queue and monitor ML experiments
- WebSocket Communication: Real-time updates
- Multi-User Authentication: Role-based access control
- Performance Monitoring: Grafana dashboards and metrics
- Log Aggregation: Centralized logging with Loki
Advanced Configuration
- Production Setup: See Deployment Guide
- Performance Monitoring: See Performance Monitoring
- Testing Procedures: See Testing Guide
- CLI Reference: See CLI Reference
Production Deployment
For production deployment:
- Review Deployment Guide
- Set up production monitoring
- Configure security and authentication
- Set up backup procedures
Help and Support
Get Help
make help # Show all available commands
./cli/zig-out/bin/ml --help # CLI help
Documentation
- Testing Guide - Comprehensive testing procedures
- Deployment Guide - Production deployment
- Performance Monitoring - Monitoring setup
- Architecture Guide - System architecture
- Troubleshooting - Common issues
Community
- Check logs:
docker logs ml-experiments-api - Review documentation in
docs/src/ - Use
--debugflag with CLI commands for detailed output
Ready in minutes!
See Also
- Architecture - System architecture overview
- Scheduler Architecture - Job scheduling and service management
- Jupyter Workflow - Jupyter notebook services
- vLLM Workflow - LLM inference services
- Configuration Reference - Configuration options
- Security Guide - Security best practices