fetch_ml/README.md

# FetchML - Machine Learning Platform

A production-ready ML experiment platform with task queuing, monitoring, and a modern CLI/API.

## Features

- **🚀 Production Resilience** - Task leasing, smart retries, dead-letter queues
- **📊 Monitoring** - Grafana/Prometheus/Loki with auto-provisioned dashboards
- **🔐 Security** - API key auth, TLS, rate limiting, IP whitelisting
- **⚡ Performance** - Go API server + Zig CLI for speed
- **📦 Easy Deployment** - Docker Compose (dev) or systemd (prod)

## Quick Start

### Development (macOS/Linux)

```bash
# Clone and start
git clone <your-repo>
cd fetch_ml
docker-compose up -d

# Access Grafana: http://localhost:3000 (admin/admin)
```

### Production (Linux)

```bash
# Setup application
sudo ./scripts/setup-prod.sh

# Setup monitoring
sudo ./scripts/setup-monitoring-prod.sh

# Build and install
make prod
make install

# Start services
sudo systemctl start fetchml-api fetchml-worker
sudo systemctl start prometheus grafana loki promtail
```

## Architecture

```
┌──────────────┐   WebSocket   ┌──────────────┐
│  Zig CLI/TUI │◄─────────────►│  API Server  │
└──────────────┘               │    (Go)      │
                               └──────┬───────┘
                                      │
                        ┌─────────────┼─────────────┐
                        │             │             │
                   ┌────▼────┐   ┌───▼────┐   ┌───▼────┐
                   │  Redis  │   │ Worker │   │  Loki  │
                   │ (Queue) │   │  (Go)  │   │ (Logs) │
                   └─────────┘   └────────┘   └────────┘
```

## Usage

### API Server

```bash
# Development (stderr logging)
go run cmd/api-server/main.go --config configs/config-dev.yaml

# Production (file logging)
go run cmd/api-server/main.go --config configs/config-no-tls.yaml
```

### CLI

```bash
# Build
cd cli && zig build prod

# Run experiment
./cli/zig-out/bin/ml run --config config.toml

# Check status
./cli/zig-out/bin/ml status
```

### Docker

```bash
make docker-run      # Start all services
make docker-logs     # View logs
make docker-stop     # Stop services
```

## Development

### Prerequisites

- Go 1.21+
- Zig 0.11+
- Redis
- Docker (for local dev)

### Build

```bash
make build           # All components
make dev             # Fast dev build
make prod            # Optimized production build
```

### Test

```bash
make test            # All tests
make test-unit       # Unit tests only
make test-coverage   # With coverage report
```

## Configuration

### Development (`configs/config-dev.yaml`)
```yaml
logging:
  level: "info"
  file: ""  # stderr only

redis:
  url: "redis://localhost:6379"
```

### Production (`configs/config-no-tls.yaml`)
```yaml
logging:
  level: "info"
  file: "./logs/fetch_ml.log"  # file only

redis:
  url: "redis://redis:6379"
```

## Monitoring

### Grafana Dashboards (Auto-Provisioned)

- **ML Task Queue** - Queue depth, task duration, failure rates
- **Application Logs** - Log streams, error tracking, search

Access: `http://localhost:3000` (dev) or `http://YOUR_SERVER:3000` (prod)

### Metrics

- Queue depth and task processing rates
- Retry attempts by error category
- Dead letter queue size
- Lease expirations

## Documentation

- **[Getting Started](docs/getting-started.md)** - Detailed setup guide
- **[Production Deployment](docs/production-monitoring.md)** - Linux deployment
- **[WebSocket API](docs/api/)** - Protocol documentation
- **[Architecture](docs/architecture/)** - System design

## Makefile Targets

```bash
# Build
make build               # Build all components
make prod                # Production build
make clean               # Clean artifacts

# Docker
make docker-build        # Build image
make docker-run          # Start services
make docker-stop         # Stop services

# Test
make test                # All tests
make test-coverage       # With coverage

# Production (Linux only)
make setup               # Setup app
make setup-monitoring    # Setup monitoring
make install             # Install binaries
```

## Security

- **TLS/HTTPS** - End-to-end encryption
- **API Keys** - Hashed with SHA256
- **Rate Limiting** - Per-user quotas
- **IP Whitelist** - Network restrictions
- **Audit Logging** - All API access logged

## License

MIT - See [LICENSE](LICENSE)

## Contributing

Contributions welcome! This is a personal homelab project but PRs are appreciated.