chore(docs): remove legacy Jekyll docs/_pages after Hugo migration

2026-01-05 12:41:09 -05:00 · 2026-01-05 12:41:09 -05:00 · f6e506a632
commit f6e506a632
parent 3d58387207
7 changed files with 0 additions and 2486 deletions
--- a/docs/_pages/architecture.md
+++ b/docs/_pages/architecture.md
@ -1,738 +0,0 @@
---
-layout: page
-title: "Homelab Architecture"
-permalink: /architecture/
-nav_order: 1
---
-
-# Homelab Architecture
-
-Simple, secure architecture for ML experiments in your homelab.
-
-## Components Overview
-
-```mermaid
-graph TB
-    subgraph "Homelab Stack"
-        CLI[Zig CLI]
-        API[HTTPS API]
-        REDIS[Redis Cache]
-        FS[Local Storage]
-    end
-    
-    CLI --> API
-    API --> REDIS
-    API --> FS
-```
-
-## Core Services
-
-### API Server
- **Purpose**: Secure HTTPS API for ML experiments
- **Port**: 9101 (HTTPS only)
- **Auth**: API key authentication
- **Security**: Rate limiting, IP whitelisting
-
-### Redis
- **Purpose**: Caching and job queuing
- **Port**: 6379 (localhost only)
- **Storage**: Temporary data only
- **Persistence**: Local volume
-
-### Zig CLI
- **Purpose**: High-performance experiment management
- **Language**: Zig for maximum speed and efficiency
- **Features**: 
-  - Content-addressed storage with deduplication
-  - SHA256-based commit ID generation
-  - WebSocket communication for real-time updates
-  - Rsync-based incremental file transfers
-  - Multi-threaded operations
-  - Secure API key authentication
-  - Auto-sync monitoring with file system watching
-  - Priority-based job queuing
-  - Memory-efficient operations with arena allocators
-
-## Security Architecture
-
-```mermaid
-graph LR
-    USER[User] --> AUTH[API Key Auth]
-    AUTH --> RATE[Rate Limiting]
-    RATE --> WHITELIST[IP Whitelist]
-    WHITELIST --> API[Secure API]
-    API --> AUDIT[Audit Logging]
-```
-
-### Security Layers
-1. **API Key Authentication** - Hashed keys with roles
-2. **Rate Limiting** - 30 requests/minute
-3. **IP Whitelisting** - Local networks only
-4. **Fail2Ban** - Automatic IP blocking
-5. **HTTPS/TLS** - Encrypted communication
-6. **Audit Logging** - Complete action tracking
-
-## Data Flow
-
-```mermaid
-sequenceDiagram
-    participant CLI
-    participant API
-    participant Redis
-    participant Storage
-    
-    CLI->>API: HTTPS Request
-    API->>API: Validate Auth
-    API->>Redis: Cache/Queue
-    API->>Storage: Experiment Data
-    Storage->>API: Results
-    API->>CLI: Response
-```
-
-## Deployment Options
-
-### Docker Compose (Recommended)
-```yaml
-services:
-  redis:
-    image: redis:7-alpine
-    ports: ["6379:6379"]
-    volumes: [redis_data:/data]
-    
-  api-server:
-    build: .
-    ports: ["9101:9101"]
-    depends_on: [redis]
-```
-
-### Local Setup
-```bash
-./setup.sh && ./manage.sh start
-```
-
-## Network Architecture
-
- **Private Network**: Docker internal network
- **Localhost Access**: Redis only on localhost
- **HTTPS API**: Port 9101, TLS encrypted
- **No External Dependencies**: Everything runs locally
-
-## Storage Architecture
-
-```
-data/
-├── experiments/     # ML experiment results
-├── cache/          # Temporary cache files
-└── backups/        # Local backups
-
-logs/
-├── app.log         # Application logs
-├── audit.log       # Security events
-└── access.log      # API access logs
-```
-
-## Monitoring Architecture
-
-Simple, lightweight monitoring:
- **Health Checks**: Service availability
- **Log Files**: Structured logging
- **Basic Metrics**: Request counts, error rates
- **Security Events**: Failed auth, rate limits
-
-## Homelab Benefits
-
- ✅ **Simple Setup**: One-command installation
- ✅ **Local Only**: No external dependencies
- ✅ **Secure by Default**: HTTPS, auth, rate limiting
- ✅ **Low Resource**: Minimal CPU/memory usage
- ✅ **Easy Backup**: Local file system
- ✅ **Privacy**: Everything stays on your network
-
-## High-Level Architecture
-
-```mermaid
-graph TB
-    subgraph "Client Layer"
-        CLI[CLI Tools]
-        TUI[Terminal UI]
-        API[REST API]
-    end
-    
-    subgraph "Authentication Layer"
-        Auth[Authentication Service]
-        RBAC[Role-Based Access Control]
-        Perm[Permission Manager]
-    end
-    
-    subgraph "Core Services"
-        Worker[ML Worker Service]
-        DataMgr[Data Manager Service]
-        Queue[Job Queue]
-    end
-    
-    subgraph "Storage Layer"
-        Redis[(Redis Cache)]
-        DB[(SQLite/PostgreSQL)]
-        Files[File Storage]
-    end
-    
-    subgraph "Container Runtime"
-        Podman[Podman/Docker]
-        Containers[ML Containers]
-    end
-    
-    CLI --> Auth
-    TUI --> Auth
-    API --> Auth
-    
-    Auth --> RBAC
-    RBAC --> Perm
-    
-    Worker --> Queue
-    Worker --> DataMgr
-    Worker --> Podman
-    
-    DataMgr --> DB
-    DataMgr --> Files
-    
-    Queue --> Redis
-    
-    Podman --> Containers
-```
-
-## Zig CLI Architecture
-
-### Component Structure
-
-```mermaid
-graph TB
-    subgraph "Zig CLI Components"
-        Main[main.zig] --> Commands[commands/]
-        Commands --> Config[config.zig]
-        Commands --> Utils[utils/]
-        Commands --> Net[net/]
-        Commands --> Errors[errors.zig]
-        
-        subgraph "Commands"
-            Init[init.zig]
-            Sync[sync.zig]
-            Queue[queue.zig]
-            Watch[watch.zig]
-            Status[status.zig]
-            Monitor[monitor.zig]
-            Cancel[cancel.zig]
-            Prune[prune.zig]
-        end
-        
-        subgraph "Utils"
-            Crypto[crypto.zig]
-            Storage[storage.zig]
-            Rsync[rsync.zig]
-        end
-        
-        subgraph "Network"
-            WS[ws.zig]
-        end
-    end
-```
-
-### Performance Optimizations
-
-#### Content-Addressed Storage
- **Deduplication**: Files stored by SHA256 hash
- **Space Efficiency**: Shared files across experiments
- **Fast Lookup**: Hash-based file retrieval
-
-#### Memory Management
- **Arena Allocators**: Efficient bulk allocation
- **Zero-Copy Operations**: Minimized memory copying
- **Automatic Cleanup**: Resource deallocation
-
-#### Network Communication
- **WebSocket Protocol**: Real-time bidirectional communication
- **Connection Pooling**: Reused connections
- **Binary Messaging**: Efficient data transfer
-
-### Security Implementation
-
-```mermaid
-graph LR
-    subgraph "CLI Security"
-        Config[Config File] --> Hash[SHA256 Hashing]
-        Hash --> Auth[API Authentication]
-        Auth --> SSH[SSH Transfer]
-        SSH --> WS[WebSocket Security]
-    end
-```
-
-## Core Components
-
-### 1. Authentication & Authorization
-
-```mermaid
-graph LR
-    subgraph "Auth Flow"
-        Client[Client] --> APIKey[API Key]
-        APIKey --> Hash[Hash Validation]
-        Hash --> Roles[Role Resolution]
-        Roles --> Perms[Permission Check]
-        Perms --> Access[Grant/Deny Access]
-    end
-    
-    subgraph "Permission Sources"
-        YAML[YAML Config]
-        Inline[Inline Fallback]
-        Roles --> YAML
-        Roles --> Inline
-    end
-```
-
-**Features:**
- API key-based authentication
- Role-based access control (RBAC)
- YAML-based permission configuration
- Fallback to inline permissions
- Admin wildcard permissions
-
-### 2. Worker Service
-
-```mermaid
-graph TB
-    subgraph "Worker Architecture"
-        API[HTTP API] --> Router[Request Router]
-        Router --> Auth[Auth Middleware]
-        Auth --> Queue[Job Queue]
-        Queue --> Processor[Job Processor]
-        Processor --> Runtime[Container Runtime]
-        Runtime --> Storage[Result Storage]
-        
-        subgraph "Job Lifecycle"
-            Submit[Submit Job] --> Queue
-            Queue --> Execute[Execute]
-            Execute --> Monitor[Monitor]
-            Monitor --> Complete[Complete]
-            Complete --> Store[Store Results]
-        end
-    end
-```
-
-**Responsibilities:**
- HTTP API for job submission
- Job queue management
- Container orchestration
- Result collection and storage
- Metrics and monitoring
-
-### 3. Data Manager Service
-
-```mermaid
-graph TB
-    subgraph "Data Management"
-        API[Data API] --> Storage[Storage Layer]
-        Storage --> Metadata[Metadata DB]
-        Storage --> Files[File System]
-        Storage --> Cache[Redis Cache]
-        
-        subgraph "Data Operations"
-            Upload[Upload Data] --> Validate[Validate]
-            Validate --> Store[Store]
-            Store --> Index[Index]
-            Index --> Catalog[Catalog]
-        end
-    end
-```
-
-**Features:**
- Data upload and validation
- Metadata management
- File system abstraction
- Caching layer
- Data catalog
-
-### 4. Terminal UI (TUI)
-
-```mermaid
-graph TB
-    subgraph "TUI Architecture"
-        UI[UI Components] --> Model[Data Model]
-        Model --> Update[Update Loop]
-        Update --> Render[Render]
-        
-        subgraph "UI Panels"
-            Jobs[Job List]
-            Details[Job Details]
-            Logs[Log Viewer]
-            Status[Status Bar]
-        end
-        
-        UI --> Jobs
-        UI --> Details
-        UI --> Logs
-        UI --> Status
-    end
-```
-
-**Components:**
- Bubble Tea framework
- Component-based architecture
- Real-time updates
- Keyboard navigation
- Theme support
-
-## Data Flow
-
-### Job Execution Flow
-
-```mermaid
-sequenceDiagram
-    participant Client
-    participant Auth
-    participant Worker
-    participant Queue
-    participant Container
-    participant Storage
-    
-    Client->>Auth: Submit job with API key
-    Auth->>Client: Validate and return job ID
-    
-    Client->>Worker: Execute job request
-    Worker->>Queue: Queue job
-    Queue->>Worker: Job ready
-    Worker->>Container: Start ML container
-    Container->>Worker: Execute experiment
-    Worker->>Storage: Store results
-    Worker->>Client: Return results
-```
-
-### Authentication Flow
-
-```mermaid
-sequenceDiagram
-    participant Client
-    participant Auth
-    participant PermMgr
-    participant Config
-    
-    Client->>Auth: Request with API key
-    Auth->>Auth: Validate key hash
-    Auth->>PermMgr: Get user permissions
-    PermMgr->>Config: Load YAML permissions
-    Config->>PermMgr: Return permissions
-    PermMgr->>Auth: Return resolved permissions
-    Auth->>Client: Grant/deny access
-```
-
-## Security Architecture
-
-### Defense in Depth
-
-```mermaid
-graph TB
-    subgraph "Security Layers"
-        Network[Network Security]
-        Auth[Authentication]
-        AuthZ[Authorization]
-        Container[Container Security]
-        Data[Data Protection]
-        Audit[Audit Logging]
-    end
-    
-    Network --> Auth
-    Auth --> AuthZ
-    AuthZ --> Container
-    Container --> Data
-    Data --> Audit
-```
-
-**Security Features:**
- API key authentication
- Role-based permissions
- Container isolation
- File system sandboxing
- Comprehensive audit logs
- Input validation and sanitization
-
-### Container Security
-
-```mermaid
-graph TB
-    subgraph "Container Isolation"
-        Host[Host System]
-        Podman[Podman Runtime]
-        Network[Network Isolation]
-        FS[File System Isolation]
-        User[User Namespaces]
-        ML[ML Container]
-        
-        Host --> Podman
-        Podman --> Network
-        Podman --> FS
-        Podman --> User
-        User --> ML
-    end
-```
-
-**Isolation Features:**
- Rootless containers
- Network isolation
- File system sandboxing
- User namespace mapping
- Resource limits
-
-## Configuration Architecture
-
-### Configuration Hierarchy
-
-```mermaid
-graph TB
-    subgraph "Config Sources"
-        Env[Environment Variables]
-        File[Config Files]
-        CLI[CLI Flags]
-        Defaults[Default Values]
-    end
-    
-    subgraph "Config Processing"
-        Merge[Config Merger]
-        Validate[Schema Validator]
-        Apply[Config Applier]
-    end
-    
-    Env --> Merge
-    File --> Merge
-    CLI --> Merge
-    Defaults --> Merge
-    
-    Merge --> Validate
-    Validate --> Apply
-```
-
-**Configuration Priority:**
-1. CLI flags (highest)
-2. Environment variables
-3. Configuration files
-4. Default values (lowest)
-
-## Scalability Architecture
-
-### Horizontal Scaling
-
-```mermaid
-graph TB
-    subgraph "Scaled Architecture"
-        LB[Load Balancer]
-        W1[Worker 1]
-        W2[Worker 2]
-        W3[Worker N]
-        Redis[Redis Cluster]
-        Storage[Shared Storage]
-        
-        LB --> W1
-        LB --> W2
-        LB --> W3
-        
-        W1 --> Redis
-        W2 --> Redis
-        W3 --> Redis
-        
-        W1 --> Storage
-        W2 --> Storage
-        W3 --> Storage
-    end
-```
-
-**Scaling Features:**
- Stateless worker services
- Shared job queue (Redis)
- Distributed storage
- Load balancer ready
- Health checks and monitoring
-
-## Technology Stack
-
-### Backend Technologies
-
-| Component | Technology | Purpose |
-|-----------|------------|---------|
-| **Language** | Go 1.25+ | Core application |
-| **Web Framework** | Standard library | HTTP server |
-| **Authentication** | Custom | API key + RBAC |
-| **Database** | SQLite/PostgreSQL | Metadata storage |
-| **Cache** | Redis | Job queue & caching |
-| **Containers** | Podman/Docker | Job isolation |
-| **UI Framework** | Bubble Tea | Terminal UI |
-
-### Dependencies
-
-```go
-// Core dependencies
-require (
-    github.com/charmbracelet/bubbletea v1.3.10  // TUI framework
-    github.com/go-redis/redis/v8 v8.11.5        // Redis client
-    github.com/google/uuid v1.6.0               // UUID generation
-    github.com/mattn/go-sqlite3 v1.14.32        // SQLite driver
-    golang.org/x/crypto v0.45.0                 // Crypto utilities
-    gopkg.in/yaml.v3 v3.0.1                     // YAML parsing
-)
-```
-
-## Development Architecture
-
-### Project Structure
-
-```
-fetch_ml/
-├── cmd/                    # CLI applications
-│   ├── worker/            # ML worker service
-│   ├── tui/               # Terminal UI
-│   ├── data_manager/      # Data management
-│   └── user_manager/      # User management
-├── internal/              # Internal packages
-│   ├── auth/              # Authentication system
-│   ├── config/            # Configuration management
-│   ├── container/         # Container operations
-│   ├── database/          # Database operations
-│   ├── logging/           # Logging utilities
-│   ├── metrics/           # Metrics collection
-│   └── network/           # Network utilities
-├── configs/               # Configuration files
-├── scripts/               # Setup and utility scripts
-├── tests/                 # Test suites
-└── docs/                  # Documentation
-```
-
-### Package Dependencies
-
-```mermaid
-graph TB
-    subgraph "Application Layer"
-        Worker[cmd/worker]
-        TUI[cmd/tui]
-        DataMgr[cmd/data_manager]
-        UserMgr[cmd/user_manager]
-    end
-    
-    subgraph "Service Layer"
-        Auth[internal/auth]
-        Config[internal/config]
-        Container[internal/container]
-        Database[internal/database]
-    end
-    
-    subgraph "Utility Layer"
-        Logging[internal/logging]
-        Metrics[internal/metrics]
-        Network[internal/network]
-    end
-    
-    Worker --> Auth
-    Worker --> Config
-    Worker --> Container
-    TUI --> Auth
-    DataMgr --> Database
-    UserMgr --> Auth
-    
-    Auth --> Logging
-    Container --> Network
-    Database --> Metrics
-```
-
-## Monitoring & Observability
-
-### Metrics Collection
-
-```mermaid
-graph TB
-    subgraph "Metrics Pipeline"
-        App[Application] --> Metrics[Metrics Collector]
-        Metrics --> Export[Prometheus Exporter]
-        Export --> Prometheus[Prometheus Server]
-        Prometheus --> Grafana[Grafana Dashboard]
-        
-        subgraph "Metric Types"
-            Counter[Counters]
-            Gauge[Gauges]
-            Histogram[Histograms]
-            Timer[Timers]
-        end
-        
-        App --> Counter
-        App --> Gauge
-        App --> Histogram
-        App --> Timer
-    end
-```
-
-### Logging Architecture
-
-```mermaid
-graph TB
-    subgraph "Logging Pipeline"
-        App[Application] --> Logger[Structured Logger]
-        Logger --> File[File Output]
-        Logger --> Console[Console Output]
-        Logger --> Syslog[Syslog Forwarder]
-        Syslog --> Aggregator[Log Aggregator]
-        Aggregator --> Storage[Log Storage]
-        Storage --> Viewer[Log Viewer]
-    end
-```
-
-## Deployment Architecture
-
-### Container Deployment
-
-```mermaid
-graph TB
-    subgraph "Deployment Stack"
-        Image[Container Image]
-        Registry[Container Registry]
-        Orchestrator[Docker Compose]
-        Config[ConfigMaps/Secrets]
-        Storage[Persistent Storage]
-        
-        Image --> Registry
-        Registry --> Orchestrator
-        Config --> Orchestrator
-        Storage --> Orchestrator
-    end
-```
-
-### Service Discovery
-
-```mermaid
-graph TB
-    subgraph "Service Mesh"
-        Gateway[API Gateway]
-        Discovery[Service Discovery]
-        Worker[Worker Service]
-        Data[Data Service]
-        Redis[Redis Cluster]
-        
-        Gateway --> Discovery
-        Discovery --> Worker
-        Discovery --> Data
-        Discovery --> Redis
-    end
-```
-
-## Future Architecture Considerations
-
-### Microservices Evolution
-
- **API Gateway**: Centralized routing and authentication
- **Service Mesh**: Inter-service communication
- **Event Streaming**: Kafka for job events
- **Distributed Tracing**: OpenTelemetry integration
- **Multi-tenant**: Tenant isolation and quotas
-
-### Homelab Features
-
- **Docker Compose**: Simple container orchestration
- **Local Development**: Easy setup and testing
- **Security**: Built-in authentication and encryption
- **Monitoring**: Basic health checks and logging
-
---
-
-This architecture provides a solid foundation for secure, scalable machine learning experiments while maintaining simplicity and developer productivity.
--- a/docs/_pages/cicd.md
+++ b/docs/_pages/cicd.md
@ -1,165 +0,0 @@
---
-layout: page
-title: "CI/CD Pipeline"
-permalink: /cicd/
-nav_order: 5
---
-
-# CI/CD Pipeline
-
-Automated testing, building, and releasing for fetch_ml.
-
-## Workflows
-
-### CI Workflow (`.github/workflows/ci.yml`)
-
-Runs on every push to `main`/`develop` and all pull requests.
-
-**Jobs:**
-1. **test** - Go backend tests with Redis
-2. **build** - Build all binaries (Go + Zig CLI)
-3. **test-scripts** - Validate deployment scripts
-4. **security-scan** - Trivy and Gosec security scans
-5. **docker-build** - Build and push Docker images (main branch only)
-
-**Test Coverage:**
- Go unit tests with race detection
- `internal/queue` package tests
- Zig CLI tests  
- Integration tests
- Security audits
-
-### Release Workflow (`.github/workflows/release.yml`)
-
-Runs on version tags (e.g., `v1.0.0`).
-
-**Jobs:**
-
-1. **build-cli** (matrix build)
-   - Linux x86_64 (static musl)
-   - macOS x86_64
-   - macOS ARM64
-   - Downloads platform-specific static rsync
-   - Embeds rsync for zero-dependency releases
-
-2. **build-go-backends**
-   - Cross-platform Go builds
-   - api-server, worker, tui, data_manager, user_manager
-
-3. **create-release**
-   - Collects all artifacts
-   - Generates SHA256 checksums
-   - Creates GitHub release with notes
-
-## Release Process
-
-### Creating a Release
-
-```bash
-# 1. Update version
-git tag v1.0.0
-
-# 2. Push tag
-git push origin v1.0.0
-
-# 3. CI automatically builds and releases
-```
-
-### Release Artifacts
-
-**CLI Binaries (with embedded rsync):**
- `ml-linux-x86_64.tar.gz` (~450-650KB)
- `ml-macos-x86_64.tar.gz` (~450-650KB)
- `ml-macos-arm64.tar.gz` (~450-650KB)
-
-**Go Backends:**
- `fetch_ml_api-server.tar.gz`
- `fetch_ml_worker.tar.gz`
- `fetch_ml_tui.tar.gz`
- `fetch_ml_data_manager.tar.gz`
- `fetch_ml_user_manager.tar.gz`
-
-**Checksums:**
- `checksums.txt` - Combined SHA256 sums
- Individual `.sha256` files per binary
-
-## Development Workflow
-
-### Local Testing
-
-```bash
-# Run all tests
-make test
-
-# Run specific package tests
-go test ./internal/queue/...
-
-# Build CLI
-cd cli && zig build dev
-
-# Run formatters and linters
-make lint
-
-# Security scans are handled automatically in CI by the `security-scan` job
-```
-
-#### Optional heavy end-to-end tests
-
-Some e2e tests exercise full Docker deployments and performance scenarios and are
-**skipped by default** to keep local/CI runs fast. You can enable them explicitly
-with environment variables:
-
-```bash
-# Run Docker deployment e2e tests
-FETCH_ML_E2E_DOCKER=1 go test ./tests/e2e/...
-
-# Run performance-oriented e2e tests
-FETCH_ML_E2E_PERF=1 go test ./tests/e2e/...
-```
-
-Without these variables, `TestDockerDeploymentE2E` and `TestPerformanceE2E` will
-`t.Skip`, while all lighter e2e tests still run.
-
-### Pull Request Checks
-
-All PRs must pass:
- ✅ Go tests (with Redis)
- ✅ CLI tests
- ✅ Security scans
- ✅ Code linting
- ✅ Build verification
-
-## Configuration
-
-### Environment Variables
-
-```yaml
-GO_VERSION: '1.25.0'
-ZIG_VERSION: '0.15.2'
-```
-
-### Secrets
-
-Required for releases:
- `GITHUB_TOKEN` - Automatic, provided by GitHub Actions
-
-## Monitoring
-
-### Build Status
-
-Check workflow runs at:
-```
-https://github.com/jfraeys/fetch_ml/actions
-```
-
-### Artifacts
-
-Download build artifacts from:
- Successful workflow runs (30-day retention)
- GitHub Releases (permanent)
-
---
-
-For implementation details:
- [.github/workflows/ci.yml](https://github.com/jfraeys/fetch_ml/blob/main/.github/workflows/ci.yml)
- [.github/workflows/release.yml](https://github.com/jfraeys/fetch_ml/blob/main/.github/workflows/release.yml)
--- a/docs/_pages/cli-reference.md
+++ b/docs/_pages/cli-reference.md
@ -1,404 +0,0 @@
---
-layout: page
-title: "CLI Reference"
-permalink: /cli-reference/
-nav_order: 2
---
-
-# Fetch ML CLI Reference
-
-Comprehensive command-line tools for managing ML experiments in your homelab with Zig-based high-performance CLI.
-
-## Overview
-
-Fetch ML provides a comprehensive CLI toolkit built with performance and security in mind:
-
- **Zig CLI** - High-performance experiment management written in Zig
- **Go Commands** - API server, TUI, and data management utilities
- **Management Scripts** - Service orchestration and deployment
- **Setup Scripts** - One-command installation and configuration
-
-## Zig CLI (`./cli/zig-out/bin/ml`)
-
-High-performance command-line interface for experiment management, written in Zig for speed and efficiency.
-
-### Available Commands
-
-| Command | Description | Example |
-|---------|-------------|----------|
-| `init` | Interactive configuration setup | `ml init` |
-| `sync` | Sync project to worker with deduplication | `ml sync ./project --name myjob --queue` |
-| `queue` | Queue job for execution | `ml queue myjob --commit abc123 --priority 8` |
-| `status` | Get system and worker status | `ml status` |
-| `monitor` | Launch TUI monitoring via SSH | `ml monitor` |
-| `cancel` | Cancel running job | `ml cancel job123` |
-| `prune` | Clean up old experiments | `ml prune --keep 10` |
-| `watch` | Auto-sync directory on changes | `ml watch ./project --queue` |
-
-### Command Details
-
-#### `init` - Configuration Setup
-```bash
-ml init
-```
-Creates a configuration template at `~/.ml/config.toml` with:
- Worker connection details
- API authentication
- Base paths and ports
-
-#### `sync` - Project Synchronization
-```bash
-# Basic sync
-ml sync ./my-project
-
-# Sync with custom name and queue
-ml sync ./my-project --name "experiment-1" --queue
-
-# Sync with priority
-ml sync ./my-project --priority 9
-```
-
-**Features:**
- Content-addressed storage for deduplication
- SHA256 commit ID generation
- Rsync-based file transfer
- Automatic queuing (with `--queue` flag)
-
-#### `queue` - Job Management
-```bash
-# Queue with commit ID
-ml queue my-job --commit abc123def456
-
-# Queue with priority (1-10, default 5)
-ml queue my-job --commit abc123 --priority 8
-```
-
-**Features:**
- WebSocket-based communication
- Priority queuing system
- API key authentication
-
-#### `watch` - Auto-Sync Monitoring
-```bash
-# Watch directory for changes
-ml watch ./project
-
-# Watch and auto-queue on changes
-ml watch ./project --name "dev-exp" --queue
-```
-
-**Features:**
- Real-time file system monitoring
- Automatic re-sync on changes
- Configurable polling interval (2 seconds)
- Commit ID comparison for efficiency
-
-#### `prune` - Cleanup Management
-```bash
-# Keep last N experiments
-ml prune --keep 20
-
-# Remove experiments older than N days
-ml prune --older-than 30
-```
-
-#### `monitor` - Remote Monitoring
-```bash
-ml monitor
-```
-Launches TUI interface via SSH for real-time monitoring.
-
-#### `cancel` - Job Cancellation
-```bash
-ml cancel running-job-id
-```
-Cancels currently running jobs by ID.
-
-### Configuration
-
-The Zig CLI reads configuration from `~/.ml/config.toml`:
-
-```toml
-worker_host = "worker.local"
-worker_user = "mluser"
-worker_base = "/data/ml-experiments"
-worker_port = 22
-api_key = "your-api-key"
-```
-
-### Performance Features
-
- **Content-Addressed Storage**: Automatic deduplication of identical files
- **Incremental Sync**: Only transfers changed files
- **SHA256 Hashing**: Reliable commit ID generation
- **WebSocket Communication**: Efficient real-time messaging
- **Multi-threaded**: Concurrent operations where applicable
-
-## Go Commands
-
-### API Server (`./cmd/api-server/main.go`)
-Main HTTPS API server for experiment management.
-
-```bash
-# Build and run
-go run ./cmd/api-server/main.go
-
-# With configuration
-./bin/api-server --config configs/config-local.yaml
-```
-
-**Features:**
- HTTPS-only communication
- API key authentication
- Rate limiting and IP whitelisting
- WebSocket support for real-time updates
- Redis integration for caching
-
-### TUI (`./cmd/tui/main.go`)
-Terminal User Interface for monitoring experiments.
-
-```bash
-# Launch TUI
-go run ./cmd/tui/main.go
-
-# With custom config
-./tui --config configs/config-local.yaml
-```
-
-**Features:**
- Real-time experiment monitoring
- Interactive job management
- Status visualization
- Log viewing
-
-### Data Manager (`./cmd/data_manager/`)
-Utilities for data synchronization and management.
-
-```bash
-# Sync data
-./data_manager --sync ./data
-
-# Clean old data
-./data_manager --cleanup --older-than 30d
-```
-
-### Config Lint (`./cmd/configlint/main.go`)
-Configuration validation and linting tool.
-
-```bash
-# Validate configuration
-./configlint configs/config-local.yaml
-
-# Check schema compliance
-./configlint --schema configs/schema/config_schema.yaml
-```
-
-## Management Script (`./tools/manage.sh`)
-
-Simple service management for your homelab.
-
-### Commands
-```bash
-./tools/manage.sh start          # Start all services
-./tools/manage.sh stop           # Stop all services
-./tools/manage.sh status         # Check service status
-./tools/manage.sh logs           # View logs
-./tools/manage.sh monitor        # Basic monitoring
-./tools/manage.sh security       # Security status
-./tools/manage.sh cleanup        # Clean project artifacts
-```
-
-## Setup Script (`./setup.sh`)
-
-One-command homelab setup.
-
-### Usage
-```bash
-# Full setup
-./setup.sh
-
-# Setup includes:
-# - SSL certificate generation
-# - Configuration creation
-# - Build all components
-# - Start Redis
-# - Setup Fail2Ban (if available)
-```
-
-## API Testing
-
-Test the API with curl:
-
-```bash
-# Health check
-curl -k -H 'X-API-Key: password' https://localhost:9101/health
-
-# List experiments
-curl -k -H 'X-API-Key: password' https://localhost:9101/experiments
-
-# Submit experiment
-curl -k -X POST -H 'X-API-Key: password' \
-     -H 'Content-Type: application/json' \
-     -d '{"name":"test","config":{"type":"basic"}}' \
-     https://localhost:9101/experiments
-```
-
-## Zig CLI Architecture
-
-The Zig CLI is designed for performance and reliability:
-
-### Core Components
- **Commands** (`cli/src/commands/`): Individual command implementations
- **Config** (`cli/src/config.zig`): Configuration management
- **Network** (`cli/src/net/ws.zig`): WebSocket client implementation
- **Utils** (`cli/src/utils/`): Cryptography, storage, and rsync utilities
- **Errors** (`cli/src/errors.zig`): Centralized error handling
-
-### Performance Optimizations
- **Content-Addressed Storage**: Deduplicates identical files across experiments
- **SHA256 Hashing**: Fast, reliable commit ID generation
- **Rsync Integration**: Efficient incremental file transfers
- **WebSocket Protocol**: Low-latency communication with worker
- **Memory Management**: Efficient allocation with Zig's allocator system
-
-### Security Features
- **API Key Hashing**: Secure authentication token handling
- **SSH Integration**: Secure file transfers
- **Input Validation**: Comprehensive argument checking
- **Error Handling**: Secure error reporting without information leakage
-
-## Configuration
-
-Main configuration file: `configs/config-local.yaml`
-
-### Key Settings
-```yaml
-auth:
-  enabled: true
-  api_keys:
-    homelab_user:
-      hash: "5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8"
-      admin: true
-
-server:
-  address: ":9101"
-  tls:
-    enabled: true
-    cert_file: "./ssl/cert.pem"
-    key_file: "./ssl/key.pem"
-
-security:
-  rate_limit:
-    enabled: true
-    requests_per_minute: 30
-  ip_whitelist:
-    - "127.0.0.1"
-    - "::1"
-    - "192.168.0.0/16"
-    - "10.0.0.0/8"
-```
-
-## Docker Commands
-
-If using Docker Compose:
-
-```bash
-# Start services
-docker-compose up -d
-
-# View logs
-docker-compose logs -f
-
-# Stop services
-docker-compose down
-
-# Check status
-docker-compose ps
-```
-
-## Troubleshooting
-
-### Common Issues
-
-**Zig CLI not found:**
-```bash
-# Build the CLI
-cd cli && make build
-
-# Check binary exists
-ls -la ./cli/zig-out/bin/ml
-```
-
-**Configuration not found:**
-```bash
-# Create configuration
-./cli/zig-out/bin/ml init
-
-# Check config file
-ls -la ~/.ml/config.toml
-```
-
-**Worker connection failed:**
-```bash
-# Test SSH connection
-ssh -p 22 mluser@worker.local
-
-# Check configuration
-cat ~/.ml/config.toml
-```
-
-**Sync not working:**
-```bash
-# Check rsync availability
-rsync --version
-
-# Test manual sync
-rsync -avz ./project/ mluser@worker.local:/tmp/test/
-```
-
-**WebSocket connection failed:**
-```bash
-# Check worker WebSocket port
-telnet worker.local 9100
-
-# Verify API key
-./cli/zig-out/bin/ml status
-```
-
-**API not responding:**
-```bash
-./tools/manage.sh status
-./tools/manage.sh logs
-```
-
-**Authentication failed:**
-```bash
-# Check API key in config-local.yaml
-grep -A 5 "api_keys:" configs/config-local.yaml
-```
-
-**Redis connection failed:**
-```bash
-# Check Redis status
-redis-cli ping
-
-# Start Redis
-redis-server
-```
-
-### Getting Help
-
-```bash
-# CLI help
-./cli/zig-out/bin/ml help
-
-# Management script help
-./tools/manage.sh help
-
-# Check all available commands
-make help
-```
-
---
-
-**That's it for the CLI reference!** For complete setup instructions, see the main [README](/).
--- a/docs/_pages/operations.md
+++ b/docs/_pages/operations.md
@ -1,310 +0,0 @@
---
-layout: page
-title: "Operations Runbook"
-permalink: /operations/
-nav_order: 6
---
-
-# Operations Runbook
-
-Operational guide for troubleshooting and maintaining the ML experiment system.
-
-## Task Queue Operations
-
-### Monitoring Queue Health
-
-```redis
-# Check queue depth
-ZCARD task:queue
-
-# List pending tasks
-ZRANGE task:queue 0 -1 WITHSCORES
-
-# Check dead letter queue
-KEYS task:dlq:*
-```
-
-### Handling Stuck Tasks
-
-**Symptom:** Tasks stuck in "running" status
-
-**Diagnosis:**
-```bash
-# Check for expired leases
-redis-cli GET task:{task-id}
-# Look for LeaseExpiry in past
-```
-
-**Rem
-
-ediation:**
-Tasks with expired leases are automatically reclaimed every 1 minute. To force immediate reclamation:
-```bash
-# Restart worker to trigger reclaim cycle
-systemctl restart ml-worker
-```
-
-### Dead Letter Queue Management
-
-**View failed tasks:**
-```redis
-KEYS task:dlq:*
-```
-
-**Inspect failed task:**
-```redis
-GET task:dlq:{task-id}
-```
-
-**Retry from DLQ:**
-```bash
-# Manual retry (requires custom script)
-# 1. Get task from DLQ
-# 2. Reset retry count
-# 3. Re-queue task
-```
-
-### Worker Crashes
-
-**Symptom:** Worker disappeared mid-task
-
-**What Happens:**
-1. Lease expires after 30 minutes (default)
-2. Background reclaim job detects expired lease
-3. Task is retried (up to 3 attempts)
-4. After max retries → Dead Letter Queue
-
-**Prevention:**
- Monitor worker heartbeats
- Set up alerts for worker down
- Use process manager (systemd, supervisor)
-
-## Worker Operations
-
-### Graceful Shutdown
-
-```bash
-# Send SIGTERM for graceful shutdown
-kill -TERM $(pgrep ml-worker)
-
-# Worker will:
-# 1. Stop accepting new tasks
-# 2. Finish active tasks (up to 5min timeout)
-# 3. Release all leases
-# 4. Exit cleanly
-```
-
-### Force Shutdown
-
-```bash
-# Force kill (leases will be reclaimed automatically)
-kill -9 $(pgrep ml-worker)
-```
-
-### Worker Heartbeat Monitoring
-
-```redis
-# Check worker heartbeats
-HGETALL worker:heartbeat
-
-# Example output:
-# worker-abc123 1701234567
-# worker-def456 1701234580
-```
-
-**Alert if:** Heartbeat timestamp > 5 minutes old
-
-## Redis Operations
-
-### Backup
-
-```bash
-# Manual backup
-redis-cli SAVE
-cp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d).rdb
-```
-
-### Restore
-
-```bash
-# Stop Redis
-systemctl stop redis
-
-#  Restore snapshot
-cp /backup/redis-20231201.rdb /var/lib/redis/dump.rdb
-
-# Start Redis
-systemctl start redis
-```
-
-### Memory Management
-
-```redis
-# Check memory usage
-INFO memory
-
-# Evict old data if needed
-FLUSHDB  # DANGER: Clears all data!
-```
-
-## Common Issues
-
-### Issue: Queue Growing Unbounded
-
-**Symptoms:**
- `ZCARD task:queue` keeps increasing
- No workers processing tasks
-
-**Diagnosis:**
-```bash
-# Check worker status
-systemctl status ml-worker
-
-# Check logs
-journalctl -u ml-worker -n 100
-```
-
-**Resolution:**
-1. Verify workers are running
-2. Check Redis connectivity
-3. Verify lease configuration
-
-### Issue: High Retry Rate
-
-**Symptoms:**
- Many tasks in DLQ
- `retry_count` field high on tasks
-
-**Diagnosis:**
-```bash
-# Check worker logs for errors
-journalctl -u ml-worker | grep "retry"
-
-# Look for patterns (network issues, resource limits, etc)
-```
-
-**Resolution:**
- Fix underlying issue (network, resources, etc)
- Adjust retry limits if permanent failures
- Increase task timeout if jobs are slow
-
-### Issue: Leases Expiring Prematurely
-
-**Symptoms:**
- Tasks retried even though worker is healthy
- Logs show "lease expired" frequently
-
-**Diagnosis:**
-```yaml
-# Check worker config
-cat configs/worker-config.yaml | grep -A3 "lease"
-
-task_lease_duration: 30m  # Too short?
-heartbeat_interval: 1m    # Too infrequent?
-```
-
-**Resolution:**
-```yaml
-# Increase lease duration for long-running jobs
-task_lease_duration: 60m
-heartbeat_interval: 30s  # More frequent heartbeats
-```
-
-## Performance Tuning
-
-### Worker Concurrency
-
-```yaml
-# worker-config.yaml
-max_workers: 4  # Number of parallel tasks
-
-# Adjust based on:
-# - CPU cores available
-# - Memory per task
-# - GPU availability
-```
-
-### Redis Configuration
-
-```conf
-# /etc/redis/redis.conf
-
-# Persistence
-save 900 1
-save 300 10
-
-# Memory
-maxmemory 2gb
-maxmemory-policy noeviction
-
-# Performance
-tcp-keepalive 300
-timeout 0
-```
-
-## Alerting Rules
-
-### Critical Alerts
-
-1. **Worker Down** (no heartbeat > 5min)
-2. **Queue Depth** > 1000 tasks
-3. **DLQ Growth** > 100 tasks/hour
-4. **Redis Down** (connection failed)
-
-### Warning Alerts
-
-1. **High Retry Rate** > 10% of tasks
-2. **Slow Queue Drain** (depth increasing over 1 hour)
-3. **Worker Memory** > 80% usage
-
-## Health Checks
-
-```bash
-#!/bin/bash
-# health-check.sh
-
-# Check Redis
-redis-cli PING || echo "Redis DOWN"
-
-# Check worker heartbeat
-WORKER_ID=$(cat /var/run/ml-worker.pid)
-LAST_HB=$(redis-cli HGET worker:heartbeat "$WORKER_ID")
-NOW=$(date +%s)
-if [ $((NOW - LAST_HB)) -gt 300 ]; then
-  echo "Worker heartbeat stale"
-fi
-
-# Check queue depth
-DEPTH=$(redis-cli ZCARD task:queue)
-if [ "$DEPTH" -gt 1000 ]; then
-  echo "Queue depth critical: $DEPTH"
-fi
-```
-
-## Runbook Checklist
-
-### Daily Operations
- [ ] Check queue depth
- [ ] Verify worker heartbeats
- [ ] Review DLQ for patterns
- [ ] Check Redis memory usage
-
-### Weekly Operations
- [ ] Review retry rates
- [ ] Analyze failed task patterns
- [ ] Backup Redis snapshot
- [ ] Review worker logs
-
-### Monthly Operations
- [ ] Performance tuning review
- [ ] Capacity planning
- [ ] Update documentation
- [ ] Test disaster recovery
-
---
-
-**For homelab setups:**
-Most of these operations can be simplified. Focus on:
-  Basic monitoring (queue depth, worker status)
- Periodic Redis backups
- Graceful shutdowns for maintenance
--- a/docs/_pages/queue.md
+++ b/docs/_pages/queue.md
@ -1,322 +0,0 @@
---
-layout: page
-title: "Task Queue Architecture"
-permalink: /queue/
-nav_order: 3
---
-
-# Task Queue Architecture
-
-The task queue system enables reliable job processing between the API server and workers using Redis.
-
-## Overview
-
-```mermaid
-graph LR
-    CLI[CLI/Client] -->|WebSocket| API[API Server]
-    API -->|Enqueue| Redis[(Redis)]
-    Redis -->|Dequeue| Worker[Worker]
-    Worker -->|Update Status| Redis
-```
-
-## Components
-
-### TaskQueue (`internal/queue`)
-
-Shared package used by both API server and worker for job management.
-
-#### Task Structure
-
-```go
-type Task struct {
-    ID        string            // Unique task ID (UUID)
-    JobName   string            // User-defined job name  
-    Args      string            // Job arguments
-    Status    string            // queued, running, completed, failed
-    Priority  int64             // Higher = executed first
-    CreatedAt time.Time         
-    StartedAt *time.Time        
-    EndedAt   *time.Time        
-    WorkerID  string            
-    Error     string            
-    Datasets  []string          
-    Metadata  map[string]string // commit_id, user, etc
-}
-```
-
-#### TaskQueue Interface
-
-```go
-// Initialize queue
-queue, err := queue.NewTaskQueue(queue.Config{
-    RedisAddr:     "localhost:6379",
-    RedisPassword: "",
-    RedisDB:       0,
-})
-
-// Add task (API server)
-task := &queue.Task{
-    ID:       uuid.New().String(),
-    JobName:  "train-model",
-    Status:   "queued",
-    Priority: 5,
-    Metadata: map[string]string{
-        "commit_id": commitID,
-        "user":      username,
-    },
-}
-err = queue.AddTask(task)
-
-// Get next task (Worker)
-task, err := queue.GetNextTask()
-
-// Update task status
-task.Status = "running"
-err = queue.UpdateTask(task)
-```
-
-## Data Flow
-
-### Job Submission Flow
-
-```mermaid
-sequenceDiagram
-    participant CLI
-    participant API
-    participant Redis
-    participant Worker
-    
-    CLI->>API: Queue Job (WebSocket)
-    API->>API: Create Task (UUID)
-    API->>Redis: ZADD task:queue
-    API->>Redis: SET task:{id}
-    API->>CLI: Success Response
-    
-    Worker->>Redis: ZPOPMAX task:queue
-    Redis->>Worker: Task ID
-    Worker->>Redis: GET task:{id}
-    Redis->>Worker: Task Data
-    Worker->>Worker: Execute Job
-    Worker->>Redis: Update Status
-```
-
-### Protocol
-
-**CLI → API** (Binary WebSocket):
-```
-[opcode:1][api_key_hash:64][commit_id:64][priority:1][job_name_len:1][job_name:var]
-```
-
-**API → Redis**:
- Priority queue: `ZADD task:queue {priority} {task_id}`
- Task data: `SET task:{id} {json}`
- Status: `HSET task:status:{job_name} ...`
-
-**Worker ← Redis**:
- Poll: `ZPOPMAX task:queue 1` (highest priority first)
- Fetch: `GET task:{id}`
-
-## Redis Data Structures
-
-### Keys
-
-```
-task:queue                    # ZSET: priority queue
-task:{uuid}                  # STRING: task JSON data
-task:status:{job_name}       # HASH: job status
-worker:heartbeat             # HASH: worker health
-job:metrics:{job_name}       # HASH: job metrics
-```
-
-### Priority Queue (ZSET)
-
-```redis
-ZADD task:queue 10 "uuid-1"   # Priority 10
-ZADD task:queue 5  "uuid-2"   # Priority 5
-ZPOPMAX task:queue 1          # Returns uuid-1 (highest)
-```
-
-## API Server Integration
-
-### Initialization
-
-```go
-// cmd/api-server/main.go
-queueCfg := queue.Config{
-    RedisAddr:     cfg.Redis.Addr,
-    RedisPassword: cfg.Redis.Password,
-    RedisDB:       cfg.Redis.DB,
-}
-taskQueue, err := queue.NewTaskQueue(queueCfg)
-```
-
-### WebSocket Handler
-
-```go
-// internal/api/ws.go
-func (h *WSHandler) handleQueueJob(conn *websocket.Conn, payload []byte) error {
-    // Parse request
-    apiKeyHash, commitID, priority, jobName := parsePayload(payload)
-    
-    // Create task with unique ID
-    taskID := uuid.New().String()
-    task := &queue.Task{
-        ID:       taskID,
-        JobName:  jobName,
-        Status:   "queued",
-        Priority: int64(priority),
-        Metadata: map[string]string{
-            "commit_id": commitID,
-            "user":      user,
-        },
-    }
-    
-    // Enqueue
-    if err := h.queue.AddTask(task); err != nil {
-        return h.sendErrorPacket(conn, ErrorCodeDatabaseError, ...)
-    }
-    
-    return h.sendSuccessPacket(conn, "Job queued")
-}
-```
-
-## Worker Integration
-
-### Task Polling
-
-```go
-// cmd/worker/worker_server.go
-func (w *Worker) Start() error {
-    for {
-        task, err := w.queue.WaitForNextTask(ctx, 5*time.Second)
-        if task != nil {
-            go w.executeTask(task)
-        }
-    }
-}
-```
-
-### Task Execution
-
-```go
-func (w *Worker) executeTask(task *queue.Task) {
-    // Update status
-    task.Status = "running"
-    task.StartedAt = &now
-    w.queue.UpdateTaskWithMetrics(task, "start")
-    
-    // Execute
-    err := w.runJob(task)
-    
-    // Finalize
-    task.Status = "completed" // or "failed"
-    task.EndedAt = &endTime
-    task.Error = err.Error() // if err != nil
-    w.queue.UpdateTaskWithMetrics(task, "final")
-}
-```
-
-## Configuration
-
-### API Server (`configs/config.yaml`)
-
-```yaml
-redis:
-  addr: "localhost:6379"
-  password: ""
-  db: 0
-```
-
-### Worker (`configs/worker-config.yaml`)
-
-```yaml
-redis:
-  addr: "localhost:6379"
-  password: ""
-  db: 0
-  
-metrics_flush_interval: 500ms
-```
-
-## Monitoring
-
-### Queue Depth
-
-```go
-depth, err := queue.QueueDepth()
-fmt.Printf("Pending tasks: %d\n", depth)
-```
-
-### Worker Heartbeat
-
-```go
-// Worker sends heartbeat every 30s
-err := queue.Heartbeat(workerID)
-```
-
-### Metrics
-
-```redis
-HGETALL job:metrics:{job_name}
-# Returns: timestamp, tasks_start, tasks_final, etc
-```
-
-## Error Handling
-
-### Task Failures
-
-```go
-if err := w.runJob(task); err != nil {
-    task.Status = "failed"
-    task.Error = err.Error()
-    w.queue.UpdateTask(task)
-}
-```
-
-### Redis Connection Loss
-
-```go
-// TaskQueue automatically reconnects
-// Workers should implement retry logic
-for retries := 0; retries < 3; retries++ {
-    task, err := queue.GetNextTask()
-    if err == nil {
-        break
-    }
-    time.Sleep(backoff)
-}
-```
-
-## Testing
-
-```go
-// tests using miniredis
-s, _ := miniredis.Run()
-defer s.Close()
-
-tq, _ := queue.NewTaskQueue(queue.Config{
-    RedisAddr: s.Addr(),
-})
-
-task := &queue.Task{ID: "test-1", JobName: "test"}
-tq.AddTask(task)
-
-fetched, _ := tq.GetNextTask()
-// assert fetched.ID == "test-1"
-```
-
-## Best Practices
-
-1. **Unique Task IDs**: Always use UUIDs to avoid conflicts
-2. **Metadata**: Store commit_id and user in task metadata
-3. **Priority**: Higher values execute first (0-255 range)
-4. **Status Updates**: Update status at each lifecycle stage
-5. **Error Logging**: Store detailed errors in task.Error
-6. **Heartbeats**: Workers should send heartbeats regularly
-7. **Metrics**: Use UpdateTaskWithMetrics for atomic updates
-
---
-
-For implementation details, see:
- [internal/queue/task.go](https://github.com/jfraeys/fetch_ml/blob/main/internal/queue/task.go)
- [internal/queue/queue.go](https://github.com/jfraeys/fetch_ml/blob/main/internal/queue/queue.go)
--- a/docs/_pages/redis-ha.md
+++ b/docs/_pages/redis-ha.md
@ -1,95 +0,0 @@
---
-layout: page
-title: "Redis High Availability (Optional)"
-permalink: /redis-ha/
-nav_order: 7
---
-
-# Redis High Availability
-
-**Note:** This is optional for homelab setups. Single Redis instance is sufficient for most use cases.
-
-## When You Need HA
-
-Consider Redis HA if:
- Running production workloads
- Uptime > 99.9% required
- Can't afford to lose queued tasks
- Multiple workers across machines
-
-## Redis Sentinel (Recommended)
-
-### Setup
-
-```yaml
-# docker-compose.yml
-version: '3.8'
-services:
-  redis-master:
-    image: redis:7-alpine
-    command: redis-server --maxmemory 2gb
-    
-  redis-replica:
-    image: redis:7-alpine
-    command: redis-server --slaveof redis-master 6379
-    
-  redis-sentinel-1:
-    image: redis:7-alpine
-    command: redis-sentinel /etc/redis/sentinel.conf
-    volumes:
-      - ./sentinel.conf:/etc/redis/sentinel.conf
-```
-
-**sentinel.conf:**
-```conf
-sentinel monitor mymaster redis-master 6379 2
-sentinel down-after-milliseconds mymaster 5000
-sentinel parallel-syncs mymaster 1
-sentinel failover-timeout mymaster 10000
-```
-
-### Application Configuration
-
-```yaml
-# worker-config.yaml
-redis_addr: "redis-sentinel-1:26379,redis-sentinel-2:26379"
-redis_master_name: "mymaster"
-```
-
-## Redis Cluster (Advanced)
-
-For larger deployments with sharding needs.
-
-```yaml
-# Minimum 3 masters + 3 replicas
-services:
-  redis-1:
-    image: redis:7-alpine
-    command: redis-server --cluster-enabled yes
-    
-  redis-2:
-    # ... similar config
-```
-
-## Homelab Alternative: Persistence Only
-
-**For most homelabs, just enable persistence:**
-
-```yaml
-# docker-compose.yml
-services:
-  redis:
-    image: redis:7-alpine
-    command: redis-server --appendonly yes
-    volumes:
-      - redis_data:/data
-      
-volumes:
-  redis_data:
-```
-
-This ensures tasks survive Redis restarts without full HA complexity.
-
---
-
-**Recommendation:** Start simple. Add HA only if you experience actual downtime issues.
--- a/docs/_pages/zig-cli.md
+++ b/docs/_pages/zig-cli.md
@ -1,452 +0,0 @@
---
-layout: page
-title: "Zig CLI Guide"
-permalink: /zig-cli/
-nav_order: 3
---
-
-# Zig CLI Guide
-
-High-performance command-line interface for ML experiment management, written in Zig for maximum speed and efficiency.
-
-## Overview
-
-The Zig CLI (`ml`) is the primary interface for managing ML experiments in your homelab. Built with Zig, it provides exceptional performance for file operations, network communication, and experiment management.
-
-## Installation
-
-### Pre-built Binaries (Recommended)
-
-Download from [GitHub Releases](https://github.com/jfraeys/fetch_ml/releases):
-
-```bash
-# Download for your platform
-curl -LO https://github.com/jfraeys/fetch_ml/releases/latest/download/ml-<platform>.tar.gz
-
-# Extract
-tar -xzf ml-<platform>.tar.gz
-
-# Install
-chmod +x ml-<platform>
-sudo mv ml-<platform> /usr/local/bin/ml
-
-# Verify
-ml --help
-```
-
-**Platforms:**
- `ml-linux-x86_64.tar.gz` - Linux (fully static, zero dependencies)
- `ml-macos-x86_64.tar.gz` - macOS Intel  
- `ml-macos-arm64.tar.gz` - macOS Apple Silicon
-
-All release binaries include **embedded static rsync** for complete independence.
-
-### Build from Source
-
-**Development Build** (uses system rsync):
-```bash
-cd cli
-zig build dev
-./zig-out/dev/ml-dev --help
-```
-
-**Production Build** (embedded rsync):
-```bash
-cd cli
-# For testing: uses rsync wrapper
-zig build prod
-
-# For release with static rsync:
-# 1. Place static rsync binary at src/assets/rsync_release.bin
-# 2. Build
-zig build prod
-strip zig-out/prod/ml  # Optional: reduce size
-
-# Verify
-./zig-out/prod/ml --help
-ls -lh zig-out/prod/ml
-```
-
-See [cli/src/assets/README.md](https://github.com/jfraeys/fetch_ml/blob/main/cli/src/assets/README.md) for details on obtaining static rsync binaries.
-
-### Verify Installation
-```bash
-ml --help
-ml --version  # Shows build config
-```
-
-## Quick Start
-
-1. **Initialize Configuration**
-   ```bash
-   ./cli/zig-out/bin/ml init
-   ```
-
-2. **Sync Your First Project**
-   ```bash
-   ./cli/zig-out/bin/ml sync ./my-project --queue
-   ```
-
-3. **Monitor Progress**
-   ```bash
-   ./cli/zig-out/bin/ml status
-   ```
-
-## Command Reference
-
-### `init` - Configuration Setup
-
-Initialize the CLI configuration file.
-
-```bash
-ml init
-```
-
-**Creates:** `~/.ml/config.toml`
-
-**Configuration Template:**
-```toml
-worker_host = "worker.local"
-worker_user = "mluser"
-worker_base = "/data/ml-experiments"
-worker_port = 22
-api_key = "your-api-key"
-```
-
-### `sync` - Project Synchronization
-
-Sync project files to the worker with intelligent deduplication.
-
-```bash
-# Basic sync
-ml sync ./project
-
-# Sync with custom name and auto-queue
-ml sync ./project --name "experiment-1" --queue
-
-# Sync with priority
-ml sync ./project --priority 8
-```
-
-**Options:**
- `--name <name>`: Custom experiment name
- `--queue`: Automatically queue after sync
- `--priority N`: Set priority (1-10, default 5)
-
-**Features:**
- **Content-Addressed Storage**: Automatic deduplication
- **SHA256 Commit IDs**: Reliable change detection
- **Incremental Transfer**: Only sync changed files
- **Rsync Backend**: Efficient file transfer
-
-### `queue` - Job Management
-
-Queue experiments for execution on the worker.
-
-```bash
-# Queue with commit ID
-ml queue my-job --commit abc123def456
-
-# Queue with priority
-ml queue my-job --commit abc123 --priority 8
-```
-
-**Options:**
- `--commit <id>`: Commit ID from sync output
- `--priority N`: Execution priority (1-10)
-
-**Features:**
- **WebSocket Communication**: Real-time job submission
- **Priority Queuing**: Higher priority jobs run first
- **API Authentication**: Secure job submission
-
-### `watch` - Auto-Sync Monitoring
-
-Monitor directories for changes and auto-sync.
-
-```bash
-# Watch for changes
-ml watch ./project
-
-# Watch and auto-queue on changes
-ml watch ./project --name "dev-exp" --queue
-```
-
-**Options:**
- `--name <name>`: Custom experiment name
- `--queue`: Auto-queue on changes
- `--priority N`: Set priority for queued jobs
-
-**Features:**
- **Real-time Monitoring**: 2-second polling interval
- **Change Detection**: File modification time tracking
- **Commit Comparison**: Only sync when content changes
- **Automatic Queuing**: Seamless development workflow
-
-### `status` - System Status
-
-Check system and worker status.
-
-```bash
-ml status
-```
-
-**Displays:**
- Worker connectivity
- Queue status
- Running jobs
- System health
-
-### `monitor` - Remote Monitoring
-
-Launch TUI interface via SSH for real-time monitoring.
-
-```bash
-ml monitor
-```
-
-**Features:**
- **Real-time Updates**: Live experiment status
- **Interactive Interface**: Browse and manage experiments
- **SSH Integration**: Secure remote access
-
-### `cancel` - Job Cancellation
-
-Cancel running or queued jobs.
-
-```bash
-ml cancel job-id
-```
-
-**Options:**
- `job-id`: Job identifier from status output
-
-### `prune` - Cleanup Management
-
-Clean up old experiments to save space.
-
-```bash
-# Keep last N experiments
-ml prune --keep 20
-
-# Remove experiments older than N days
-ml prune --older-than 30
-```
-
-**Options:**
- `--keep N`: Keep N most recent experiments
- `--older-than N`: Remove experiments older than N days
-
-## Architecture
-
-### Core Components
-
-```
-cli/src/
-├── commands/        # Command implementations
-│   ├── init.zig     # Configuration setup
-│   ├── sync.zig     # Project synchronization
-│   ├── queue.zig    # Job management
-│   ├── watch.zig    # Auto-sync monitoring
-│   ├── status.zig   # System status
-│   ├── monitor.zig  # Remote monitoring
-│   ├── cancel.zig   # Job cancellation
-│   └── prune.zig    # Cleanup operations
-├── config.zig       # Configuration management
-├── errors.zig       # Error handling
-├── net/            # Network utilities
-│   └── ws.zig       # WebSocket client
-└── utils/          # Utility functions
-    ├── crypto.zig   # Hashing and encryption
-    ├── storage.zig  # Content-addressed storage
-    └── rsync.zig    # File synchronization
-```
-
-### Performance Features
-
-#### Content-Addressed Storage
- **Deduplication**: Identical files shared across experiments
- **Hash-based Storage**: Files stored by SHA256 hash
- **Space Efficiency**: Reduces storage by up to 90%
-
-#### SHA256 Commit IDs
- **Reliable Detection**: Cryptographic change detection
- **Collision Resistance**: Guaranteed unique identifiers
- **Fast Computation**: Optimized for large directories
-
-#### WebSocket Protocol
- **Low Latency**: Real-time communication
- **Binary Protocol**: Efficient message format
- **Connection Pooling**: Reused connections
-
-#### Memory Management
- **Arena Allocators**: Efficient memory allocation
- **Zero-copy Operations**: Minimized memory usage
- **Resource Cleanup**: Automatic resource management
-
-### Security Features
-
-#### Authentication
- **API Key Hashing**: Secure token storage
- **SHA256 Hashes**: Irreversible token protection
- **Config Validation**: Input sanitization
-
-#### Secure Communication
- **SSH Integration**: Encrypted file transfers
- **WebSocket Security**: TLS-protected communication
- **Input Validation**: Comprehensive argument checking
-
-#### Error Handling
- **Secure Reporting**: No sensitive information leakage
- **Graceful Degradation**: Safe error recovery
- **Audit Logging**: Operation tracking
-
-## Advanced Usage
-
-### Workflow Integration
-
-#### Development Workflow
-```bash
-# 1. Initialize project
-ml sync ./project --name "dev" --queue
-
-# 2. Auto-sync during development
-ml watch ./project --name "dev" --queue
-
-# 3. Monitor progress
-ml status
-```
-
-#### Batch Processing
-```bash
-# Process multiple experiments
-for dir in experiments/*/; do
-    ml sync "$dir" --queue
-done
-```
-
-#### Priority Management
-```bash
-# High priority experiment
-ml sync ./urgent --priority 10 --queue
-
-# Background processing
-ml sync ./background --priority 1 --queue
-```
-
-### Configuration Management
-
-#### Multiple Workers
-```toml
-# ~/.ml/config.toml
-worker_host = "worker.local"
-worker_user = "mluser"
-worker_base = "/data/ml-experiments"
-worker_port = 22
-api_key = "your-api-key"
-```
-
-#### Security Settings
-```bash
-# Set restrictive permissions
-chmod 600 ~/.ml/config.toml
-
-# Verify configuration
-ml status
-```
-
-## Troubleshooting
-
-### Common Issues
-
-#### Build Problems
-```bash
-# Check Zig installation
-zig version
-
-# Clean build
-cd cli && make clean && make build
-```
-
-#### Connection Issues
-```bash
-# Test SSH connectivity
-ssh -p $worker_port $worker_user@$worker_host
-
-# Verify configuration
-cat ~/.ml/config.toml
-```
-
-#### Sync Failures
-```bash
-# Check rsync
-rsync --version
-
-# Manual sync test
-rsync -avz ./test/ $worker_user@$worker_host:/tmp/
-```
-
-#### Performance Issues
-```bash
-# Monitor resource usage
-top -p $(pgrep ml)
-
-# Check disk space
-df -h $worker_base
-```
-
-### Debug Mode
-
-Enable verbose logging:
-```bash
-# Environment variable
-export ML_DEBUG=1
-ml sync ./project
-
-# Or use debug build
-cd cli && make debug
-```
-
-## Performance Benchmarks
-
-### File Operations
- **Sync Speed**: 100MB/s+ (network limited)
- **Hash Computation**: 500MB/s+ (CPU limited)
- **Deduplication**: 90%+ space savings
-
-### Memory Usage
- **Base Memory**: ~10MB
- **Large Projects**: ~50MB (1GB+ projects)
- **Memory Efficiency**: Constant per-file overhead
-
-### Network Performance
- **WebSocket Latency**: <10ms (local network)
- **Connection Setup**: <100ms
- **Throughput**: Network limited
-
-## Contributing
-
-### Development Setup
-```bash
-cd cli
-zig build-exe src/main.zig
-```
-
-### Testing
-```bash
-# Run tests
-cd cli && zig test src/
-
-# Integration tests
-zig test tests/
-```
-
-### Code Style
- Follow Zig style guidelines
- Use explicit error handling
- Document public APIs
- Add comprehensive tests
-
---
-
-**For more information, see the [CLI Reference](/cli-reference/) and [Architecture](/architecture/) pages.**