fetch_ml/docs/src/architecture.md

---
title: "Homelab Architecture"
url: "/architecture/"
weight: 1
---

# Homelab Architecture

Simple, secure architecture for ML experiments in your homelab.

## Components Overview

```mermaid
graph TB
    subgraph "Homelab Stack"
        CLI[Zig CLI]
        API["API Server (HTTPS + WebSocket)"]
        REDIS[Redis Cache]
        DB[(SQLite/PostgreSQL)]
        FS[Local Storage]
        WORKER[Worker Service]
        PODMAN[Podman/Docker]
    end

    CLI --> API
    API --> REDIS
    API --> DB
    API --> FS
    WORKER --> API
    WORKER --> REDIS
    WORKER --> FS
    WORKER --> PODMAN
```

## Core Services

### API Server
- **Purpose**: Secure HTTPS API for ML experiments
- **Port**: 9101 (HTTPS only)
- **Auth**: API key authentication
- **Security**: Rate limiting, IP whitelisting

### Redis
- **Purpose**: Caching and job queuing
- **Port**: 6379 (localhost only)
- **Storage**: Temporary data only
- **Persistence**: Local volume

### Zig CLI
- **Purpose**: High-performance experiment management
- **Language**: Zig for maximum speed and efficiency
- **Features**:
  - Content-addressed storage with deduplication
  - SHA256-based commit ID generation
  - WebSocket communication for real-time updates
  - Rsync-based incremental file transfers
  - Multi-threaded operations
  - Secure API key authentication
  - Auto-sync monitoring with file system watching
  - Priority-based job queuing
  - Memory-efficient operations with arena allocators

## Security Architecture

```mermaid
graph LR
    USER[User] --> AUTH[API Key Auth]
    AUTH --> RATE[Rate Limiting]
    RATE --> WHITELIST[IP Whitelist]
    WHITELIST --> API[Secure API]
    API --> AUDIT[Audit Logging]
```

### Security Layers
1. **API Key Authentication** - Hashed keys with roles
2. **Rate Limiting** - 30 requests/minute
3. **IP Whitelisting** - Local networks only
4. **Fail2Ban** - Automatic IP blocking
5. **HTTPS/TLS** - Encrypted communication
6. **Audit Logging** - Complete action tracking

## Data Flow

```mermaid
sequenceDiagram
    participant CLI
    participant API
    participant Redis
    participant Storage

    CLI->>API: HTTPS + WebSocket request
    API->>API: Validate Auth
    API->>Redis: Cache/Queue
    API->>Storage: Experiment Data
    Storage->>API: Results
    API->>CLI: Response
```

## Deployment Options

### Docker Compose (Recommended)
```yaml
services:
  redis:
    image: redis:7-alpine
    ports: ["6379:6379"]
    volumes: [redis_data:/data]

  api-server:
    build: .
    ports: ["9101:9101"]
    depends_on: [redis]
```

### Local Setup
```bash
docker-compose -f deployments/docker-compose.dev.yml up -d
```

## Network Architecture

- **Private Network**: Docker internal network
- **Localhost Access**: Redis only on localhost
- **HTTPS API**: Port 9101, TLS encrypted
- **No External Dependencies**: Everything runs locally

## Storage Architecture

```
data/
├── experiments/     # Experiment definitions, run manifests, and artifacts
├── tracking/        # Tracking tool state (e.g., MLflow/TensorBoard), when enabled
├── .prewarm/        # Best-effort prewarm staging (snapshots/env/datasets), when enabled
├── cache/           # Temporary caches (best-effort)
└── backups/         # Local backups

logs/
├── app.log         # Application logs
├── audit.log       # Security events
└── access.log      # API access logs
```

## Monitoring Architecture

Simple, lightweight monitoring:
- **Health Checks**: Service availability
- **Log Files**: Structured logging
- **Prometheus Metrics**: Worker and API metrics (including prewarm hit/miss/timing)
- **Security Events**: Failed auth, rate limits

## Homelab Benefits

- ✅ **Simple Setup**: One-command installation
- ✅ **Local Only**: No external dependencies
- ✅ **Secure by Default**: HTTPS, auth, rate limiting
- ✅ **Low Resource**: Minimal CPU/memory usage
- ✅ **Easy Backup**: Local file system
- ✅ **Privacy**: Everything stays on your network

## High-Level Architecture

```mermaid
graph TB
    subgraph "Client Layer"
        CLI[CLI Tools]
        TUI[Terminal UI]
        API[WebSocket API]
    end

    subgraph "Authentication Layer"
        Auth[Authentication Service]
        RBAC[Role-Based Access Control]
        Perm[Permission Manager]
    end

    subgraph "Core Services"
        Worker[ML Worker Service]
        DataMgr[Data Manager Service]
        Queue[Job Queue]
    end

    subgraph "Storage Layer"
        Redis[(Redis Cache)]
        DB[(SQLite/PostgreSQL)]
        Files[File Storage]
    end

    subgraph "Container Runtime"
        Podman[Podman/Docker]
        Containers[ML Containers]
    end

    CLI --> Auth
    TUI --> Auth
    API --> Auth

    Auth --> RBAC
    RBAC --> Perm

    Worker --> Queue
    Worker --> DataMgr
    Worker --> Podman

    DataMgr --> DB
    DataMgr --> Files

    Queue --> Redis

    Podman --> Containers
```

## Tracking & Plugin System

fetch_ml includes an optional tracking plugin system that can provision sidecar tools and/or pass through environment variables for common research tracking stacks.

### Tracking modes

Tracking tools support the following modes:

- `sidecar`: provision a local sidecar container per task (best-effort).
- `remote`: point to an externally managed instance (no local provisioning).
- `disabled`: disable the tool entirely.

### How it works

- The worker maintains a tracking registry and provisions tools during task startup.
- Provisioned plugins return environment variables that are injected into the task container.
- Some plugins also require host paths (e.g., TensorBoard log directory); these are mounted into the task container and sanitized to avoid leaking host paths.

### Built-in plugins

The worker ships with built-in plugins:

- `mlflow`: can run an MLflow server as a sidecar or use a remote `MLFLOW_TRACKING_URI`.
- `tensorboard`: runs a TensorBoard sidecar and mounts a per-job log directory.
- `wandb`: does not provision a sidecar; it forwards configuration via environment variables.

### Configuration

Plugins can be configured via worker configuration under `plugins`, including:

- `enabled`
- `image`
- `mode`
- per-plugin paths/settings (e.g., artifact base path, log base path)

## Plugin GPU Quota System

The scheduler includes a GPU quota management system for plugin-based services (Jupyter, vLLM, etc.) that controls resource allocation across users and plugins.

### Quota Enforcement

The quota system enforces limits at multiple levels:

1. **Global GPU Limit**: Total GPUs available across all plugins
2. **Per-User GPU Limit**: Maximum GPUs a single user can allocate
3. **Per-User Service Limit**: Maximum number of service instances per user
4. **Plugin-Specific Limits**: Separate limits for each plugin type
5. **User Overrides**: Custom limits for specific users with allowed plugin restrictions

### Architecture

```mermaid
graph TB
    subgraph "Plugin Quota System"
        Submit[Job Submission] --> CheckQuota{Check Quota}
        CheckQuota -->|Within Limits| Accept[Accept Job]
        CheckQuota -->|Exceeded| Reject[Reject with Error]

        Accept --> RecordUsage[Record Usage]
        RecordUsage --> Assign[Assign to Worker]

        Complete[Job Complete] --> ReleaseUsage[Release Usage]

        subgraph "Quota Manager"
            Global[Global GPU Counter]
            PerUser[Per-User Tracking]
            PerPlugin[Per-Plugin Tracking]
            Overrides[User Overrides]
        end

        CheckQuota --> Global
        CheckQuota --> PerUser
        CheckQuota --> PerPlugin
        CheckQuota --> Overrides
    end
```

### Components

- **PluginQuotaConfig**: Configuration for all quota limits and overrides
- **PluginQuotaManager**: Thread-safe manager for tracking and enforcing quotas
- **Integration Points**:
  - `SubmitJob()`: Validates quotas before accepting service jobs
  - `handleJobAccepted()`: Records usage when jobs are assigned
  - `handleJobResult()`: Releases usage when jobs complete

### Usage

Jobs must include `user_id` and `plugin_name` metadata for quota tracking:

```go
spec := scheduler.JobSpec{
    Type:     scheduler.JobTypeService,
    UserID:   "user123",
    GPUCount: 2,
    Metadata: map[string]string{
        "plugin_name": "jupyter",
    },
}
```

## Zig CLI Architecture

### Component Structure

```mermaid
graph TB
    subgraph "Zig CLI Components"
        Main[main.zig] --> Commands[commands/]
        Commands --> Config[config.zig]
        Commands --> Utils[utils/]
        Commands --> Net[net/]
        Commands --> Errors[errors.zig]

        subgraph "Commands"
            Init[init.zig]
            Sync[sync.zig]
            Queue[queue.zig]
            Watch[watch.zig]
            Status[status.zig]
            Monitor[monitor.zig]
            Cancel[cancel.zig]
            Prune[prune.zig]
        end

        subgraph "Utils"
            Crypto[crypto.zig]
            Storage[storage.zig]
            Rsync[rsync.zig]
        end

        subgraph "Network"
            WS[ws.zig]
        end
    end
```

### Performance Optimizations

#### Content-Addressed Storage
- **Deduplication**: Files stored by SHA256 hash
- **Space Efficiency**: Shared files across experiments
- **Fast Lookup**: Hash-based file retrieval

#### Memory Management
- **Arena Allocators**: Efficient bulk allocation
- **Zero-Copy Operations**: Minimized memory copying
- **Automatic Cleanup**: Resource deallocation

#### Network Communication
- **WebSocket Protocol**: Real-time bidirectional communication
- **Connection Pooling**: Reused connections
- **Binary Messaging**: Efficient data transfer

### Security Implementation

```mermaid
graph LR
    subgraph "CLI Security"
        Config[Config File] --> Hash[SHA256 Hashing]
        Hash --> Auth[API Authentication]
        Auth --> SSH[SSH Transfer]
        SSH --> WS[WebSocket Security]
    end
```

## Core Components

### 1. Authentication & Authorization

```mermaid
graph LR
    subgraph "Auth Flow"
        Client[Client] --> APIKey[API Key]
        APIKey --> Hash[Hash Validation]
        Hash --> Roles[Role Resolution]
        Roles --> Perms[Permission Check]
        Perms --> Access[Grant/Deny Access]
    end

    subgraph "Permission Sources"
        YAML[YAML Config]
        Inline[Inline Fallback]
        Roles --> YAML
        Roles --> Inline
    end
```

**Features:**
- API key-based authentication
- Role-based access control (RBAC)
- YAML-based permission configuration
- Fallback to inline permissions
- Admin wildcard permissions

### 2. Worker Service

```mermaid
graph TB
    subgraph "Worker Architecture"
        API[HTTP API] --> Router[Request Router]
        Router --> Auth[Auth Middleware]
        Auth --> Queue[Job Queue]
        Queue --> Processor[Job Processor]
        Processor --> Runtime[Container Runtime]
        Runtime --> Storage[Result Storage]

        subgraph "Job Lifecycle"
            Submit[Submit Job] --> Queue
            Queue --> Execute[Execute]
            Execute --> Monitor[Monitor]
            Monitor --> Complete[Complete]
            Complete --> Store[Store Results]
        end
    end
```

**Responsibilities:**
- HTTP API for job submission
- Job queue management
- Container orchestration
- Result collection and storage
- Metrics and monitoring

### 3. Data Manager Service

```mermaid
graph TB
    subgraph "Data Management"
        API[Data API] --> Storage[Storage Layer]
        Storage --> Metadata[Metadata DB]
        Storage --> Files[File System]
        Storage --> Cache[Redis Cache]

        subgraph "Data Operations"
            Upload[Upload Data] --> Validate[Validate]
            Validate --> Store[Store]
            Store --> Index[Index]
            Index --> Catalog[Catalog]
        end
    end
```

**Features:**
- Data upload and validation
- Metadata management
- File system abstraction
- Caching layer
- Data catalog

### 4. Terminal UI (TUI)

```mermaid
graph TB
    subgraph "TUI Architecture"
        UI[UI Components] --> Model[Data Model]
        Model --> Update[Update Loop]
        Update --> Render[Render]

        subgraph "UI Panels"
            Jobs[Job List]
            Details[Job Details]
            Logs[Log Viewer]
            Status[Status Bar]
        end

        UI --> Jobs
        UI --> Details
        UI --> Logs
        UI --> Status
    end
```

**Components:**
- Bubble Tea framework
- Component-based architecture
- Real-time updates
- Keyboard navigation
- Theme support

## Data Flow

### Job Execution Flow

```mermaid
sequenceDiagram
    participant Client
    participant Auth
    participant Worker
    participant Queue
    participant Container
    participant Storage

    Client->>Auth: Submit job with API key
    Auth->>Client: Validate and return job ID

    Client->>Worker: Execute job request
    Worker->>Queue: Queue job
    Queue->>Worker: Job ready
    Worker->>Container: Start ML container
    Container->>Worker: Execute experiment
    Worker->>Storage: Store results
    Worker->>Client: Return results
```

### Authentication Flow

```mermaid
sequenceDiagram
    participant Client
    participant Auth
    participant PermMgr
    participant Config

    Client->>Auth: Request with API key
    Auth->>Auth: Validate key hash
    Auth->>PermMgr: Get user permissions
    PermMgr->>Config: Load YAML permissions
    Config->>PermMgr: Return permissions
    PermMgr->>Auth: Return resolved permissions
    Auth->>Client: Grant/deny access
```

## Security Architecture

### Defense in Depth

```mermaid
graph TB
    subgraph "Security Layers"
        Network[Network Security]
        Auth[Authentication]
        AuthZ[Authorization]
        Container[Container Security]
        Data[Data Protection]
        Audit[Audit Logging]
    end

    Network --> Auth
    Auth --> AuthZ
    AuthZ --> Container
    Container --> Data
    Data --> Audit
```

**Security Features:**
- API key authentication
- Role-based permissions
- Container isolation
- File system sandboxing
- Comprehensive audit logs
- Input validation and sanitization

### Container Security

```mermaid
graph TB
    subgraph "Container Isolation"
        Host[Host System]
        Podman[Podman Runtime]
        Network[Network Isolation]
        FS[File System Isolation]
        User[User Namespaces]
        ML[ML Container]

        Host --> Podman
        Podman --> Network
        Podman --> FS
        Podman --> User
        User --> ML
    end
```

**Isolation Features:**
- Rootless containers
- Network isolation
- File system sandboxing
- User namespace mapping
- Resource limits

## Configuration Architecture

### Configuration Hierarchy

```mermaid
graph TB
    subgraph "Config Sources"
        Env[Environment Variables]
        File[Config Files]
        CLI[CLI Flags]
        Defaults[Default Values]
    end

    subgraph "Config Processing"
        Merge[Config Merger]
        Validate[Schema Validator]
        Apply[Config Applier]
    end

    Env --> Merge
    File --> Merge
    CLI --> Merge
    Defaults --> Merge

    Merge --> Validate
    Validate --> Apply
```

**Configuration Priority:**
1. CLI flags (highest)
2. Environment variables
3. Configuration files
4. Default values (lowest)

## Scalability Architecture

### Horizontal Scaling

```mermaid
graph TB
    subgraph "Scaled Architecture"
        LB[Load Balancer]
        W1[Worker 1]
        W2[Worker 2]
        W3[Worker N]
        Redis[Redis Cluster]
        Storage[Shared Storage]

        LB --> W1
        LB --> W2
        LB --> W3

        W1 --> Redis
        W2 --> Redis
        W3 --> Redis

        W1 --> Storage
        W2 --> Storage
        W3 --> Storage
    end
```

**Scaling Features:**
- Stateless worker services
- Shared job queue (Redis)
- Distributed storage
- Load balancer ready
- Health checks and monitoring

## Technology Stack

### Backend Technologies

| Component | Technology | Purpose |
|-----------|------------|---------|
| **Language** | Go 1.25+ | Core application |
| **Web Framework** | Standard library | HTTP server |
| **Authentication** | Custom | API key + RBAC |
| **Database** | SQLite/PostgreSQL | Metadata storage |
| **Cache** | Redis | Job queue & caching |
| **Containers** | Podman/Docker | Job isolation |
| **UI Framework** | Bubble Tea | Terminal UI |

### Dependencies

```go
// Core dependencies
require (
    github.com/charmbracelet/bubbletea v1.3.10  // TUI framework
    github.com/go-redis/redis/v8 v8.11.5        // Redis client
    github.com/google/uuid v1.6.0               // UUID generation
    github.com/mattn/go-sqlite3 v1.14.32        // SQLite driver
    golang.org/x/crypto v0.45.0                 // Crypto utilities
    gopkg.in/yaml.v3 v3.0.1                     // YAML parsing
)
```

## Development Architecture

### Project Structure

```
fetch_ml/
├── cmd/                    # CLI applications
│   ├── worker/            # ML worker service
│   ├── tui/               # Terminal UI
│   ├── data_manager/      # Data management
│   └── user_manager/      # User management
├── internal/              # Internal packages
│   ├── auth/              # Authentication system
│   ├── config/            # Configuration management
│   ├── container/         # Container operations
│   ├── database/          # Database operations
│   ├── logging/           # Logging utilities
│   ├── metrics/           # Metrics collection
│   └── network/           # Network utilities
├── configs/               # Configuration files
├── scripts/               # Setup and utility scripts
├── tests/                 # Test suites
└── docs/                  # Documentation
```

### Package Dependencies

```mermaid
graph TB
    subgraph "Application Layer"
        Worker[cmd/worker]
        TUI[cmd/tui]
        DataMgr[cmd/data_manager]
        UserMgr[cmd/user_manager]
    end

    subgraph "Service Layer"
        Auth[internal/auth]
        Config[internal/config]
        Container[internal/container]
        Database[internal/database]
    end

    subgraph "Utility Layer"
        Logging[internal/logging]
        Metrics[internal/metrics]
        Network[internal/network]
    end

    Worker --> Auth
    Worker --> Config
    Worker --> Container
    TUI --> Auth
    DataMgr --> Database
    UserMgr --> Auth

    Auth --> Logging
    Container --> Network
    Database --> Metrics
```

## Monitoring & Observability

### Metrics Collection

```mermaid
graph TB
    subgraph "Metrics Pipeline"
        App[Application] --> Metrics[Metrics Collector]
        Metrics --> Export[Prometheus Exporter]
        Export --> Prometheus[Prometheus Server]
        Prometheus --> Grafana[Grafana Dashboard]

        subgraph "Metric Types"
            Counter[Counters]
            Gauge[Gauges]
            Histogram[Histograms]
            Timer[Timers]
        end

        App --> Counter
        App --> Gauge
        App --> Histogram
        App --> Timer
    end
```

### Logging Architecture

```mermaid
graph TB
    subgraph "Logging Pipeline"
        App[Application] --> Logger[Structured Logger]
        Logger --> File[File Output]
        Logger --> Console[Console Output]
        Logger --> Syslog[Syslog Forwarder]
        Syslog --> Aggregator[Log Aggregator]
        Aggregator --> Storage[Log Storage]
        Storage --> Viewer[Log Viewer]
    end
```

## Deployment Architecture

### Container Deployment

```mermaid
graph TB
    subgraph "Deployment Stack"
        Image[Container Image]
        Registry[Container Registry]
        Orchestrator[Docker Compose]
        Config[ConfigMaps/Secrets]
        Storage[Persistent Storage]

        Image --> Registry
        Registry --> Orchestrator
        Config --> Orchestrator
        Storage --> Orchestrator
    end
```

### Service Discovery

```mermaid
graph TB
    subgraph "Service Mesh"
        Gateway[API Gateway]
        Discovery[Service Discovery]
        Worker[Worker Service]
        Data[Data Service]
        Redis[Redis Cluster]

        Gateway --> Discovery
        Discovery --> Worker
        Discovery --> Data
        Discovery --> Redis
    end
```

## Future Architecture Considerations

### Microservices Evolution

- **API Gateway**: Centralized routing and authentication
- **Service Mesh**: Inter-service communication
- **Event Streaming**: Kafka for job events
- **Distributed Tracing**: OpenTelemetry integration
- **Multi-tenant**: Tenant isolation and quotas

### Homelab Features

- **Docker Compose**: Simple container orchestration
- **Local Development**: Easy setup and testing
- **Security**: Built-in authentication and encryption
- **Monitoring**: Basic health checks and logging

 ## Roadmap (Research-First, Workstation-First)

 fetchml is a research-first ML experiment runner with production-grade discipline.

 ### Guiding principles

 - **Reproducibility over speed**: optimizations must never change experimental semantics.
 - **Explicit over magic**: every run should be explainable from manifests, configs, and logs.
 - **Best-effort optimizations**: prewarming/caching must be optional and must not be required for correctness.
 - **Workstation-first**: prioritize single-node reliability, observability, and fast iteration; avoid HPC-specific complexity.

 ### Where we are now

 - **Run provenance**: `run_manifest.json` exists and is readable via `ml info <path|id>`.
 - **Validation**: `ml validate <commit_id>` and `ml validate --task <task_id>` exist; task validation includes run-manifest lifecycle/provenance checks.
 - **Prewarming (best-effort)**:
   - Next-task prewarm loop stages snapshots under `base/.prewarm/snapshots/<task_id>`.
   - Best-effort dataset prefetch with a TTL cache.
   - Warmed container image infrastructure exists (images keyed by `deps_manifest_sha256`).
   - Prewarm status is surfaced in `ml status --json` under the `prewarm` field.

 ### Trust and usability (highest priority)

 #### 1) Make `ml status` excellent (human output)

 - Show a compact summary of:
   - queued/running/completed/failed counts
   - a short list of most relevant tasks
   - **prewarm state** (worker id, target task id, phase, dataset count, age)
 - Preserve `--json` output as stable API for scripting.

 #### 2) Add a dry-run preview command (`ml explain`)

 - Print the resolved execution plan before running:
   - commit id, experiment manifest overall sha
   - dependency manifest name + sha
   - snapshot id + expected sha (when applicable)
   - dataset identities + checksums (when applicable)
   - requested resources (cpu/mem/gpu)
   - candidate runtime image (base vs warmed tag)

 - Enforce a strict preflight by default:
   - Queue-time blocking (do not enqueue tasks that fail reproducibility requirements).
   - The strict preflight should be shared by `ml queue` and `ml explain`.
   - Record the resolved plan into task metadata for traceability:
     - `repro_policy: strict`
     - `trust_level: <L0..L4>` (simple trust ladder)
     - `plan_sha256: <sha256>` (digest of the resolved execution plan)

 #### 3) Tighten run manifest completeness

 - For `running`: require `started_at`.
 - For `completed/failed`: require `started_at`, `ended_at`, and `exit_code`.
 - When snapshots/datasets are used: ensure manifest records the relevant identifiers and digests.

 #### 4) Dataset identity (minimal but research-grade)

 - Prefer structured `dataset_specs` (name + checksum) as the authoritative input.
 - Treat missing checksum as an error by default (strict-by-default).

 ### Simple performance wins (only after trust/usability feels solid)

 - Keep prewarming single-level (next task only).
 - Improve observability first (status output + metrics), then expand capabilities.

 ### Research workflows

 - `ml compare <runA> <runB>`: manifest-driven diff of provenance and key parameters.
 - `ml reproduce <run-id>`: submit a new task derived from the recorded manifest inputs.
 - `ml export <run-id>`: package provenance + artifacts for collaborators/reviewers.

 ### Infrastructure (only if needed)

  - Multi-level prewarming, predictive scheduling, tmpfs caching, dashboards.
  - Optional scalable storage backend for team deployments:
    - Store run manifests + artifacts in S3-compatible object storage (e.g., MinIO) for durability and multi-worker/Kubernetes setups.
    - Keep workstation-first defaults (local filesystem) for simplicity.
  - Optional integrations via plugins/exporters (keep core strict and offline-capable):
    - Server-side exporters that mirror run metadata, metrics, and artifacts to external systems (e.g., MLflow Tracking, Weights & Biases).
    - Prefer lifecycle hooks that consume `run_manifest.json` / artifact manifests over plugins that influence execution semantics.
  - Optional Kubernetes deployment path (for teams on scalable infra):
    - Publish versioned container images for the backend (API server; optionally worker) and provide reference manifests (Helm/Kustomize).
    - Keep the CLI as the primary UX; Kubernetes is an execution/deployment backend, not a UI.
  - These are optional and should be driven by measured bottlenecks.

---

This architecture provides a solid foundation for secure, scalable machine learning experiments while maintaining simplicity and developer productivity.

## See Also

- **[Scheduler Architecture](scheduler-architecture.md)** - Detailed scheduler design and protocols
- **[Security Guide](security.md)** - Security architecture and best practices
- **[Configuration Reference](configuration-reference.md)** - Configuration options and environment variables
- **[Deployment Guide](deployment.md)** - Production deployment architecture
- **[Performance & Monitoring](performance-monitoring.md)** - Metrics and observability
- **[Research Runner Plan](research-runner-plan.md)** - Roadmap and implementation phases
- **[Native Libraries](native-libraries.md)** - C++ performance optimizations