chore(docs): remove legacy Jekyll docs/_pages after Hugo migration
This commit is contained in:
parent
3d58387207
commit
f6e506a632
7 changed files with 0 additions and 2486 deletions
|
|
@ -1,738 +0,0 @@
|
|||
---
|
||||
layout: page
|
||||
title: "Homelab Architecture"
|
||||
permalink: /architecture/
|
||||
nav_order: 1
|
||||
---
|
||||
|
||||
# Homelab Architecture
|
||||
|
||||
Simple, secure architecture for ML experiments in your homelab.
|
||||
|
||||
## Components Overview
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Homelab Stack"
|
||||
CLI[Zig CLI]
|
||||
API[HTTPS API]
|
||||
REDIS[Redis Cache]
|
||||
FS[Local Storage]
|
||||
end
|
||||
|
||||
CLI --> API
|
||||
API --> REDIS
|
||||
API --> FS
|
||||
```
|
||||
|
||||
## Core Services
|
||||
|
||||
### API Server
|
||||
- **Purpose**: Secure HTTPS API for ML experiments
|
||||
- **Port**: 9101 (HTTPS only)
|
||||
- **Auth**: API key authentication
|
||||
- **Security**: Rate limiting, IP whitelisting
|
||||
|
||||
### Redis
|
||||
- **Purpose**: Caching and job queuing
|
||||
- **Port**: 6379 (localhost only)
|
||||
- **Storage**: Temporary data only
|
||||
- **Persistence**: Local volume
|
||||
|
||||
### Zig CLI
|
||||
- **Purpose**: High-performance experiment management
|
||||
- **Language**: Zig for maximum speed and efficiency
|
||||
- **Features**:
|
||||
- Content-addressed storage with deduplication
|
||||
- SHA256-based commit ID generation
|
||||
- WebSocket communication for real-time updates
|
||||
- Rsync-based incremental file transfers
|
||||
- Multi-threaded operations
|
||||
- Secure API key authentication
|
||||
- Auto-sync monitoring with file system watching
|
||||
- Priority-based job queuing
|
||||
- Memory-efficient operations with arena allocators
|
||||
|
||||
## Security Architecture
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
USER[User] --> AUTH[API Key Auth]
|
||||
AUTH --> RATE[Rate Limiting]
|
||||
RATE --> WHITELIST[IP Whitelist]
|
||||
WHITELIST --> API[Secure API]
|
||||
API --> AUDIT[Audit Logging]
|
||||
```
|
||||
|
||||
### Security Layers
|
||||
1. **API Key Authentication** - Hashed keys with roles
|
||||
2. **Rate Limiting** - 30 requests/minute
|
||||
3. **IP Whitelisting** - Local networks only
|
||||
4. **Fail2Ban** - Automatic IP blocking
|
||||
5. **HTTPS/TLS** - Encrypted communication
|
||||
6. **Audit Logging** - Complete action tracking
|
||||
|
||||
## Data Flow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant CLI
|
||||
participant API
|
||||
participant Redis
|
||||
participant Storage
|
||||
|
||||
CLI->>API: HTTPS Request
|
||||
API->>API: Validate Auth
|
||||
API->>Redis: Cache/Queue
|
||||
API->>Storage: Experiment Data
|
||||
Storage->>API: Results
|
||||
API->>CLI: Response
|
||||
```
|
||||
|
||||
## Deployment Options
|
||||
|
||||
### Docker Compose (Recommended)
|
||||
```yaml
|
||||
services:
|
||||
redis:
|
||||
image: redis:7-alpine
|
||||
ports: ["6379:6379"]
|
||||
volumes: [redis_data:/data]
|
||||
|
||||
api-server:
|
||||
build: .
|
||||
ports: ["9101:9101"]
|
||||
depends_on: [redis]
|
||||
```
|
||||
|
||||
### Local Setup
|
||||
```bash
|
||||
./setup.sh && ./manage.sh start
|
||||
```
|
||||
|
||||
## Network Architecture
|
||||
|
||||
- **Private Network**: Docker internal network
|
||||
- **Localhost Access**: Redis only on localhost
|
||||
- **HTTPS API**: Port 9101, TLS encrypted
|
||||
- **No External Dependencies**: Everything runs locally
|
||||
|
||||
## Storage Architecture
|
||||
|
||||
```
|
||||
data/
|
||||
├── experiments/ # ML experiment results
|
||||
├── cache/ # Temporary cache files
|
||||
└── backups/ # Local backups
|
||||
|
||||
logs/
|
||||
├── app.log # Application logs
|
||||
├── audit.log # Security events
|
||||
└── access.log # API access logs
|
||||
```
|
||||
|
||||
## Monitoring Architecture
|
||||
|
||||
Simple, lightweight monitoring:
|
||||
- **Health Checks**: Service availability
|
||||
- **Log Files**: Structured logging
|
||||
- **Basic Metrics**: Request counts, error rates
|
||||
- **Security Events**: Failed auth, rate limits
|
||||
|
||||
## Homelab Benefits
|
||||
|
||||
- ✅ **Simple Setup**: One-command installation
|
||||
- ✅ **Local Only**: No external dependencies
|
||||
- ✅ **Secure by Default**: HTTPS, auth, rate limiting
|
||||
- ✅ **Low Resource**: Minimal CPU/memory usage
|
||||
- ✅ **Easy Backup**: Local file system
|
||||
- ✅ **Privacy**: Everything stays on your network
|
||||
|
||||
## High-Level Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Client Layer"
|
||||
CLI[CLI Tools]
|
||||
TUI[Terminal UI]
|
||||
API[REST API]
|
||||
end
|
||||
|
||||
subgraph "Authentication Layer"
|
||||
Auth[Authentication Service]
|
||||
RBAC[Role-Based Access Control]
|
||||
Perm[Permission Manager]
|
||||
end
|
||||
|
||||
subgraph "Core Services"
|
||||
Worker[ML Worker Service]
|
||||
DataMgr[Data Manager Service]
|
||||
Queue[Job Queue]
|
||||
end
|
||||
|
||||
subgraph "Storage Layer"
|
||||
Redis[(Redis Cache)]
|
||||
DB[(SQLite/PostgreSQL)]
|
||||
Files[File Storage]
|
||||
end
|
||||
|
||||
subgraph "Container Runtime"
|
||||
Podman[Podman/Docker]
|
||||
Containers[ML Containers]
|
||||
end
|
||||
|
||||
CLI --> Auth
|
||||
TUI --> Auth
|
||||
API --> Auth
|
||||
|
||||
Auth --> RBAC
|
||||
RBAC --> Perm
|
||||
|
||||
Worker --> Queue
|
||||
Worker --> DataMgr
|
||||
Worker --> Podman
|
||||
|
||||
DataMgr --> DB
|
||||
DataMgr --> Files
|
||||
|
||||
Queue --> Redis
|
||||
|
||||
Podman --> Containers
|
||||
```
|
||||
|
||||
## Zig CLI Architecture
|
||||
|
||||
### Component Structure
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Zig CLI Components"
|
||||
Main[main.zig] --> Commands[commands/]
|
||||
Commands --> Config[config.zig]
|
||||
Commands --> Utils[utils/]
|
||||
Commands --> Net[net/]
|
||||
Commands --> Errors[errors.zig]
|
||||
|
||||
subgraph "Commands"
|
||||
Init[init.zig]
|
||||
Sync[sync.zig]
|
||||
Queue[queue.zig]
|
||||
Watch[watch.zig]
|
||||
Status[status.zig]
|
||||
Monitor[monitor.zig]
|
||||
Cancel[cancel.zig]
|
||||
Prune[prune.zig]
|
||||
end
|
||||
|
||||
subgraph "Utils"
|
||||
Crypto[crypto.zig]
|
||||
Storage[storage.zig]
|
||||
Rsync[rsync.zig]
|
||||
end
|
||||
|
||||
subgraph "Network"
|
||||
WS[ws.zig]
|
||||
end
|
||||
end
|
||||
```
|
||||
|
||||
### Performance Optimizations
|
||||
|
||||
#### Content-Addressed Storage
|
||||
- **Deduplication**: Files stored by SHA256 hash
|
||||
- **Space Efficiency**: Shared files across experiments
|
||||
- **Fast Lookup**: Hash-based file retrieval
|
||||
|
||||
#### Memory Management
|
||||
- **Arena Allocators**: Efficient bulk allocation
|
||||
- **Zero-Copy Operations**: Minimized memory copying
|
||||
- **Automatic Cleanup**: Resource deallocation
|
||||
|
||||
#### Network Communication
|
||||
- **WebSocket Protocol**: Real-time bidirectional communication
|
||||
- **Connection Pooling**: Reused connections
|
||||
- **Binary Messaging**: Efficient data transfer
|
||||
|
||||
### Security Implementation
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph "CLI Security"
|
||||
Config[Config File] --> Hash[SHA256 Hashing]
|
||||
Hash --> Auth[API Authentication]
|
||||
Auth --> SSH[SSH Transfer]
|
||||
SSH --> WS[WebSocket Security]
|
||||
end
|
||||
```
|
||||
|
||||
## Core Components
|
||||
|
||||
### 1. Authentication & Authorization
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph "Auth Flow"
|
||||
Client[Client] --> APIKey[API Key]
|
||||
APIKey --> Hash[Hash Validation]
|
||||
Hash --> Roles[Role Resolution]
|
||||
Roles --> Perms[Permission Check]
|
||||
Perms --> Access[Grant/Deny Access]
|
||||
end
|
||||
|
||||
subgraph "Permission Sources"
|
||||
YAML[YAML Config]
|
||||
Inline[Inline Fallback]
|
||||
Roles --> YAML
|
||||
Roles --> Inline
|
||||
end
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- API key-based authentication
|
||||
- Role-based access control (RBAC)
|
||||
- YAML-based permission configuration
|
||||
- Fallback to inline permissions
|
||||
- Admin wildcard permissions
|
||||
|
||||
### 2. Worker Service
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Worker Architecture"
|
||||
API[HTTP API] --> Router[Request Router]
|
||||
Router --> Auth[Auth Middleware]
|
||||
Auth --> Queue[Job Queue]
|
||||
Queue --> Processor[Job Processor]
|
||||
Processor --> Runtime[Container Runtime]
|
||||
Runtime --> Storage[Result Storage]
|
||||
|
||||
subgraph "Job Lifecycle"
|
||||
Submit[Submit Job] --> Queue
|
||||
Queue --> Execute[Execute]
|
||||
Execute --> Monitor[Monitor]
|
||||
Monitor --> Complete[Complete]
|
||||
Complete --> Store[Store Results]
|
||||
end
|
||||
end
|
||||
```
|
||||
|
||||
**Responsibilities:**
|
||||
- HTTP API for job submission
|
||||
- Job queue management
|
||||
- Container orchestration
|
||||
- Result collection and storage
|
||||
- Metrics and monitoring
|
||||
|
||||
### 3. Data Manager Service
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Data Management"
|
||||
API[Data API] --> Storage[Storage Layer]
|
||||
Storage --> Metadata[Metadata DB]
|
||||
Storage --> Files[File System]
|
||||
Storage --> Cache[Redis Cache]
|
||||
|
||||
subgraph "Data Operations"
|
||||
Upload[Upload Data] --> Validate[Validate]
|
||||
Validate --> Store[Store]
|
||||
Store --> Index[Index]
|
||||
Index --> Catalog[Catalog]
|
||||
end
|
||||
end
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- Data upload and validation
|
||||
- Metadata management
|
||||
- File system abstraction
|
||||
- Caching layer
|
||||
- Data catalog
|
||||
|
||||
### 4. Terminal UI (TUI)
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "TUI Architecture"
|
||||
UI[UI Components] --> Model[Data Model]
|
||||
Model --> Update[Update Loop]
|
||||
Update --> Render[Render]
|
||||
|
||||
subgraph "UI Panels"
|
||||
Jobs[Job List]
|
||||
Details[Job Details]
|
||||
Logs[Log Viewer]
|
||||
Status[Status Bar]
|
||||
end
|
||||
|
||||
UI --> Jobs
|
||||
UI --> Details
|
||||
UI --> Logs
|
||||
UI --> Status
|
||||
end
|
||||
```
|
||||
|
||||
**Components:**
|
||||
- Bubble Tea framework
|
||||
- Component-based architecture
|
||||
- Real-time updates
|
||||
- Keyboard navigation
|
||||
- Theme support
|
||||
|
||||
## Data Flow
|
||||
|
||||
### Job Execution Flow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Client
|
||||
participant Auth
|
||||
participant Worker
|
||||
participant Queue
|
||||
participant Container
|
||||
participant Storage
|
||||
|
||||
Client->>Auth: Submit job with API key
|
||||
Auth->>Client: Validate and return job ID
|
||||
|
||||
Client->>Worker: Execute job request
|
||||
Worker->>Queue: Queue job
|
||||
Queue->>Worker: Job ready
|
||||
Worker->>Container: Start ML container
|
||||
Container->>Worker: Execute experiment
|
||||
Worker->>Storage: Store results
|
||||
Worker->>Client: Return results
|
||||
```
|
||||
|
||||
### Authentication Flow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Client
|
||||
participant Auth
|
||||
participant PermMgr
|
||||
participant Config
|
||||
|
||||
Client->>Auth: Request with API key
|
||||
Auth->>Auth: Validate key hash
|
||||
Auth->>PermMgr: Get user permissions
|
||||
PermMgr->>Config: Load YAML permissions
|
||||
Config->>PermMgr: Return permissions
|
||||
PermMgr->>Auth: Return resolved permissions
|
||||
Auth->>Client: Grant/deny access
|
||||
```
|
||||
|
||||
## Security Architecture
|
||||
|
||||
### Defense in Depth
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Security Layers"
|
||||
Network[Network Security]
|
||||
Auth[Authentication]
|
||||
AuthZ[Authorization]
|
||||
Container[Container Security]
|
||||
Data[Data Protection]
|
||||
Audit[Audit Logging]
|
||||
end
|
||||
|
||||
Network --> Auth
|
||||
Auth --> AuthZ
|
||||
AuthZ --> Container
|
||||
Container --> Data
|
||||
Data --> Audit
|
||||
```
|
||||
|
||||
**Security Features:**
|
||||
- API key authentication
|
||||
- Role-based permissions
|
||||
- Container isolation
|
||||
- File system sandboxing
|
||||
- Comprehensive audit logs
|
||||
- Input validation and sanitization
|
||||
|
||||
### Container Security
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Container Isolation"
|
||||
Host[Host System]
|
||||
Podman[Podman Runtime]
|
||||
Network[Network Isolation]
|
||||
FS[File System Isolation]
|
||||
User[User Namespaces]
|
||||
ML[ML Container]
|
||||
|
||||
Host --> Podman
|
||||
Podman --> Network
|
||||
Podman --> FS
|
||||
Podman --> User
|
||||
User --> ML
|
||||
end
|
||||
```
|
||||
|
||||
**Isolation Features:**
|
||||
- Rootless containers
|
||||
- Network isolation
|
||||
- File system sandboxing
|
||||
- User namespace mapping
|
||||
- Resource limits
|
||||
|
||||
## Configuration Architecture
|
||||
|
||||
### Configuration Hierarchy
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Config Sources"
|
||||
Env[Environment Variables]
|
||||
File[Config Files]
|
||||
CLI[CLI Flags]
|
||||
Defaults[Default Values]
|
||||
end
|
||||
|
||||
subgraph "Config Processing"
|
||||
Merge[Config Merger]
|
||||
Validate[Schema Validator]
|
||||
Apply[Config Applier]
|
||||
end
|
||||
|
||||
Env --> Merge
|
||||
File --> Merge
|
||||
CLI --> Merge
|
||||
Defaults --> Merge
|
||||
|
||||
Merge --> Validate
|
||||
Validate --> Apply
|
||||
```
|
||||
|
||||
**Configuration Priority:**
|
||||
1. CLI flags (highest)
|
||||
2. Environment variables
|
||||
3. Configuration files
|
||||
4. Default values (lowest)
|
||||
|
||||
## Scalability Architecture
|
||||
|
||||
### Horizontal Scaling
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Scaled Architecture"
|
||||
LB[Load Balancer]
|
||||
W1[Worker 1]
|
||||
W2[Worker 2]
|
||||
W3[Worker N]
|
||||
Redis[Redis Cluster]
|
||||
Storage[Shared Storage]
|
||||
|
||||
LB --> W1
|
||||
LB --> W2
|
||||
LB --> W3
|
||||
|
||||
W1 --> Redis
|
||||
W2 --> Redis
|
||||
W3 --> Redis
|
||||
|
||||
W1 --> Storage
|
||||
W2 --> Storage
|
||||
W3 --> Storage
|
||||
end
|
||||
```
|
||||
|
||||
**Scaling Features:**
|
||||
- Stateless worker services
|
||||
- Shared job queue (Redis)
|
||||
- Distributed storage
|
||||
- Load balancer ready
|
||||
- Health checks and monitoring
|
||||
|
||||
## Technology Stack
|
||||
|
||||
### Backend Technologies
|
||||
|
||||
| Component | Technology | Purpose |
|
||||
|-----------|------------|---------|
|
||||
| **Language** | Go 1.25+ | Core application |
|
||||
| **Web Framework** | Standard library | HTTP server |
|
||||
| **Authentication** | Custom | API key + RBAC |
|
||||
| **Database** | SQLite/PostgreSQL | Metadata storage |
|
||||
| **Cache** | Redis | Job queue & caching |
|
||||
| **Containers** | Podman/Docker | Job isolation |
|
||||
| **UI Framework** | Bubble Tea | Terminal UI |
|
||||
|
||||
### Dependencies
|
||||
|
||||
```go
|
||||
// Core dependencies
|
||||
require (
|
||||
github.com/charmbracelet/bubbletea v1.3.10 // TUI framework
|
||||
github.com/go-redis/redis/v8 v8.11.5 // Redis client
|
||||
github.com/google/uuid v1.6.0 // UUID generation
|
||||
github.com/mattn/go-sqlite3 v1.14.32 // SQLite driver
|
||||
golang.org/x/crypto v0.45.0 // Crypto utilities
|
||||
gopkg.in/yaml.v3 v3.0.1 // YAML parsing
|
||||
)
|
||||
```
|
||||
|
||||
## Development Architecture
|
||||
|
||||
### Project Structure
|
||||
|
||||
```
|
||||
fetch_ml/
|
||||
├── cmd/ # CLI applications
|
||||
│ ├── worker/ # ML worker service
|
||||
│ ├── tui/ # Terminal UI
|
||||
│ ├── data_manager/ # Data management
|
||||
│ └── user_manager/ # User management
|
||||
├── internal/ # Internal packages
|
||||
│ ├── auth/ # Authentication system
|
||||
│ ├── config/ # Configuration management
|
||||
│ ├── container/ # Container operations
|
||||
│ ├── database/ # Database operations
|
||||
│ ├── logging/ # Logging utilities
|
||||
│ ├── metrics/ # Metrics collection
|
||||
│ └── network/ # Network utilities
|
||||
├── configs/ # Configuration files
|
||||
├── scripts/ # Setup and utility scripts
|
||||
├── tests/ # Test suites
|
||||
└── docs/ # Documentation
|
||||
```
|
||||
|
||||
### Package Dependencies
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Application Layer"
|
||||
Worker[cmd/worker]
|
||||
TUI[cmd/tui]
|
||||
DataMgr[cmd/data_manager]
|
||||
UserMgr[cmd/user_manager]
|
||||
end
|
||||
|
||||
subgraph "Service Layer"
|
||||
Auth[internal/auth]
|
||||
Config[internal/config]
|
||||
Container[internal/container]
|
||||
Database[internal/database]
|
||||
end
|
||||
|
||||
subgraph "Utility Layer"
|
||||
Logging[internal/logging]
|
||||
Metrics[internal/metrics]
|
||||
Network[internal/network]
|
||||
end
|
||||
|
||||
Worker --> Auth
|
||||
Worker --> Config
|
||||
Worker --> Container
|
||||
TUI --> Auth
|
||||
DataMgr --> Database
|
||||
UserMgr --> Auth
|
||||
|
||||
Auth --> Logging
|
||||
Container --> Network
|
||||
Database --> Metrics
|
||||
```
|
||||
|
||||
## Monitoring & Observability
|
||||
|
||||
### Metrics Collection
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Metrics Pipeline"
|
||||
App[Application] --> Metrics[Metrics Collector]
|
||||
Metrics --> Export[Prometheus Exporter]
|
||||
Export --> Prometheus[Prometheus Server]
|
||||
Prometheus --> Grafana[Grafana Dashboard]
|
||||
|
||||
subgraph "Metric Types"
|
||||
Counter[Counters]
|
||||
Gauge[Gauges]
|
||||
Histogram[Histograms]
|
||||
Timer[Timers]
|
||||
end
|
||||
|
||||
App --> Counter
|
||||
App --> Gauge
|
||||
App --> Histogram
|
||||
App --> Timer
|
||||
end
|
||||
```
|
||||
|
||||
### Logging Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Logging Pipeline"
|
||||
App[Application] --> Logger[Structured Logger]
|
||||
Logger --> File[File Output]
|
||||
Logger --> Console[Console Output]
|
||||
Logger --> Syslog[Syslog Forwarder]
|
||||
Syslog --> Aggregator[Log Aggregator]
|
||||
Aggregator --> Storage[Log Storage]
|
||||
Storage --> Viewer[Log Viewer]
|
||||
end
|
||||
```
|
||||
|
||||
## Deployment Architecture
|
||||
|
||||
### Container Deployment
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Deployment Stack"
|
||||
Image[Container Image]
|
||||
Registry[Container Registry]
|
||||
Orchestrator[Docker Compose]
|
||||
Config[ConfigMaps/Secrets]
|
||||
Storage[Persistent Storage]
|
||||
|
||||
Image --> Registry
|
||||
Registry --> Orchestrator
|
||||
Config --> Orchestrator
|
||||
Storage --> Orchestrator
|
||||
end
|
||||
```
|
||||
|
||||
### Service Discovery
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Service Mesh"
|
||||
Gateway[API Gateway]
|
||||
Discovery[Service Discovery]
|
||||
Worker[Worker Service]
|
||||
Data[Data Service]
|
||||
Redis[Redis Cluster]
|
||||
|
||||
Gateway --> Discovery
|
||||
Discovery --> Worker
|
||||
Discovery --> Data
|
||||
Discovery --> Redis
|
||||
end
|
||||
```
|
||||
|
||||
## Future Architecture Considerations
|
||||
|
||||
### Microservices Evolution
|
||||
|
||||
- **API Gateway**: Centralized routing and authentication
|
||||
- **Service Mesh**: Inter-service communication
|
||||
- **Event Streaming**: Kafka for job events
|
||||
- **Distributed Tracing**: OpenTelemetry integration
|
||||
- **Multi-tenant**: Tenant isolation and quotas
|
||||
|
||||
### Homelab Features
|
||||
|
||||
- **Docker Compose**: Simple container orchestration
|
||||
- **Local Development**: Easy setup and testing
|
||||
- **Security**: Built-in authentication and encryption
|
||||
- **Monitoring**: Basic health checks and logging
|
||||
|
||||
---
|
||||
|
||||
This architecture provides a solid foundation for secure, scalable machine learning experiments while maintaining simplicity and developer productivity.
|
||||
|
|
@ -1,165 +0,0 @@
|
|||
---
|
||||
layout: page
|
||||
title: "CI/CD Pipeline"
|
||||
permalink: /cicd/
|
||||
nav_order: 5
|
||||
---
|
||||
|
||||
# CI/CD Pipeline
|
||||
|
||||
Automated testing, building, and releasing for fetch_ml.
|
||||
|
||||
## Workflows
|
||||
|
||||
### CI Workflow (`.github/workflows/ci.yml`)
|
||||
|
||||
Runs on every push to `main`/`develop` and all pull requests.
|
||||
|
||||
**Jobs:**
|
||||
1. **test** - Go backend tests with Redis
|
||||
2. **build** - Build all binaries (Go + Zig CLI)
|
||||
3. **test-scripts** - Validate deployment scripts
|
||||
4. **security-scan** - Trivy and Gosec security scans
|
||||
5. **docker-build** - Build and push Docker images (main branch only)
|
||||
|
||||
**Test Coverage:**
|
||||
- Go unit tests with race detection
|
||||
- `internal/queue` package tests
|
||||
- Zig CLI tests
|
||||
- Integration tests
|
||||
- Security audits
|
||||
|
||||
### Release Workflow (`.github/workflows/release.yml`)
|
||||
|
||||
Runs on version tags (e.g., `v1.0.0`).
|
||||
|
||||
**Jobs:**
|
||||
|
||||
1. **build-cli** (matrix build)
|
||||
- Linux x86_64 (static musl)
|
||||
- macOS x86_64
|
||||
- macOS ARM64
|
||||
- Downloads platform-specific static rsync
|
||||
- Embeds rsync for zero-dependency releases
|
||||
|
||||
2. **build-go-backends**
|
||||
- Cross-platform Go builds
|
||||
- api-server, worker, tui, data_manager, user_manager
|
||||
|
||||
3. **create-release**
|
||||
- Collects all artifacts
|
||||
- Generates SHA256 checksums
|
||||
- Creates GitHub release with notes
|
||||
|
||||
## Release Process
|
||||
|
||||
### Creating a Release
|
||||
|
||||
```bash
|
||||
# 1. Update version
|
||||
git tag v1.0.0
|
||||
|
||||
# 2. Push tag
|
||||
git push origin v1.0.0
|
||||
|
||||
# 3. CI automatically builds and releases
|
||||
```
|
||||
|
||||
### Release Artifacts
|
||||
|
||||
**CLI Binaries (with embedded rsync):**
|
||||
- `ml-linux-x86_64.tar.gz` (~450-650KB)
|
||||
- `ml-macos-x86_64.tar.gz` (~450-650KB)
|
||||
- `ml-macos-arm64.tar.gz` (~450-650KB)
|
||||
|
||||
**Go Backends:**
|
||||
- `fetch_ml_api-server.tar.gz`
|
||||
- `fetch_ml_worker.tar.gz`
|
||||
- `fetch_ml_tui.tar.gz`
|
||||
- `fetch_ml_data_manager.tar.gz`
|
||||
- `fetch_ml_user_manager.tar.gz`
|
||||
|
||||
**Checksums:**
|
||||
- `checksums.txt` - Combined SHA256 sums
|
||||
- Individual `.sha256` files per binary
|
||||
|
||||
## Development Workflow
|
||||
|
||||
### Local Testing
|
||||
|
||||
```bash
|
||||
# Run all tests
|
||||
make test
|
||||
|
||||
# Run specific package tests
|
||||
go test ./internal/queue/...
|
||||
|
||||
# Build CLI
|
||||
cd cli && zig build dev
|
||||
|
||||
# Run formatters and linters
|
||||
make lint
|
||||
|
||||
# Security scans are handled automatically in CI by the `security-scan` job
|
||||
```
|
||||
|
||||
#### Optional heavy end-to-end tests
|
||||
|
||||
Some e2e tests exercise full Docker deployments and performance scenarios and are
|
||||
**skipped by default** to keep local/CI runs fast. You can enable them explicitly
|
||||
with environment variables:
|
||||
|
||||
```bash
|
||||
# Run Docker deployment e2e tests
|
||||
FETCH_ML_E2E_DOCKER=1 go test ./tests/e2e/...
|
||||
|
||||
# Run performance-oriented e2e tests
|
||||
FETCH_ML_E2E_PERF=1 go test ./tests/e2e/...
|
||||
```
|
||||
|
||||
Without these variables, `TestDockerDeploymentE2E` and `TestPerformanceE2E` will
|
||||
`t.Skip`, while all lighter e2e tests still run.
|
||||
|
||||
### Pull Request Checks
|
||||
|
||||
All PRs must pass:
|
||||
- ✅ Go tests (with Redis)
|
||||
- ✅ CLI tests
|
||||
- ✅ Security scans
|
||||
- ✅ Code linting
|
||||
- ✅ Build verification
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
```yaml
|
||||
GO_VERSION: '1.25.0'
|
||||
ZIG_VERSION: '0.15.2'
|
||||
```
|
||||
|
||||
### Secrets
|
||||
|
||||
Required for releases:
|
||||
- `GITHUB_TOKEN` - Automatic, provided by GitHub Actions
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Build Status
|
||||
|
||||
Check workflow runs at:
|
||||
```
|
||||
https://github.com/jfraeys/fetch_ml/actions
|
||||
```
|
||||
|
||||
### Artifacts
|
||||
|
||||
Download build artifacts from:
|
||||
- Successful workflow runs (30-day retention)
|
||||
- GitHub Releases (permanent)
|
||||
|
||||
---
|
||||
|
||||
For implementation details:
|
||||
- [.github/workflows/ci.yml](https://github.com/jfraeys/fetch_ml/blob/main/.github/workflows/ci.yml)
|
||||
- [.github/workflows/release.yml](https://github.com/jfraeys/fetch_ml/blob/main/.github/workflows/release.yml)
|
||||
|
|
@ -1,404 +0,0 @@
|
|||
---
|
||||
layout: page
|
||||
title: "CLI Reference"
|
||||
permalink: /cli-reference/
|
||||
nav_order: 2
|
||||
---
|
||||
|
||||
# Fetch ML CLI Reference
|
||||
|
||||
Comprehensive command-line tools for managing ML experiments in your homelab with Zig-based high-performance CLI.
|
||||
|
||||
## Overview
|
||||
|
||||
Fetch ML provides a comprehensive CLI toolkit built with performance and security in mind:
|
||||
|
||||
- **Zig CLI** - High-performance experiment management written in Zig
|
||||
- **Go Commands** - API server, TUI, and data management utilities
|
||||
- **Management Scripts** - Service orchestration and deployment
|
||||
- **Setup Scripts** - One-command installation and configuration
|
||||
|
||||
## Zig CLI (`./cli/zig-out/bin/ml`)
|
||||
|
||||
High-performance command-line interface for experiment management, written in Zig for speed and efficiency.
|
||||
|
||||
### Available Commands
|
||||
|
||||
| Command | Description | Example |
|
||||
|---------|-------------|----------|
|
||||
| `init` | Interactive configuration setup | `ml init` |
|
||||
| `sync` | Sync project to worker with deduplication | `ml sync ./project --name myjob --queue` |
|
||||
| `queue` | Queue job for execution | `ml queue myjob --commit abc123 --priority 8` |
|
||||
| `status` | Get system and worker status | `ml status` |
|
||||
| `monitor` | Launch TUI monitoring via SSH | `ml monitor` |
|
||||
| `cancel` | Cancel running job | `ml cancel job123` |
|
||||
| `prune` | Clean up old experiments | `ml prune --keep 10` |
|
||||
| `watch` | Auto-sync directory on changes | `ml watch ./project --queue` |
|
||||
|
||||
### Command Details
|
||||
|
||||
#### `init` - Configuration Setup
|
||||
```bash
|
||||
ml init
|
||||
```
|
||||
Creates a configuration template at `~/.ml/config.toml` with:
|
||||
- Worker connection details
|
||||
- API authentication
|
||||
- Base paths and ports
|
||||
|
||||
#### `sync` - Project Synchronization
|
||||
```bash
|
||||
# Basic sync
|
||||
ml sync ./my-project
|
||||
|
||||
# Sync with custom name and queue
|
||||
ml sync ./my-project --name "experiment-1" --queue
|
||||
|
||||
# Sync with priority
|
||||
ml sync ./my-project --priority 9
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- Content-addressed storage for deduplication
|
||||
- SHA256 commit ID generation
|
||||
- Rsync-based file transfer
|
||||
- Automatic queuing (with `--queue` flag)
|
||||
|
||||
#### `queue` - Job Management
|
||||
```bash
|
||||
# Queue with commit ID
|
||||
ml queue my-job --commit abc123def456
|
||||
|
||||
# Queue with priority (1-10, default 5)
|
||||
ml queue my-job --commit abc123 --priority 8
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- WebSocket-based communication
|
||||
- Priority queuing system
|
||||
- API key authentication
|
||||
|
||||
#### `watch` - Auto-Sync Monitoring
|
||||
```bash
|
||||
# Watch directory for changes
|
||||
ml watch ./project
|
||||
|
||||
# Watch and auto-queue on changes
|
||||
ml watch ./project --name "dev-exp" --queue
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- Real-time file system monitoring
|
||||
- Automatic re-sync on changes
|
||||
- Configurable polling interval (2 seconds)
|
||||
- Commit ID comparison for efficiency
|
||||
|
||||
#### `prune` - Cleanup Management
|
||||
```bash
|
||||
# Keep last N experiments
|
||||
ml prune --keep 20
|
||||
|
||||
# Remove experiments older than N days
|
||||
ml prune --older-than 30
|
||||
```
|
||||
|
||||
#### `monitor` - Remote Monitoring
|
||||
```bash
|
||||
ml monitor
|
||||
```
|
||||
Launches TUI interface via SSH for real-time monitoring.
|
||||
|
||||
#### `cancel` - Job Cancellation
|
||||
```bash
|
||||
ml cancel running-job-id
|
||||
```
|
||||
Cancels currently running jobs by ID.
|
||||
|
||||
### Configuration
|
||||
|
||||
The Zig CLI reads configuration from `~/.ml/config.toml`:
|
||||
|
||||
```toml
|
||||
worker_host = "worker.local"
|
||||
worker_user = "mluser"
|
||||
worker_base = "/data/ml-experiments"
|
||||
worker_port = 22
|
||||
api_key = "your-api-key"
|
||||
```
|
||||
|
||||
### Performance Features
|
||||
|
||||
- **Content-Addressed Storage**: Automatic deduplication of identical files
|
||||
- **Incremental Sync**: Only transfers changed files
|
||||
- **SHA256 Hashing**: Reliable commit ID generation
|
||||
- **WebSocket Communication**: Efficient real-time messaging
|
||||
- **Multi-threaded**: Concurrent operations where applicable
|
||||
|
||||
## Go Commands
|
||||
|
||||
### API Server (`./cmd/api-server/main.go`)
|
||||
Main HTTPS API server for experiment management.
|
||||
|
||||
```bash
|
||||
# Build and run
|
||||
go run ./cmd/api-server/main.go
|
||||
|
||||
# With configuration
|
||||
./bin/api-server --config configs/config-local.yaml
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- HTTPS-only communication
|
||||
- API key authentication
|
||||
- Rate limiting and IP whitelisting
|
||||
- WebSocket support for real-time updates
|
||||
- Redis integration for caching
|
||||
|
||||
### TUI (`./cmd/tui/main.go`)
|
||||
Terminal User Interface for monitoring experiments.
|
||||
|
||||
```bash
|
||||
# Launch TUI
|
||||
go run ./cmd/tui/main.go
|
||||
|
||||
# With custom config
|
||||
./tui --config configs/config-local.yaml
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- Real-time experiment monitoring
|
||||
- Interactive job management
|
||||
- Status visualization
|
||||
- Log viewing
|
||||
|
||||
### Data Manager (`./cmd/data_manager/`)
|
||||
Utilities for data synchronization and management.
|
||||
|
||||
```bash
|
||||
# Sync data
|
||||
./data_manager --sync ./data
|
||||
|
||||
# Clean old data
|
||||
./data_manager --cleanup --older-than 30d
|
||||
```
|
||||
|
||||
### Config Lint (`./cmd/configlint/main.go`)
|
||||
Configuration validation and linting tool.
|
||||
|
||||
```bash
|
||||
# Validate configuration
|
||||
./configlint configs/config-local.yaml
|
||||
|
||||
# Check schema compliance
|
||||
./configlint --schema configs/schema/config_schema.yaml
|
||||
```
|
||||
|
||||
## Management Script (`./tools/manage.sh`)
|
||||
|
||||
Simple service management for your homelab.
|
||||
|
||||
### Commands
|
||||
```bash
|
||||
./tools/manage.sh start # Start all services
|
||||
./tools/manage.sh stop # Stop all services
|
||||
./tools/manage.sh status # Check service status
|
||||
./tools/manage.sh logs # View logs
|
||||
./tools/manage.sh monitor # Basic monitoring
|
||||
./tools/manage.sh security # Security status
|
||||
./tools/manage.sh cleanup # Clean project artifacts
|
||||
```
|
||||
|
||||
## Setup Script (`./setup.sh`)
|
||||
|
||||
One-command homelab setup.
|
||||
|
||||
### Usage
|
||||
```bash
|
||||
# Full setup
|
||||
./setup.sh
|
||||
|
||||
# Setup includes:
|
||||
# - SSL certificate generation
|
||||
# - Configuration creation
|
||||
# - Build all components
|
||||
# - Start Redis
|
||||
# - Setup Fail2Ban (if available)
|
||||
```
|
||||
|
||||
## API Testing
|
||||
|
||||
Test the API with curl:
|
||||
|
||||
```bash
|
||||
# Health check
|
||||
curl -k -H 'X-API-Key: password' https://localhost:9101/health
|
||||
|
||||
# List experiments
|
||||
curl -k -H 'X-API-Key: password' https://localhost:9101/experiments
|
||||
|
||||
# Submit experiment
|
||||
curl -k -X POST -H 'X-API-Key: password' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"name":"test","config":{"type":"basic"}}' \
|
||||
https://localhost:9101/experiments
|
||||
```
|
||||
|
||||
## Zig CLI Architecture
|
||||
|
||||
The Zig CLI is designed for performance and reliability:
|
||||
|
||||
### Core Components
|
||||
- **Commands** (`cli/src/commands/`): Individual command implementations
|
||||
- **Config** (`cli/src/config.zig`): Configuration management
|
||||
- **Network** (`cli/src/net/ws.zig`): WebSocket client implementation
|
||||
- **Utils** (`cli/src/utils/`): Cryptography, storage, and rsync utilities
|
||||
- **Errors** (`cli/src/errors.zig`): Centralized error handling
|
||||
|
||||
### Performance Optimizations
|
||||
- **Content-Addressed Storage**: Deduplicates identical files across experiments
|
||||
- **SHA256 Hashing**: Fast, reliable commit ID generation
|
||||
- **Rsync Integration**: Efficient incremental file transfers
|
||||
- **WebSocket Protocol**: Low-latency communication with worker
|
||||
- **Memory Management**: Efficient allocation with Zig's allocator system
|
||||
|
||||
### Security Features
|
||||
- **API Key Hashing**: Secure authentication token handling
|
||||
- **SSH Integration**: Secure file transfers
|
||||
- **Input Validation**: Comprehensive argument checking
|
||||
- **Error Handling**: Secure error reporting without information leakage
|
||||
|
||||
## Configuration
|
||||
|
||||
Main configuration file: `configs/config-local.yaml`
|
||||
|
||||
### Key Settings
|
||||
```yaml
|
||||
auth:
|
||||
enabled: true
|
||||
api_keys:
|
||||
homelab_user:
|
||||
hash: "5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8"
|
||||
admin: true
|
||||
|
||||
server:
|
||||
address: ":9101"
|
||||
tls:
|
||||
enabled: true
|
||||
cert_file: "./ssl/cert.pem"
|
||||
key_file: "./ssl/key.pem"
|
||||
|
||||
security:
|
||||
rate_limit:
|
||||
enabled: true
|
||||
requests_per_minute: 30
|
||||
ip_whitelist:
|
||||
- "127.0.0.1"
|
||||
- "::1"
|
||||
- "192.168.0.0/16"
|
||||
- "10.0.0.0/8"
|
||||
```
|
||||
|
||||
## Docker Commands
|
||||
|
||||
If using Docker Compose:
|
||||
|
||||
```bash
|
||||
# Start services
|
||||
docker-compose up -d
|
||||
|
||||
# View logs
|
||||
docker-compose logs -f
|
||||
|
||||
# Stop services
|
||||
docker-compose down
|
||||
|
||||
# Check status
|
||||
docker-compose ps
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Zig CLI not found:**
|
||||
```bash
|
||||
# Build the CLI
|
||||
cd cli && make build
|
||||
|
||||
# Check binary exists
|
||||
ls -la ./cli/zig-out/bin/ml
|
||||
```
|
||||
|
||||
**Configuration not found:**
|
||||
```bash
|
||||
# Create configuration
|
||||
./cli/zig-out/bin/ml init
|
||||
|
||||
# Check config file
|
||||
ls -la ~/.ml/config.toml
|
||||
```
|
||||
|
||||
**Worker connection failed:**
|
||||
```bash
|
||||
# Test SSH connection
|
||||
ssh -p 22 mluser@worker.local
|
||||
|
||||
# Check configuration
|
||||
cat ~/.ml/config.toml
|
||||
```
|
||||
|
||||
**Sync not working:**
|
||||
```bash
|
||||
# Check rsync availability
|
||||
rsync --version
|
||||
|
||||
# Test manual sync
|
||||
rsync -avz ./project/ mluser@worker.local:/tmp/test/
|
||||
```
|
||||
|
||||
**WebSocket connection failed:**
|
||||
```bash
|
||||
# Check worker WebSocket port
|
||||
telnet worker.local 9100
|
||||
|
||||
# Verify API key
|
||||
./cli/zig-out/bin/ml status
|
||||
```
|
||||
|
||||
**API not responding:**
|
||||
```bash
|
||||
./tools/manage.sh status
|
||||
./tools/manage.sh logs
|
||||
```
|
||||
|
||||
**Authentication failed:**
|
||||
```bash
|
||||
# Check API key in config-local.yaml
|
||||
grep -A 5 "api_keys:" configs/config-local.yaml
|
||||
```
|
||||
|
||||
**Redis connection failed:**
|
||||
```bash
|
||||
# Check Redis status
|
||||
redis-cli ping
|
||||
|
||||
# Start Redis
|
||||
redis-server
|
||||
```
|
||||
|
||||
### Getting Help
|
||||
|
||||
```bash
|
||||
# CLI help
|
||||
./cli/zig-out/bin/ml help
|
||||
|
||||
# Management script help
|
||||
./tools/manage.sh help
|
||||
|
||||
# Check all available commands
|
||||
make help
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**That's it for the CLI reference!** For complete setup instructions, see the main [README](/).
|
||||
|
|
@ -1,310 +0,0 @@
|
|||
---
|
||||
layout: page
|
||||
title: "Operations Runbook"
|
||||
permalink: /operations/
|
||||
nav_order: 6
|
||||
---
|
||||
|
||||
# Operations Runbook
|
||||
|
||||
Operational guide for troubleshooting and maintaining the ML experiment system.
|
||||
|
||||
## Task Queue Operations
|
||||
|
||||
### Monitoring Queue Health
|
||||
|
||||
```redis
|
||||
# Check queue depth
|
||||
ZCARD task:queue
|
||||
|
||||
# List pending tasks
|
||||
ZRANGE task:queue 0 -1 WITHSCORES
|
||||
|
||||
# Check dead letter queue
|
||||
KEYS task:dlq:*
|
||||
```
|
||||
|
||||
### Handling Stuck Tasks
|
||||
|
||||
**Symptom:** Tasks stuck in "running" status
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check for expired leases
|
||||
redis-cli GET task:{task-id}
|
||||
# Look for LeaseExpiry in past
|
||||
```
|
||||
|
||||
**Rem
|
||||
|
||||
ediation:**
|
||||
Tasks with expired leases are automatically reclaimed every 1 minute. To force immediate reclamation:
|
||||
```bash
|
||||
# Restart worker to trigger reclaim cycle
|
||||
systemctl restart ml-worker
|
||||
```
|
||||
|
||||
### Dead Letter Queue Management
|
||||
|
||||
**View failed tasks:**
|
||||
```redis
|
||||
KEYS task:dlq:*
|
||||
```
|
||||
|
||||
**Inspect failed task:**
|
||||
```redis
|
||||
GET task:dlq:{task-id}
|
||||
```
|
||||
|
||||
**Retry from DLQ:**
|
||||
```bash
|
||||
# Manual retry (requires custom script)
|
||||
# 1. Get task from DLQ
|
||||
# 2. Reset retry count
|
||||
# 3. Re-queue task
|
||||
```
|
||||
|
||||
### Worker Crashes
|
||||
|
||||
**Symptom:** Worker disappeared mid-task
|
||||
|
||||
**What Happens:**
|
||||
1. Lease expires after 30 minutes (default)
|
||||
2. Background reclaim job detects expired lease
|
||||
3. Task is retried (up to 3 attempts)
|
||||
4. After max retries → Dead Letter Queue
|
||||
|
||||
**Prevention:**
|
||||
- Monitor worker heartbeats
|
||||
- Set up alerts for worker down
|
||||
- Use process manager (systemd, supervisor)
|
||||
|
||||
## Worker Operations
|
||||
|
||||
### Graceful Shutdown
|
||||
|
||||
```bash
|
||||
# Send SIGTERM for graceful shutdown
|
||||
kill -TERM $(pgrep ml-worker)
|
||||
|
||||
# Worker will:
|
||||
# 1. Stop accepting new tasks
|
||||
# 2. Finish active tasks (up to 5min timeout)
|
||||
# 3. Release all leases
|
||||
# 4. Exit cleanly
|
||||
```
|
||||
|
||||
### Force Shutdown
|
||||
|
||||
```bash
|
||||
# Force kill (leases will be reclaimed automatically)
|
||||
kill -9 $(pgrep ml-worker)
|
||||
```
|
||||
|
||||
### Worker Heartbeat Monitoring
|
||||
|
||||
```redis
|
||||
# Check worker heartbeats
|
||||
HGETALL worker:heartbeat
|
||||
|
||||
# Example output:
|
||||
# worker-abc123 1701234567
|
||||
# worker-def456 1701234580
|
||||
```
|
||||
|
||||
**Alert if:** Heartbeat timestamp > 5 minutes old
|
||||
|
||||
## Redis Operations
|
||||
|
||||
### Backup
|
||||
|
||||
```bash
|
||||
# Manual backup
|
||||
redis-cli SAVE
|
||||
cp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d).rdb
|
||||
```
|
||||
|
||||
### Restore
|
||||
|
||||
```bash
|
||||
# Stop Redis
|
||||
systemctl stop redis
|
||||
|
||||
# Restore snapshot
|
||||
cp /backup/redis-20231201.rdb /var/lib/redis/dump.rdb
|
||||
|
||||
# Start Redis
|
||||
systemctl start redis
|
||||
```
|
||||
|
||||
### Memory Management
|
||||
|
||||
```redis
|
||||
# Check memory usage
|
||||
INFO memory
|
||||
|
||||
# Evict old data if needed
|
||||
FLUSHDB # DANGER: Clears all data!
|
||||
```
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Issue: Queue Growing Unbounded
|
||||
|
||||
**Symptoms:**
|
||||
- `ZCARD task:queue` keeps increasing
|
||||
- No workers processing tasks
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check worker status
|
||||
systemctl status ml-worker
|
||||
|
||||
# Check logs
|
||||
journalctl -u ml-worker -n 100
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
1. Verify workers are running
|
||||
2. Check Redis connectivity
|
||||
3. Verify lease configuration
|
||||
|
||||
### Issue: High Retry Rate
|
||||
|
||||
**Symptoms:**
|
||||
- Many tasks in DLQ
|
||||
- `retry_count` field high on tasks
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check worker logs for errors
|
||||
journalctl -u ml-worker | grep "retry"
|
||||
|
||||
# Look for patterns (network issues, resource limits, etc)
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
- Fix underlying issue (network, resources, etc)
|
||||
- Adjust retry limits if permanent failures
|
||||
- Increase task timeout if jobs are slow
|
||||
|
||||
### Issue: Leases Expiring Prematurely
|
||||
|
||||
**Symptoms:**
|
||||
- Tasks retried even though worker is healthy
|
||||
- Logs show "lease expired" frequently
|
||||
|
||||
**Diagnosis:**
|
||||
```yaml
|
||||
# Check worker config
|
||||
cat configs/worker-config.yaml | grep -A3 "lease"
|
||||
|
||||
task_lease_duration: 30m # Too short?
|
||||
heartbeat_interval: 1m # Too infrequent?
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
```yaml
|
||||
# Increase lease duration for long-running jobs
|
||||
task_lease_duration: 60m
|
||||
heartbeat_interval: 30s # More frequent heartbeats
|
||||
```
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
### Worker Concurrency
|
||||
|
||||
```yaml
|
||||
# worker-config.yaml
|
||||
max_workers: 4 # Number of parallel tasks
|
||||
|
||||
# Adjust based on:
|
||||
# - CPU cores available
|
||||
# - Memory per task
|
||||
# - GPU availability
|
||||
```
|
||||
|
||||
### Redis Configuration
|
||||
|
||||
```conf
|
||||
# /etc/redis/redis.conf
|
||||
|
||||
# Persistence
|
||||
save 900 1
|
||||
save 300 10
|
||||
|
||||
# Memory
|
||||
maxmemory 2gb
|
||||
maxmemory-policy noeviction
|
||||
|
||||
# Performance
|
||||
tcp-keepalive 300
|
||||
timeout 0
|
||||
```
|
||||
|
||||
## Alerting Rules
|
||||
|
||||
### Critical Alerts
|
||||
|
||||
1. **Worker Down** (no heartbeat > 5min)
|
||||
2. **Queue Depth** > 1000 tasks
|
||||
3. **DLQ Growth** > 100 tasks/hour
|
||||
4. **Redis Down** (connection failed)
|
||||
|
||||
### Warning Alerts
|
||||
|
||||
1. **High Retry Rate** > 10% of tasks
|
||||
2. **Slow Queue Drain** (depth increasing over 1 hour)
|
||||
3. **Worker Memory** > 80% usage
|
||||
|
||||
## Health Checks
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# health-check.sh
|
||||
|
||||
# Check Redis
|
||||
redis-cli PING || echo "Redis DOWN"
|
||||
|
||||
# Check worker heartbeat
|
||||
WORKER_ID=$(cat /var/run/ml-worker.pid)
|
||||
LAST_HB=$(redis-cli HGET worker:heartbeat "$WORKER_ID")
|
||||
NOW=$(date +%s)
|
||||
if [ $((NOW - LAST_HB)) -gt 300 ]; then
|
||||
echo "Worker heartbeat stale"
|
||||
fi
|
||||
|
||||
# Check queue depth
|
||||
DEPTH=$(redis-cli ZCARD task:queue)
|
||||
if [ "$DEPTH" -gt 1000 ]; then
|
||||
echo "Queue depth critical: $DEPTH"
|
||||
fi
|
||||
```
|
||||
|
||||
## Runbook Checklist
|
||||
|
||||
### Daily Operations
|
||||
- [ ] Check queue depth
|
||||
- [ ] Verify worker heartbeats
|
||||
- [ ] Review DLQ for patterns
|
||||
- [ ] Check Redis memory usage
|
||||
|
||||
### Weekly Operations
|
||||
- [ ] Review retry rates
|
||||
- [ ] Analyze failed task patterns
|
||||
- [ ] Backup Redis snapshot
|
||||
- [ ] Review worker logs
|
||||
|
||||
### Monthly Operations
|
||||
- [ ] Performance tuning review
|
||||
- [ ] Capacity planning
|
||||
- [ ] Update documentation
|
||||
- [ ] Test disaster recovery
|
||||
|
||||
---
|
||||
|
||||
**For homelab setups:**
|
||||
Most of these operations can be simplified. Focus on:
|
||||
- Basic monitoring (queue depth, worker status)
|
||||
- Periodic Redis backups
|
||||
- Graceful shutdowns for maintenance
|
||||
|
|
@ -1,322 +0,0 @@
|
|||
---
|
||||
layout: page
|
||||
title: "Task Queue Architecture"
|
||||
permalink: /queue/
|
||||
nav_order: 3
|
||||
---
|
||||
|
||||
# Task Queue Architecture
|
||||
|
||||
The task queue system enables reliable job processing between the API server and workers using Redis.
|
||||
|
||||
## Overview
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
CLI[CLI/Client] -->|WebSocket| API[API Server]
|
||||
API -->|Enqueue| Redis[(Redis)]
|
||||
Redis -->|Dequeue| Worker[Worker]
|
||||
Worker -->|Update Status| Redis
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
### TaskQueue (`internal/queue`)
|
||||
|
||||
Shared package used by both API server and worker for job management.
|
||||
|
||||
#### Task Structure
|
||||
|
||||
```go
|
||||
type Task struct {
|
||||
ID string // Unique task ID (UUID)
|
||||
JobName string // User-defined job name
|
||||
Args string // Job arguments
|
||||
Status string // queued, running, completed, failed
|
||||
Priority int64 // Higher = executed first
|
||||
CreatedAt time.Time
|
||||
StartedAt *time.Time
|
||||
EndedAt *time.Time
|
||||
WorkerID string
|
||||
Error string
|
||||
Datasets []string
|
||||
Metadata map[string]string // commit_id, user, etc
|
||||
}
|
||||
```
|
||||
|
||||
#### TaskQueue Interface
|
||||
|
||||
```go
|
||||
// Initialize queue
|
||||
queue, err := queue.NewTaskQueue(queue.Config{
|
||||
RedisAddr: "localhost:6379",
|
||||
RedisPassword: "",
|
||||
RedisDB: 0,
|
||||
})
|
||||
|
||||
// Add task (API server)
|
||||
task := &queue.Task{
|
||||
ID: uuid.New().String(),
|
||||
JobName: "train-model",
|
||||
Status: "queued",
|
||||
Priority: 5,
|
||||
Metadata: map[string]string{
|
||||
"commit_id": commitID,
|
||||
"user": username,
|
||||
},
|
||||
}
|
||||
err = queue.AddTask(task)
|
||||
|
||||
// Get next task (Worker)
|
||||
task, err := queue.GetNextTask()
|
||||
|
||||
// Update task status
|
||||
task.Status = "running"
|
||||
err = queue.UpdateTask(task)
|
||||
```
|
||||
|
||||
## Data Flow
|
||||
|
||||
### Job Submission Flow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant CLI
|
||||
participant API
|
||||
participant Redis
|
||||
participant Worker
|
||||
|
||||
CLI->>API: Queue Job (WebSocket)
|
||||
API->>API: Create Task (UUID)
|
||||
API->>Redis: ZADD task:queue
|
||||
API->>Redis: SET task:{id}
|
||||
API->>CLI: Success Response
|
||||
|
||||
Worker->>Redis: ZPOPMAX task:queue
|
||||
Redis->>Worker: Task ID
|
||||
Worker->>Redis: GET task:{id}
|
||||
Redis->>Worker: Task Data
|
||||
Worker->>Worker: Execute Job
|
||||
Worker->>Redis: Update Status
|
||||
```
|
||||
|
||||
### Protocol
|
||||
|
||||
**CLI → API** (Binary WebSocket):
|
||||
```
|
||||
[opcode:1][api_key_hash:64][commit_id:64][priority:1][job_name_len:1][job_name:var]
|
||||
```
|
||||
|
||||
**API → Redis**:
|
||||
- Priority queue: `ZADD task:queue {priority} {task_id}`
|
||||
- Task data: `SET task:{id} {json}`
|
||||
- Status: `HSET task:status:{job_name} ...`
|
||||
|
||||
**Worker ← Redis**:
|
||||
- Poll: `ZPOPMAX task:queue 1` (highest priority first)
|
||||
- Fetch: `GET task:{id}`
|
||||
|
||||
## Redis Data Structures
|
||||
|
||||
### Keys
|
||||
|
||||
```
|
||||
task:queue # ZSET: priority queue
|
||||
task:{uuid} # STRING: task JSON data
|
||||
task:status:{job_name} # HASH: job status
|
||||
worker:heartbeat # HASH: worker health
|
||||
job:metrics:{job_name} # HASH: job metrics
|
||||
```
|
||||
|
||||
### Priority Queue (ZSET)
|
||||
|
||||
```redis
|
||||
ZADD task:queue 10 "uuid-1" # Priority 10
|
||||
ZADD task:queue 5 "uuid-2" # Priority 5
|
||||
ZPOPMAX task:queue 1 # Returns uuid-1 (highest)
|
||||
```
|
||||
|
||||
## API Server Integration
|
||||
|
||||
### Initialization
|
||||
|
||||
```go
|
||||
// cmd/api-server/main.go
|
||||
queueCfg := queue.Config{
|
||||
RedisAddr: cfg.Redis.Addr,
|
||||
RedisPassword: cfg.Redis.Password,
|
||||
RedisDB: cfg.Redis.DB,
|
||||
}
|
||||
taskQueue, err := queue.NewTaskQueue(queueCfg)
|
||||
```
|
||||
|
||||
### WebSocket Handler
|
||||
|
||||
```go
|
||||
// internal/api/ws.go
|
||||
func (h *WSHandler) handleQueueJob(conn *websocket.Conn, payload []byte) error {
|
||||
// Parse request
|
||||
apiKeyHash, commitID, priority, jobName := parsePayload(payload)
|
||||
|
||||
// Create task with unique ID
|
||||
taskID := uuid.New().String()
|
||||
task := &queue.Task{
|
||||
ID: taskID,
|
||||
JobName: jobName,
|
||||
Status: "queued",
|
||||
Priority: int64(priority),
|
||||
Metadata: map[string]string{
|
||||
"commit_id": commitID,
|
||||
"user": user,
|
||||
},
|
||||
}
|
||||
|
||||
// Enqueue
|
||||
if err := h.queue.AddTask(task); err != nil {
|
||||
return h.sendErrorPacket(conn, ErrorCodeDatabaseError, ...)
|
||||
}
|
||||
|
||||
return h.sendSuccessPacket(conn, "Job queued")
|
||||
}
|
||||
```
|
||||
|
||||
## Worker Integration
|
||||
|
||||
### Task Polling
|
||||
|
||||
```go
|
||||
// cmd/worker/worker_server.go
|
||||
func (w *Worker) Start() error {
|
||||
for {
|
||||
task, err := w.queue.WaitForNextTask(ctx, 5*time.Second)
|
||||
if task != nil {
|
||||
go w.executeTask(task)
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Task Execution
|
||||
|
||||
```go
|
||||
func (w *Worker) executeTask(task *queue.Task) {
|
||||
// Update status
|
||||
task.Status = "running"
|
||||
task.StartedAt = &now
|
||||
w.queue.UpdateTaskWithMetrics(task, "start")
|
||||
|
||||
// Execute
|
||||
err := w.runJob(task)
|
||||
|
||||
// Finalize
|
||||
task.Status = "completed" // or "failed"
|
||||
task.EndedAt = &endTime
|
||||
task.Error = err.Error() // if err != nil
|
||||
w.queue.UpdateTaskWithMetrics(task, "final")
|
||||
}
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### API Server (`configs/config.yaml`)
|
||||
|
||||
```yaml
|
||||
redis:
|
||||
addr: "localhost:6379"
|
||||
password: ""
|
||||
db: 0
|
||||
```
|
||||
|
||||
### Worker (`configs/worker-config.yaml`)
|
||||
|
||||
```yaml
|
||||
redis:
|
||||
addr: "localhost:6379"
|
||||
password: ""
|
||||
db: 0
|
||||
|
||||
metrics_flush_interval: 500ms
|
||||
```
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Queue Depth
|
||||
|
||||
```go
|
||||
depth, err := queue.QueueDepth()
|
||||
fmt.Printf("Pending tasks: %d\n", depth)
|
||||
```
|
||||
|
||||
### Worker Heartbeat
|
||||
|
||||
```go
|
||||
// Worker sends heartbeat every 30s
|
||||
err := queue.Heartbeat(workerID)
|
||||
```
|
||||
|
||||
### Metrics
|
||||
|
||||
```redis
|
||||
HGETALL job:metrics:{job_name}
|
||||
# Returns: timestamp, tasks_start, tasks_final, etc
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Task Failures
|
||||
|
||||
```go
|
||||
if err := w.runJob(task); err != nil {
|
||||
task.Status = "failed"
|
||||
task.Error = err.Error()
|
||||
w.queue.UpdateTask(task)
|
||||
}
|
||||
```
|
||||
|
||||
### Redis Connection Loss
|
||||
|
||||
```go
|
||||
// TaskQueue automatically reconnects
|
||||
// Workers should implement retry logic
|
||||
for retries := 0; retries < 3; retries++ {
|
||||
task, err := queue.GetNextTask()
|
||||
if err == nil {
|
||||
break
|
||||
}
|
||||
time.Sleep(backoff)
|
||||
}
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
```go
|
||||
// tests using miniredis
|
||||
s, _ := miniredis.Run()
|
||||
defer s.Close()
|
||||
|
||||
tq, _ := queue.NewTaskQueue(queue.Config{
|
||||
RedisAddr: s.Addr(),
|
||||
})
|
||||
|
||||
task := &queue.Task{ID: "test-1", JobName: "test"}
|
||||
tq.AddTask(task)
|
||||
|
||||
fetched, _ := tq.GetNextTask()
|
||||
// assert fetched.ID == "test-1"
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Unique Task IDs**: Always use UUIDs to avoid conflicts
|
||||
2. **Metadata**: Store commit_id and user in task metadata
|
||||
3. **Priority**: Higher values execute first (0-255 range)
|
||||
4. **Status Updates**: Update status at each lifecycle stage
|
||||
5. **Error Logging**: Store detailed errors in task.Error
|
||||
6. **Heartbeats**: Workers should send heartbeats regularly
|
||||
7. **Metrics**: Use UpdateTaskWithMetrics for atomic updates
|
||||
|
||||
---
|
||||
|
||||
For implementation details, see:
|
||||
- [internal/queue/task.go](https://github.com/jfraeys/fetch_ml/blob/main/internal/queue/task.go)
|
||||
- [internal/queue/queue.go](https://github.com/jfraeys/fetch_ml/blob/main/internal/queue/queue.go)
|
||||
|
|
@ -1,95 +0,0 @@
|
|||
---
|
||||
layout: page
|
||||
title: "Redis High Availability (Optional)"
|
||||
permalink: /redis-ha/
|
||||
nav_order: 7
|
||||
---
|
||||
|
||||
# Redis High Availability
|
||||
|
||||
**Note:** This is optional for homelab setups. Single Redis instance is sufficient for most use cases.
|
||||
|
||||
## When You Need HA
|
||||
|
||||
Consider Redis HA if:
|
||||
- Running production workloads
|
||||
- Uptime > 99.9% required
|
||||
- Can't afford to lose queued tasks
|
||||
- Multiple workers across machines
|
||||
|
||||
## Redis Sentinel (Recommended)
|
||||
|
||||
### Setup
|
||||
|
||||
```yaml
|
||||
# docker-compose.yml
|
||||
version: '3.8'
|
||||
services:
|
||||
redis-master:
|
||||
image: redis:7-alpine
|
||||
command: redis-server --maxmemory 2gb
|
||||
|
||||
redis-replica:
|
||||
image: redis:7-alpine
|
||||
command: redis-server --slaveof redis-master 6379
|
||||
|
||||
redis-sentinel-1:
|
||||
image: redis:7-alpine
|
||||
command: redis-sentinel /etc/redis/sentinel.conf
|
||||
volumes:
|
||||
- ./sentinel.conf:/etc/redis/sentinel.conf
|
||||
```
|
||||
|
||||
**sentinel.conf:**
|
||||
```conf
|
||||
sentinel monitor mymaster redis-master 6379 2
|
||||
sentinel down-after-milliseconds mymaster 5000
|
||||
sentinel parallel-syncs mymaster 1
|
||||
sentinel failover-timeout mymaster 10000
|
||||
```
|
||||
|
||||
### Application Configuration
|
||||
|
||||
```yaml
|
||||
# worker-config.yaml
|
||||
redis_addr: "redis-sentinel-1:26379,redis-sentinel-2:26379"
|
||||
redis_master_name: "mymaster"
|
||||
```
|
||||
|
||||
## Redis Cluster (Advanced)
|
||||
|
||||
For larger deployments with sharding needs.
|
||||
|
||||
```yaml
|
||||
# Minimum 3 masters + 3 replicas
|
||||
services:
|
||||
redis-1:
|
||||
image: redis:7-alpine
|
||||
command: redis-server --cluster-enabled yes
|
||||
|
||||
redis-2:
|
||||
# ... similar config
|
||||
```
|
||||
|
||||
## Homelab Alternative: Persistence Only
|
||||
|
||||
**For most homelabs, just enable persistence:**
|
||||
|
||||
```yaml
|
||||
# docker-compose.yml
|
||||
services:
|
||||
redis:
|
||||
image: redis:7-alpine
|
||||
command: redis-server --appendonly yes
|
||||
volumes:
|
||||
- redis_data:/data
|
||||
|
||||
volumes:
|
||||
redis_data:
|
||||
```
|
||||
|
||||
This ensures tasks survive Redis restarts without full HA complexity.
|
||||
|
||||
---
|
||||
|
||||
**Recommendation:** Start simple. Add HA only if you experience actual downtime issues.
|
||||
|
|
@ -1,452 +0,0 @@
|
|||
---
|
||||
layout: page
|
||||
title: "Zig CLI Guide"
|
||||
permalink: /zig-cli/
|
||||
nav_order: 3
|
||||
---
|
||||
|
||||
# Zig CLI Guide
|
||||
|
||||
High-performance command-line interface for ML experiment management, written in Zig for maximum speed and efficiency.
|
||||
|
||||
## Overview
|
||||
|
||||
The Zig CLI (`ml`) is the primary interface for managing ML experiments in your homelab. Built with Zig, it provides exceptional performance for file operations, network communication, and experiment management.
|
||||
|
||||
## Installation
|
||||
|
||||
### Pre-built Binaries (Recommended)
|
||||
|
||||
Download from [GitHub Releases](https://github.com/jfraeys/fetch_ml/releases):
|
||||
|
||||
```bash
|
||||
# Download for your platform
|
||||
curl -LO https://github.com/jfraeys/fetch_ml/releases/latest/download/ml-<platform>.tar.gz
|
||||
|
||||
# Extract
|
||||
tar -xzf ml-<platform>.tar.gz
|
||||
|
||||
# Install
|
||||
chmod +x ml-<platform>
|
||||
sudo mv ml-<platform> /usr/local/bin/ml
|
||||
|
||||
# Verify
|
||||
ml --help
|
||||
```
|
||||
|
||||
**Platforms:**
|
||||
- `ml-linux-x86_64.tar.gz` - Linux (fully static, zero dependencies)
|
||||
- `ml-macos-x86_64.tar.gz` - macOS Intel
|
||||
- `ml-macos-arm64.tar.gz` - macOS Apple Silicon
|
||||
|
||||
All release binaries include **embedded static rsync** for complete independence.
|
||||
|
||||
### Build from Source
|
||||
|
||||
**Development Build** (uses system rsync):
|
||||
```bash
|
||||
cd cli
|
||||
zig build dev
|
||||
./zig-out/dev/ml-dev --help
|
||||
```
|
||||
|
||||
**Production Build** (embedded rsync):
|
||||
```bash
|
||||
cd cli
|
||||
# For testing: uses rsync wrapper
|
||||
zig build prod
|
||||
|
||||
# For release with static rsync:
|
||||
# 1. Place static rsync binary at src/assets/rsync_release.bin
|
||||
# 2. Build
|
||||
zig build prod
|
||||
strip zig-out/prod/ml # Optional: reduce size
|
||||
|
||||
# Verify
|
||||
./zig-out/prod/ml --help
|
||||
ls -lh zig-out/prod/ml
|
||||
```
|
||||
|
||||
See [cli/src/assets/README.md](https://github.com/jfraeys/fetch_ml/blob/main/cli/src/assets/README.md) for details on obtaining static rsync binaries.
|
||||
|
||||
### Verify Installation
|
||||
```bash
|
||||
ml --help
|
||||
ml --version # Shows build config
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
1. **Initialize Configuration**
|
||||
```bash
|
||||
./cli/zig-out/bin/ml init
|
||||
```
|
||||
|
||||
2. **Sync Your First Project**
|
||||
```bash
|
||||
./cli/zig-out/bin/ml sync ./my-project --queue
|
||||
```
|
||||
|
||||
3. **Monitor Progress**
|
||||
```bash
|
||||
./cli/zig-out/bin/ml status
|
||||
```
|
||||
|
||||
## Command Reference
|
||||
|
||||
### `init` - Configuration Setup
|
||||
|
||||
Initialize the CLI configuration file.
|
||||
|
||||
```bash
|
||||
ml init
|
||||
```
|
||||
|
||||
**Creates:** `~/.ml/config.toml`
|
||||
|
||||
**Configuration Template:**
|
||||
```toml
|
||||
worker_host = "worker.local"
|
||||
worker_user = "mluser"
|
||||
worker_base = "/data/ml-experiments"
|
||||
worker_port = 22
|
||||
api_key = "your-api-key"
|
||||
```
|
||||
|
||||
### `sync` - Project Synchronization
|
||||
|
||||
Sync project files to the worker with intelligent deduplication.
|
||||
|
||||
```bash
|
||||
# Basic sync
|
||||
ml sync ./project
|
||||
|
||||
# Sync with custom name and auto-queue
|
||||
ml sync ./project --name "experiment-1" --queue
|
||||
|
||||
# Sync with priority
|
||||
ml sync ./project --priority 8
|
||||
```
|
||||
|
||||
**Options:**
|
||||
- `--name <name>`: Custom experiment name
|
||||
- `--queue`: Automatically queue after sync
|
||||
- `--priority N`: Set priority (1-10, default 5)
|
||||
|
||||
**Features:**
|
||||
- **Content-Addressed Storage**: Automatic deduplication
|
||||
- **SHA256 Commit IDs**: Reliable change detection
|
||||
- **Incremental Transfer**: Only sync changed files
|
||||
- **Rsync Backend**: Efficient file transfer
|
||||
|
||||
### `queue` - Job Management
|
||||
|
||||
Queue experiments for execution on the worker.
|
||||
|
||||
```bash
|
||||
# Queue with commit ID
|
||||
ml queue my-job --commit abc123def456
|
||||
|
||||
# Queue with priority
|
||||
ml queue my-job --commit abc123 --priority 8
|
||||
```
|
||||
|
||||
**Options:**
|
||||
- `--commit <id>`: Commit ID from sync output
|
||||
- `--priority N`: Execution priority (1-10)
|
||||
|
||||
**Features:**
|
||||
- **WebSocket Communication**: Real-time job submission
|
||||
- **Priority Queuing**: Higher priority jobs run first
|
||||
- **API Authentication**: Secure job submission
|
||||
|
||||
### `watch` - Auto-Sync Monitoring
|
||||
|
||||
Monitor directories for changes and auto-sync.
|
||||
|
||||
```bash
|
||||
# Watch for changes
|
||||
ml watch ./project
|
||||
|
||||
# Watch and auto-queue on changes
|
||||
ml watch ./project --name "dev-exp" --queue
|
||||
```
|
||||
|
||||
**Options:**
|
||||
- `--name <name>`: Custom experiment name
|
||||
- `--queue`: Auto-queue on changes
|
||||
- `--priority N`: Set priority for queued jobs
|
||||
|
||||
**Features:**
|
||||
- **Real-time Monitoring**: 2-second polling interval
|
||||
- **Change Detection**: File modification time tracking
|
||||
- **Commit Comparison**: Only sync when content changes
|
||||
- **Automatic Queuing**: Seamless development workflow
|
||||
|
||||
### `status` - System Status
|
||||
|
||||
Check system and worker status.
|
||||
|
||||
```bash
|
||||
ml status
|
||||
```
|
||||
|
||||
**Displays:**
|
||||
- Worker connectivity
|
||||
- Queue status
|
||||
- Running jobs
|
||||
- System health
|
||||
|
||||
### `monitor` - Remote Monitoring
|
||||
|
||||
Launch TUI interface via SSH for real-time monitoring.
|
||||
|
||||
```bash
|
||||
ml monitor
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- **Real-time Updates**: Live experiment status
|
||||
- **Interactive Interface**: Browse and manage experiments
|
||||
- **SSH Integration**: Secure remote access
|
||||
|
||||
### `cancel` - Job Cancellation
|
||||
|
||||
Cancel running or queued jobs.
|
||||
|
||||
```bash
|
||||
ml cancel job-id
|
||||
```
|
||||
|
||||
**Options:**
|
||||
- `job-id`: Job identifier from status output
|
||||
|
||||
### `prune` - Cleanup Management
|
||||
|
||||
Clean up old experiments to save space.
|
||||
|
||||
```bash
|
||||
# Keep last N experiments
|
||||
ml prune --keep 20
|
||||
|
||||
# Remove experiments older than N days
|
||||
ml prune --older-than 30
|
||||
```
|
||||
|
||||
**Options:**
|
||||
- `--keep N`: Keep N most recent experiments
|
||||
- `--older-than N`: Remove experiments older than N days
|
||||
|
||||
## Architecture
|
||||
|
||||
### Core Components
|
||||
|
||||
```
|
||||
cli/src/
|
||||
├── commands/ # Command implementations
|
||||
│ ├── init.zig # Configuration setup
|
||||
│ ├── sync.zig # Project synchronization
|
||||
│ ├── queue.zig # Job management
|
||||
│ ├── watch.zig # Auto-sync monitoring
|
||||
│ ├── status.zig # System status
|
||||
│ ├── monitor.zig # Remote monitoring
|
||||
│ ├── cancel.zig # Job cancellation
|
||||
│ └── prune.zig # Cleanup operations
|
||||
├── config.zig # Configuration management
|
||||
├── errors.zig # Error handling
|
||||
├── net/ # Network utilities
|
||||
│ └── ws.zig # WebSocket client
|
||||
└── utils/ # Utility functions
|
||||
├── crypto.zig # Hashing and encryption
|
||||
├── storage.zig # Content-addressed storage
|
||||
└── rsync.zig # File synchronization
|
||||
```
|
||||
|
||||
### Performance Features
|
||||
|
||||
#### Content-Addressed Storage
|
||||
- **Deduplication**: Identical files shared across experiments
|
||||
- **Hash-based Storage**: Files stored by SHA256 hash
|
||||
- **Space Efficiency**: Reduces storage by up to 90%
|
||||
|
||||
#### SHA256 Commit IDs
|
||||
- **Reliable Detection**: Cryptographic change detection
|
||||
- **Collision Resistance**: Guaranteed unique identifiers
|
||||
- **Fast Computation**: Optimized for large directories
|
||||
|
||||
#### WebSocket Protocol
|
||||
- **Low Latency**: Real-time communication
|
||||
- **Binary Protocol**: Efficient message format
|
||||
- **Connection Pooling**: Reused connections
|
||||
|
||||
#### Memory Management
|
||||
- **Arena Allocators**: Efficient memory allocation
|
||||
- **Zero-copy Operations**: Minimized memory usage
|
||||
- **Resource Cleanup**: Automatic resource management
|
||||
|
||||
### Security Features
|
||||
|
||||
#### Authentication
|
||||
- **API Key Hashing**: Secure token storage
|
||||
- **SHA256 Hashes**: Irreversible token protection
|
||||
- **Config Validation**: Input sanitization
|
||||
|
||||
#### Secure Communication
|
||||
- **SSH Integration**: Encrypted file transfers
|
||||
- **WebSocket Security**: TLS-protected communication
|
||||
- **Input Validation**: Comprehensive argument checking
|
||||
|
||||
#### Error Handling
|
||||
- **Secure Reporting**: No sensitive information leakage
|
||||
- **Graceful Degradation**: Safe error recovery
|
||||
- **Audit Logging**: Operation tracking
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Workflow Integration
|
||||
|
||||
#### Development Workflow
|
||||
```bash
|
||||
# 1. Initialize project
|
||||
ml sync ./project --name "dev" --queue
|
||||
|
||||
# 2. Auto-sync during development
|
||||
ml watch ./project --name "dev" --queue
|
||||
|
||||
# 3. Monitor progress
|
||||
ml status
|
||||
```
|
||||
|
||||
#### Batch Processing
|
||||
```bash
|
||||
# Process multiple experiments
|
||||
for dir in experiments/*/; do
|
||||
ml sync "$dir" --queue
|
||||
done
|
||||
```
|
||||
|
||||
#### Priority Management
|
||||
```bash
|
||||
# High priority experiment
|
||||
ml sync ./urgent --priority 10 --queue
|
||||
|
||||
# Background processing
|
||||
ml sync ./background --priority 1 --queue
|
||||
```
|
||||
|
||||
### Configuration Management
|
||||
|
||||
#### Multiple Workers
|
||||
```toml
|
||||
# ~/.ml/config.toml
|
||||
worker_host = "worker.local"
|
||||
worker_user = "mluser"
|
||||
worker_base = "/data/ml-experiments"
|
||||
worker_port = 22
|
||||
api_key = "your-api-key"
|
||||
```
|
||||
|
||||
#### Security Settings
|
||||
```bash
|
||||
# Set restrictive permissions
|
||||
chmod 600 ~/.ml/config.toml
|
||||
|
||||
# Verify configuration
|
||||
ml status
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
#### Build Problems
|
||||
```bash
|
||||
# Check Zig installation
|
||||
zig version
|
||||
|
||||
# Clean build
|
||||
cd cli && make clean && make build
|
||||
```
|
||||
|
||||
#### Connection Issues
|
||||
```bash
|
||||
# Test SSH connectivity
|
||||
ssh -p $worker_port $worker_user@$worker_host
|
||||
|
||||
# Verify configuration
|
||||
cat ~/.ml/config.toml
|
||||
```
|
||||
|
||||
#### Sync Failures
|
||||
```bash
|
||||
# Check rsync
|
||||
rsync --version
|
||||
|
||||
# Manual sync test
|
||||
rsync -avz ./test/ $worker_user@$worker_host:/tmp/
|
||||
```
|
||||
|
||||
#### Performance Issues
|
||||
```bash
|
||||
# Monitor resource usage
|
||||
top -p $(pgrep ml)
|
||||
|
||||
# Check disk space
|
||||
df -h $worker_base
|
||||
```
|
||||
|
||||
### Debug Mode
|
||||
|
||||
Enable verbose logging:
|
||||
```bash
|
||||
# Environment variable
|
||||
export ML_DEBUG=1
|
||||
ml sync ./project
|
||||
|
||||
# Or use debug build
|
||||
cd cli && make debug
|
||||
```
|
||||
|
||||
## Performance Benchmarks
|
||||
|
||||
### File Operations
|
||||
- **Sync Speed**: 100MB/s+ (network limited)
|
||||
- **Hash Computation**: 500MB/s+ (CPU limited)
|
||||
- **Deduplication**: 90%+ space savings
|
||||
|
||||
### Memory Usage
|
||||
- **Base Memory**: ~10MB
|
||||
- **Large Projects**: ~50MB (1GB+ projects)
|
||||
- **Memory Efficiency**: Constant per-file overhead
|
||||
|
||||
### Network Performance
|
||||
- **WebSocket Latency**: <10ms (local network)
|
||||
- **Connection Setup**: <100ms
|
||||
- **Throughput**: Network limited
|
||||
|
||||
## Contributing
|
||||
|
||||
### Development Setup
|
||||
```bash
|
||||
cd cli
|
||||
zig build-exe src/main.zig
|
||||
```
|
||||
|
||||
### Testing
|
||||
```bash
|
||||
# Run tests
|
||||
cd cli && zig test src/
|
||||
|
||||
# Integration tests
|
||||
zig test tests/
|
||||
```
|
||||
|
||||
### Code Style
|
||||
- Follow Zig style guidelines
|
||||
- Use explicit error handling
|
||||
- Document public APIs
|
||||
- Add comprehensive tests
|
||||
|
||||
---
|
||||
|
||||
**For more information, see the [CLI Reference](/cli-reference/) and [Architecture](/architecture/) pages.**
|
||||
Loading…
Reference in a new issue