chore(docs): remove legacy Jekyll docs/_pages after Hugo migration

This commit is contained in:
Jeremie Fraeys 2026-01-05 12:41:09 -05:00
parent 3d58387207
commit f6e506a632
7 changed files with 0 additions and 2486 deletions

View file

@ -1,738 +0,0 @@
---
layout: page
title: "Homelab Architecture"
permalink: /architecture/
nav_order: 1
---
# Homelab Architecture
Simple, secure architecture for ML experiments in your homelab.
## Components Overview
```mermaid
graph TB
subgraph "Homelab Stack"
CLI[Zig CLI]
API[HTTPS API]
REDIS[Redis Cache]
FS[Local Storage]
end
CLI --> API
API --> REDIS
API --> FS
```
## Core Services
### API Server
- **Purpose**: Secure HTTPS API for ML experiments
- **Port**: 9101 (HTTPS only)
- **Auth**: API key authentication
- **Security**: Rate limiting, IP whitelisting
### Redis
- **Purpose**: Caching and job queuing
- **Port**: 6379 (localhost only)
- **Storage**: Temporary data only
- **Persistence**: Local volume
### Zig CLI
- **Purpose**: High-performance experiment management
- **Language**: Zig for maximum speed and efficiency
- **Features**:
- Content-addressed storage with deduplication
- SHA256-based commit ID generation
- WebSocket communication for real-time updates
- Rsync-based incremental file transfers
- Multi-threaded operations
- Secure API key authentication
- Auto-sync monitoring with file system watching
- Priority-based job queuing
- Memory-efficient operations with arena allocators
## Security Architecture
```mermaid
graph LR
USER[User] --> AUTH[API Key Auth]
AUTH --> RATE[Rate Limiting]
RATE --> WHITELIST[IP Whitelist]
WHITELIST --> API[Secure API]
API --> AUDIT[Audit Logging]
```
### Security Layers
1. **API Key Authentication** - Hashed keys with roles
2. **Rate Limiting** - 30 requests/minute
3. **IP Whitelisting** - Local networks only
4. **Fail2Ban** - Automatic IP blocking
5. **HTTPS/TLS** - Encrypted communication
6. **Audit Logging** - Complete action tracking
## Data Flow
```mermaid
sequenceDiagram
participant CLI
participant API
participant Redis
participant Storage
CLI->>API: HTTPS Request
API->>API: Validate Auth
API->>Redis: Cache/Queue
API->>Storage: Experiment Data
Storage->>API: Results
API->>CLI: Response
```
## Deployment Options
### Docker Compose (Recommended)
```yaml
services:
redis:
image: redis:7-alpine
ports: ["6379:6379"]
volumes: [redis_data:/data]
api-server:
build: .
ports: ["9101:9101"]
depends_on: [redis]
```
### Local Setup
```bash
./setup.sh && ./manage.sh start
```
## Network Architecture
- **Private Network**: Docker internal network
- **Localhost Access**: Redis only on localhost
- **HTTPS API**: Port 9101, TLS encrypted
- **No External Dependencies**: Everything runs locally
## Storage Architecture
```
data/
├── experiments/ # ML experiment results
├── cache/ # Temporary cache files
└── backups/ # Local backups
logs/
├── app.log # Application logs
├── audit.log # Security events
└── access.log # API access logs
```
## Monitoring Architecture
Simple, lightweight monitoring:
- **Health Checks**: Service availability
- **Log Files**: Structured logging
- **Basic Metrics**: Request counts, error rates
- **Security Events**: Failed auth, rate limits
## Homelab Benefits
- ✅ **Simple Setup**: One-command installation
- ✅ **Local Only**: No external dependencies
- ✅ **Secure by Default**: HTTPS, auth, rate limiting
- ✅ **Low Resource**: Minimal CPU/memory usage
- ✅ **Easy Backup**: Local file system
- ✅ **Privacy**: Everything stays on your network
## High-Level Architecture
```mermaid
graph TB
subgraph "Client Layer"
CLI[CLI Tools]
TUI[Terminal UI]
API[REST API]
end
subgraph "Authentication Layer"
Auth[Authentication Service]
RBAC[Role-Based Access Control]
Perm[Permission Manager]
end
subgraph "Core Services"
Worker[ML Worker Service]
DataMgr[Data Manager Service]
Queue[Job Queue]
end
subgraph "Storage Layer"
Redis[(Redis Cache)]
DB[(SQLite/PostgreSQL)]
Files[File Storage]
end
subgraph "Container Runtime"
Podman[Podman/Docker]
Containers[ML Containers]
end
CLI --> Auth
TUI --> Auth
API --> Auth
Auth --> RBAC
RBAC --> Perm
Worker --> Queue
Worker --> DataMgr
Worker --> Podman
DataMgr --> DB
DataMgr --> Files
Queue --> Redis
Podman --> Containers
```
## Zig CLI Architecture
### Component Structure
```mermaid
graph TB
subgraph "Zig CLI Components"
Main[main.zig] --> Commands[commands/]
Commands --> Config[config.zig]
Commands --> Utils[utils/]
Commands --> Net[net/]
Commands --> Errors[errors.zig]
subgraph "Commands"
Init[init.zig]
Sync[sync.zig]
Queue[queue.zig]
Watch[watch.zig]
Status[status.zig]
Monitor[monitor.zig]
Cancel[cancel.zig]
Prune[prune.zig]
end
subgraph "Utils"
Crypto[crypto.zig]
Storage[storage.zig]
Rsync[rsync.zig]
end
subgraph "Network"
WS[ws.zig]
end
end
```
### Performance Optimizations
#### Content-Addressed Storage
- **Deduplication**: Files stored by SHA256 hash
- **Space Efficiency**: Shared files across experiments
- **Fast Lookup**: Hash-based file retrieval
#### Memory Management
- **Arena Allocators**: Efficient bulk allocation
- **Zero-Copy Operations**: Minimized memory copying
- **Automatic Cleanup**: Resource deallocation
#### Network Communication
- **WebSocket Protocol**: Real-time bidirectional communication
- **Connection Pooling**: Reused connections
- **Binary Messaging**: Efficient data transfer
### Security Implementation
```mermaid
graph LR
subgraph "CLI Security"
Config[Config File] --> Hash[SHA256 Hashing]
Hash --> Auth[API Authentication]
Auth --> SSH[SSH Transfer]
SSH --> WS[WebSocket Security]
end
```
## Core Components
### 1. Authentication & Authorization
```mermaid
graph LR
subgraph "Auth Flow"
Client[Client] --> APIKey[API Key]
APIKey --> Hash[Hash Validation]
Hash --> Roles[Role Resolution]
Roles --> Perms[Permission Check]
Perms --> Access[Grant/Deny Access]
end
subgraph "Permission Sources"
YAML[YAML Config]
Inline[Inline Fallback]
Roles --> YAML
Roles --> Inline
end
```
**Features:**
- API key-based authentication
- Role-based access control (RBAC)
- YAML-based permission configuration
- Fallback to inline permissions
- Admin wildcard permissions
### 2. Worker Service
```mermaid
graph TB
subgraph "Worker Architecture"
API[HTTP API] --> Router[Request Router]
Router --> Auth[Auth Middleware]
Auth --> Queue[Job Queue]
Queue --> Processor[Job Processor]
Processor --> Runtime[Container Runtime]
Runtime --> Storage[Result Storage]
subgraph "Job Lifecycle"
Submit[Submit Job] --> Queue
Queue --> Execute[Execute]
Execute --> Monitor[Monitor]
Monitor --> Complete[Complete]
Complete --> Store[Store Results]
end
end
```
**Responsibilities:**
- HTTP API for job submission
- Job queue management
- Container orchestration
- Result collection and storage
- Metrics and monitoring
### 3. Data Manager Service
```mermaid
graph TB
subgraph "Data Management"
API[Data API] --> Storage[Storage Layer]
Storage --> Metadata[Metadata DB]
Storage --> Files[File System]
Storage --> Cache[Redis Cache]
subgraph "Data Operations"
Upload[Upload Data] --> Validate[Validate]
Validate --> Store[Store]
Store --> Index[Index]
Index --> Catalog[Catalog]
end
end
```
**Features:**
- Data upload and validation
- Metadata management
- File system abstraction
- Caching layer
- Data catalog
### 4. Terminal UI (TUI)
```mermaid
graph TB
subgraph "TUI Architecture"
UI[UI Components] --> Model[Data Model]
Model --> Update[Update Loop]
Update --> Render[Render]
subgraph "UI Panels"
Jobs[Job List]
Details[Job Details]
Logs[Log Viewer]
Status[Status Bar]
end
UI --> Jobs
UI --> Details
UI --> Logs
UI --> Status
end
```
**Components:**
- Bubble Tea framework
- Component-based architecture
- Real-time updates
- Keyboard navigation
- Theme support
## Data Flow
### Job Execution Flow
```mermaid
sequenceDiagram
participant Client
participant Auth
participant Worker
participant Queue
participant Container
participant Storage
Client->>Auth: Submit job with API key
Auth->>Client: Validate and return job ID
Client->>Worker: Execute job request
Worker->>Queue: Queue job
Queue->>Worker: Job ready
Worker->>Container: Start ML container
Container->>Worker: Execute experiment
Worker->>Storage: Store results
Worker->>Client: Return results
```
### Authentication Flow
```mermaid
sequenceDiagram
participant Client
participant Auth
participant PermMgr
participant Config
Client->>Auth: Request with API key
Auth->>Auth: Validate key hash
Auth->>PermMgr: Get user permissions
PermMgr->>Config: Load YAML permissions
Config->>PermMgr: Return permissions
PermMgr->>Auth: Return resolved permissions
Auth->>Client: Grant/deny access
```
## Security Architecture
### Defense in Depth
```mermaid
graph TB
subgraph "Security Layers"
Network[Network Security]
Auth[Authentication]
AuthZ[Authorization]
Container[Container Security]
Data[Data Protection]
Audit[Audit Logging]
end
Network --> Auth
Auth --> AuthZ
AuthZ --> Container
Container --> Data
Data --> Audit
```
**Security Features:**
- API key authentication
- Role-based permissions
- Container isolation
- File system sandboxing
- Comprehensive audit logs
- Input validation and sanitization
### Container Security
```mermaid
graph TB
subgraph "Container Isolation"
Host[Host System]
Podman[Podman Runtime]
Network[Network Isolation]
FS[File System Isolation]
User[User Namespaces]
ML[ML Container]
Host --> Podman
Podman --> Network
Podman --> FS
Podman --> User
User --> ML
end
```
**Isolation Features:**
- Rootless containers
- Network isolation
- File system sandboxing
- User namespace mapping
- Resource limits
## Configuration Architecture
### Configuration Hierarchy
```mermaid
graph TB
subgraph "Config Sources"
Env[Environment Variables]
File[Config Files]
CLI[CLI Flags]
Defaults[Default Values]
end
subgraph "Config Processing"
Merge[Config Merger]
Validate[Schema Validator]
Apply[Config Applier]
end
Env --> Merge
File --> Merge
CLI --> Merge
Defaults --> Merge
Merge --> Validate
Validate --> Apply
```
**Configuration Priority:**
1. CLI flags (highest)
2. Environment variables
3. Configuration files
4. Default values (lowest)
## Scalability Architecture
### Horizontal Scaling
```mermaid
graph TB
subgraph "Scaled Architecture"
LB[Load Balancer]
W1[Worker 1]
W2[Worker 2]
W3[Worker N]
Redis[Redis Cluster]
Storage[Shared Storage]
LB --> W1
LB --> W2
LB --> W3
W1 --> Redis
W2 --> Redis
W3 --> Redis
W1 --> Storage
W2 --> Storage
W3 --> Storage
end
```
**Scaling Features:**
- Stateless worker services
- Shared job queue (Redis)
- Distributed storage
- Load balancer ready
- Health checks and monitoring
## Technology Stack
### Backend Technologies
| Component | Technology | Purpose |
|-----------|------------|---------|
| **Language** | Go 1.25+ | Core application |
| **Web Framework** | Standard library | HTTP server |
| **Authentication** | Custom | API key + RBAC |
| **Database** | SQLite/PostgreSQL | Metadata storage |
| **Cache** | Redis | Job queue & caching |
| **Containers** | Podman/Docker | Job isolation |
| **UI Framework** | Bubble Tea | Terminal UI |
### Dependencies
```go
// Core dependencies
require (
github.com/charmbracelet/bubbletea v1.3.10 // TUI framework
github.com/go-redis/redis/v8 v8.11.5 // Redis client
github.com/google/uuid v1.6.0 // UUID generation
github.com/mattn/go-sqlite3 v1.14.32 // SQLite driver
golang.org/x/crypto v0.45.0 // Crypto utilities
gopkg.in/yaml.v3 v3.0.1 // YAML parsing
)
```
## Development Architecture
### Project Structure
```
fetch_ml/
├── cmd/ # CLI applications
│ ├── worker/ # ML worker service
│ ├── tui/ # Terminal UI
│ ├── data_manager/ # Data management
│ └── user_manager/ # User management
├── internal/ # Internal packages
│ ├── auth/ # Authentication system
│ ├── config/ # Configuration management
│ ├── container/ # Container operations
│ ├── database/ # Database operations
│ ├── logging/ # Logging utilities
│ ├── metrics/ # Metrics collection
│ └── network/ # Network utilities
├── configs/ # Configuration files
├── scripts/ # Setup and utility scripts
├── tests/ # Test suites
└── docs/ # Documentation
```
### Package Dependencies
```mermaid
graph TB
subgraph "Application Layer"
Worker[cmd/worker]
TUI[cmd/tui]
DataMgr[cmd/data_manager]
UserMgr[cmd/user_manager]
end
subgraph "Service Layer"
Auth[internal/auth]
Config[internal/config]
Container[internal/container]
Database[internal/database]
end
subgraph "Utility Layer"
Logging[internal/logging]
Metrics[internal/metrics]
Network[internal/network]
end
Worker --> Auth
Worker --> Config
Worker --> Container
TUI --> Auth
DataMgr --> Database
UserMgr --> Auth
Auth --> Logging
Container --> Network
Database --> Metrics
```
## Monitoring & Observability
### Metrics Collection
```mermaid
graph TB
subgraph "Metrics Pipeline"
App[Application] --> Metrics[Metrics Collector]
Metrics --> Export[Prometheus Exporter]
Export --> Prometheus[Prometheus Server]
Prometheus --> Grafana[Grafana Dashboard]
subgraph "Metric Types"
Counter[Counters]
Gauge[Gauges]
Histogram[Histograms]
Timer[Timers]
end
App --> Counter
App --> Gauge
App --> Histogram
App --> Timer
end
```
### Logging Architecture
```mermaid
graph TB
subgraph "Logging Pipeline"
App[Application] --> Logger[Structured Logger]
Logger --> File[File Output]
Logger --> Console[Console Output]
Logger --> Syslog[Syslog Forwarder]
Syslog --> Aggregator[Log Aggregator]
Aggregator --> Storage[Log Storage]
Storage --> Viewer[Log Viewer]
end
```
## Deployment Architecture
### Container Deployment
```mermaid
graph TB
subgraph "Deployment Stack"
Image[Container Image]
Registry[Container Registry]
Orchestrator[Docker Compose]
Config[ConfigMaps/Secrets]
Storage[Persistent Storage]
Image --> Registry
Registry --> Orchestrator
Config --> Orchestrator
Storage --> Orchestrator
end
```
### Service Discovery
```mermaid
graph TB
subgraph "Service Mesh"
Gateway[API Gateway]
Discovery[Service Discovery]
Worker[Worker Service]
Data[Data Service]
Redis[Redis Cluster]
Gateway --> Discovery
Discovery --> Worker
Discovery --> Data
Discovery --> Redis
end
```
## Future Architecture Considerations
### Microservices Evolution
- **API Gateway**: Centralized routing and authentication
- **Service Mesh**: Inter-service communication
- **Event Streaming**: Kafka for job events
- **Distributed Tracing**: OpenTelemetry integration
- **Multi-tenant**: Tenant isolation and quotas
### Homelab Features
- **Docker Compose**: Simple container orchestration
- **Local Development**: Easy setup and testing
- **Security**: Built-in authentication and encryption
- **Monitoring**: Basic health checks and logging
---
This architecture provides a solid foundation for secure, scalable machine learning experiments while maintaining simplicity and developer productivity.

View file

@ -1,165 +0,0 @@
---
layout: page
title: "CI/CD Pipeline"
permalink: /cicd/
nav_order: 5
---
# CI/CD Pipeline
Automated testing, building, and releasing for fetch_ml.
## Workflows
### CI Workflow (`.github/workflows/ci.yml`)
Runs on every push to `main`/`develop` and all pull requests.
**Jobs:**
1. **test** - Go backend tests with Redis
2. **build** - Build all binaries (Go + Zig CLI)
3. **test-scripts** - Validate deployment scripts
4. **security-scan** - Trivy and Gosec security scans
5. **docker-build** - Build and push Docker images (main branch only)
**Test Coverage:**
- Go unit tests with race detection
- `internal/queue` package tests
- Zig CLI tests
- Integration tests
- Security audits
### Release Workflow (`.github/workflows/release.yml`)
Runs on version tags (e.g., `v1.0.0`).
**Jobs:**
1. **build-cli** (matrix build)
- Linux x86_64 (static musl)
- macOS x86_64
- macOS ARM64
- Downloads platform-specific static rsync
- Embeds rsync for zero-dependency releases
2. **build-go-backends**
- Cross-platform Go builds
- api-server, worker, tui, data_manager, user_manager
3. **create-release**
- Collects all artifacts
- Generates SHA256 checksums
- Creates GitHub release with notes
## Release Process
### Creating a Release
```bash
# 1. Update version
git tag v1.0.0
# 2. Push tag
git push origin v1.0.0
# 3. CI automatically builds and releases
```
### Release Artifacts
**CLI Binaries (with embedded rsync):**
- `ml-linux-x86_64.tar.gz` (~450-650KB)
- `ml-macos-x86_64.tar.gz` (~450-650KB)
- `ml-macos-arm64.tar.gz` (~450-650KB)
**Go Backends:**
- `fetch_ml_api-server.tar.gz`
- `fetch_ml_worker.tar.gz`
- `fetch_ml_tui.tar.gz`
- `fetch_ml_data_manager.tar.gz`
- `fetch_ml_user_manager.tar.gz`
**Checksums:**
- `checksums.txt` - Combined SHA256 sums
- Individual `.sha256` files per binary
## Development Workflow
### Local Testing
```bash
# Run all tests
make test
# Run specific package tests
go test ./internal/queue/...
# Build CLI
cd cli && zig build dev
# Run formatters and linters
make lint
# Security scans are handled automatically in CI by the `security-scan` job
```
#### Optional heavy end-to-end tests
Some e2e tests exercise full Docker deployments and performance scenarios and are
**skipped by default** to keep local/CI runs fast. You can enable them explicitly
with environment variables:
```bash
# Run Docker deployment e2e tests
FETCH_ML_E2E_DOCKER=1 go test ./tests/e2e/...
# Run performance-oriented e2e tests
FETCH_ML_E2E_PERF=1 go test ./tests/e2e/...
```
Without these variables, `TestDockerDeploymentE2E` and `TestPerformanceE2E` will
`t.Skip`, while all lighter e2e tests still run.
### Pull Request Checks
All PRs must pass:
- ✅ Go tests (with Redis)
- ✅ CLI tests
- ✅ Security scans
- ✅ Code linting
- ✅ Build verification
## Configuration
### Environment Variables
```yaml
GO_VERSION: '1.25.0'
ZIG_VERSION: '0.15.2'
```
### Secrets
Required for releases:
- `GITHUB_TOKEN` - Automatic, provided by GitHub Actions
## Monitoring
### Build Status
Check workflow runs at:
```
https://github.com/jfraeys/fetch_ml/actions
```
### Artifacts
Download build artifacts from:
- Successful workflow runs (30-day retention)
- GitHub Releases (permanent)
---
For implementation details:
- [.github/workflows/ci.yml](https://github.com/jfraeys/fetch_ml/blob/main/.github/workflows/ci.yml)
- [.github/workflows/release.yml](https://github.com/jfraeys/fetch_ml/blob/main/.github/workflows/release.yml)

View file

@ -1,404 +0,0 @@
---
layout: page
title: "CLI Reference"
permalink: /cli-reference/
nav_order: 2
---
# Fetch ML CLI Reference
Comprehensive command-line tools for managing ML experiments in your homelab with Zig-based high-performance CLI.
## Overview
Fetch ML provides a comprehensive CLI toolkit built with performance and security in mind:
- **Zig CLI** - High-performance experiment management written in Zig
- **Go Commands** - API server, TUI, and data management utilities
- **Management Scripts** - Service orchestration and deployment
- **Setup Scripts** - One-command installation and configuration
## Zig CLI (`./cli/zig-out/bin/ml`)
High-performance command-line interface for experiment management, written in Zig for speed and efficiency.
### Available Commands
| Command | Description | Example |
|---------|-------------|----------|
| `init` | Interactive configuration setup | `ml init` |
| `sync` | Sync project to worker with deduplication | `ml sync ./project --name myjob --queue` |
| `queue` | Queue job for execution | `ml queue myjob --commit abc123 --priority 8` |
| `status` | Get system and worker status | `ml status` |
| `monitor` | Launch TUI monitoring via SSH | `ml monitor` |
| `cancel` | Cancel running job | `ml cancel job123` |
| `prune` | Clean up old experiments | `ml prune --keep 10` |
| `watch` | Auto-sync directory on changes | `ml watch ./project --queue` |
### Command Details
#### `init` - Configuration Setup
```bash
ml init
```
Creates a configuration template at `~/.ml/config.toml` with:
- Worker connection details
- API authentication
- Base paths and ports
#### `sync` - Project Synchronization
```bash
# Basic sync
ml sync ./my-project
# Sync with custom name and queue
ml sync ./my-project --name "experiment-1" --queue
# Sync with priority
ml sync ./my-project --priority 9
```
**Features:**
- Content-addressed storage for deduplication
- SHA256 commit ID generation
- Rsync-based file transfer
- Automatic queuing (with `--queue` flag)
#### `queue` - Job Management
```bash
# Queue with commit ID
ml queue my-job --commit abc123def456
# Queue with priority (1-10, default 5)
ml queue my-job --commit abc123 --priority 8
```
**Features:**
- WebSocket-based communication
- Priority queuing system
- API key authentication
#### `watch` - Auto-Sync Monitoring
```bash
# Watch directory for changes
ml watch ./project
# Watch and auto-queue on changes
ml watch ./project --name "dev-exp" --queue
```
**Features:**
- Real-time file system monitoring
- Automatic re-sync on changes
- Configurable polling interval (2 seconds)
- Commit ID comparison for efficiency
#### `prune` - Cleanup Management
```bash
# Keep last N experiments
ml prune --keep 20
# Remove experiments older than N days
ml prune --older-than 30
```
#### `monitor` - Remote Monitoring
```bash
ml monitor
```
Launches TUI interface via SSH for real-time monitoring.
#### `cancel` - Job Cancellation
```bash
ml cancel running-job-id
```
Cancels currently running jobs by ID.
### Configuration
The Zig CLI reads configuration from `~/.ml/config.toml`:
```toml
worker_host = "worker.local"
worker_user = "mluser"
worker_base = "/data/ml-experiments"
worker_port = 22
api_key = "your-api-key"
```
### Performance Features
- **Content-Addressed Storage**: Automatic deduplication of identical files
- **Incremental Sync**: Only transfers changed files
- **SHA256 Hashing**: Reliable commit ID generation
- **WebSocket Communication**: Efficient real-time messaging
- **Multi-threaded**: Concurrent operations where applicable
## Go Commands
### API Server (`./cmd/api-server/main.go`)
Main HTTPS API server for experiment management.
```bash
# Build and run
go run ./cmd/api-server/main.go
# With configuration
./bin/api-server --config configs/config-local.yaml
```
**Features:**
- HTTPS-only communication
- API key authentication
- Rate limiting and IP whitelisting
- WebSocket support for real-time updates
- Redis integration for caching
### TUI (`./cmd/tui/main.go`)
Terminal User Interface for monitoring experiments.
```bash
# Launch TUI
go run ./cmd/tui/main.go
# With custom config
./tui --config configs/config-local.yaml
```
**Features:**
- Real-time experiment monitoring
- Interactive job management
- Status visualization
- Log viewing
### Data Manager (`./cmd/data_manager/`)
Utilities for data synchronization and management.
```bash
# Sync data
./data_manager --sync ./data
# Clean old data
./data_manager --cleanup --older-than 30d
```
### Config Lint (`./cmd/configlint/main.go`)
Configuration validation and linting tool.
```bash
# Validate configuration
./configlint configs/config-local.yaml
# Check schema compliance
./configlint --schema configs/schema/config_schema.yaml
```
## Management Script (`./tools/manage.sh`)
Simple service management for your homelab.
### Commands
```bash
./tools/manage.sh start # Start all services
./tools/manage.sh stop # Stop all services
./tools/manage.sh status # Check service status
./tools/manage.sh logs # View logs
./tools/manage.sh monitor # Basic monitoring
./tools/manage.sh security # Security status
./tools/manage.sh cleanup # Clean project artifacts
```
## Setup Script (`./setup.sh`)
One-command homelab setup.
### Usage
```bash
# Full setup
./setup.sh
# Setup includes:
# - SSL certificate generation
# - Configuration creation
# - Build all components
# - Start Redis
# - Setup Fail2Ban (if available)
```
## API Testing
Test the API with curl:
```bash
# Health check
curl -k -H 'X-API-Key: password' https://localhost:9101/health
# List experiments
curl -k -H 'X-API-Key: password' https://localhost:9101/experiments
# Submit experiment
curl -k -X POST -H 'X-API-Key: password' \
-H 'Content-Type: application/json' \
-d '{"name":"test","config":{"type":"basic"}}' \
https://localhost:9101/experiments
```
## Zig CLI Architecture
The Zig CLI is designed for performance and reliability:
### Core Components
- **Commands** (`cli/src/commands/`): Individual command implementations
- **Config** (`cli/src/config.zig`): Configuration management
- **Network** (`cli/src/net/ws.zig`): WebSocket client implementation
- **Utils** (`cli/src/utils/`): Cryptography, storage, and rsync utilities
- **Errors** (`cli/src/errors.zig`): Centralized error handling
### Performance Optimizations
- **Content-Addressed Storage**: Deduplicates identical files across experiments
- **SHA256 Hashing**: Fast, reliable commit ID generation
- **Rsync Integration**: Efficient incremental file transfers
- **WebSocket Protocol**: Low-latency communication with worker
- **Memory Management**: Efficient allocation with Zig's allocator system
### Security Features
- **API Key Hashing**: Secure authentication token handling
- **SSH Integration**: Secure file transfers
- **Input Validation**: Comprehensive argument checking
- **Error Handling**: Secure error reporting without information leakage
## Configuration
Main configuration file: `configs/config-local.yaml`
### Key Settings
```yaml
auth:
enabled: true
api_keys:
homelab_user:
hash: "5e884898da28047151d0e56f8dc6292773603d0d6aabbdd62a11ef721d1542d8"
admin: true
server:
address: ":9101"
tls:
enabled: true
cert_file: "./ssl/cert.pem"
key_file: "./ssl/key.pem"
security:
rate_limit:
enabled: true
requests_per_minute: 30
ip_whitelist:
- "127.0.0.1"
- "::1"
- "192.168.0.0/16"
- "10.0.0.0/8"
```
## Docker Commands
If using Docker Compose:
```bash
# Start services
docker-compose up -d
# View logs
docker-compose logs -f
# Stop services
docker-compose down
# Check status
docker-compose ps
```
## Troubleshooting
### Common Issues
**Zig CLI not found:**
```bash
# Build the CLI
cd cli && make build
# Check binary exists
ls -la ./cli/zig-out/bin/ml
```
**Configuration not found:**
```bash
# Create configuration
./cli/zig-out/bin/ml init
# Check config file
ls -la ~/.ml/config.toml
```
**Worker connection failed:**
```bash
# Test SSH connection
ssh -p 22 mluser@worker.local
# Check configuration
cat ~/.ml/config.toml
```
**Sync not working:**
```bash
# Check rsync availability
rsync --version
# Test manual sync
rsync -avz ./project/ mluser@worker.local:/tmp/test/
```
**WebSocket connection failed:**
```bash
# Check worker WebSocket port
telnet worker.local 9100
# Verify API key
./cli/zig-out/bin/ml status
```
**API not responding:**
```bash
./tools/manage.sh status
./tools/manage.sh logs
```
**Authentication failed:**
```bash
# Check API key in config-local.yaml
grep -A 5 "api_keys:" configs/config-local.yaml
```
**Redis connection failed:**
```bash
# Check Redis status
redis-cli ping
# Start Redis
redis-server
```
### Getting Help
```bash
# CLI help
./cli/zig-out/bin/ml help
# Management script help
./tools/manage.sh help
# Check all available commands
make help
```
---
**That's it for the CLI reference!** For complete setup instructions, see the main [README](/).

View file

@ -1,310 +0,0 @@
---
layout: page
title: "Operations Runbook"
permalink: /operations/
nav_order: 6
---
# Operations Runbook
Operational guide for troubleshooting and maintaining the ML experiment system.
## Task Queue Operations
### Monitoring Queue Health
```redis
# Check queue depth
ZCARD task:queue
# List pending tasks
ZRANGE task:queue 0 -1 WITHSCORES
# Check dead letter queue
KEYS task:dlq:*
```
### Handling Stuck Tasks
**Symptom:** Tasks stuck in "running" status
**Diagnosis:**
```bash
# Check for expired leases
redis-cli GET task:{task-id}
# Look for LeaseExpiry in past
```
**Rem
ediation:**
Tasks with expired leases are automatically reclaimed every 1 minute. To force immediate reclamation:
```bash
# Restart worker to trigger reclaim cycle
systemctl restart ml-worker
```
### Dead Letter Queue Management
**View failed tasks:**
```redis
KEYS task:dlq:*
```
**Inspect failed task:**
```redis
GET task:dlq:{task-id}
```
**Retry from DLQ:**
```bash
# Manual retry (requires custom script)
# 1. Get task from DLQ
# 2. Reset retry count
# 3. Re-queue task
```
### Worker Crashes
**Symptom:** Worker disappeared mid-task
**What Happens:**
1. Lease expires after 30 minutes (default)
2. Background reclaim job detects expired lease
3. Task is retried (up to 3 attempts)
4. After max retries → Dead Letter Queue
**Prevention:**
- Monitor worker heartbeats
- Set up alerts for worker down
- Use process manager (systemd, supervisor)
## Worker Operations
### Graceful Shutdown
```bash
# Send SIGTERM for graceful shutdown
kill -TERM $(pgrep ml-worker)
# Worker will:
# 1. Stop accepting new tasks
# 2. Finish active tasks (up to 5min timeout)
# 3. Release all leases
# 4. Exit cleanly
```
### Force Shutdown
```bash
# Force kill (leases will be reclaimed automatically)
kill -9 $(pgrep ml-worker)
```
### Worker Heartbeat Monitoring
```redis
# Check worker heartbeats
HGETALL worker:heartbeat
# Example output:
# worker-abc123 1701234567
# worker-def456 1701234580
```
**Alert if:** Heartbeat timestamp > 5 minutes old
## Redis Operations
### Backup
```bash
# Manual backup
redis-cli SAVE
cp /var/lib/redis/dump.rdb /backup/redis-$(date +%Y%m%d).rdb
```
### Restore
```bash
# Stop Redis
systemctl stop redis
# Restore snapshot
cp /backup/redis-20231201.rdb /var/lib/redis/dump.rdb
# Start Redis
systemctl start redis
```
### Memory Management
```redis
# Check memory usage
INFO memory
# Evict old data if needed
FLUSHDB # DANGER: Clears all data!
```
## Common Issues
### Issue: Queue Growing Unbounded
**Symptoms:**
- `ZCARD task:queue` keeps increasing
- No workers processing tasks
**Diagnosis:**
```bash
# Check worker status
systemctl status ml-worker
# Check logs
journalctl -u ml-worker -n 100
```
**Resolution:**
1. Verify workers are running
2. Check Redis connectivity
3. Verify lease configuration
### Issue: High Retry Rate
**Symptoms:**
- Many tasks in DLQ
- `retry_count` field high on tasks
**Diagnosis:**
```bash
# Check worker logs for errors
journalctl -u ml-worker | grep "retry"
# Look for patterns (network issues, resource limits, etc)
```
**Resolution:**
- Fix underlying issue (network, resources, etc)
- Adjust retry limits if permanent failures
- Increase task timeout if jobs are slow
### Issue: Leases Expiring Prematurely
**Symptoms:**
- Tasks retried even though worker is healthy
- Logs show "lease expired" frequently
**Diagnosis:**
```yaml
# Check worker config
cat configs/worker-config.yaml | grep -A3 "lease"
task_lease_duration: 30m # Too short?
heartbeat_interval: 1m # Too infrequent?
```
**Resolution:**
```yaml
# Increase lease duration for long-running jobs
task_lease_duration: 60m
heartbeat_interval: 30s # More frequent heartbeats
```
## Performance Tuning
### Worker Concurrency
```yaml
# worker-config.yaml
max_workers: 4 # Number of parallel tasks
# Adjust based on:
# - CPU cores available
# - Memory per task
# - GPU availability
```
### Redis Configuration
```conf
# /etc/redis/redis.conf
# Persistence
save 900 1
save 300 10
# Memory
maxmemory 2gb
maxmemory-policy noeviction
# Performance
tcp-keepalive 300
timeout 0
```
## Alerting Rules
### Critical Alerts
1. **Worker Down** (no heartbeat > 5min)
2. **Queue Depth** > 1000 tasks
3. **DLQ Growth** > 100 tasks/hour
4. **Redis Down** (connection failed)
### Warning Alerts
1. **High Retry Rate** > 10% of tasks
2. **Slow Queue Drain** (depth increasing over 1 hour)
3. **Worker Memory** > 80% usage
## Health Checks
```bash
#!/bin/bash
# health-check.sh
# Check Redis
redis-cli PING || echo "Redis DOWN"
# Check worker heartbeat
WORKER_ID=$(cat /var/run/ml-worker.pid)
LAST_HB=$(redis-cli HGET worker:heartbeat "$WORKER_ID")
NOW=$(date +%s)
if [ $((NOW - LAST_HB)) -gt 300 ]; then
echo "Worker heartbeat stale"
fi
# Check queue depth
DEPTH=$(redis-cli ZCARD task:queue)
if [ "$DEPTH" -gt 1000 ]; then
echo "Queue depth critical: $DEPTH"
fi
```
## Runbook Checklist
### Daily Operations
- [ ] Check queue depth
- [ ] Verify worker heartbeats
- [ ] Review DLQ for patterns
- [ ] Check Redis memory usage
### Weekly Operations
- [ ] Review retry rates
- [ ] Analyze failed task patterns
- [ ] Backup Redis snapshot
- [ ] Review worker logs
### Monthly Operations
- [ ] Performance tuning review
- [ ] Capacity planning
- [ ] Update documentation
- [ ] Test disaster recovery
---
**For homelab setups:**
Most of these operations can be simplified. Focus on:
- Basic monitoring (queue depth, worker status)
- Periodic Redis backups
- Graceful shutdowns for maintenance

View file

@ -1,322 +0,0 @@
---
layout: page
title: "Task Queue Architecture"
permalink: /queue/
nav_order: 3
---
# Task Queue Architecture
The task queue system enables reliable job processing between the API server and workers using Redis.
## Overview
```mermaid
graph LR
CLI[CLI/Client] -->|WebSocket| API[API Server]
API -->|Enqueue| Redis[(Redis)]
Redis -->|Dequeue| Worker[Worker]
Worker -->|Update Status| Redis
```
## Components
### TaskQueue (`internal/queue`)
Shared package used by both API server and worker for job management.
#### Task Structure
```go
type Task struct {
ID string // Unique task ID (UUID)
JobName string // User-defined job name
Args string // Job arguments
Status string // queued, running, completed, failed
Priority int64 // Higher = executed first
CreatedAt time.Time
StartedAt *time.Time
EndedAt *time.Time
WorkerID string
Error string
Datasets []string
Metadata map[string]string // commit_id, user, etc
}
```
#### TaskQueue Interface
```go
// Initialize queue
queue, err := queue.NewTaskQueue(queue.Config{
RedisAddr: "localhost:6379",
RedisPassword: "",
RedisDB: 0,
})
// Add task (API server)
task := &queue.Task{
ID: uuid.New().String(),
JobName: "train-model",
Status: "queued",
Priority: 5,
Metadata: map[string]string{
"commit_id": commitID,
"user": username,
},
}
err = queue.AddTask(task)
// Get next task (Worker)
task, err := queue.GetNextTask()
// Update task status
task.Status = "running"
err = queue.UpdateTask(task)
```
## Data Flow
### Job Submission Flow
```mermaid
sequenceDiagram
participant CLI
participant API
participant Redis
participant Worker
CLI->>API: Queue Job (WebSocket)
API->>API: Create Task (UUID)
API->>Redis: ZADD task:queue
API->>Redis: SET task:{id}
API->>CLI: Success Response
Worker->>Redis: ZPOPMAX task:queue
Redis->>Worker: Task ID
Worker->>Redis: GET task:{id}
Redis->>Worker: Task Data
Worker->>Worker: Execute Job
Worker->>Redis: Update Status
```
### Protocol
**CLI → API** (Binary WebSocket):
```
[opcode:1][api_key_hash:64][commit_id:64][priority:1][job_name_len:1][job_name:var]
```
**API → Redis**:
- Priority queue: `ZADD task:queue {priority} {task_id}`
- Task data: `SET task:{id} {json}`
- Status: `HSET task:status:{job_name} ...`
**Worker ← Redis**:
- Poll: `ZPOPMAX task:queue 1` (highest priority first)
- Fetch: `GET task:{id}`
## Redis Data Structures
### Keys
```
task:queue # ZSET: priority queue
task:{uuid} # STRING: task JSON data
task:status:{job_name} # HASH: job status
worker:heartbeat # HASH: worker health
job:metrics:{job_name} # HASH: job metrics
```
### Priority Queue (ZSET)
```redis
ZADD task:queue 10 "uuid-1" # Priority 10
ZADD task:queue 5 "uuid-2" # Priority 5
ZPOPMAX task:queue 1 # Returns uuid-1 (highest)
```
## API Server Integration
### Initialization
```go
// cmd/api-server/main.go
queueCfg := queue.Config{
RedisAddr: cfg.Redis.Addr,
RedisPassword: cfg.Redis.Password,
RedisDB: cfg.Redis.DB,
}
taskQueue, err := queue.NewTaskQueue(queueCfg)
```
### WebSocket Handler
```go
// internal/api/ws.go
func (h *WSHandler) handleQueueJob(conn *websocket.Conn, payload []byte) error {
// Parse request
apiKeyHash, commitID, priority, jobName := parsePayload(payload)
// Create task with unique ID
taskID := uuid.New().String()
task := &queue.Task{
ID: taskID,
JobName: jobName,
Status: "queued",
Priority: int64(priority),
Metadata: map[string]string{
"commit_id": commitID,
"user": user,
},
}
// Enqueue
if err := h.queue.AddTask(task); err != nil {
return h.sendErrorPacket(conn, ErrorCodeDatabaseError, ...)
}
return h.sendSuccessPacket(conn, "Job queued")
}
```
## Worker Integration
### Task Polling
```go
// cmd/worker/worker_server.go
func (w *Worker) Start() error {
for {
task, err := w.queue.WaitForNextTask(ctx, 5*time.Second)
if task != nil {
go w.executeTask(task)
}
}
}
```
### Task Execution
```go
func (w *Worker) executeTask(task *queue.Task) {
// Update status
task.Status = "running"
task.StartedAt = &now
w.queue.UpdateTaskWithMetrics(task, "start")
// Execute
err := w.runJob(task)
// Finalize
task.Status = "completed" // or "failed"
task.EndedAt = &endTime
task.Error = err.Error() // if err != nil
w.queue.UpdateTaskWithMetrics(task, "final")
}
```
## Configuration
### API Server (`configs/config.yaml`)
```yaml
redis:
addr: "localhost:6379"
password: ""
db: 0
```
### Worker (`configs/worker-config.yaml`)
```yaml
redis:
addr: "localhost:6379"
password: ""
db: 0
metrics_flush_interval: 500ms
```
## Monitoring
### Queue Depth
```go
depth, err := queue.QueueDepth()
fmt.Printf("Pending tasks: %d\n", depth)
```
### Worker Heartbeat
```go
// Worker sends heartbeat every 30s
err := queue.Heartbeat(workerID)
```
### Metrics
```redis
HGETALL job:metrics:{job_name}
# Returns: timestamp, tasks_start, tasks_final, etc
```
## Error Handling
### Task Failures
```go
if err := w.runJob(task); err != nil {
task.Status = "failed"
task.Error = err.Error()
w.queue.UpdateTask(task)
}
```
### Redis Connection Loss
```go
// TaskQueue automatically reconnects
// Workers should implement retry logic
for retries := 0; retries < 3; retries++ {
task, err := queue.GetNextTask()
if err == nil {
break
}
time.Sleep(backoff)
}
```
## Testing
```go
// tests using miniredis
s, _ := miniredis.Run()
defer s.Close()
tq, _ := queue.NewTaskQueue(queue.Config{
RedisAddr: s.Addr(),
})
task := &queue.Task{ID: "test-1", JobName: "test"}
tq.AddTask(task)
fetched, _ := tq.GetNextTask()
// assert fetched.ID == "test-1"
```
## Best Practices
1. **Unique Task IDs**: Always use UUIDs to avoid conflicts
2. **Metadata**: Store commit_id and user in task metadata
3. **Priority**: Higher values execute first (0-255 range)
4. **Status Updates**: Update status at each lifecycle stage
5. **Error Logging**: Store detailed errors in task.Error
6. **Heartbeats**: Workers should send heartbeats regularly
7. **Metrics**: Use UpdateTaskWithMetrics for atomic updates
---
For implementation details, see:
- [internal/queue/task.go](https://github.com/jfraeys/fetch_ml/blob/main/internal/queue/task.go)
- [internal/queue/queue.go](https://github.com/jfraeys/fetch_ml/blob/main/internal/queue/queue.go)

View file

@ -1,95 +0,0 @@
---
layout: page
title: "Redis High Availability (Optional)"
permalink: /redis-ha/
nav_order: 7
---
# Redis High Availability
**Note:** This is optional for homelab setups. Single Redis instance is sufficient for most use cases.
## When You Need HA
Consider Redis HA if:
- Running production workloads
- Uptime > 99.9% required
- Can't afford to lose queued tasks
- Multiple workers across machines
## Redis Sentinel (Recommended)
### Setup
```yaml
# docker-compose.yml
version: '3.8'
services:
redis-master:
image: redis:7-alpine
command: redis-server --maxmemory 2gb
redis-replica:
image: redis:7-alpine
command: redis-server --slaveof redis-master 6379
redis-sentinel-1:
image: redis:7-alpine
command: redis-sentinel /etc/redis/sentinel.conf
volumes:
- ./sentinel.conf:/etc/redis/sentinel.conf
```
**sentinel.conf:**
```conf
sentinel monitor mymaster redis-master 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 10000
```
### Application Configuration
```yaml
# worker-config.yaml
redis_addr: "redis-sentinel-1:26379,redis-sentinel-2:26379"
redis_master_name: "mymaster"
```
## Redis Cluster (Advanced)
For larger deployments with sharding needs.
```yaml
# Minimum 3 masters + 3 replicas
services:
redis-1:
image: redis:7-alpine
command: redis-server --cluster-enabled yes
redis-2:
# ... similar config
```
## Homelab Alternative: Persistence Only
**For most homelabs, just enable persistence:**
```yaml
# docker-compose.yml
services:
redis:
image: redis:7-alpine
command: redis-server --appendonly yes
volumes:
- redis_data:/data
volumes:
redis_data:
```
This ensures tasks survive Redis restarts without full HA complexity.
---
**Recommendation:** Start simple. Add HA only if you experience actual downtime issues.

View file

@ -1,452 +0,0 @@
---
layout: page
title: "Zig CLI Guide"
permalink: /zig-cli/
nav_order: 3
---
# Zig CLI Guide
High-performance command-line interface for ML experiment management, written in Zig for maximum speed and efficiency.
## Overview
The Zig CLI (`ml`) is the primary interface for managing ML experiments in your homelab. Built with Zig, it provides exceptional performance for file operations, network communication, and experiment management.
## Installation
### Pre-built Binaries (Recommended)
Download from [GitHub Releases](https://github.com/jfraeys/fetch_ml/releases):
```bash
# Download for your platform
curl -LO https://github.com/jfraeys/fetch_ml/releases/latest/download/ml-<platform>.tar.gz
# Extract
tar -xzf ml-<platform>.tar.gz
# Install
chmod +x ml-<platform>
sudo mv ml-<platform> /usr/local/bin/ml
# Verify
ml --help
```
**Platforms:**
- `ml-linux-x86_64.tar.gz` - Linux (fully static, zero dependencies)
- `ml-macos-x86_64.tar.gz` - macOS Intel
- `ml-macos-arm64.tar.gz` - macOS Apple Silicon
All release binaries include **embedded static rsync** for complete independence.
### Build from Source
**Development Build** (uses system rsync):
```bash
cd cli
zig build dev
./zig-out/dev/ml-dev --help
```
**Production Build** (embedded rsync):
```bash
cd cli
# For testing: uses rsync wrapper
zig build prod
# For release with static rsync:
# 1. Place static rsync binary at src/assets/rsync_release.bin
# 2. Build
zig build prod
strip zig-out/prod/ml # Optional: reduce size
# Verify
./zig-out/prod/ml --help
ls -lh zig-out/prod/ml
```
See [cli/src/assets/README.md](https://github.com/jfraeys/fetch_ml/blob/main/cli/src/assets/README.md) for details on obtaining static rsync binaries.
### Verify Installation
```bash
ml --help
ml --version # Shows build config
```
## Quick Start
1. **Initialize Configuration**
```bash
./cli/zig-out/bin/ml init
```
2. **Sync Your First Project**
```bash
./cli/zig-out/bin/ml sync ./my-project --queue
```
3. **Monitor Progress**
```bash
./cli/zig-out/bin/ml status
```
## Command Reference
### `init` - Configuration Setup
Initialize the CLI configuration file.
```bash
ml init
```
**Creates:** `~/.ml/config.toml`
**Configuration Template:**
```toml
worker_host = "worker.local"
worker_user = "mluser"
worker_base = "/data/ml-experiments"
worker_port = 22
api_key = "your-api-key"
```
### `sync` - Project Synchronization
Sync project files to the worker with intelligent deduplication.
```bash
# Basic sync
ml sync ./project
# Sync with custom name and auto-queue
ml sync ./project --name "experiment-1" --queue
# Sync with priority
ml sync ./project --priority 8
```
**Options:**
- `--name <name>`: Custom experiment name
- `--queue`: Automatically queue after sync
- `--priority N`: Set priority (1-10, default 5)
**Features:**
- **Content-Addressed Storage**: Automatic deduplication
- **SHA256 Commit IDs**: Reliable change detection
- **Incremental Transfer**: Only sync changed files
- **Rsync Backend**: Efficient file transfer
### `queue` - Job Management
Queue experiments for execution on the worker.
```bash
# Queue with commit ID
ml queue my-job --commit abc123def456
# Queue with priority
ml queue my-job --commit abc123 --priority 8
```
**Options:**
- `--commit <id>`: Commit ID from sync output
- `--priority N`: Execution priority (1-10)
**Features:**
- **WebSocket Communication**: Real-time job submission
- **Priority Queuing**: Higher priority jobs run first
- **API Authentication**: Secure job submission
### `watch` - Auto-Sync Monitoring
Monitor directories for changes and auto-sync.
```bash
# Watch for changes
ml watch ./project
# Watch and auto-queue on changes
ml watch ./project --name "dev-exp" --queue
```
**Options:**
- `--name <name>`: Custom experiment name
- `--queue`: Auto-queue on changes
- `--priority N`: Set priority for queued jobs
**Features:**
- **Real-time Monitoring**: 2-second polling interval
- **Change Detection**: File modification time tracking
- **Commit Comparison**: Only sync when content changes
- **Automatic Queuing**: Seamless development workflow
### `status` - System Status
Check system and worker status.
```bash
ml status
```
**Displays:**
- Worker connectivity
- Queue status
- Running jobs
- System health
### `monitor` - Remote Monitoring
Launch TUI interface via SSH for real-time monitoring.
```bash
ml monitor
```
**Features:**
- **Real-time Updates**: Live experiment status
- **Interactive Interface**: Browse and manage experiments
- **SSH Integration**: Secure remote access
### `cancel` - Job Cancellation
Cancel running or queued jobs.
```bash
ml cancel job-id
```
**Options:**
- `job-id`: Job identifier from status output
### `prune` - Cleanup Management
Clean up old experiments to save space.
```bash
# Keep last N experiments
ml prune --keep 20
# Remove experiments older than N days
ml prune --older-than 30
```
**Options:**
- `--keep N`: Keep N most recent experiments
- `--older-than N`: Remove experiments older than N days
## Architecture
### Core Components
```
cli/src/
├── commands/ # Command implementations
│ ├── init.zig # Configuration setup
│ ├── sync.zig # Project synchronization
│ ├── queue.zig # Job management
│ ├── watch.zig # Auto-sync monitoring
│ ├── status.zig # System status
│ ├── monitor.zig # Remote monitoring
│ ├── cancel.zig # Job cancellation
│ └── prune.zig # Cleanup operations
├── config.zig # Configuration management
├── errors.zig # Error handling
├── net/ # Network utilities
│ └── ws.zig # WebSocket client
└── utils/ # Utility functions
├── crypto.zig # Hashing and encryption
├── storage.zig # Content-addressed storage
└── rsync.zig # File synchronization
```
### Performance Features
#### Content-Addressed Storage
- **Deduplication**: Identical files shared across experiments
- **Hash-based Storage**: Files stored by SHA256 hash
- **Space Efficiency**: Reduces storage by up to 90%
#### SHA256 Commit IDs
- **Reliable Detection**: Cryptographic change detection
- **Collision Resistance**: Guaranteed unique identifiers
- **Fast Computation**: Optimized for large directories
#### WebSocket Protocol
- **Low Latency**: Real-time communication
- **Binary Protocol**: Efficient message format
- **Connection Pooling**: Reused connections
#### Memory Management
- **Arena Allocators**: Efficient memory allocation
- **Zero-copy Operations**: Minimized memory usage
- **Resource Cleanup**: Automatic resource management
### Security Features
#### Authentication
- **API Key Hashing**: Secure token storage
- **SHA256 Hashes**: Irreversible token protection
- **Config Validation**: Input sanitization
#### Secure Communication
- **SSH Integration**: Encrypted file transfers
- **WebSocket Security**: TLS-protected communication
- **Input Validation**: Comprehensive argument checking
#### Error Handling
- **Secure Reporting**: No sensitive information leakage
- **Graceful Degradation**: Safe error recovery
- **Audit Logging**: Operation tracking
## Advanced Usage
### Workflow Integration
#### Development Workflow
```bash
# 1. Initialize project
ml sync ./project --name "dev" --queue
# 2. Auto-sync during development
ml watch ./project --name "dev" --queue
# 3. Monitor progress
ml status
```
#### Batch Processing
```bash
# Process multiple experiments
for dir in experiments/*/; do
ml sync "$dir" --queue
done
```
#### Priority Management
```bash
# High priority experiment
ml sync ./urgent --priority 10 --queue
# Background processing
ml sync ./background --priority 1 --queue
```
### Configuration Management
#### Multiple Workers
```toml
# ~/.ml/config.toml
worker_host = "worker.local"
worker_user = "mluser"
worker_base = "/data/ml-experiments"
worker_port = 22
api_key = "your-api-key"
```
#### Security Settings
```bash
# Set restrictive permissions
chmod 600 ~/.ml/config.toml
# Verify configuration
ml status
```
## Troubleshooting
### Common Issues
#### Build Problems
```bash
# Check Zig installation
zig version
# Clean build
cd cli && make clean && make build
```
#### Connection Issues
```bash
# Test SSH connectivity
ssh -p $worker_port $worker_user@$worker_host
# Verify configuration
cat ~/.ml/config.toml
```
#### Sync Failures
```bash
# Check rsync
rsync --version
# Manual sync test
rsync -avz ./test/ $worker_user@$worker_host:/tmp/
```
#### Performance Issues
```bash
# Monitor resource usage
top -p $(pgrep ml)
# Check disk space
df -h $worker_base
```
### Debug Mode
Enable verbose logging:
```bash
# Environment variable
export ML_DEBUG=1
ml sync ./project
# Or use debug build
cd cli && make debug
```
## Performance Benchmarks
### File Operations
- **Sync Speed**: 100MB/s+ (network limited)
- **Hash Computation**: 500MB/s+ (CPU limited)
- **Deduplication**: 90%+ space savings
### Memory Usage
- **Base Memory**: ~10MB
- **Large Projects**: ~50MB (1GB+ projects)
- **Memory Efficiency**: Constant per-file overhead
### Network Performance
- **WebSocket Latency**: <10ms (local network)
- **Connection Setup**: <100ms
- **Throughput**: Network limited
## Contributing
### Development Setup
```bash
cd cli
zig build-exe src/main.zig
```
### Testing
```bash
# Run tests
cd cli && zig test src/
# Integration tests
zig test tests/
```
### Code Style
- Follow Zig style guidelines
- Use explicit error handling
- Document public APIs
- Add comprehensive tests
---
**For more information, see the [CLI Reference](/cli-reference/) and [Architecture](/architecture/) pages.**