docs(config): reorganize configuration structure and add documentation

Restructure configuration files for better organization:
- Add scheduler configuration examples (scheduler.yaml.example)
- Reorganize worker configs into subdirectories:
  - distributed/ - Multi-node cluster configurations
  - standalone/ - Single-node deployment configs
- Add environment-specific configs:
  - dev-local.yaml, docker-dev.yaml, docker-prod.yaml
  - homelab-secure.yaml, worker-prod.toml
- Add deployment configs for different security modes:
  - docker-standard.yaml, docker-hipaa.yaml, docker-dev.yaml

Add documentation:
- configs/README.md with configuration guidelines
- configs/SECURITY.md with security configuration best practices
This commit is contained in:
Jeremie Fraeys 2026-02-26 12:04:11 -05:00
parent 95adcba437
commit 86f9ae5a7e
No known key found for this signature in database
15 changed files with 406 additions and 27 deletions

60
configs/README.md Normal file
View file

@ -0,0 +1,60 @@
# fetch_ml Configuration Guide
## Quick Start
### Standalone Mode (Existing Behavior)
```bash
# Single worker, direct queue access
go run ./cmd/worker -config configs/worker/standalone/worker.yaml
```
### Distributed Mode
```bash
# Terminal 1: Start scheduler
go run ./cmd/scheduler -config configs/scheduler/scheduler.yaml
# Terminal 2: Start worker
go run ./cmd/worker -config configs/worker/distributed/worker.yaml
```
### Single-Node Mode (Zero Config)
```bash
# Both scheduler and worker in one process
go run ./cmd/fetch_ml -config configs/multi-node/single-node.yaml
```
## Config Structure
```
configs/
├── scheduler/
│ └── scheduler.yaml # Central scheduler configuration
├── worker/
│ ├── standalone/
│ │ └── worker.yaml # Direct queue access (Redis/SQLite)
│ └── distributed/
│ └── worker.yaml # WebSocket to scheduler
└── multi-node/
└── single-node.yaml # Combined scheduler+worker
```
## Key Configuration Modes
| Mode | Use Case | Backend |
|------|----------|---------|
| `standalone` | Single machine, existing behavior | Redis/SQLite/Filesystem |
| `distributed` | Multiple workers, central scheduler | WebSocket to scheduler |
| `both` | Quick testing, single process | In-process scheduler |
## Worker Mode Selection
Set `worker.mode` to switch between implementations:
```yaml
worker:
mode: "standalone" # Uses Redis/SQLite queue.Backend
# OR
mode: "distributed" # Uses SchedulerBackend over WebSocket
```
The worker code is unchanged — only the backend implementation changes.

130
configs/SECURITY.md Normal file
View file

@ -0,0 +1,130 @@
# Security Guidelines for fetch_ml Distributed Mode
## Token Management
### Quick Start (Recommended)
```bash
# 1. Generate config with tokens
scheduler -init -config scheduler.yaml
# 2. Or generate a single token
scheduler -generate-token
```
### Generating Tokens
**Option 1: Initialize full config (recommended)**
```bash
# Generate config with 3 worker tokens
scheduler -init -config /etc/fetch_ml/scheduler.yaml
# Generate with more tokens
scheduler -init -config /etc/fetch_ml/scheduler.yaml -tokens 5
```
**Option 2: Generate single token**
```bash
# Generate one token
scheduler -generate-token
# Output: wkr_abc123...
```
**Option 3: Using OpenSSL**
```bash
openssl rand -hex 32
```
### Token Storage
- **NEVER commit tokens to git** — config files with real tokens are gitignored
- Store tokens in environment variables or secure secret management
- Use `.env` files locally (already gitignored)
- Rotate tokens periodically
### Config File Security
```
configs/
├── scheduler/scheduler.yaml # ⛔ NEVER commit with real tokens
├── scheduler/scheduler.yaml.example # ✅ Safe to commit (placeholders)
└── worker/distributed/worker.yaml # ⛔ NEVER commit with real tokens
```
All `*.yaml` files in `configs/` subdirectories are gitignored by default.
### Distribution Workflow
```bash
# On scheduler host:
$ scheduler -init -config /etc/fetch_ml/scheduler.yaml
Config generated: /etc/fetch_ml/scheduler.yaml
Generated 3 worker tokens. Copy the appropriate token to each worker's config.
=== Generated Worker Tokens ===
Copy these to your worker configs:
Worker: worker-01
Token: wkr_abc123...
Worker: worker-02
Token: wkr_def456...
# On each worker host - copy the appropriate token:
$ cat > /etc/fetch_ml/worker.yaml <<EOF
scheduler:
address: "scheduler-host:7777"
cert: "/etc/fetch_ml/scheduler.crt"
token: "wkr_abc123..." # Copy from above
EOF
```
## TLS Configuration
### Self-Signed Certs (Development)
```yaml
scheduler:
auto_generate_certs: true
cert_file: "/etc/fetch_ml/scheduler.crt"
key_file: "/etc/fetch_ml/scheduler.key"
```
Auto-generated certs are for development only. The scheduler prints the cert path on first run — distribute this to workers securely.
### Production TLS
Use proper certificates from your CA:
```yaml
scheduler:
auto_generate_certs: false
cert_file: "/etc/ssl/certs/fetch_ml.crt"
key_file: "/etc/ssl/private/fetch_ml.key"
```
## Network Security
- Scheduler bind address defaults to `0.0.0.0:7777` — firewall appropriately
- WebSocket connections use WSS with cert pinning (no CA chain required)
- Token authentication on every WebSocket connection
- Metrics endpoint (`/metrics`) has no auth — bind to localhost or add proxy auth
## Audit Logging
Enable audit logging to track job lifecycle:
```yaml
scheduler:
audit_log: "/var/log/fetch_ml/audit.log"
```
## Security Checklist
- [ ] Tokens generated via `scheduler -init` or `scheduler -generate-token`
- [ ] Config files with tokens NOT in git
- [ ] TLS certs distributed securely to workers
- [ ] Scheduler bind address firewalled
- [ ] Metrics endpoint protected (if exposed)
- [ ] Audit logging enabled

View file

@ -0,0 +1,32 @@
# Scheduler Configuration Example
# Copy this to scheduler.yaml and replace placeholders with real values
# DO NOT commit the actual scheduler.yaml with real tokens
scheduler:
bind_addr: "0.0.0.0:7777"
# Auto-generate self-signed certs if files don't exist
auto_generate_certs: true
cert_file: "/etc/fetch_ml/scheduler.crt"
key_file: "/etc/fetch_ml/scheduler.key"
state_dir: "/var/lib/fetch_ml"
default_batch_slots: 3
default_service_slots: 1
starvation_threshold_mins: 5
priority_aging_rate: 0.1
gang_alloc_timeout_secs: 60
acceptance_timeout_secs: 30
metrics_addr: "0.0.0.0:9090"
# Generate tokens using: openssl rand -hex 32
# Example: wkr_abc123... (64 hex chars after wkr_)
worker_tokens:
- id: "worker-01"
token: "wkr_PLACEHOLDER_GENERATE_WITH_OPENSSL_RAND_HEX_32"
- id: "worker-02"
token: "wkr_PLACEHOLDER_GENERATE_WITH_OPENSSL_RAND_HEX_32"

View file

@ -0,0 +1,33 @@
# Distributed Worker Configuration Example
# Copy this to worker.yaml and replace placeholders with real values
# DO NOT commit the actual worker.yaml with real tokens
node:
role: "worker"
id: "" # Auto-generated UUID if empty
worker:
mode: "distributed"
max_workers: 3
scheduler:
address: "192.168.1.10:7777"
cert: "/etc/fetch_ml/scheduler.crt"
# Copy token from scheduler config for this worker
token: "wkr_COPY_FROM_SCHEDULER_CONFIG"
slots:
service_slots: 1
ports:
service_range_start: 8000
service_range_end: 8099
gpu:
vendor: "auto"
prewarm:
enabled: true
log:
level: "info"
format: "json"

View file

@ -0,0 +1,32 @@
# Standalone Worker Configuration Example
# Copy this to worker.yaml and adjust for your environment
node:
role: "worker"
id: ""
worker:
mode: "standalone"
max_workers: 3
queue:
backend: "redis"
redis_addr: "localhost:6379"
redis_password: "" # Set if Redis requires auth
redis_db: 0
slots:
service_slots: 1
ports:
service_range_start: 8000
service_range_end: 8099
gpu:
vendor: "auto"
prewarm:
enabled: true
log:
level: "info"
format: "json"

View file

@ -1,27 +0,0 @@
worker_id: "test-prewarm-worker"
host: "localhost"
port: 8081
base_path: "/tmp/fetch-ml-test"
data_dir: "/tmp/fetch-ml-test/data"
max_workers: 2
local_mode: true
auto_fetch_data: true
prewarm_enabled: true
metrics:
enabled: true
listen_addr: ":9102"
train_script: "train.py"
snapshot_store:
enabled: false
endpoint: ""
secure: false
region: ""
bucket: ""
prefix: ""
access_key: ""
secret_key: ""
session_token: ""
max_retries: 3
timeout: 0s
gpu_devices: []
gpu_access: "none"

View file

@ -0,0 +1,31 @@
# Development mode worker configuration
# Relaxed validation for fast iteration
host: localhost
port: 22
user: dev-user
base_path: /tmp/fetchml_dev
train_script: train.py
# Redis configuration
redis_url: redis://redis:6379
# Development mode - relaxed security
compliance_mode: dev
max_workers: 4
# Sandbox settings (relaxed for development)
sandbox:
network_mode: bridge
seccomp_profile: ""
no_new_privileges: false
allowed_secrets: [] # All secrets allowed in dev
# GPU configuration
gpu_vendor: none
# Artifact handling (relaxed limits)
max_artifact_files: 10000
max_artifact_total_bytes: 1073741824 # 1GB
# Provenance (disabled in dev for speed)
provenance_best_effort: false

View file

@ -0,0 +1,53 @@
# HIPAA compliance mode worker configuration
# Strict validation, no network, PHI protection
host: localhost
port: 22
user: hipaa-worker
base_path: /var/lib/fetchml/secure
train_script: train.py
# Redis configuration (must use env var for password)
redis_url: redis://redis:6379
redis_password: ${REDIS_PASSWORD}
# HIPAA mode - strict compliance
compliance_mode: hipaa
max_workers: 1
# Sandbox settings (strict isolation required by HIPAA)
sandbox:
# Network must be disabled for HIPAA compliance
network_mode: none
# Seccomp profile must be set
seccomp_profile: default-hardened
# No new privileges must be enforced
no_new_privileges: true
# Only approved secrets allowed (no PHI fields)
allowed_secrets:
- HF_TOKEN
- WANDB_API_KEY
- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY
# PHI fields are EXPLICITLY DENIED:
# - PATIENT_ID
# - SSN
# - MEDICAL_RECORD_NUMBER
# - DIAGNOSIS_CODE
# - DOB
# - INSURANCE_ID
# GPU configuration
gpu_vendor: none
# Artifact handling (strict limits for HIPAA)
max_artifact_files: 100
max_artifact_total_bytes: 104857600 # 100MB
# Provenance (strictly required for HIPAA)
provenance_best_effort: false
# SSH key must use environment variable
ssh_key: ${SSH_KEY_PATH}
# Config hash computation enabled (required for audit)
# This is automatically computed by Validate()

View file

@ -0,0 +1,35 @@
# Standard security mode worker configuration
# Normal sandbox, network isolation
host: localhost
port: 22
user: worker-user
base_path: /var/lib/fetchml
train_script: train.py
# Redis configuration
redis_url: redis://redis:6379
# Standard mode - normal security
compliance_mode: standard
max_workers: 2
# Sandbox settings (standard isolation)
sandbox:
network_mode: none
seccomp_profile: default
no_new_privileges: true
allowed_secrets:
- HF_TOKEN
- WANDB_API_KEY
- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY
# GPU configuration
gpu_vendor: none
# Artifact handling (reasonable limits)
max_artifact_files: 1000
max_artifact_total_bytes: 536870912 # 512MB
# Provenance (enabled)
provenance_best_effort: true