- Add new vLLM workflow documentation (vllm-workflow.md) - Update scheduler-architecture.md with Plugin GPU Quota and audit logging - Add See Also sections to jupyter-workflow.md, quick-start.md, configuration-reference.md for better navigation - Update landing page and index with vLLM and scheduler links - Cross-link all documentation for improved discoverability
20 KiB
Configuration Reference
Overview
This document provides a comprehensive reference for all configuration options in the FetchML project.
Environment Configurations
Local Development
File: configs/api/dev.yaml
base_path: "./data/dev/experiments"
data_dir: "./data/dev/active"
auth:
enabled: false
server:
address: "0.0.0.0:9101"
tls:
enabled: false
cert_file: "/app/ssl/cert.pem"
key_file: "/app/ssl/key.pem"
security:
production_mode: false
allowed_origins:
- "http://localhost:3000"
api_key_rotation_days: 90
audit_logging:
enabled: true
log_path: "./data/dev/logs/fetchml-audit.log"
rate_limit:
enabled: false
requests_per_minute: 60
burst_size: 10
ip_whitelist: []
monitoring:
prometheus:
enabled: true
port: 9101
path: "/metrics"
health_checks:
enabled: true
interval: "30s"
redis:
addr: "redis:6379"
password: ""
db: 0
database:
type: "sqlite"
connection: "./data/dev/fetchml.sqlite"
logging:
level: "info"
file: "./data/dev/logs/fetchml.log"
audit_log: "./data/dev/logs/fetchml-audit.log"
resources:
max_workers: 1
desired_rps_per_worker: 2
podman_cpus: "2"
podman_memory: "4Gi"
Multi-User Setup
File: configs/api/multi-user.yaml
base_path: "/app/data/experiments"
data_dir: "/data/active"
auth:
enabled: true
api_keys:
admin_user:
hash: "CHANGE_ME_SHA256_ADMIN_USER_KEY"
admin: true
roles: ["user", "admin"]
permissions:
"*": true
researcher1:
hash: "CHANGE_ME_SHA256_RESEARCHER1_KEY"
admin: false
roles: ["user", "researcher"]
permissions:
"jobs:read": true
"jobs:create": true
"jobs:update": true
"jobs:delete": false
analyst1:
hash: "CHANGE_ME_SHA256_ANALYST1_KEY"
admin: false
roles: ["user", "analyst"]
permissions:
"jobs:read": true
"jobs:create": false
"jobs:update": false
"jobs:delete": false
server:
address: ":9101"
tls:
enabled: false
security:
production_mode: false
allowed_origins: []
rate_limit:
enabled: true
requests_per_minute: 60
burst_size: 20
ip_whitelist: []
monitoring:
prometheus:
enabled: true
port: 9101
path: "/metrics"
health_checks:
enabled: true
interval: "30s"
redis:
url: "redis://redis:6379"
password: ""
db: 0
database:
type: "sqlite"
connection: "/app/data/experiments/fetch_ml.sqlite"
logging:
level: "info"
file: "/logs/app.log"
audit_log: ""
resources:
max_workers: 3
desired_rps_per_worker: 3
podman_cpus: "2"
podman_memory: "4Gi"
Production
File: configs/api/prod.yaml
base_path: "/app/data/prod/experiments"
data_dir: "/app/data/prod/active"
auth:
enabled: true
api_keys:
admin:
hash: "replace-with-sha256-of-your-api-key"
admin: true
roles:
- admin
permissions:
"*": true
server:
address: ":9101"
tls:
enabled: true
cert_file: "/app/ssl/cert.pem"
key_file: "/app/ssl/key.pem"
security:
production_mode: false
allowed_origins: []
rate_limit:
enabled: true
requests_per_minute: 60
burst_size: 10
ip_whitelist: []
monitoring:
prometheus:
enabled: true
port: 9101
path: "/metrics"
health_checks:
enabled: true
interval: "30s"
redis:
addr: "redis:6379"
password: ""
db: 0
database:
type: "sqlite"
connection: "/app/data/prod/fetch_ml.sqlite"
logging:
level: "info"
file: "/app/data/prod/logs/fetch_ml.log"
audit_log: "/app/data/prod/logs/audit.log"
resources:
max_workers: 2
desired_rps_per_worker: 5
podman_cpus: "2"
podman_memory: "4Gi"
Homelab Secure
File: configs/api/homelab-secure.yaml
Secure configuration for homelab deployments with production-grade security settings:
base_path: "/data/experiments"
data_dir: "/data/active"
auth:
enabled: true
api_keys:
homelab_admin:
hash: "CHANGE_ME_SHA256_HOMELAB_ADMIN_KEY"
admin: true
roles:
- admin
permissions:
"*": true
homelab_user:
hash: "CHANGE_ME_SHA256_HOMELAB_USER_KEY"
admin: false
roles:
- researcher
permissions:
experiments: true
datasets: true
jupyter: true
server:
address: ":9101"
tls:
enabled: false
cert_file: "/app/ssl/cert.pem"
key_file: "/app/ssl/key.pem"
security:
production_mode: true
allowed_origins:
- "https://ml-experiments.example.com"
rate_limit:
enabled: true
requests_per_minute: 60
burst_size: 10
ip_whitelist:
- "127.0.0.1"
- "192.168.0.0/16"
monitoring:
prometheus:
enabled: true
port: 9101
path: "/metrics"
health_checks:
enabled: true
interval: "30s"
redis:
url: "redis://:CHANGE_ME_REDIS_PASSWORD@redis:6379"
password: ""
db: 0
database:
type: "sqlite"
connection: "/data/experiments/fetch_ml.sqlite"
logging:
level: "info"
file: "/logs/fetch_ml.log"
audit_log: ""
resources:
max_workers: 1
desired_rps_per_worker: 2
podman_cpus: "2"
podman_memory: "4Gi"
Worker Configurations
Local Development Worker
File: configs/workers/dev-local.yaml
worker_id: "local-worker"
base_path: "data/dev/experiments"
train_script: "train.py"
redis_url: "redis://localhost:6379/0"
local_mode: true
prewarm_enabled: false
max_workers: 2
poll_interval_seconds: 2
auto_fetch_data: false
data_manager_path: "./data_manager"
dataset_cache_ttl: "30m"
data_dir: "data/dev/active"
snapshot_store:
enabled: false
podman_image: "python:3.9-slim"
container_workspace: "/workspace"
container_results: "/results"
gpu_devices: []
gpu_vendor: "apple"
gpu_visible_devices: []
# Apple M-series GPU configuration
apple_gpu:
enabled: true
metal_device: "/dev/metal"
mps_runtime: "/dev/mps"
resources:
max_workers: 2
desired_rps_per_worker: 2
podman_cpus: "2"
podman_memory: "4Gi"
metrics:
enabled: false
queue:
type: "native"
native:
data_dir: "data/dev/queue"
task_lease_duration: "30m"
heartbeat_interval: "1m"
max_retries: 3
graceful_timeout: "5m"
Homelab Secure Worker
File: configs/workers/homelab-secure.yaml
Secure worker configuration with snapshot store and Redis authentication:
worker_id: "homelab-worker"
base_path: "/tmp/fetchml-jobs"
train_script: "train.py"
redis_url: "redis://:${REDIS_PASSWORD}@redis:6379/0"
local_mode: true
max_workers: 1
poll_interval_seconds: 2
auto_fetch_data: false
data_manager_path: "./data_manager"
dataset_cache_ttl: "30m"
data_dir: "/data/active"
snapshot_store:
enabled: true
endpoint: "minio:9000"
secure: false
bucket: "fetchml-snapshots"
prefix: "snapshots"
timeout: "5m"
max_retries: 3
podman_image: "python:3.9-slim"
container_workspace: "/workspace"
container_results: "/results"
gpu_devices: []
resources:
max_workers: 1
desired_rps_per_worker: 2
podman_cpus: "2"
podman_memory: "4Gi"
metrics:
enabled: true
listen_addr: ":9100"
metrics_flush_interval: "500ms"
task_lease_duration: "30m"
heartbeat_interval: "1m"
max_retries: 3
graceful_timeout: "5m"
Docker Development Worker
File: configs/workers/docker.yaml
worker_id: "docker-worker"
base_path: "/tmp/fetchml-jobs"
train_script: "train.py"
redis_addr: "redis:6379"
redis_password: ""
redis_db: 0
local_mode: true
max_workers: 1
poll_interval_seconds: 5
podman_image: "python:3.9-slim"
container_workspace: "/workspace"
container_results: "/results"
gpu_devices: []
gpu_vendor: "none"
gpu_visible_devices: []
metrics:
enabled: true
listen_addr: ":9100"
metrics_flush_interval: "500ms"
Legacy TOML Worker (Deprecated)
File: configs/workers/worker-prod.toml
worker_id = "worker-prod-01"
base_path = "/data/ml-experiments"
max_workers = 4
redis_addr = "localhost:6379"
redis_password = "CHANGE_ME_REDIS_PASSWORD"
redis_db = 0
host = "localhost"
user = "ml-user"
port = 22
ssh_key = "~/.ssh/id_rsa"
podman_image = "ml-training:latest"
gpu_vendor = "none"
gpu_visible_devices = []
gpu_devices = []
container_workspace = "/workspace"
container_results = "/results"
train_script = "train.py"
[resources]
max_workers = 4
desired_rps_per_worker = 2
podman_cpus = "4"
podman_memory = "16g"
[metrics]
enabled = true
listen_addr = ":9100"
Security Hardening
Seccomp Profiles
FetchML includes a hardened seccomp profile for container sandboxing at configs/seccomp/default-hardened.json.
Features:
- Default-deny policy:
SCMP_ACT_ERRNOblocks all syscalls by default - Allowlist approach: Only explicitly permitted syscalls are allowed
- Multi-architecture support: x86_64, x86, aarch64
- Blocked dangerous syscalls: ptrace, mount, umount2, reboot, kexec_load, open_by_handle_at, perf_event_open
Usage with Docker/Podman:
# Docker with seccomp
docker run --security-opt seccomp=configs/seccomp/default-hardened.json \
-v /data:/data:ro \
my-image:latest
# Podman with seccomp
podman run --security-opt seccomp=configs/seccomp/default-hardened.json \
--read-only \
--no-new-privileges \
my-image:latest
Key Allowed Syscalls:
- File operations:
open,openat,read,write,close - Memory:
mmap,munmap,mprotect,brk - Process:
clone,fork,execve,exit,wait4 - Network:
socket,bind,listen,accept,connect,sendto,recvfrom - Signals:
rt_sigaction,rt_sigprocmask,kill,tkill - Time:
clock_gettime,gettimeofday,nanosleep - I/O:
epoll_create,epoll_ctl,epoll_wait,poll,select
Customization:
Copy the default profile and modify for your needs:
cp configs/seccomp/default-hardened.json configs/seccomp/custom-profile.json
# Edit to add/remove syscalls
Testing Seccomp:
# Test with a simple container
docker run --rm --security-opt seccomp=configs/seccomp/default-hardened.json \
alpine:latest echo "Seccomp test passed"
CLI Configuration
User Config File
Location: ~/.ml/config.toml
[server]
worker_host = "localhost"
worker_user = "appuser"
worker_base = "/app"
worker_port = 22
[auth]
api_key = "<your-api-key>"
[cli]
default_timeout = 30
verbose = false
Multi-User CLI Configs
Admin Config: ~/.ml/config-admin.toml
[server]
worker_host = "localhost"
worker_user = "appuser"
worker_base = "/app"
worker_port = 22
[auth]
api_key = "<admin-api-key>"
Researcher Config: ~/.ml/config-researcher.toml
[server]
worker_host = "localhost"
worker_user = "appuser"
worker_base = "/app"
worker_port = 22
[auth]
api_key = "<researcher-api-key>"
Analyst Config: ~/.ml/config-analyst.toml
[server]
worker_host = "localhost"
worker_user = "appuser"
worker_base = "/app"
worker_port = 22
[auth]
api_key = "<analyst-api-key>"
Configuration Options
Authentication
| Option | Type | Default | Description |
|---|---|---|---|
auth.enabled |
bool | false | Enable authentication |
auth.apikeys |
map | {} | API key configurations |
auth.apikeys.[user].hash |
string | - | SHA256 hash of API key |
auth.apikeys.[user].admin |
bool | false | Admin privileges |
auth.apikeys.[user].roles |
array | [] | User roles |
auth.apikeys.[user].permissions |
map | {} | User permissions |
Server
| Option | Type | Default | Description |
|---|---|---|---|
server.address |
string | ":9101" | Server bind address |
server.tls.enabled |
bool | false | Enable TLS |
server.tls.cert_file |
string | - | TLS certificate file |
server.tls.key_file |
string | - | TLS private key file |
Security
| Option | Type | Default | Description |
|---|---|---|---|
security.production_mode |
bool | false | Enable production hardening |
security.allowed_origins |
array | [] | Allowed CORS origins |
security.api_key_rotation_days |
int | 90 | Days until API key rotation required |
security.audit_logging.enabled |
bool | false | Enable audit logging |
security.audit_logging.log_path |
string | - | Audit log file path |
security.rate_limit.enabled |
bool | true | Enable rate limiting |
security.rate_limit.requests_per_minute |
int | 60 | Requests per minute limit |
security.rate_limit.burst_size |
int | 10 | Burst request allowance |
security.ip_whitelist |
array | [] | Allowed IP addresses/CIDR ranges |
security.failed_login_lockout.enabled |
bool | false | Enable login lockout |
security.failed_login_lockout.max_attempts |
int | 5 | Max failed attempts before lockout |
security.failed_login_lockout.lockout_duration |
string | "15m" | Lockout duration (e.g., "15m") |
Monitoring
| Option | Type | Default | Description |
|---|---|---|---|
monitoring.prometheus.enabled |
bool | true | Enable Prometheus metrics |
monitoring.prometheus.port |
int | 9101 | Prometheus metrics port |
monitoring.prometheus.path |
string | "/metrics" | Metrics endpoint path |
monitoring.health_checks.enabled |
bool | true | Enable health checks |
monitoring.health_checks.interval |
string | "30s" | Health check interval |
Database
| Option | Type | Default | Description |
|---|---|---|---|
database.type |
string | "sqlite" | Database type (sqlite, postgres, mysql) |
database.connection |
string | - | Connection string or path |
database.host |
string | - | Database host (for postgres/mysql) |
database.port |
int | - | Database port (for postgres/mysql) |
database.username |
string | - | Database username |
database.password |
string | - | Database password |
database.database |
string | - | Database name |
Queue
| Option | Type | Default | Description |
|---|---|---|---|
queue.type |
string | "native" | Queue backend type (native, redis, sqlite, filesystem) |
queue.native.data_dir |
string | - | Data directory for native queue |
queue.sqlite_path |
string | - | SQLite database path for queue |
queue.filesystem_path |
string | - | Filesystem queue path |
queue.fallback_to_filesystem |
bool | false | Fallback to filesystem on Redis failure |
Resources
| Option | Type | Default | Description |
|---|---|---|---|
resources.max_workers |
int | 1 | Maximum concurrent workers |
resources.desired_rps_per_worker |
int | 2 | Desired requests per second per worker |
resources.requests_per_sec |
int | - | Global request rate limit |
resources.request_burst |
int | - | Request burst allowance |
resources.podman_cpus |
string | "2" | CPU limit for Podman containers |
resources.podman_memory |
string | "4Gi" | Memory limit for Podman containers |
Plugin GPU Quotas
Control GPU allocation for plugin-based services (Jupyter, vLLM, etc.).
| Option | Type | Default | Description |
|---|---|---|---|
scheduler.plugin_quota.enabled |
bool | false | Enable plugin GPU quota enforcement |
scheduler.plugin_quota.total_gpus |
int | 0 | Global GPU limit across all plugins (0 = unlimited) |
scheduler.plugin_quota.per_user_gpus |
int | 0 | Default per-user GPU limit (0 = unlimited) |
scheduler.plugin_quota.per_user_services |
int | 0 | Default per-user service count limit (0 = unlimited) |
scheduler.plugin_quota.per_plugin_limits.{plugin}.max_gpus |
int | 0 | Plugin-specific GPU limit |
scheduler.plugin_quota.per_plugin_limits.{plugin}.max_services |
int | 0 | Plugin-specific service count limit |
scheduler.plugin_quota.user_overrides.{user}.max_gpus |
int | 0 | Per-user GPU override |
scheduler.plugin_quota.user_overrides.{user}.max_services |
int | 0 | Per-user service limit override |
scheduler.plugin_quota.user_overrides.{user}.allowed_plugins |
array | [] | Plugins user is allowed to use (empty = all) |
Example configuration:
scheduler:
plugin_quota:
enabled: true
total_gpus: 16
per_user_gpus: 4
per_user_services: 2
per_plugin_limits:
vllm:
max_gpus: 8
max_services: 4
jupyter:
max_gpus: 4
max_services: 10
user_overrides:
admin:
max_gpus: 8
max_services: 5
allowed_plugins: ["jupyter", "vllm"]
Redis
| Option | Type | Default | Description |
|---|---|---|---|
redis.url |
string | "redis://localhost:6379" | Redis connection URL |
redis.addr |
string | - | Redis host:port shorthand |
redis.password |
string | - | Redis password |
redis.db |
int | 0 | Redis database number |
redis.max_connections |
int | 10 | Max Redis connections |
Logging
| Option | Type | Default | Description |
|---|---|---|---|
logging.level |
string | "info" | Log level |
logging.file |
string | - | Log file path |
logging.audit_file |
string | - | Audit log path |
Permission System
Permission Keys
| Permission | Description |
|---|---|
jobs:read |
Read job information |
jobs:create |
Create new jobs |
jobs:update |
Update existing jobs |
jobs:delete |
Delete jobs |
* |
All permissions (admin only) |
Role-Based Permissions
| Role | Default Permissions |
|---|---|
| admin | All permissions |
| researcher | jobs:read, jobs:create, jobs:update |
| analyst | jobs:read |
| user | No default permissions |
Environment Variables
| Variable | Default | Description |
|---|---|---|
FETCHML_CONFIG |
- | Path to config file |
FETCHML_LOG_LEVEL |
"info" | Override log level |
CLI_CONFIG |
- | Path to CLI config file |
FETCH_ML_GPU_TYPE |
- | Override GPU vendor detection (nvidia, amd, apple, none). Takes precedence over config file. |
FETCH_ML_GPU_COUNT |
- | Override GPU count detection. Used with auto-detected or configured vendor. |
FETCH_ML_TOTAL_CPU |
- | Override total CPU count detection. Sets the number of CPU cores available. |
FETCH_ML_GPU_SLOTS_PER_GPU |
1 | Override GPU slots per GPU. Controls how many concurrent tasks can share a single GPU. |
When environment variable overrides are active, they are logged to stderr at worker startup for debugging.
Note: When gpu_vendor: amd is configured, the system uses the NVIDIA detector implementation (aliased) due to similar device exposure patterns. The configured_vendor field will show "amd" while the actual detection uses NVIDIA-compatible methods.
Troubleshooting
Common Configuration Issues
-
Authentication Failures
- Check API key hashes are correct SHA256
- Verify YAML syntax
- Ensure auth.enabled: true
-
Connection Issues
- Verify server address and ports
- Check firewall settings
- Validate network connectivity
-
Permission Issues
- Check user roles and permissions
- Verify permission key format
- Ensure admin users have "*": true
Configuration Validation
# Validate server configuration
go run cmd/api-server/main.go --config configs/api/dev.yaml --validate
# Test CLI configuration
./cli/zig-out/bin/ml status --debug
See Also
- Architecture - System architecture overview
- Scheduler Architecture - Scheduler configuration details
- Environment Variables - Additional environment variable documentation
- Security Guide - Security-related configuration
- Deployment Guide - Production configuration guidance
- Jupyter Workflow - Jupyter service configuration
- vLLM Workflow - vLLM service configuration