jfraeysd/fetch_ml

Fork 0

Jeremie Fraeys 90ea18555c

Security Scan / Security Analysis (push) Waiting to run

Details

Security Scan / Native Library Security (push) Waiting to run

Details

Verification & Maintenance / V.1 - Schema Drift Detection (push) Waiting to run

Details

Verification & Maintenance / V.4 - Custom Go Vet Analyzers (push) Waiting to run

Details

Verification & Maintenance / V.7 - Audit Chain Integrity (push) Waiting to run

Details

Verification & Maintenance / V.6 - Extended Security Scanning (push) Waiting to run

Details

Verification & Maintenance / V.10 - OpenSSF Scorecard (push) Waiting to run

Details

Verification & Maintenance / Verification Summary (push) Blocked by required conditions

Details

Build Pipeline / Build Binaries (push) Failing after 2m4s

Details

Build Pipeline / Build Docker Images (push) Has been skipped

Details

Build Pipeline / Sign HIPAA Config (push) Has been skipped

Details

Build Pipeline / Generate SLSA Provenance (push) Has been skipped

Details

Checkout test / test (push) Successful in 5s

Details

CI Pipeline / Test (push) Failing after 1s

Details

CI Pipeline / Dev Compose Smoke Test (push) Has been skipped

Details

CI Pipeline / Security Scan (push) Has been skipped

Details

CI Pipeline / Test Scripts (push) Has been skipped

Details

CI Pipeline / Test Native Libraries (push) Has been skipped

Details

CI Pipeline / Native Library Build Matrix (push) Has been skipped

Details

Contract Tests / Spec Drift Detection (push) Failing after 16s

Details

Contract Tests / API Contract Tests (push) Has been skipped

Details

Deploy API Docs / Build API Documentation (push) Failing after 5s

Details

Deploy API Docs / Deploy to GitHub Pages (push) Has been skipped

Details

Documentation / build-and-publish (push) Failing after 44s

Details

CI Pipeline / Trigger Build Workflow (push) Failing after 0s

Details

docs: add vLLM workflow and cross-link documentation

- Add new vLLM workflow documentation (vllm-workflow.md)
- Update scheduler-architecture.md with Plugin GPU Quota and audit logging
- Add See Also sections to jupyter-workflow.md, quick-start.md,
  configuration-reference.md for better navigation
- Update landing page and index with vLLM and scheduler links
- Cross-link all documentation for improved discoverability

2026-02-26 13:04:39 -05:00

20 KiB

Raw Blame History

Configuration Reference

Overview

This document provides a comprehensive reference for all configuration options in the FetchML project.

Environment Configurations

Local Development

File: configs/api/dev.yaml

base_path: "./data/dev/experiments"
data_dir: "./data/dev/active"

auth:
  enabled: false

server:
  address: "0.0.0.0:9101"
  tls:
    enabled: false
    cert_file: "/app/ssl/cert.pem"
    key_file: "/app/ssl/key.pem"

security:
  production_mode: false
  allowed_origins:
    - "http://localhost:3000"
  api_key_rotation_days: 90
  audit_logging:
    enabled: true
    log_path: "./data/dev/logs/fetchml-audit.log"
  rate_limit:
    enabled: false
    requests_per_minute: 60
    burst_size: 10
  ip_whitelist: []

monitoring:
  prometheus:
    enabled: true
    port: 9101
    path: "/metrics"
  health_checks:
    enabled: true
    interval: "30s"

redis:
  addr: "redis:6379"
  password: ""
  db: 0

database:
  type: "sqlite"
  connection: "./data/dev/fetchml.sqlite"

logging:
  level: "info"
  file: "./data/dev/logs/fetchml.log"
  audit_log: "./data/dev/logs/fetchml-audit.log"

resources:
  max_workers: 1
  desired_rps_per_worker: 2
  podman_cpus: "2"
  podman_memory: "4Gi"

Multi-User Setup

File: configs/api/multi-user.yaml

base_path: "/app/data/experiments"
data_dir: "/data/active"

auth:
  enabled: true
  api_keys:
    admin_user:
      hash: "CHANGE_ME_SHA256_ADMIN_USER_KEY"
      admin: true
      roles: ["user", "admin"]
      permissions:
        "*": true
    researcher1:
      hash: "CHANGE_ME_SHA256_RESEARCHER1_KEY"
      admin: false
      roles: ["user", "researcher"]
      permissions:
        "jobs:read": true
        "jobs:create": true
        "jobs:update": true
        "jobs:delete": false
    analyst1:
      hash: "CHANGE_ME_SHA256_ANALYST1_KEY"
      admin: false
      roles: ["user", "analyst"]
      permissions:
        "jobs:read": true
        "jobs:create": false
        "jobs:update": false
        "jobs:delete": false

server:
  address: ":9101"
  tls:
    enabled: false

security:
  production_mode: false
  allowed_origins: []
  rate_limit:
    enabled: true
    requests_per_minute: 60
    burst_size: 20
  ip_whitelist: []

monitoring:
  prometheus:
    enabled: true
    port: 9101
    path: "/metrics"
  health_checks:
    enabled: true
    interval: "30s"

redis:
  url: "redis://redis:6379"
  password: ""
  db: 0

database:
  type: "sqlite"
  connection: "/app/data/experiments/fetch_ml.sqlite"

logging:
  level: "info"
  file: "/logs/app.log"
  audit_log: ""

resources:
  max_workers: 3
  desired_rps_per_worker: 3
  podman_cpus: "2"
  podman_memory: "4Gi"

Production

File: configs/api/prod.yaml

base_path: "/app/data/prod/experiments"
data_dir: "/app/data/prod/active"

auth:
  enabled: true
  api_keys:
    admin:
      hash: "replace-with-sha256-of-your-api-key"
      admin: true
      roles:
        - admin
      permissions:
        "*": true

server:
  address: ":9101"
  tls:
    enabled: true
    cert_file: "/app/ssl/cert.pem"
    key_file: "/app/ssl/key.pem"

security:
  production_mode: false
  allowed_origins: []
  rate_limit:
    enabled: true
    requests_per_minute: 60
    burst_size: 10
  ip_whitelist: []

monitoring:
  prometheus:
    enabled: true
    port: 9101
    path: "/metrics"
  health_checks:
    enabled: true
    interval: "30s"

redis:
  addr: "redis:6379"
  password: ""
  db: 0

database:
  type: "sqlite"
  connection: "/app/data/prod/fetch_ml.sqlite"

logging:
  level: "info"
  file: "/app/data/prod/logs/fetch_ml.log"
  audit_log: "/app/data/prod/logs/audit.log"

resources:
  max_workers: 2
  desired_rps_per_worker: 5
  podman_cpus: "2"
  podman_memory: "4Gi"

Homelab Secure

File: configs/api/homelab-secure.yaml

Secure configuration for homelab deployments with production-grade security settings:

base_path: "/data/experiments"
data_dir: "/data/active"

auth:
  enabled: true
  api_keys:
    homelab_admin:
      hash: "CHANGE_ME_SHA256_HOMELAB_ADMIN_KEY"
      admin: true
      roles:
        - admin
      permissions:
        "*": true
    homelab_user:
      hash: "CHANGE_ME_SHA256_HOMELAB_USER_KEY"
      admin: false
      roles:
        - researcher
      permissions:
        experiments: true
        datasets: true
        jupyter: true

server:
  address: ":9101"
  tls:
    enabled: false
    cert_file: "/app/ssl/cert.pem"
    key_file: "/app/ssl/key.pem"

security:
  production_mode: true
  allowed_origins:
    - "https://ml-experiments.example.com"
  rate_limit:
    enabled: true
    requests_per_minute: 60
    burst_size: 10
  ip_whitelist:
    - "127.0.0.1"
    - "192.168.0.0/16"

monitoring:
  prometheus:
    enabled: true
    port: 9101
    path: "/metrics"
  health_checks:
    enabled: true
    interval: "30s"

redis:
  url: "redis://:CHANGE_ME_REDIS_PASSWORD@redis:6379"
  password: ""
  db: 0

database:
  type: "sqlite"
  connection: "/data/experiments/fetch_ml.sqlite"

logging:
  level: "info"
  file: "/logs/fetch_ml.log"
  audit_log: ""

resources:
  max_workers: 1
  desired_rps_per_worker: 2
  podman_cpus: "2"
  podman_memory: "4Gi"

Worker Configurations

Local Development Worker

File: configs/workers/dev-local.yaml

worker_id: "local-worker"
base_path: "data/dev/experiments"
train_script: "train.py"

redis_url: "redis://localhost:6379/0"

local_mode: true

prewarm_enabled: false

max_workers: 2
poll_interval_seconds: 2

auto_fetch_data: false

data_manager_path: "./data_manager"
dataset_cache_ttl: "30m"

data_dir: "data/dev/active"

snapshot_store:
  enabled: false

podman_image: "python:3.9-slim"
container_workspace: "/workspace"
container_results: "/results"
gpu_devices: []
gpu_vendor: "apple"
gpu_visible_devices: []

# Apple M-series GPU configuration
apple_gpu:
  enabled: true
  metal_device: "/dev/metal"
  mps_runtime: "/dev/mps"

resources:
  max_workers: 2
  desired_rps_per_worker: 2
  podman_cpus: "2"
  podman_memory: "4Gi"

metrics:
  enabled: false

queue:
  type: "native"
  native:
    data_dir: "data/dev/queue"

task_lease_duration: "30m"
heartbeat_interval: "1m"
max_retries: 3
graceful_timeout: "5m"

Homelab Secure Worker

File: configs/workers/homelab-secure.yaml

Secure worker configuration with snapshot store and Redis authentication:

worker_id: "homelab-worker"
base_path: "/tmp/fetchml-jobs"
train_script: "train.py"

redis_url: "redis://:${REDIS_PASSWORD}@redis:6379/0"

local_mode: true

max_workers: 1
poll_interval_seconds: 2

auto_fetch_data: false

data_manager_path: "./data_manager"
dataset_cache_ttl: "30m"

data_dir: "/data/active"

snapshot_store:
  enabled: true
  endpoint: "minio:9000"
  secure: false
  bucket: "fetchml-snapshots"
  prefix: "snapshots"
  timeout: "5m"
  max_retries: 3

podman_image: "python:3.9-slim"
container_workspace: "/workspace"
container_results: "/results"
gpu_devices: []

resources:
  max_workers: 1
  desired_rps_per_worker: 2
  podman_cpus: "2"
  podman_memory: "4Gi"

metrics:
  enabled: true
  listen_addr: ":9100"
metrics_flush_interval: "500ms"

task_lease_duration: "30m"
heartbeat_interval: "1m"
max_retries: 3
graceful_timeout: "5m"

Docker Development Worker

File: configs/workers/docker.yaml

worker_id: "docker-worker"
base_path: "/tmp/fetchml-jobs"
train_script: "train.py"

redis_addr: "redis:6379"
redis_password: ""
redis_db: 0

local_mode: true

max_workers: 1
poll_interval_seconds: 5

podman_image: "python:3.9-slim"
container_workspace: "/workspace"
container_results: "/results"
gpu_devices: []
gpu_vendor: "none"
gpu_visible_devices: []

metrics:
  enabled: true
  listen_addr: ":9100"
metrics_flush_interval: "500ms"

Legacy TOML Worker (Deprecated)

File: configs/workers/worker-prod.toml

worker_id = "worker-prod-01"
base_path = "/data/ml-experiments"
max_workers = 4

redis_addr = "localhost:6379"
redis_password = "CHANGE_ME_REDIS_PASSWORD"
redis_db = 0

host = "localhost"
user = "ml-user"
port = 22
ssh_key = "~/.ssh/id_rsa"

podman_image = "ml-training:latest"
gpu_vendor = "none"
gpu_visible_devices = []
gpu_devices = []
container_workspace = "/workspace"
container_results = "/results"
train_script = "train.py"

[resources]
max_workers = 4
desired_rps_per_worker = 2
podman_cpus = "4"
podman_memory = "16g"

[metrics]
enabled = true
listen_addr = ":9100"

Security Hardening

Seccomp Profiles

FetchML includes a hardened seccomp profile for container sandboxing at configs/seccomp/default-hardened.json.

Features:

Default-deny policy: SCMP_ACT_ERRNO blocks all syscalls by default
Allowlist approach: Only explicitly permitted syscalls are allowed
Multi-architecture support: x86_64, x86, aarch64
Blocked dangerous syscalls: ptrace, mount, umount2, reboot, kexec_load, open_by_handle_at, perf_event_open

Usage with Docker/Podman:

# Docker with seccomp
docker run --security-opt seccomp=configs/seccomp/default-hardened.json \
  -v /data:/data:ro \
  my-image:latest

# Podman with seccomp
podman run --security-opt seccomp=configs/seccomp/default-hardened.json \
  --read-only \
  --no-new-privileges \
  my-image:latest

Key Allowed Syscalls:

File operations: open, openat, read, write, close
Memory: mmap, munmap, mprotect, brk
Process: clone, fork, execve, exit, wait4
Network: socket, bind, listen, accept, connect, sendto, recvfrom
Signals: rt_sigaction, rt_sigprocmask, kill, tkill
Time: clock_gettime, gettimeofday, nanosleep
I/O: epoll_create, epoll_ctl, epoll_wait, poll, select

Customization:

Copy the default profile and modify for your needs:

cp configs/seccomp/default-hardened.json configs/seccomp/custom-profile.json
# Edit to add/remove syscalls

Testing Seccomp:

# Test with a simple container
docker run --rm --security-opt seccomp=configs/seccomp/default-hardened.json \
  alpine:latest echo "Seccomp test passed"

CLI Configuration

User Config File

Location: ~/.ml/config.toml

[server]
worker_host = "localhost"
worker_user = "appuser"
worker_base = "/app"
worker_port = 22

[auth]
api_key = "<your-api-key>"

[cli]
default_timeout = 30
verbose = false

Multi-User CLI Configs

Admin Config: ~/.ml/config-admin.toml

[server]
worker_host = "localhost"
worker_user = "appuser"
worker_base = "/app"
worker_port = 22

[auth]
api_key = "<admin-api-key>"

Researcher Config: ~/.ml/config-researcher.toml

[server]
worker_host = "localhost"
worker_user = "appuser"
worker_base = "/app"
worker_port = 22

[auth]
api_key = "<researcher-api-key>"

Analyst Config: ~/.ml/config-analyst.toml

[server]
worker_host = "localhost"
worker_user = "appuser"
worker_base = "/app"
worker_port = 22

[auth]
api_key = "<analyst-api-key>"

Configuration Options

Authentication

Option	Type	Default	Description
`auth.enabled`	bool	false	Enable authentication
`auth.apikeys`	map	{}	API key configurations
`auth.apikeys.[user].hash`	string	-	SHA256 hash of API key
`auth.apikeys.[user].admin`	bool	false	Admin privileges
`auth.apikeys.[user].roles`	array	[]	User roles
`auth.apikeys.[user].permissions`	map	{}	User permissions

Server

Option	Type	Default	Description
`server.address`	string	":9101"	Server bind address
`server.tls.enabled`	bool	false	Enable TLS
`server.tls.cert_file`	string	-	TLS certificate file
`server.tls.key_file`	string	-	TLS private key file

Security

Option	Type	Default	Description
`security.production_mode`	bool	false	Enable production hardening
`security.allowed_origins`	array	[]	Allowed CORS origins
`security.api_key_rotation_days`	int	90	Days until API key rotation required
`security.audit_logging.enabled`	bool	false	Enable audit logging
`security.audit_logging.log_path`	string	-	Audit log file path
`security.rate_limit.enabled`	bool	true	Enable rate limiting
`security.rate_limit.requests_per_minute`	int	60	Requests per minute limit
`security.rate_limit.burst_size`	int	10	Burst request allowance
`security.ip_whitelist`	array	[]	Allowed IP addresses/CIDR ranges
`security.failed_login_lockout.enabled`	bool	false	Enable login lockout
`security.failed_login_lockout.max_attempts`	int	5	Max failed attempts before lockout
`security.failed_login_lockout.lockout_duration`	string	"15m"	Lockout duration (e.g., "15m")

Monitoring

Option	Type	Default	Description
`monitoring.prometheus.enabled`	bool	true	Enable Prometheus metrics
`monitoring.prometheus.port`	int	9101	Prometheus metrics port
`monitoring.prometheus.path`	string	"/metrics"	Metrics endpoint path
`monitoring.health_checks.enabled`	bool	true	Enable health checks
`monitoring.health_checks.interval`	string	"30s"	Health check interval

Database

Option	Type	Default	Description
`database.type`	string	"sqlite"	Database type (sqlite, postgres, mysql)
`database.connection`	string	-	Connection string or path
`database.host`	string	-	Database host (for postgres/mysql)
`database.port`	int	-	Database port (for postgres/mysql)
`database.username`	string	-	Database username
`database.password`	string	-	Database password
`database.database`	string	-	Database name

Queue

Option	Type	Default	Description
`queue.type`	string	"native"	Queue backend type (native, redis, sqlite, filesystem)
`queue.native.data_dir`	string	-	Data directory for native queue
`queue.sqlite_path`	string	-	SQLite database path for queue
`queue.filesystem_path`	string	-	Filesystem queue path
`queue.fallback_to_filesystem`	bool	false	Fallback to filesystem on Redis failure

Resources

Option	Type	Default	Description
`resources.max_workers`	int	1	Maximum concurrent workers
`resources.desired_rps_per_worker`	int	2	Desired requests per second per worker
`resources.requests_per_sec`	int	-	Global request rate limit
`resources.request_burst`	int	-	Request burst allowance
`resources.podman_cpus`	string	"2"	CPU limit for Podman containers
`resources.podman_memory`	string	"4Gi"	Memory limit for Podman containers

Plugin GPU Quotas

Control GPU allocation for plugin-based services (Jupyter, vLLM, etc.).

Option	Type	Default	Description
`scheduler.plugin_quota.enabled`	bool	false	Enable plugin GPU quota enforcement
`scheduler.plugin_quota.total_gpus`	int	0	Global GPU limit across all plugins (0 = unlimited)
`scheduler.plugin_quota.per_user_gpus`	int	0	Default per-user GPU limit (0 = unlimited)
`scheduler.plugin_quota.per_user_services`	int	0	Default per-user service count limit (0 = unlimited)
`scheduler.plugin_quota.per_plugin_limits.{plugin}.max_gpus`	int	0	Plugin-specific GPU limit
`scheduler.plugin_quota.per_plugin_limits.{plugin}.max_services`	int	0	Plugin-specific service count limit
`scheduler.plugin_quota.user_overrides.{user}.max_gpus`	int	0	Per-user GPU override
`scheduler.plugin_quota.user_overrides.{user}.max_services`	int	0	Per-user service limit override
`scheduler.plugin_quota.user_overrides.{user}.allowed_plugins`	array	[]	Plugins user is allowed to use (empty = all)

Example configuration:

scheduler:
  plugin_quota:
    enabled: true
    total_gpus: 16
    per_user_gpus: 4
    per_user_services: 2
    per_plugin_limits:
      vllm:
        max_gpus: 8
        max_services: 4
      jupyter:
        max_gpus: 4
        max_services: 10
    user_overrides:
      admin:
        max_gpus: 8
        max_services: 5
        allowed_plugins: ["jupyter", "vllm"]

Redis

Option	Type	Default	Description
`redis.url`	string	"redis://localhost:6379"	Redis connection URL
`redis.addr`	string	-	Redis host:port shorthand
`redis.password`	string	-	Redis password
`redis.db`	int	0	Redis database number
`redis.max_connections`	int	10	Max Redis connections

Logging

Option	Type	Default	Description
`logging.level`	string	"info"	Log level
`logging.file`	string	-	Log file path
`logging.audit_file`	string	-	Audit log path

Permission System

Permission Keys

Permission	Description
`jobs:read`	Read job information
`jobs:create`	Create new jobs
`jobs:update`	Update existing jobs
`jobs:delete`	Delete jobs
`*`	All permissions (admin only)

Role-Based Permissions

Role	Default Permissions
admin	All permissions
researcher	jobs:read, jobs:create, jobs:update
analyst	jobs:read
user	No default permissions

Environment Variables

Variable	Default	Description
`FETCHML_CONFIG`	-	Path to config file
`FETCHML_LOG_LEVEL`	"info"	Override log level
`CLI_CONFIG`	-	Path to CLI config file
`FETCH_ML_GPU_TYPE`	-	Override GPU vendor detection (nvidia, amd, apple, none). Takes precedence over config file.
`FETCH_ML_GPU_COUNT`	-	Override GPU count detection. Used with auto-detected or configured vendor.
`FETCH_ML_TOTAL_CPU`	-	Override total CPU count detection. Sets the number of CPU cores available.
`FETCH_ML_GPU_SLOTS_PER_GPU`	1	Override GPU slots per GPU. Controls how many concurrent tasks can share a single GPU.

When environment variable overrides are active, they are logged to stderr at worker startup for debugging.

Note: When gpu_vendor: amd is configured, the system uses the NVIDIA detector implementation (aliased) due to similar device exposure patterns. The configured_vendor field will show "amd" while the actual detection uses NVIDIA-compatible methods.

Troubleshooting

Common Configuration Issues

Authentication Failures
- Check API key hashes are correct SHA256
- Verify YAML syntax
- Ensure auth.enabled: true
Connection Issues
- Verify server address and ports
- Check firewall settings
- Validate network connectivity
Permission Issues
- Check user roles and permissions
- Verify permission key format
- Ensure admin users have "*": true

Configuration Validation

# Validate server configuration
go run cmd/api-server/main.go --config configs/api/dev.yaml --validate

# Test CLI configuration
./cli/zig-out/bin/ml status --debug

20 KiB Raw Blame History

Configuration Reference

Overview

Environment Configurations

Local Development

Multi-User Setup

Production

Homelab Secure

Worker Configurations

Local Development Worker

Homelab Secure Worker

Docker Development Worker

Legacy TOML Worker (Deprecated)

Security Hardening

Seccomp Profiles

CLI Configuration

User Config File

Multi-User CLI Configs

Configuration Options

Authentication

Server

Security

Monitoring

Database

Queue

Resources

Plugin GPU Quotas

Redis

Logging

Permission System

Permission Keys

Role-Based Permissions

Environment Variables

Troubleshooting

Common Configuration Issues

Configuration Validation

See Also

20 KiB

Raw Blame History