# Configuration Reference ## Overview This document provides a comprehensive reference for all configuration options in the FetchML project. ## Environment Configurations ### Local Development **File:** `configs/api/dev.yaml` ```yaml base_path: "./data/dev/experiments" data_dir: "./data/dev/active" auth: enabled: false server: address: "0.0.0.0:9101" tls: enabled: false cert_file: "/app/ssl/cert.pem" key_file: "/app/ssl/key.pem" security: production_mode: false allowed_origins: - "http://localhost:3000" api_key_rotation_days: 90 audit_logging: enabled: true log_path: "./data/dev/logs/fetchml-audit.log" rate_limit: enabled: false requests_per_minute: 60 burst_size: 10 ip_whitelist: [] monitoring: prometheus: enabled: true port: 9101 path: "/metrics" health_checks: enabled: true interval: "30s" redis: addr: "redis:6379" password: "" db: 0 database: type: "sqlite" connection: "./data/dev/fetchml.sqlite" logging: level: "info" file: "./data/dev/logs/fetchml.log" audit_log: "./data/dev/logs/fetchml-audit.log" resources: max_workers: 1 desired_rps_per_worker: 2 podman_cpus: "2" podman_memory: "4Gi" ``` ### Multi-User Setup **File:** `configs/api/multi-user.yaml` ```yaml base_path: "/app/data/experiments" data_dir: "/data/active" auth: enabled: true api_keys: admin_user: hash: "CHANGE_ME_SHA256_ADMIN_USER_KEY" admin: true roles: ["user", "admin"] permissions: "*": true researcher1: hash: "CHANGE_ME_SHA256_RESEARCHER1_KEY" admin: false roles: ["user", "researcher"] permissions: "jobs:read": true "jobs:create": true "jobs:update": true "jobs:delete": false analyst1: hash: "CHANGE_ME_SHA256_ANALYST1_KEY" admin: false roles: ["user", "analyst"] permissions: "jobs:read": true "jobs:create": false "jobs:update": false "jobs:delete": false server: address: ":9101" tls: enabled: false security: production_mode: false allowed_origins: [] rate_limit: enabled: true requests_per_minute: 60 burst_size: 20 ip_whitelist: [] monitoring: prometheus: enabled: true port: 9101 path: "/metrics" health_checks: enabled: true interval: "30s" redis: url: "redis://redis:6379" password: "" db: 0 database: type: "sqlite" connection: "/app/data/experiments/fetch_ml.sqlite" logging: level: "info" file: "/logs/app.log" audit_log: "" resources: max_workers: 3 desired_rps_per_worker: 3 podman_cpus: "2" podman_memory: "4Gi" ``` ### Production **File:** `configs/api/prod.yaml` ```yaml base_path: "/app/data/prod/experiments" data_dir: "/app/data/prod/active" auth: enabled: true api_keys: admin: hash: "replace-with-sha256-of-your-api-key" admin: true roles: - admin permissions: "*": true server: address: ":9101" tls: enabled: true cert_file: "/app/ssl/cert.pem" key_file: "/app/ssl/key.pem" security: production_mode: false allowed_origins: [] rate_limit: enabled: true requests_per_minute: 60 burst_size: 10 ip_whitelist: [] monitoring: prometheus: enabled: true port: 9101 path: "/metrics" health_checks: enabled: true interval: "30s" redis: addr: "redis:6379" password: "" db: 0 database: type: "sqlite" connection: "/app/data/prod/fetch_ml.sqlite" logging: level: "info" file: "/app/data/prod/logs/fetch_ml.log" audit_log: "/app/data/prod/logs/audit.log" resources: max_workers: 2 desired_rps_per_worker: 5 podman_cpus: "2" podman_memory: "4Gi" ``` ### Homelab Secure **File:** `configs/api/homelab-secure.yaml` Secure configuration for homelab deployments with production-grade security settings: ```yaml base_path: "/data/experiments" data_dir: "/data/active" auth: enabled: true api_keys: homelab_admin: hash: "CHANGE_ME_SHA256_HOMELAB_ADMIN_KEY" admin: true roles: - admin permissions: "*": true homelab_user: hash: "CHANGE_ME_SHA256_HOMELAB_USER_KEY" admin: false roles: - researcher permissions: experiments: true datasets: true jupyter: true server: address: ":9101" tls: enabled: false cert_file: "/app/ssl/cert.pem" key_file: "/app/ssl/key.pem" security: production_mode: true allowed_origins: - "https://ml-experiments.example.com" rate_limit: enabled: true requests_per_minute: 60 burst_size: 10 ip_whitelist: - "127.0.0.1" - "192.168.0.0/16" monitoring: prometheus: enabled: true port: 9101 path: "/metrics" health_checks: enabled: true interval: "30s" redis: url: "redis://:CHANGE_ME_REDIS_PASSWORD@redis:6379" password: "" db: 0 database: type: "sqlite" connection: "/data/experiments/fetch_ml.sqlite" logging: level: "info" file: "/logs/fetch_ml.log" audit_log: "" resources: max_workers: 1 desired_rps_per_worker: 2 podman_cpus: "2" podman_memory: "4Gi" ``` ## Worker Configurations ### Local Development Worker **File:** `configs/workers/dev-local.yaml` ```yaml worker_id: "local-worker" base_path: "data/dev/experiments" train_script: "train.py" redis_url: "redis://localhost:6379/0" local_mode: true prewarm_enabled: false max_workers: 2 poll_interval_seconds: 2 auto_fetch_data: false data_manager_path: "./data_manager" dataset_cache_ttl: "30m" data_dir: "data/dev/active" snapshot_store: enabled: false podman_image: "python:3.9-slim" container_workspace: "/workspace" container_results: "/results" gpu_devices: [] gpu_vendor: "apple" gpu_visible_devices: [] # Apple M-series GPU configuration apple_gpu: enabled: true metal_device: "/dev/metal" mps_runtime: "/dev/mps" resources: max_workers: 2 desired_rps_per_worker: 2 podman_cpus: "2" podman_memory: "4Gi" metrics: enabled: false queue: type: "native" native: data_dir: "data/dev/queue" task_lease_duration: "30m" heartbeat_interval: "1m" max_retries: 3 graceful_timeout: "5m" ``` ### Homelab Secure Worker **File:** `configs/workers/homelab-secure.yaml` Secure worker configuration with snapshot store and Redis authentication: ```yaml worker_id: "homelab-worker" base_path: "/tmp/fetchml-jobs" train_script: "train.py" redis_url: "redis://:${REDIS_PASSWORD}@redis:6379/0" local_mode: true max_workers: 1 poll_interval_seconds: 2 auto_fetch_data: false data_manager_path: "./data_manager" dataset_cache_ttl: "30m" data_dir: "/data/active" snapshot_store: enabled: true endpoint: "minio:9000" secure: false bucket: "fetchml-snapshots" prefix: "snapshots" timeout: "5m" max_retries: 3 podman_image: "python:3.9-slim" container_workspace: "/workspace" container_results: "/results" gpu_devices: [] resources: max_workers: 1 desired_rps_per_worker: 2 podman_cpus: "2" podman_memory: "4Gi" metrics: enabled: true listen_addr: ":9100" metrics_flush_interval: "500ms" task_lease_duration: "30m" heartbeat_interval: "1m" max_retries: 3 graceful_timeout: "5m" ``` ### Docker Development Worker **File:** `configs/workers/docker.yaml` ```yaml worker_id: "docker-worker" base_path: "/tmp/fetchml-jobs" train_script: "train.py" redis_addr: "redis:6379" redis_password: "" redis_db: 0 local_mode: true max_workers: 1 poll_interval_seconds: 5 podman_image: "python:3.9-slim" container_workspace: "/workspace" container_results: "/results" gpu_devices: [] gpu_vendor: "none" gpu_visible_devices: [] metrics: enabled: true listen_addr: ":9100" metrics_flush_interval: "500ms" ``` ### Legacy TOML Worker (Deprecated) **File:** `configs/workers/worker-prod.toml` ```toml worker_id = "worker-prod-01" base_path = "/data/ml-experiments" max_workers = 4 redis_addr = "localhost:6379" redis_password = "CHANGE_ME_REDIS_PASSWORD" redis_db = 0 host = "localhost" user = "ml-user" port = 22 ssh_key = "~/.ssh/id_rsa" podman_image = "ml-training:latest" gpu_vendor = "none" gpu_visible_devices = [] gpu_devices = [] container_workspace = "/workspace" container_results = "/results" train_script = "train.py" [resources] max_workers = 4 desired_rps_per_worker = 2 podman_cpus = "4" podman_memory = "16g" [metrics] enabled = true listen_addr = ":9100" ``` ## Security Hardening ### Seccomp Profiles FetchML includes a hardened seccomp profile for container sandboxing at `configs/seccomp/default-hardened.json`. **Features:** - **Default-deny policy**: `SCMP_ACT_ERRNO` blocks all syscalls by default - **Allowlist approach**: Only explicitly permitted syscalls are allowed - **Multi-architecture support**: x86_64, x86, aarch64 - **Blocked dangerous syscalls**: ptrace, mount, umount2, reboot, kexec_load, open_by_handle_at, perf_event_open **Usage with Docker/Podman:** ```bash # Docker with seccomp docker run --security-opt seccomp=configs/seccomp/default-hardened.json \ -v /data:/data:ro \ my-image:latest # Podman with seccomp podman run --security-opt seccomp=configs/seccomp/default-hardened.json \ --read-only \ --no-new-privileges \ my-image:latest ``` **Key Allowed Syscalls:** - File operations: `open`, `openat`, `read`, `write`, `close` - Memory: `mmap`, `munmap`, `mprotect`, `brk` - Process: `clone`, `fork`, `execve`, `exit`, `wait4` - Network: `socket`, `bind`, `listen`, `accept`, `connect`, `sendto`, `recvfrom` - Signals: `rt_sigaction`, `rt_sigprocmask`, `kill`, `tkill` - Time: `clock_gettime`, `gettimeofday`, `nanosleep` - I/O: `epoll_create`, `epoll_ctl`, `epoll_wait`, `poll`, `select` **Customization:** Copy the default profile and modify for your needs: ```bash cp configs/seccomp/default-hardened.json configs/seccomp/custom-profile.json # Edit to add/remove syscalls ``` **Testing Seccomp:** ```bash # Test with a simple container docker run --rm --security-opt seccomp=configs/seccomp/default-hardened.json \ alpine:latest echo "Seccomp test passed" ``` ## CLI Configuration ### User Config File **Location:** `~/.ml/config.toml` ```toml [server] worker_host = "localhost" worker_user = "appuser" worker_base = "/app" worker_port = 22 [auth] api_key = "" [cli] default_timeout = 30 verbose = false ``` ### Multi-User CLI Configs **Admin Config:** `~/.ml/config-admin.toml` ```toml [server] worker_host = "localhost" worker_user = "appuser" worker_base = "/app" worker_port = 22 [auth] api_key = "" ``` **Researcher Config:** `~/.ml/config-researcher.toml` ```toml [server] worker_host = "localhost" worker_user = "appuser" worker_base = "/app" worker_port = 22 [auth] api_key = "" ``` **Analyst Config:** `~/.ml/config-analyst.toml` ```toml [server] worker_host = "localhost" worker_user = "appuser" worker_base = "/app" worker_port = 22 [auth] api_key = "" ``` ## Configuration Options ### Authentication | Option | Type | Default | Description | |--------|------|---------|-------------| | `auth.enabled` | bool | false | Enable authentication | | `auth.apikeys` | map | {} | API key configurations | | `auth.apikeys.[user].hash` | string | - | SHA256 hash of API key | | `auth.apikeys.[user].admin` | bool | false | Admin privileges | | `auth.apikeys.[user].roles` | array | [] | User roles | | `auth.apikeys.[user].permissions` | map | {} | User permissions | ### Server | Option | Type | Default | Description | |--------|------|---------|-------------| | `server.address` | string | ":9101" | Server bind address | | `server.tls.enabled` | bool | false | Enable TLS | | `server.tls.cert_file` | string | - | TLS certificate file | | `server.tls.key_file` | string | - | TLS private key file | ### Security | Option | Type | Default | Description | |--------|------|---------|-------------| | `security.production_mode` | bool | false | Enable production hardening | | `security.allowed_origins` | array | [] | Allowed CORS origins | | `security.api_key_rotation_days` | int | 90 | Days until API key rotation required | | `security.audit_logging.enabled` | bool | false | Enable audit logging | | `security.audit_logging.log_path` | string | - | Audit log file path | | `security.rate_limit.enabled` | bool | true | Enable rate limiting | | `security.rate_limit.requests_per_minute` | int | 60 | Requests per minute limit | | `security.rate_limit.burst_size` | int | 10 | Burst request allowance | | `security.ip_whitelist` | array | [] | Allowed IP addresses/CIDR ranges | | `security.failed_login_lockout.enabled` | bool | false | Enable login lockout | | `security.failed_login_lockout.max_attempts` | int | 5 | Max failed attempts before lockout | | `security.failed_login_lockout.lockout_duration` | string | "15m" | Lockout duration (e.g., "15m") | ### Monitoring | Option | Type | Default | Description | |--------|------|---------|-------------| | `monitoring.prometheus.enabled` | bool | true | Enable Prometheus metrics | | `monitoring.prometheus.port` | int | 9101 | Prometheus metrics port | | `monitoring.prometheus.path` | string | "/metrics" | Metrics endpoint path | | `monitoring.health_checks.enabled` | bool | true | Enable health checks | | `monitoring.health_checks.interval` | string | "30s" | Health check interval | ### Database | Option | Type | Default | Description | |--------|------|---------|-------------| | `database.type` | string | "sqlite" | Database type (sqlite, postgres, mysql) | | `database.connection` | string | - | Connection string or path | | `database.host` | string | - | Database host (for postgres/mysql) | | `database.port` | int | - | Database port (for postgres/mysql) | | `database.username` | string | - | Database username | | `database.password` | string | - | Database password | | `database.database` | string | - | Database name | ### Queue | Option | Type | Default | Description | |--------|------|---------|-------------| | `queue.type` | string | "native" | Queue backend type (native, redis, sqlite, filesystem) | | `queue.native.data_dir` | string | - | Data directory for native queue | | `queue.sqlite_path` | string | - | SQLite database path for queue | | `queue.filesystem_path` | string | - | Filesystem queue path | | `queue.fallback_to_filesystem` | bool | false | Fallback to filesystem on Redis failure | ### Resources | Option | Type | Default | Description | |--------|------|---------|-------------| | `resources.max_workers` | int | 1 | Maximum concurrent workers | | `resources.desired_rps_per_worker` | int | 2 | Desired requests per second per worker | | `resources.requests_per_sec` | int | - | Global request rate limit | | `resources.request_burst` | int | - | Request burst allowance | | `resources.podman_cpus` | string | "2" | CPU limit for Podman containers | | `resources.podman_memory` | string | "4Gi" | Memory limit for Podman containers | ### Plugin GPU Quotas Control GPU allocation for plugin-based services (Jupyter, vLLM, etc.). | Option | Type | Default | Description | |--------|------|---------|-------------| | `scheduler.plugin_quota.enabled` | bool | false | Enable plugin GPU quota enforcement | | `scheduler.plugin_quota.total_gpus` | int | 0 | Global GPU limit across all plugins (0 = unlimited) | | `scheduler.plugin_quota.per_user_gpus` | int | 0 | Default per-user GPU limit (0 = unlimited) | | `scheduler.plugin_quota.per_user_services` | int | 0 | Default per-user service count limit (0 = unlimited) | | `scheduler.plugin_quota.per_plugin_limits.{plugin}.max_gpus` | int | 0 | Plugin-specific GPU limit | | `scheduler.plugin_quota.per_plugin_limits.{plugin}.max_services` | int | 0 | Plugin-specific service count limit | | `scheduler.plugin_quota.user_overrides.{user}.max_gpus` | int | 0 | Per-user GPU override | | `scheduler.plugin_quota.user_overrides.{user}.max_services` | int | 0 | Per-user service limit override | | `scheduler.plugin_quota.user_overrides.{user}.allowed_plugins` | array | [] | Plugins user is allowed to use (empty = all) | **Example configuration:** ```yaml scheduler: plugin_quota: enabled: true total_gpus: 16 per_user_gpus: 4 per_user_services: 2 per_plugin_limits: vllm: max_gpus: 8 max_services: 4 jupyter: max_gpus: 4 max_services: 10 user_overrides: admin: max_gpus: 8 max_services: 5 allowed_plugins: ["jupyter", "vllm"] ``` ### Redis | Option | Type | Default | Description | |--------|------|---------|-------------| | `redis.url` | string | "redis://localhost:6379" | Redis connection URL | | `redis.addr` | string | - | Redis host:port shorthand | | `redis.password` | string | - | Redis password | | `redis.db` | int | 0 | Redis database number | | `redis.max_connections` | int | 10 | Max Redis connections | ### Logging | Option | Type | Default | Description | |--------|------|---------|-------------| | `logging.level` | string | "info" | Log level | | `logging.file` | string | - | Log file path | | `logging.audit_file` | string | - | Audit log path | ## Permission System ### Permission Keys | Permission | Description | |------------|-------------| | `jobs:read` | Read job information | | `jobs:create` | Create new jobs | | `jobs:update` | Update existing jobs | | `jobs:delete` | Delete jobs | | `*` | All permissions (admin only) | ### Role-Based Permissions | Role | Default Permissions | |------|-------------------| | admin | All permissions | | researcher | jobs:read, jobs:create, jobs:update | | analyst | jobs:read | | user | No default permissions | ## Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `FETCHML_CONFIG` | - | Path to config file | | `FETCHML_LOG_LEVEL` | "info" | Override log level | | `CLI_CONFIG` | - | Path to CLI config file | | `FETCH_ML_GPU_TYPE` | - | Override GPU vendor detection (nvidia, amd, apple, none). Takes precedence over config file. | | `FETCH_ML_GPU_COUNT` | - | Override GPU count detection. Used with auto-detected or configured vendor. | | `FETCH_ML_TOTAL_CPU` | - | Override total CPU count detection. Sets the number of CPU cores available. | | `FETCH_ML_GPU_SLOTS_PER_GPU` | 1 | Override GPU slots per GPU. Controls how many concurrent tasks can share a single GPU. | When environment variable overrides are active, they are logged to stderr at worker startup for debugging. Note: When `gpu_vendor: amd` is configured, the system uses the NVIDIA detector implementation (aliased) due to similar device exposure patterns. The `configured_vendor` field will show "amd" while the actual detection uses NVIDIA-compatible methods. ## Troubleshooting ### Common Configuration Issues 1. **Authentication Failures** - Check API key hashes are correct SHA256 - Verify YAML syntax - Ensure auth.enabled: true 2. **Connection Issues** - Verify server address and ports - Check firewall settings - Validate network connectivity 3. **Permission Issues** - Check user roles and permissions - Verify permission key format - Ensure admin users have "*": true ### Configuration Validation ```bash # Validate server configuration go run cmd/api-server/main.go --config configs/api/dev.yaml --validate # Test CLI configuration ./cli/zig-out/bin/ml status --debug ``` --- ## See Also - **[Architecture](architecture.md)** - System architecture overview - **[Scheduler Architecture](scheduler-architecture.md)** - Scheduler configuration details - **[Environment Variables](environment-variables.md)** - Additional environment variable documentation - **[Security Guide](security.md)** - Security-related configuration - **[Deployment Guide](deployment.md)** - Production configuration guidance - **[Jupyter Workflow](jupyter-workflow.md)** - Jupyter service configuration - **[vLLM Workflow](vllm-workflow.md)** - vLLM service configuration