Some checks failed
Security Scan / Security Analysis (push) Waiting to run
Security Scan / Native Library Security (push) Waiting to run
Verification & Maintenance / V.1 - Schema Drift Detection (push) Waiting to run
Verification & Maintenance / V.4 - Custom Go Vet Analyzers (push) Waiting to run
Verification & Maintenance / V.7 - Audit Chain Integrity (push) Waiting to run
Verification & Maintenance / V.6 - Extended Security Scanning (push) Waiting to run
Verification & Maintenance / V.10 - OpenSSF Scorecard (push) Waiting to run
Verification & Maintenance / Verification Summary (push) Blocked by required conditions
Build Pipeline / Build Binaries (push) Failing after 2m4s
Build Pipeline / Build Docker Images (push) Has been skipped
Build Pipeline / Sign HIPAA Config (push) Has been skipped
Build Pipeline / Generate SLSA Provenance (push) Has been skipped
Checkout test / test (push) Successful in 5s
CI Pipeline / Test (push) Failing after 1s
CI Pipeline / Dev Compose Smoke Test (push) Has been skipped
CI Pipeline / Security Scan (push) Has been skipped
CI Pipeline / Test Scripts (push) Has been skipped
CI Pipeline / Test Native Libraries (push) Has been skipped
CI Pipeline / Native Library Build Matrix (push) Has been skipped
Contract Tests / Spec Drift Detection (push) Failing after 16s
Contract Tests / API Contract Tests (push) Has been skipped
Deploy API Docs / Build API Documentation (push) Failing after 5s
Deploy API Docs / Deploy to GitHub Pages (push) Has been skipped
Documentation / build-and-publish (push) Failing after 44s
CI Pipeline / Trigger Build Workflow (push) Failing after 0s
- Add new vLLM workflow documentation (vllm-workflow.md) - Update scheduler-architecture.md with Plugin GPU Quota and audit logging - Add See Also sections to jupyter-workflow.md, quick-start.md, configuration-reference.md for better navigation - Update landing page and index with vLLM and scheduler links - Cross-link all documentation for improved discoverability
800 lines
No EOL
20 KiB
Markdown
800 lines
No EOL
20 KiB
Markdown
# Configuration Reference
|
|
|
|
## Overview
|
|
|
|
This document provides a comprehensive reference for all configuration options in the FetchML project.
|
|
|
|
## Environment Configurations
|
|
|
|
### Local Development
|
|
**File:** `configs/api/dev.yaml`
|
|
|
|
```yaml
|
|
base_path: "./data/dev/experiments"
|
|
data_dir: "./data/dev/active"
|
|
|
|
auth:
|
|
enabled: false
|
|
|
|
server:
|
|
address: "0.0.0.0:9101"
|
|
tls:
|
|
enabled: false
|
|
cert_file: "/app/ssl/cert.pem"
|
|
key_file: "/app/ssl/key.pem"
|
|
|
|
security:
|
|
production_mode: false
|
|
allowed_origins:
|
|
- "http://localhost:3000"
|
|
api_key_rotation_days: 90
|
|
audit_logging:
|
|
enabled: true
|
|
log_path: "./data/dev/logs/fetchml-audit.log"
|
|
rate_limit:
|
|
enabled: false
|
|
requests_per_minute: 60
|
|
burst_size: 10
|
|
ip_whitelist: []
|
|
|
|
monitoring:
|
|
prometheus:
|
|
enabled: true
|
|
port: 9101
|
|
path: "/metrics"
|
|
health_checks:
|
|
enabled: true
|
|
interval: "30s"
|
|
|
|
redis:
|
|
addr: "redis:6379"
|
|
password: ""
|
|
db: 0
|
|
|
|
database:
|
|
type: "sqlite"
|
|
connection: "./data/dev/fetchml.sqlite"
|
|
|
|
logging:
|
|
level: "info"
|
|
file: "./data/dev/logs/fetchml.log"
|
|
audit_log: "./data/dev/logs/fetchml-audit.log"
|
|
|
|
resources:
|
|
max_workers: 1
|
|
desired_rps_per_worker: 2
|
|
podman_cpus: "2"
|
|
podman_memory: "4Gi"
|
|
```
|
|
|
|
### Multi-User Setup
|
|
**File:** `configs/api/multi-user.yaml`
|
|
|
|
```yaml
|
|
base_path: "/app/data/experiments"
|
|
data_dir: "/data/active"
|
|
|
|
auth:
|
|
enabled: true
|
|
api_keys:
|
|
admin_user:
|
|
hash: "CHANGE_ME_SHA256_ADMIN_USER_KEY"
|
|
admin: true
|
|
roles: ["user", "admin"]
|
|
permissions:
|
|
"*": true
|
|
researcher1:
|
|
hash: "CHANGE_ME_SHA256_RESEARCHER1_KEY"
|
|
admin: false
|
|
roles: ["user", "researcher"]
|
|
permissions:
|
|
"jobs:read": true
|
|
"jobs:create": true
|
|
"jobs:update": true
|
|
"jobs:delete": false
|
|
analyst1:
|
|
hash: "CHANGE_ME_SHA256_ANALYST1_KEY"
|
|
admin: false
|
|
roles: ["user", "analyst"]
|
|
permissions:
|
|
"jobs:read": true
|
|
"jobs:create": false
|
|
"jobs:update": false
|
|
"jobs:delete": false
|
|
|
|
server:
|
|
address: ":9101"
|
|
tls:
|
|
enabled: false
|
|
|
|
security:
|
|
production_mode: false
|
|
allowed_origins: []
|
|
rate_limit:
|
|
enabled: true
|
|
requests_per_minute: 60
|
|
burst_size: 20
|
|
ip_whitelist: []
|
|
|
|
monitoring:
|
|
prometheus:
|
|
enabled: true
|
|
port: 9101
|
|
path: "/metrics"
|
|
health_checks:
|
|
enabled: true
|
|
interval: "30s"
|
|
|
|
redis:
|
|
url: "redis://redis:6379"
|
|
password: ""
|
|
db: 0
|
|
|
|
database:
|
|
type: "sqlite"
|
|
connection: "/app/data/experiments/fetch_ml.sqlite"
|
|
|
|
logging:
|
|
level: "info"
|
|
file: "/logs/app.log"
|
|
audit_log: ""
|
|
|
|
resources:
|
|
max_workers: 3
|
|
desired_rps_per_worker: 3
|
|
podman_cpus: "2"
|
|
podman_memory: "4Gi"
|
|
```
|
|
|
|
### Production
|
|
**File:** `configs/api/prod.yaml`
|
|
|
|
```yaml
|
|
base_path: "/app/data/prod/experiments"
|
|
data_dir: "/app/data/prod/active"
|
|
|
|
auth:
|
|
enabled: true
|
|
api_keys:
|
|
admin:
|
|
hash: "replace-with-sha256-of-your-api-key"
|
|
admin: true
|
|
roles:
|
|
- admin
|
|
permissions:
|
|
"*": true
|
|
|
|
server:
|
|
address: ":9101"
|
|
tls:
|
|
enabled: true
|
|
cert_file: "/app/ssl/cert.pem"
|
|
key_file: "/app/ssl/key.pem"
|
|
|
|
security:
|
|
production_mode: false
|
|
allowed_origins: []
|
|
rate_limit:
|
|
enabled: true
|
|
requests_per_minute: 60
|
|
burst_size: 10
|
|
ip_whitelist: []
|
|
|
|
monitoring:
|
|
prometheus:
|
|
enabled: true
|
|
port: 9101
|
|
path: "/metrics"
|
|
health_checks:
|
|
enabled: true
|
|
interval: "30s"
|
|
|
|
redis:
|
|
addr: "redis:6379"
|
|
password: ""
|
|
db: 0
|
|
|
|
database:
|
|
type: "sqlite"
|
|
connection: "/app/data/prod/fetch_ml.sqlite"
|
|
|
|
logging:
|
|
level: "info"
|
|
file: "/app/data/prod/logs/fetch_ml.log"
|
|
audit_log: "/app/data/prod/logs/audit.log"
|
|
|
|
resources:
|
|
max_workers: 2
|
|
desired_rps_per_worker: 5
|
|
podman_cpus: "2"
|
|
podman_memory: "4Gi"
|
|
```
|
|
|
|
### Homelab Secure
|
|
**File:** `configs/api/homelab-secure.yaml`
|
|
|
|
Secure configuration for homelab deployments with production-grade security settings:
|
|
|
|
```yaml
|
|
base_path: "/data/experiments"
|
|
data_dir: "/data/active"
|
|
|
|
auth:
|
|
enabled: true
|
|
api_keys:
|
|
homelab_admin:
|
|
hash: "CHANGE_ME_SHA256_HOMELAB_ADMIN_KEY"
|
|
admin: true
|
|
roles:
|
|
- admin
|
|
permissions:
|
|
"*": true
|
|
homelab_user:
|
|
hash: "CHANGE_ME_SHA256_HOMELAB_USER_KEY"
|
|
admin: false
|
|
roles:
|
|
- researcher
|
|
permissions:
|
|
experiments: true
|
|
datasets: true
|
|
jupyter: true
|
|
|
|
server:
|
|
address: ":9101"
|
|
tls:
|
|
enabled: false
|
|
cert_file: "/app/ssl/cert.pem"
|
|
key_file: "/app/ssl/key.pem"
|
|
|
|
security:
|
|
production_mode: true
|
|
allowed_origins:
|
|
- "https://ml-experiments.example.com"
|
|
rate_limit:
|
|
enabled: true
|
|
requests_per_minute: 60
|
|
burst_size: 10
|
|
ip_whitelist:
|
|
- "127.0.0.1"
|
|
- "192.168.0.0/16"
|
|
|
|
monitoring:
|
|
prometheus:
|
|
enabled: true
|
|
port: 9101
|
|
path: "/metrics"
|
|
health_checks:
|
|
enabled: true
|
|
interval: "30s"
|
|
|
|
redis:
|
|
url: "redis://:CHANGE_ME_REDIS_PASSWORD@redis:6379"
|
|
password: ""
|
|
db: 0
|
|
|
|
database:
|
|
type: "sqlite"
|
|
connection: "/data/experiments/fetch_ml.sqlite"
|
|
|
|
logging:
|
|
level: "info"
|
|
file: "/logs/fetch_ml.log"
|
|
audit_log: ""
|
|
|
|
resources:
|
|
max_workers: 1
|
|
desired_rps_per_worker: 2
|
|
podman_cpus: "2"
|
|
podman_memory: "4Gi"
|
|
```
|
|
|
|
## Worker Configurations
|
|
|
|
### Local Development Worker
|
|
**File:** `configs/workers/dev-local.yaml`
|
|
|
|
```yaml
|
|
worker_id: "local-worker"
|
|
base_path: "data/dev/experiments"
|
|
train_script: "train.py"
|
|
|
|
redis_url: "redis://localhost:6379/0"
|
|
|
|
local_mode: true
|
|
|
|
prewarm_enabled: false
|
|
|
|
max_workers: 2
|
|
poll_interval_seconds: 2
|
|
|
|
auto_fetch_data: false
|
|
|
|
data_manager_path: "./data_manager"
|
|
dataset_cache_ttl: "30m"
|
|
|
|
data_dir: "data/dev/active"
|
|
|
|
snapshot_store:
|
|
enabled: false
|
|
|
|
podman_image: "python:3.9-slim"
|
|
container_workspace: "/workspace"
|
|
container_results: "/results"
|
|
gpu_devices: []
|
|
gpu_vendor: "apple"
|
|
gpu_visible_devices: []
|
|
|
|
# Apple M-series GPU configuration
|
|
apple_gpu:
|
|
enabled: true
|
|
metal_device: "/dev/metal"
|
|
mps_runtime: "/dev/mps"
|
|
|
|
resources:
|
|
max_workers: 2
|
|
desired_rps_per_worker: 2
|
|
podman_cpus: "2"
|
|
podman_memory: "4Gi"
|
|
|
|
metrics:
|
|
enabled: false
|
|
|
|
queue:
|
|
type: "native"
|
|
native:
|
|
data_dir: "data/dev/queue"
|
|
|
|
task_lease_duration: "30m"
|
|
heartbeat_interval: "1m"
|
|
max_retries: 3
|
|
graceful_timeout: "5m"
|
|
```
|
|
|
|
### Homelab Secure Worker
|
|
**File:** `configs/workers/homelab-secure.yaml`
|
|
|
|
Secure worker configuration with snapshot store and Redis authentication:
|
|
|
|
```yaml
|
|
worker_id: "homelab-worker"
|
|
base_path: "/tmp/fetchml-jobs"
|
|
train_script: "train.py"
|
|
|
|
redis_url: "redis://:${REDIS_PASSWORD}@redis:6379/0"
|
|
|
|
local_mode: true
|
|
|
|
max_workers: 1
|
|
poll_interval_seconds: 2
|
|
|
|
auto_fetch_data: false
|
|
|
|
data_manager_path: "./data_manager"
|
|
dataset_cache_ttl: "30m"
|
|
|
|
data_dir: "/data/active"
|
|
|
|
snapshot_store:
|
|
enabled: true
|
|
endpoint: "minio:9000"
|
|
secure: false
|
|
bucket: "fetchml-snapshots"
|
|
prefix: "snapshots"
|
|
timeout: "5m"
|
|
max_retries: 3
|
|
|
|
podman_image: "python:3.9-slim"
|
|
container_workspace: "/workspace"
|
|
container_results: "/results"
|
|
gpu_devices: []
|
|
|
|
resources:
|
|
max_workers: 1
|
|
desired_rps_per_worker: 2
|
|
podman_cpus: "2"
|
|
podman_memory: "4Gi"
|
|
|
|
metrics:
|
|
enabled: true
|
|
listen_addr: ":9100"
|
|
metrics_flush_interval: "500ms"
|
|
|
|
task_lease_duration: "30m"
|
|
heartbeat_interval: "1m"
|
|
max_retries: 3
|
|
graceful_timeout: "5m"
|
|
```
|
|
|
|
### Docker Development Worker
|
|
**File:** `configs/workers/docker.yaml`
|
|
|
|
```yaml
|
|
worker_id: "docker-worker"
|
|
base_path: "/tmp/fetchml-jobs"
|
|
train_script: "train.py"
|
|
|
|
redis_addr: "redis:6379"
|
|
redis_password: ""
|
|
redis_db: 0
|
|
|
|
local_mode: true
|
|
|
|
max_workers: 1
|
|
poll_interval_seconds: 5
|
|
|
|
podman_image: "python:3.9-slim"
|
|
container_workspace: "/workspace"
|
|
container_results: "/results"
|
|
gpu_devices: []
|
|
gpu_vendor: "none"
|
|
gpu_visible_devices: []
|
|
|
|
metrics:
|
|
enabled: true
|
|
listen_addr: ":9100"
|
|
metrics_flush_interval: "500ms"
|
|
```
|
|
|
|
### Legacy TOML Worker (Deprecated)
|
|
**File:** `configs/workers/worker-prod.toml`
|
|
|
|
```toml
|
|
worker_id = "worker-prod-01"
|
|
base_path = "/data/ml-experiments"
|
|
max_workers = 4
|
|
|
|
redis_addr = "localhost:6379"
|
|
redis_password = "CHANGE_ME_REDIS_PASSWORD"
|
|
redis_db = 0
|
|
|
|
host = "localhost"
|
|
user = "ml-user"
|
|
port = 22
|
|
ssh_key = "~/.ssh/id_rsa"
|
|
|
|
podman_image = "ml-training:latest"
|
|
gpu_vendor = "none"
|
|
gpu_visible_devices = []
|
|
gpu_devices = []
|
|
container_workspace = "/workspace"
|
|
container_results = "/results"
|
|
train_script = "train.py"
|
|
|
|
[resources]
|
|
max_workers = 4
|
|
desired_rps_per_worker = 2
|
|
podman_cpus = "4"
|
|
podman_memory = "16g"
|
|
|
|
[metrics]
|
|
enabled = true
|
|
listen_addr = ":9100"
|
|
```
|
|
|
|
## Security Hardening
|
|
|
|
### Seccomp Profiles
|
|
|
|
FetchML includes a hardened seccomp profile for container sandboxing at `configs/seccomp/default-hardened.json`.
|
|
|
|
**Features:**
|
|
- **Default-deny policy**: `SCMP_ACT_ERRNO` blocks all syscalls by default
|
|
- **Allowlist approach**: Only explicitly permitted syscalls are allowed
|
|
- **Multi-architecture support**: x86_64, x86, aarch64
|
|
- **Blocked dangerous syscalls**: ptrace, mount, umount2, reboot, kexec_load, open_by_handle_at, perf_event_open
|
|
|
|
**Usage with Docker/Podman:**
|
|
|
|
```bash
|
|
# Docker with seccomp
|
|
docker run --security-opt seccomp=configs/seccomp/default-hardened.json \
|
|
-v /data:/data:ro \
|
|
my-image:latest
|
|
|
|
# Podman with seccomp
|
|
podman run --security-opt seccomp=configs/seccomp/default-hardened.json \
|
|
--read-only \
|
|
--no-new-privileges \
|
|
my-image:latest
|
|
```
|
|
|
|
**Key Allowed Syscalls:**
|
|
- File operations: `open`, `openat`, `read`, `write`, `close`
|
|
- Memory: `mmap`, `munmap`, `mprotect`, `brk`
|
|
- Process: `clone`, `fork`, `execve`, `exit`, `wait4`
|
|
- Network: `socket`, `bind`, `listen`, `accept`, `connect`, `sendto`, `recvfrom`
|
|
- Signals: `rt_sigaction`, `rt_sigprocmask`, `kill`, `tkill`
|
|
- Time: `clock_gettime`, `gettimeofday`, `nanosleep`
|
|
- I/O: `epoll_create`, `epoll_ctl`, `epoll_wait`, `poll`, `select`
|
|
|
|
**Customization:**
|
|
|
|
Copy the default profile and modify for your needs:
|
|
|
|
```bash
|
|
cp configs/seccomp/default-hardened.json configs/seccomp/custom-profile.json
|
|
# Edit to add/remove syscalls
|
|
```
|
|
|
|
**Testing Seccomp:**
|
|
|
|
```bash
|
|
# Test with a simple container
|
|
docker run --rm --security-opt seccomp=configs/seccomp/default-hardened.json \
|
|
alpine:latest echo "Seccomp test passed"
|
|
```
|
|
|
|
## CLI Configuration
|
|
|
|
### User Config File
|
|
**Location:** `~/.ml/config.toml`
|
|
|
|
```toml
|
|
[server]
|
|
worker_host = "localhost"
|
|
worker_user = "appuser"
|
|
worker_base = "/app"
|
|
worker_port = 22
|
|
|
|
[auth]
|
|
api_key = "<your-api-key>"
|
|
|
|
[cli]
|
|
default_timeout = 30
|
|
verbose = false
|
|
```
|
|
|
|
### Multi-User CLI Configs
|
|
|
|
**Admin Config:** `~/.ml/config-admin.toml`
|
|
```toml
|
|
[server]
|
|
worker_host = "localhost"
|
|
worker_user = "appuser"
|
|
worker_base = "/app"
|
|
worker_port = 22
|
|
|
|
[auth]
|
|
api_key = "<admin-api-key>"
|
|
```
|
|
|
|
**Researcher Config:** `~/.ml/config-researcher.toml`
|
|
```toml
|
|
[server]
|
|
worker_host = "localhost"
|
|
worker_user = "appuser"
|
|
worker_base = "/app"
|
|
worker_port = 22
|
|
|
|
[auth]
|
|
api_key = "<researcher-api-key>"
|
|
```
|
|
|
|
**Analyst Config:** `~/.ml/config-analyst.toml`
|
|
```toml
|
|
[server]
|
|
worker_host = "localhost"
|
|
worker_user = "appuser"
|
|
worker_base = "/app"
|
|
worker_port = 22
|
|
|
|
[auth]
|
|
api_key = "<analyst-api-key>"
|
|
```
|
|
|
|
## Configuration Options
|
|
|
|
### Authentication
|
|
|
|
| Option | Type | Default | Description |
|
|
|--------|------|---------|-------------|
|
|
| `auth.enabled` | bool | false | Enable authentication |
|
|
| `auth.apikeys` | map | {} | API key configurations |
|
|
| `auth.apikeys.[user].hash` | string | - | SHA256 hash of API key |
|
|
| `auth.apikeys.[user].admin` | bool | false | Admin privileges |
|
|
| `auth.apikeys.[user].roles` | array | [] | User roles |
|
|
| `auth.apikeys.[user].permissions` | map | {} | User permissions |
|
|
|
|
### Server
|
|
|
|
| Option | Type | Default | Description |
|
|
|--------|------|---------|-------------|
|
|
| `server.address` | string | ":9101" | Server bind address |
|
|
| `server.tls.enabled` | bool | false | Enable TLS |
|
|
| `server.tls.cert_file` | string | - | TLS certificate file |
|
|
| `server.tls.key_file` | string | - | TLS private key file |
|
|
|
|
### Security
|
|
|
|
| Option | Type | Default | Description |
|
|
|--------|------|---------|-------------|
|
|
| `security.production_mode` | bool | false | Enable production hardening |
|
|
| `security.allowed_origins` | array | [] | Allowed CORS origins |
|
|
| `security.api_key_rotation_days` | int | 90 | Days until API key rotation required |
|
|
| `security.audit_logging.enabled` | bool | false | Enable audit logging |
|
|
| `security.audit_logging.log_path` | string | - | Audit log file path |
|
|
| `security.rate_limit.enabled` | bool | true | Enable rate limiting |
|
|
| `security.rate_limit.requests_per_minute` | int | 60 | Requests per minute limit |
|
|
| `security.rate_limit.burst_size` | int | 10 | Burst request allowance |
|
|
| `security.ip_whitelist` | array | [] | Allowed IP addresses/CIDR ranges |
|
|
| `security.failed_login_lockout.enabled` | bool | false | Enable login lockout |
|
|
| `security.failed_login_lockout.max_attempts` | int | 5 | Max failed attempts before lockout |
|
|
| `security.failed_login_lockout.lockout_duration` | string | "15m" | Lockout duration (e.g., "15m") |
|
|
|
|
### Monitoring
|
|
|
|
| Option | Type | Default | Description |
|
|
|--------|------|---------|-------------|
|
|
| `monitoring.prometheus.enabled` | bool | true | Enable Prometheus metrics |
|
|
| `monitoring.prometheus.port` | int | 9101 | Prometheus metrics port |
|
|
| `monitoring.prometheus.path` | string | "/metrics" | Metrics endpoint path |
|
|
| `monitoring.health_checks.enabled` | bool | true | Enable health checks |
|
|
| `monitoring.health_checks.interval` | string | "30s" | Health check interval |
|
|
|
|
### Database
|
|
|
|
| Option | Type | Default | Description |
|
|
|--------|------|---------|-------------|
|
|
| `database.type` | string | "sqlite" | Database type (sqlite, postgres, mysql) |
|
|
| `database.connection` | string | - | Connection string or path |
|
|
| `database.host` | string | - | Database host (for postgres/mysql) |
|
|
| `database.port` | int | - | Database port (for postgres/mysql) |
|
|
| `database.username` | string | - | Database username |
|
|
| `database.password` | string | - | Database password |
|
|
| `database.database` | string | - | Database name |
|
|
|
|
### Queue
|
|
|
|
| Option | Type | Default | Description |
|
|
|--------|------|---------|-------------|
|
|
| `queue.type` | string | "native" | Queue backend type (native, redis, sqlite, filesystem) |
|
|
| `queue.native.data_dir` | string | - | Data directory for native queue |
|
|
| `queue.sqlite_path` | string | - | SQLite database path for queue |
|
|
| `queue.filesystem_path` | string | - | Filesystem queue path |
|
|
| `queue.fallback_to_filesystem` | bool | false | Fallback to filesystem on Redis failure |
|
|
|
|
### Resources
|
|
|
|
| Option | Type | Default | Description |
|
|
|--------|------|---------|-------------|
|
|
| `resources.max_workers` | int | 1 | Maximum concurrent workers |
|
|
| `resources.desired_rps_per_worker` | int | 2 | Desired requests per second per worker |
|
|
| `resources.requests_per_sec` | int | - | Global request rate limit |
|
|
| `resources.request_burst` | int | - | Request burst allowance |
|
|
| `resources.podman_cpus` | string | "2" | CPU limit for Podman containers |
|
|
| `resources.podman_memory` | string | "4Gi" | Memory limit for Podman containers |
|
|
|
|
### Plugin GPU Quotas
|
|
|
|
Control GPU allocation for plugin-based services (Jupyter, vLLM, etc.).
|
|
|
|
| Option | Type | Default | Description |
|
|
|--------|------|---------|-------------|
|
|
| `scheduler.plugin_quota.enabled` | bool | false | Enable plugin GPU quota enforcement |
|
|
| `scheduler.plugin_quota.total_gpus` | int | 0 | Global GPU limit across all plugins (0 = unlimited) |
|
|
| `scheduler.plugin_quota.per_user_gpus` | int | 0 | Default per-user GPU limit (0 = unlimited) |
|
|
| `scheduler.plugin_quota.per_user_services` | int | 0 | Default per-user service count limit (0 = unlimited) |
|
|
| `scheduler.plugin_quota.per_plugin_limits.{plugin}.max_gpus` | int | 0 | Plugin-specific GPU limit |
|
|
| `scheduler.plugin_quota.per_plugin_limits.{plugin}.max_services` | int | 0 | Plugin-specific service count limit |
|
|
| `scheduler.plugin_quota.user_overrides.{user}.max_gpus` | int | 0 | Per-user GPU override |
|
|
| `scheduler.plugin_quota.user_overrides.{user}.max_services` | int | 0 | Per-user service limit override |
|
|
| `scheduler.plugin_quota.user_overrides.{user}.allowed_plugins` | array | [] | Plugins user is allowed to use (empty = all) |
|
|
|
|
**Example configuration:**
|
|
|
|
```yaml
|
|
scheduler:
|
|
plugin_quota:
|
|
enabled: true
|
|
total_gpus: 16
|
|
per_user_gpus: 4
|
|
per_user_services: 2
|
|
per_plugin_limits:
|
|
vllm:
|
|
max_gpus: 8
|
|
max_services: 4
|
|
jupyter:
|
|
max_gpus: 4
|
|
max_services: 10
|
|
user_overrides:
|
|
admin:
|
|
max_gpus: 8
|
|
max_services: 5
|
|
allowed_plugins: ["jupyter", "vllm"]
|
|
```
|
|
|
|
### Redis
|
|
|
|
| Option | Type | Default | Description |
|
|
|--------|------|---------|-------------|
|
|
| `redis.url` | string | "redis://localhost:6379" | Redis connection URL |
|
|
| `redis.addr` | string | - | Redis host:port shorthand |
|
|
| `redis.password` | string | - | Redis password |
|
|
| `redis.db` | int | 0 | Redis database number |
|
|
| `redis.max_connections` | int | 10 | Max Redis connections |
|
|
|
|
### Logging
|
|
|
|
| Option | Type | Default | Description |
|
|
|--------|------|---------|-------------|
|
|
| `logging.level` | string | "info" | Log level |
|
|
| `logging.file` | string | - | Log file path |
|
|
| `logging.audit_file` | string | - | Audit log path |
|
|
|
|
## Permission System
|
|
|
|
### Permission Keys
|
|
|
|
| Permission | Description |
|
|
|------------|-------------|
|
|
| `jobs:read` | Read job information |
|
|
| `jobs:create` | Create new jobs |
|
|
| `jobs:update` | Update existing jobs |
|
|
| `jobs:delete` | Delete jobs |
|
|
| `*` | All permissions (admin only) |
|
|
|
|
### Role-Based Permissions
|
|
|
|
| Role | Default Permissions |
|
|
|------|-------------------|
|
|
| admin | All permissions |
|
|
| researcher | jobs:read, jobs:create, jobs:update |
|
|
| analyst | jobs:read |
|
|
| user | No default permissions |
|
|
|
|
## Environment Variables
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `FETCHML_CONFIG` | - | Path to config file |
|
|
| `FETCHML_LOG_LEVEL` | "info" | Override log level |
|
|
| `CLI_CONFIG` | - | Path to CLI config file |
|
|
| `FETCH_ML_GPU_TYPE` | - | Override GPU vendor detection (nvidia, amd, apple, none). Takes precedence over config file. |
|
|
| `FETCH_ML_GPU_COUNT` | - | Override GPU count detection. Used with auto-detected or configured vendor. |
|
|
| `FETCH_ML_TOTAL_CPU` | - | Override total CPU count detection. Sets the number of CPU cores available. |
|
|
| `FETCH_ML_GPU_SLOTS_PER_GPU` | 1 | Override GPU slots per GPU. Controls how many concurrent tasks can share a single GPU. |
|
|
|
|
When environment variable overrides are active, they are logged to stderr at worker startup for debugging.
|
|
|
|
Note: When `gpu_vendor: amd` is configured, the system uses the NVIDIA detector implementation (aliased) due to similar device exposure patterns. The `configured_vendor` field will show "amd" while the actual detection uses NVIDIA-compatible methods.
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Configuration Issues
|
|
|
|
1. **Authentication Failures**
|
|
- Check API key hashes are correct SHA256
|
|
- Verify YAML syntax
|
|
- Ensure auth.enabled: true
|
|
|
|
2. **Connection Issues**
|
|
- Verify server address and ports
|
|
- Check firewall settings
|
|
- Validate network connectivity
|
|
|
|
3. **Permission Issues**
|
|
- Check user roles and permissions
|
|
- Verify permission key format
|
|
- Ensure admin users have "*": true
|
|
|
|
### Configuration Validation
|
|
|
|
```bash
|
|
# Validate server configuration
|
|
go run cmd/api-server/main.go --config configs/api/dev.yaml --validate
|
|
|
|
# Test CLI configuration
|
|
./cli/zig-out/bin/ml status --debug
|
|
```
|
|
|
|
---
|
|
|
|
## See Also
|
|
|
|
- **[Architecture](architecture.md)** - System architecture overview
|
|
- **[Scheduler Architecture](scheduler-architecture.md)** - Scheduler configuration details
|
|
- **[Environment Variables](environment-variables.md)** - Additional environment variable documentation
|
|
- **[Security Guide](security.md)** - Security-related configuration
|
|
- **[Deployment Guide](deployment.md)** - Production configuration guidance
|
|
- **[Jupyter Workflow](jupyter-workflow.md)** - Jupyter service configuration
|
|
- **[vLLM Workflow](vllm-workflow.md)** - vLLM service configuration |