fetch_ml/docs/src/security.md
Jeremie Fraeys b00439b86e
docs(security): document comprehensive security hardening
Updates documentation with new security features and hardening guide:

**CHANGELOG.md:**
- Added detailed security hardening section (2026-02-23)
- Documents all phases: file ingestion, sandbox, secrets, audit logging, tests
- Lists specific files changed and security controls implemented

**docs/src/security.md:**
- Added Overview section with defense-in-depth layers
- Added Comprehensive Security Hardening section with:
  - File ingestion security with code examples
  - Sandbox hardening with complete YAML config
  - Secrets management with env expansion syntax
  - HIPAA audit logging with tamper-evident chain hashing
2026-02-23 18:03:25 -05:00

12 KiB

Security Guide

This document outlines security features, best practices, and hardening procedures for FetchML.

Overview

FetchML implements defense-in-depth security with multiple layers of protection:

  1. File Ingestion Security - Path traversal prevention, file type validation
  2. Sandbox Hardening - Container isolation with seccomp, capability dropping
  3. Secrets Management - Environment-based credential injection with plaintext detection
  4. Audit Logging - Tamper-evident logging for compliance (HIPAA)
  5. Authentication - API key-based access control with RBAC

Security Features

Authentication & Authorization

  • API Keys: SHA256-hashed with role-based access control (RBAC)
  • Permissions: Granular read/write/delete permissions per user
  • IP Whitelisting: Network-level access control
  • Rate Limiting: Per-user request quotas

Communication Security

  • TLS/HTTPS: End-to-end encryption for API traffic
  • WebSocket Auth: API key required before upgrade
  • Redis Auth: Password-protected task queue

Data Privacy

  • Log Sanitization: Automatically redacts API keys, passwords, tokens
  • Experiment Isolation: User-specific experiment directories
  • No Anonymous Access: All services require authentication

Network Security

  • Internal Networks: Backend services (Redis, Loki) not exposed publicly
  • Firewall Rules: Restrictive port access
  • Container Isolation: Services run in separate containers/pods

Comprehensive Security Hardening (2026-02)

File Ingestion Security

All file operations are protected against path traversal attacks:

// All paths are validated with symlink resolution
validator := fileutil.NewSecurePathValidator(basePath)
cleanPath, err := validator.ValidatePath(userInput)
if err != nil {
    return fmt.Errorf("path validation failed: %w", err)
}

Features:

  • Symlink resolution and canonicalization
  • Path boundary enforcement (cannot escape base directory)
  • Magic bytes validation for ML artifacts (safetensors, GGUF, HDF5)
  • Dangerous extension blocking (.pt, .pkl, .exe, .sh)
  • Upload limits (size, rate, frequency)

Sandbox Hardening

Containers run with hardened security defaults:

# configs/worker/homelab-sandbox.yaml
sandbox:
  network_mode: "none"           # No network access by default
  read_only_root: true          # Read-only filesystem
  no_new_privileges: true       # Prevent privilege escalation
  drop_all_caps: true           # Drop all capabilities
  allowed_caps: []              # Add CAP_ only if required
  user_ns: true                 # User namespace isolation
  run_as_uid: 1000               # Run as non-root user
  run_as_gid: 1000
  seccomp_profile: "default-hardened"  # Restricted syscall profile
  max_runtime_hours: 24
  max_upload_size_bytes: 10737418240   # 10GB
  max_upload_rate_bps: 104857600       # 100MB/s
  max_uploads_per_minute: 10

Seccomp Profile (configs/seccomp/default-hardened.json):

  • Blocks: ptrace, mount, umount2, reboot, kexec_load
  • Blocks: open_by_handle_at, perf_event_open
  • Default action: SCMP_ACT_ERRNO (deny by default)

Secrets Management

Environment Variable Expansion:

# config.yaml - use ${VAR} syntax for secrets
redis_password: "${REDIS_PASSWORD}"
snapshot_store:
  access_key: "${AWS_ACCESS_KEY_ID}"
  secret_key: "${AWS_SECRET_ACCESS_KEY}"

Plaintext Detection: The system detects and rejects plaintext secrets using:

  • Shannon entropy calculation (>4 bits/char indicates secret)
  • Pattern matching: AWS keys (AKIA, ASIA), GitHub tokens (ghp_), etc.

Loading Process:

  1. Config loaded from YAML
  2. Environment variables expanded (${VAR} → value)
  3. Plaintext secrets detected and rejected
  4. Validation fails if secrets don't use env reference syntax

HIPAA-Compliant Audit Logging

Tamper-Evident Logging:

// Each event includes chain hash for integrity
audit.Log(audit.Event{
    EventType: audit.EventFileRead,
    UserID:    "user1",
    Resource:  "/data/file.txt",
})

Event Types:

  • file_read - File access logged
  • file_write - File modification logged
  • file_delete - File deletion logged
  • auth_success / auth_failure - Authentication events
  • job_queued / job_started / job_completed - Job lifecycle

Chain Hashing:

  • Each event includes SHA-256 hash of previous event
  • Modification of any log entry breaks the chain
  • VerifyChain() function detects tampering

Security Checklist

Initial Setup

  1. Generate Strong Passwords
# Grafana admin password
openssl rand -base64 32 > .grafana-password

# Redis password
openssl rand -base64 32
  1. Configure Environment Variables
cp .env.example .env
# Edit .env and set:
# - GRAFANA_ADMIN_PASSWORD
  1. Enable TLS (Production only)
# configs/api/prod.yaml
server:
  tls:
    enabled: true
    cert_file: "/secrets/cert.pem"
    key_file: "/secrets/key.pem"
  1. Configure Firewall
# Allow only necessary ports
sudo ufw allow 22/tcp    # SSH
sudo ufw allow 443/tcp   # HTTPS
sudo ufw allow 80/tcp    # HTTP (redirect to HTTPS)
sudo ufw enable

Production Hardening

  1. Restrict IP Access
# configs/api/prod.yaml
auth:
  ip_whitelist:
    - "10.0.0.0/8"
    - "192.168.0.0/16"
    - "127.0.0.1"
  1. Enable Audit Logging
logging:
  level: "info"
  audit: true
  file: "/var/log/fetch_ml/audit.log"
  1. Harden Redis
# Redis security
redis-cli CONFIG SET requirepass "your-strong-password"
redis-cli CONFIG SET rename-command FLUSHDB ""
redis-cli CONFIG SET rename-command FLUSHALL ""
  1. Secure Grafana
# Change default admin password
docker-compose exec grafana grafana-cli admin reset-admin-password new-strong-password
  1. Regular Updates
# Update system packages
sudo apt update && sudo apt upgrade -y

# Update containers
docker-compose pull
docker-compose up -d (testing only)

Password Management

Generate Secure Passwords

# Method 1: OpenSSL
openssl rand -base64 32

# Method 2: pwgen (if installed)
pwgen -s 32 1

# Method 3: /dev/urandom
head -c 32 /dev/urandom | base64

Store Passwords Securely

Development: Use .env file (gitignored)

echo "REDIS_PASSWORD=$(openssl rand -base64 32)" >> .env
echo "GRAFANA_ADMIN_PASSWORD=$(openssl rand -base64 32)" >> .env

Production: Use systemd environment files

sudo mkdir -p /etc/fetch_ml/secrets
sudo chmod 700 /etc/fetch_ml/secrets
echo "REDIS_PASSWORD=..." | sudo tee /etc/fetch_ml/secrets/redis.env
sudo chmod 600 /etc/fetch_ml/secrets/redis.env

API Key Management

Generate API Keys

# Generate random API key
openssl rand -hex 32

# Hash for storage
echo -n "your-api-key" | sha256sum

Rotate API Keys

  1. Generate new API key
  2. Update your chosen API server config (for example a private copy of configs/api/homelab-secure.yaml) with the new hash
  3. Distribute new key to users
  4. Remove old key after grace period

Revoke API Keys

Remove user entry from your API server config file:

auth:
  api_keys:
    # user_to_revoke:  # Comment out or delete

Secret Flow (What lives where)

  • API server config (configs/api/*.yaml)

    • Stores SHA256 hashes of API keys (never raw keys).
    • The repo-shipped configs intentionally contain CHANGE_ME_... placeholders.
    • For real deployments, make a private copy (e.g. /etc/fetch_ml/config.yaml) and fill in real hashes.
  • Docker Compose .env / secret files

    • Used for values that should not be committed (e.g. REDIS_PASSWORD, Grafana admin password).
    • deployments/docker-compose.homelab-secure.yml requires REDIS_PASSWORD to be set explicitly.
  • TLS certs

    • Provided as mounted files (e.g. /app/ssl/cert.pem, /app/ssl/key.pem).

Network Security

Production Network Topology

Internet
    ↓
[Firewall] (ports 3000, 9102)
    ↓
[Reverse Proxy] (nginx/Apache) - TLS termination
    ↓
┌─────────────────────┐
│   Application Pod   │
│                     │
│  ┌──────────────┐   │
│  │ API Server   │   │  ← Public (via reverse proxy)
│  └──────────────┘   │
│                     │
│  ┌──────────────┐   │
│  │   Redis      │   │  ← Internal only
│  └──────────────┘   │
│                     │
│  ┌──────────────┐   │
│  │   Grafana    │   │  ← Public (via reverse proxy)
│  └──────────────┘   │
│                     │
│  ┌──────────────┐   │
│  │ Prometheus   │   │  ← Internal only
│  └──────────────┘   │
│                     │
│  ┌──────────────┐   │
│  │    Loki      │   │  ← Internal only
│  └──────────────┘   │
└─────────────────────┘
# Allow only necessary inbound connections
sudo firewall-cmd --permanent --zone=public --add-rich-rule='
  rule family="ipv4"
  source address="YOUR_NETWORK"
  port port="3000" protocol="tcp" accept'

sudo firewall-cmd --permanent --zone=public --add-rich-rule='
  rule family="ipv4"
  source address="YOUR_NETWORK"
  port port="9102" protocol="tcp" accept'

# Block all other traffic
sudo firewall-cmd --permanent --set-default-zone=drop
sudo firewall-cmd --reload

Incident Response

Suspected Breach

  1. Immediate Actions

  2. Investigation

  3. Recovery

    • Rotate all API keys
    • Stop affected services
    • Review audit logs
  4. Investigation

    # Check recent logins
    sudo journalctl -u fetchml-api --since "1 hour ago"
    
    # Review failed auth attempts
    grep "authentication failed" /var/log/fetch_ml/*.log
    
    # Check active connections
    ss -tnp | grep :9102
    
  5. Recovery

    • Rotate all passwords and API keys
    • Update firewall rules
    • Patch vulnerabilities
    • Resume services

Security Monitoring

# Monitor failed authentication
tail -f /var/log/fetch_ml/api.log | grep "auth.*failed"

# Monitor unusual activity
journalctl -u fetchml-api -f | grep -E "(ERROR|WARN)"

# Check open ports
nmap -p- localhost

Security Best Practices

  1. Principle of Least Privilege: Grant minimum necessary permissions
  2. Defense in Depth: Multiple security layers (firewall + auth + TLS)
  3. Regular Updates: Keep all components patched
  4. Audit Regularly: Review logs and access patterns
  5. Secure Secrets: Never commit passwords/keys to git
  6. Network Segmentation: Isolate services with internal networks
  7. Monitor Everything: Enable comprehensive logging and alerting
  8. Test Security: Regular penetration testing and vulnerability scans

Compliance

Data Privacy

  • Logs are sanitized (no passwords/API keys)
  • Experiment data is user-isolated
  • No telemetry or external data sharing

Audit Trail

All API access is logged with:

  • Timestamp
  • User/API key
  • Action performed
  • Source IP
  • Result (success/failure)

Getting Help

  • Security Issues: Report privately via email
  • Questions: See documentation or create issue
  • Updates: Monitor releases for security patches