fetch_ml/docs/src/security.md

# Security Guide

This document outlines security features, best practices, and hardening procedures for FetchML.

## Overview

FetchML implements defense-in-depth security with multiple layers of protection:

1. **File Ingestion Security** - Path traversal prevention, file type validation
2. **Sandbox Hardening** - Container isolation with seccomp, capability dropping
3. **Secrets Management** - Environment-based credential injection with plaintext detection
4. **Audit Logging** - Tamper-evident logging for compliance (HIPAA)
5. **Authentication** - API key-based access control with RBAC

---

## Security Features

### Authentication & Authorization
- **API Keys**: SHA256-hashed with role-based access control (RBAC)
- **Permissions**: Granular read/write/delete permissions per user
- **IP Whitelisting**: Network-level access control
- **Rate Limiting**: Per-user request quotas

### Communication Security
- **TLS/HTTPS**: End-to-end encryption for API traffic
- **WebSocket Auth**: API key required before upgrade
- **Redis Auth**: Password-protected task queue

### Data Privacy
- **Log Sanitization**: Automatically redacts API keys, passwords, tokens
- **Experiment Isolation**: User-specific experiment directories
- **No Anonymous Access**: All services require authentication

### Network Security
- **Internal Networks**: Backend services (Redis, Loki) not exposed publicly
- **Firewall Rules**: Restrictive port access
- **Container Isolation**: Services run in separate containers/pods

---

## Comprehensive Security Hardening (2026-02)

### File Ingestion Security

All file operations are protected against path traversal attacks:

```go
// All paths are validated with symlink resolution
validator := fileutil.NewSecurePathValidator(basePath)
cleanPath, err := validator.ValidatePath(userInput)
if err != nil {
    return fmt.Errorf("path validation failed: %w", err)
}
```

**Features:**
- Symlink resolution and canonicalization
- Path boundary enforcement (cannot escape base directory)
- Magic bytes validation for ML artifacts (safetensors, GGUF, HDF5)
- Dangerous extension blocking (.pt, .pkl, .exe, .sh)
- Upload limits (size, rate, frequency)

### Sandbox Hardening

Containers run with hardened security defaults:

```yaml
# configs/worker/homelab-sandbox.yaml
sandbox:
  network_mode: "none"           # No network access by default
  read_only_root: true          # Read-only filesystem
  no_new_privileges: true       # Prevent privilege escalation
  drop_all_caps: true           # Drop all capabilities
  allowed_caps: []              # Add CAP_ only if required
  user_ns: true                 # User namespace isolation
  run_as_uid: 1000               # Run as non-root user
  run_as_gid: 1000
  seccomp_profile: "default-hardened"  # Restricted syscall profile
  max_runtime_hours: 24
  max_upload_size_bytes: 10737418240   # 10GB
  max_upload_rate_bps: 104857600       # 100MB/s
  max_uploads_per_minute: 10
```

**Seccomp Profile** (`configs/seccomp/default-hardened.json`):
- Blocks: `ptrace`, `mount`, `umount2`, `reboot`, `kexec_load`
- Blocks: `open_by_handle_at`, `perf_event_open`
- Default action: `SCMP_ACT_ERRNO` (deny by default)

### Secrets Management

**Environment Variable Expansion:**
```yaml
# config.yaml - use ${VAR} syntax for secrets
redis_password: "${REDIS_PASSWORD}"
snapshot_store:
  access_key: "${AWS_ACCESS_KEY_ID}"
  secret_key: "${AWS_SECRET_ACCESS_KEY}"
```

**Plaintext Detection:**
The system detects and rejects plaintext secrets using:
- Shannon entropy calculation (>4 bits/char indicates secret)
- Pattern matching: AWS keys (`AKIA`, `ASIA`), GitHub tokens (`ghp_`), etc.

**Loading Process:**
1. Config loaded from YAML
2. Environment variables expanded (`${VAR}` → value)
3. Plaintext secrets detected and rejected
4. Validation fails if secrets don't use env reference syntax

### HIPAA-Compliant Audit Logging

**Tamper-Evident Logging:**
```go
// Each event includes chain hash for integrity
audit.Log(audit.Event{
    EventType: audit.EventFileRead,
    UserID:    "user1",
    Resource:  "/data/file.txt",
})
```

**Event Types:**
- `file_read` - File access logged
- `file_write` - File modification logged
- `file_delete` - File deletion logged
- `auth_success` / `auth_failure` - Authentication events
- `job_queued` / `job_started` / `job_completed` - Job lifecycle

**Chain Hashing:**
- Each event includes SHA-256 hash of previous event
- Modification of any log entry breaks the chain
- `VerifyChain()` function detects tampering

---

## Security Checklist

### Initial Setup

1. **Generate Strong Passwords**
  ```bash
  # Grafana admin password
  openssl rand -base64 32 > .grafana-password

  # Redis password
  openssl rand -base64 32
  ```

2. **Configure Environment Variables**
  ```bash
  cp .env.example .env
  # Edit .env and set:
  # - GRAFANA_ADMIN_PASSWORD
  ```

3. **Enable TLS** (Production only)
  ```yaml
  # configs/api/prod.yaml
  server:
    tls:
      enabled: true
      cert_file: "/secrets/cert.pem"
      key_file: "/secrets/key.pem"
  ```

4. **Configure Firewall**
  ```bash
  # Allow only necessary ports
  sudo ufw allow 22/tcp    # SSH
  sudo ufw allow 443/tcp   # HTTPS
  sudo ufw allow 80/tcp    # HTTP (redirect to HTTPS)
  sudo ufw enable
  ```

### Production Hardening

5. **Restrict IP Access**
  ```yaml
  # configs/api/prod.yaml
  auth:
    ip_whitelist:
      - "10.0.0.0/8"
      - "192.168.0.0/16"
      - "127.0.0.1"
  ```

6. **Enable Audit Logging**
  ```yaml
  logging:
    level: "info"
    audit: true
    file: "/var/log/fetch_ml/audit.log"
  ```

7. **Harden Redis**
  ```bash
  # Redis security
  redis-cli CONFIG SET requirepass "your-strong-password"
  redis-cli CONFIG SET rename-command FLUSHDB ""
  redis-cli CONFIG SET rename-command FLUSHALL ""
  ```

8. **Secure Grafana**
  ```bash
  # Change default admin password
  docker-compose exec grafana grafana-cli admin reset-admin-password new-strong-password
  ```

9. **Regular Updates**
  ```bash
  # Update system packages
  sudo apt update && sudo apt upgrade -y

  # Update containers
  docker-compose pull
  docker-compose up -d (testing only)
  ```

## Password Management

### Generate Secure Passwords

```bash
# Method 1: OpenSSL
openssl rand -base64 32

# Method 2: pwgen (if installed)
pwgen -s 32 1

# Method 3: /dev/urandom
head -c 32 /dev/urandom | base64
```

### Store Passwords Securely

**Development**: Use `.env` file (gitignored)
```bash
echo "REDIS_PASSWORD=$(openssl rand -base64 32)" >> .env
echo "GRAFANA_ADMIN_PASSWORD=$(openssl rand -base64 32)" >> .env
```

**Production**: Use systemd environment files
```bash
sudo mkdir -p /etc/fetch_ml/secrets
sudo chmod 700 /etc/fetch_ml/secrets
echo "REDIS_PASSWORD=..." | sudo tee /etc/fetch_ml/secrets/redis.env
sudo chmod 600 /etc/fetch_ml/secrets/redis.env
```

## API Key Management

### Generate API Keys

```bash
# Generate random API key
openssl rand -hex 32

# Hash for storage
echo -n "your-api-key" | sha256sum
```

### Rotate API Keys

1. Generate new API key
2. Update your chosen API server config (for example a private copy of `configs/api/homelab-secure.yaml`) with the new hash
3. Distribute new key to users
4. Remove old key after grace period

### Revoke API Keys

Remove user entry from your API server config file:
```yaml
auth:
  api_keys:
    # user_to_revoke:  # Comment out or delete
```

## Secret Flow (What lives where)

- **API server config (`configs/api/*.yaml`)**
  - Stores **SHA256 hashes** of API keys (never raw keys).
  - The repo-shipped configs intentionally contain `CHANGE_ME_...` placeholders.
  - For real deployments, make a private copy (e.g. `/etc/fetch_ml/config.yaml`) and fill in real hashes.

- **Docker Compose `.env` / secret files**
  - Used for values that should not be committed (e.g. `REDIS_PASSWORD`, Grafana admin password).
  - `deployments/docker-compose.homelab-secure.yml` requires `REDIS_PASSWORD` to be set explicitly.

- **TLS certs**
  - Provided as mounted files (e.g. `/app/ssl/cert.pem`, `/app/ssl/key.pem`).

## Network Security

### Production Network Topology

```
Internet
    ↓
[Firewall] (ports 3000, 9102)
    ↓
[Reverse Proxy] (nginx/Apache) - TLS termination
    ↓
┌─────────────────────┐
│   Application Pod   │
│                     │
│  ┌──────────────┐   │
│  │ API Server   │   │  ← Public (via reverse proxy)
│  └──────────────┘   │
│                     │
│  ┌──────────────┐   │
│  │   Redis      │   │  ← Internal only
│  └──────────────┘   │
│                     │
│  ┌──────────────┐   │
│  │   Grafana    │   │  ← Public (via reverse proxy)
│  └──────────────┘   │
│                     │
│  ┌──────────────┐   │
│  │ Prometheus   │   │  ← Internal only
│  └──────────────┘   │
│                     │
│  ┌──────────────┐   │
│  │    Loki      │   │  ← Internal only
│  └──────────────┘   │
└─────────────────────┘
```

### Recommended Firewall Rules

```bash
# Allow only necessary inbound connections
sudo firewall-cmd --permanent --zone=public --add-rich-rule='
  rule family="ipv4"
  source address="YOUR_NETWORK"
  port port="3000" protocol="tcp" accept'

sudo firewall-cmd --permanent --zone=public --add-rich-rule='
  rule family="ipv4"
  source address="YOUR_NETWORK"
  port port="9102" protocol="tcp" accept'

# Block all other traffic
sudo firewall-cmd --permanent --set-default-zone=drop
sudo firewall-cmd --reload
```

## Incident Response

### Suspected Breach

1. **Immediate Actions**
2. **Investigation**
3. **Recovery**
   - Rotate all API keys
   - Stop affected services
   - Review audit logs

2. **Investigation**
   ```bash
   # Check recent logins
   sudo journalctl -u fetchml-api --since "1 hour ago"

   # Review failed auth attempts
   grep "authentication failed" /var/log/fetch_ml/*.log

   # Check active connections
   ss -tnp | grep :9102
   ```

3. **Recovery**
   - Rotate all passwords and API keys
   - Update firewall rules
   - Patch vulnerabilities
   - Resume services

### Security Monitoring

```bash
# Monitor failed authentication
tail -f /var/log/fetch_ml/api.log | grep "auth.*failed"

# Monitor unusual activity
journalctl -u fetchml-api -f | grep -E "(ERROR|WARN)"

# Check open ports
nmap -p- localhost
```

## Security Best Practices

1. **Principle of Least Privilege**: Grant minimum necessary permissions
2. **Defense in Depth**: Multiple security layers (firewall + auth + TLS)
3. **Regular Updates**: Keep all components patched
4. **Audit Regularly**: Review logs and access patterns
5. **Secure Secrets**: Never commit passwords/keys to git
6. **Network Segmentation**: Isolate services with internal networks
7. **Monitor Everything**: Enable comprehensive logging and alerting
8. **Test Security**: Regular penetration testing and vulnerability scans

## Compliance

### Data Privacy
- Logs are sanitized (no passwords/API keys)
- Experiment data is user-isolated
- No telemetry or external data sharing

### Audit Trail
All API access is logged with:
- Timestamp
- User/API key
- Action performed
- Source IP
- Result (success/failure)

## Getting Help

- **Security Issues**: Report privately via email
- **Questions**: See documentation or create issue
- **Updates**: Monitor releases for security patches