fetch_ml/docs/src/security.md
Jeremie Fraeys b00439b86e
docs(security): document comprehensive security hardening
Updates documentation with new security features and hardening guide:

**CHANGELOG.md:**
- Added detailed security hardening section (2026-02-23)
- Documents all phases: file ingestion, sandbox, secrets, audit logging, tests
- Lists specific files changed and security controls implemented

**docs/src/security.md:**
- Added Overview section with defense-in-depth layers
- Added Comprehensive Security Hardening section with:
  - File ingestion security with code examples
  - Sandbox hardening with complete YAML config
  - Secrets management with env expansion syntax
  - HIPAA audit logging with tamper-evident chain hashing
2026-02-23 18:03:25 -05:00

422 lines
12 KiB
Markdown

# Security Guide
This document outlines security features, best practices, and hardening procedures for FetchML.
## Overview
FetchML implements defense-in-depth security with multiple layers of protection:
1. **File Ingestion Security** - Path traversal prevention, file type validation
2. **Sandbox Hardening** - Container isolation with seccomp, capability dropping
3. **Secrets Management** - Environment-based credential injection with plaintext detection
4. **Audit Logging** - Tamper-evident logging for compliance (HIPAA)
5. **Authentication** - API key-based access control with RBAC
---
## Security Features
### Authentication & Authorization
- **API Keys**: SHA256-hashed with role-based access control (RBAC)
- **Permissions**: Granular read/write/delete permissions per user
- **IP Whitelisting**: Network-level access control
- **Rate Limiting**: Per-user request quotas
### Communication Security
- **TLS/HTTPS**: End-to-end encryption for API traffic
- **WebSocket Auth**: API key required before upgrade
- **Redis Auth**: Password-protected task queue
### Data Privacy
- **Log Sanitization**: Automatically redacts API keys, passwords, tokens
- **Experiment Isolation**: User-specific experiment directories
- **No Anonymous Access**: All services require authentication
### Network Security
- **Internal Networks**: Backend services (Redis, Loki) not exposed publicly
- **Firewall Rules**: Restrictive port access
- **Container Isolation**: Services run in separate containers/pods
---
## Comprehensive Security Hardening (2026-02)
### File Ingestion Security
All file operations are protected against path traversal attacks:
```go
// All paths are validated with symlink resolution
validator := fileutil.NewSecurePathValidator(basePath)
cleanPath, err := validator.ValidatePath(userInput)
if err != nil {
return fmt.Errorf("path validation failed: %w", err)
}
```
**Features:**
- Symlink resolution and canonicalization
- Path boundary enforcement (cannot escape base directory)
- Magic bytes validation for ML artifacts (safetensors, GGUF, HDF5)
- Dangerous extension blocking (.pt, .pkl, .exe, .sh)
- Upload limits (size, rate, frequency)
### Sandbox Hardening
Containers run with hardened security defaults:
```yaml
# configs/worker/homelab-sandbox.yaml
sandbox:
network_mode: "none" # No network access by default
read_only_root: true # Read-only filesystem
no_new_privileges: true # Prevent privilege escalation
drop_all_caps: true # Drop all capabilities
allowed_caps: [] # Add CAP_ only if required
user_ns: true # User namespace isolation
run_as_uid: 1000 # Run as non-root user
run_as_gid: 1000
seccomp_profile: "default-hardened" # Restricted syscall profile
max_runtime_hours: 24
max_upload_size_bytes: 10737418240 # 10GB
max_upload_rate_bps: 104857600 # 100MB/s
max_uploads_per_minute: 10
```
**Seccomp Profile** (`configs/seccomp/default-hardened.json`):
- Blocks: `ptrace`, `mount`, `umount2`, `reboot`, `kexec_load`
- Blocks: `open_by_handle_at`, `perf_event_open`
- Default action: `SCMP_ACT_ERRNO` (deny by default)
### Secrets Management
**Environment Variable Expansion:**
```yaml
# config.yaml - use ${VAR} syntax for secrets
redis_password: "${REDIS_PASSWORD}"
snapshot_store:
access_key: "${AWS_ACCESS_KEY_ID}"
secret_key: "${AWS_SECRET_ACCESS_KEY}"
```
**Plaintext Detection:**
The system detects and rejects plaintext secrets using:
- Shannon entropy calculation (>4 bits/char indicates secret)
- Pattern matching: AWS keys (`AKIA`, `ASIA`), GitHub tokens (`ghp_`), etc.
**Loading Process:**
1. Config loaded from YAML
2. Environment variables expanded (`${VAR}` → value)
3. Plaintext secrets detected and rejected
4. Validation fails if secrets don't use env reference syntax
### HIPAA-Compliant Audit Logging
**Tamper-Evident Logging:**
```go
// Each event includes chain hash for integrity
audit.Log(audit.Event{
EventType: audit.EventFileRead,
UserID: "user1",
Resource: "/data/file.txt",
})
```
**Event Types:**
- `file_read` - File access logged
- `file_write` - File modification logged
- `file_delete` - File deletion logged
- `auth_success` / `auth_failure` - Authentication events
- `job_queued` / `job_started` / `job_completed` - Job lifecycle
**Chain Hashing:**
- Each event includes SHA-256 hash of previous event
- Modification of any log entry breaks the chain
- `VerifyChain()` function detects tampering
---
## Security Checklist
### Initial Setup
1. **Generate Strong Passwords**
```bash
# Grafana admin password
openssl rand -base64 32 > .grafana-password
# Redis password
openssl rand -base64 32
```
2. **Configure Environment Variables**
```bash
cp .env.example .env
# Edit .env and set:
# - GRAFANA_ADMIN_PASSWORD
```
3. **Enable TLS** (Production only)
```yaml
# configs/api/prod.yaml
server:
tls:
enabled: true
cert_file: "/secrets/cert.pem"
key_file: "/secrets/key.pem"
```
4. **Configure Firewall**
```bash
# Allow only necessary ports
sudo ufw allow 22/tcp # SSH
sudo ufw allow 443/tcp # HTTPS
sudo ufw allow 80/tcp # HTTP (redirect to HTTPS)
sudo ufw enable
```
### Production Hardening
5. **Restrict IP Access**
```yaml
# configs/api/prod.yaml
auth:
ip_whitelist:
- "10.0.0.0/8"
- "192.168.0.0/16"
- "127.0.0.1"
```
6. **Enable Audit Logging**
```yaml
logging:
level: "info"
audit: true
file: "/var/log/fetch_ml/audit.log"
```
7. **Harden Redis**
```bash
# Redis security
redis-cli CONFIG SET requirepass "your-strong-password"
redis-cli CONFIG SET rename-command FLUSHDB ""
redis-cli CONFIG SET rename-command FLUSHALL ""
```
8. **Secure Grafana**
```bash
# Change default admin password
docker-compose exec grafana grafana-cli admin reset-admin-password new-strong-password
```
9. **Regular Updates**
```bash
# Update system packages
sudo apt update && sudo apt upgrade -y
# Update containers
docker-compose pull
docker-compose up -d (testing only)
```
## Password Management
### Generate Secure Passwords
```bash
# Method 1: OpenSSL
openssl rand -base64 32
# Method 2: pwgen (if installed)
pwgen -s 32 1
# Method 3: /dev/urandom
head -c 32 /dev/urandom | base64
```
### Store Passwords Securely
**Development**: Use `.env` file (gitignored)
```bash
echo "REDIS_PASSWORD=$(openssl rand -base64 32)" >> .env
echo "GRAFANA_ADMIN_PASSWORD=$(openssl rand -base64 32)" >> .env
```
**Production**: Use systemd environment files
```bash
sudo mkdir -p /etc/fetch_ml/secrets
sudo chmod 700 /etc/fetch_ml/secrets
echo "REDIS_PASSWORD=..." | sudo tee /etc/fetch_ml/secrets/redis.env
sudo chmod 600 /etc/fetch_ml/secrets/redis.env
```
## API Key Management
### Generate API Keys
```bash
# Generate random API key
openssl rand -hex 32
# Hash for storage
echo -n "your-api-key" | sha256sum
```
### Rotate API Keys
1. Generate new API key
2. Update your chosen API server config (for example a private copy of `configs/api/homelab-secure.yaml`) with the new hash
3. Distribute new key to users
4. Remove old key after grace period
### Revoke API Keys
Remove user entry from your API server config file:
```yaml
auth:
api_keys:
# user_to_revoke: # Comment out or delete
```
## Secret Flow (What lives where)
- **API server config (`configs/api/*.yaml`)**
- Stores **SHA256 hashes** of API keys (never raw keys).
- The repo-shipped configs intentionally contain `CHANGE_ME_...` placeholders.
- For real deployments, make a private copy (e.g. `/etc/fetch_ml/config.yaml`) and fill in real hashes.
- **Docker Compose `.env` / secret files**
- Used for values that should not be committed (e.g. `REDIS_PASSWORD`, Grafana admin password).
- `deployments/docker-compose.homelab-secure.yml` requires `REDIS_PASSWORD` to be set explicitly.
- **TLS certs**
- Provided as mounted files (e.g. `/app/ssl/cert.pem`, `/app/ssl/key.pem`).
## Network Security
### Production Network Topology
```
Internet
[Firewall] (ports 3000, 9102)
[Reverse Proxy] (nginx/Apache) - TLS termination
┌─────────────────────┐
│ Application Pod │
│ │
│ ┌──────────────┐ │
│ │ API Server │ │ ← Public (via reverse proxy)
│ └──────────────┘ │
│ │
│ ┌──────────────┐ │
│ │ Redis │ │ ← Internal only
│ └──────────────┘ │
│ │
│ ┌──────────────┐ │
│ │ Grafana │ │ ← Public (via reverse proxy)
│ └──────────────┘ │
│ │
│ ┌──────────────┐ │
│ │ Prometheus │ │ ← Internal only
│ └──────────────┘ │
│ │
│ ┌──────────────┐ │
│ │ Loki │ │ ← Internal only
│ └──────────────┘ │
└─────────────────────┘
```
### Recommended Firewall Rules
```bash
# Allow only necessary inbound connections
sudo firewall-cmd --permanent --zone=public --add-rich-rule='
rule family="ipv4"
source address="YOUR_NETWORK"
port port="3000" protocol="tcp" accept'
sudo firewall-cmd --permanent --zone=public --add-rich-rule='
rule family="ipv4"
source address="YOUR_NETWORK"
port port="9102" protocol="tcp" accept'
# Block all other traffic
sudo firewall-cmd --permanent --set-default-zone=drop
sudo firewall-cmd --reload
```
## Incident Response
### Suspected Breach
1. **Immediate Actions**
2. **Investigation**
3. **Recovery**
- Rotate all API keys
- Stop affected services
- Review audit logs
2. **Investigation**
```bash
# Check recent logins
sudo journalctl -u fetchml-api --since "1 hour ago"
# Review failed auth attempts
grep "authentication failed" /var/log/fetch_ml/*.log
# Check active connections
ss -tnp | grep :9102
```
3. **Recovery**
- Rotate all passwords and API keys
- Update firewall rules
- Patch vulnerabilities
- Resume services
### Security Monitoring
```bash
# Monitor failed authentication
tail -f /var/log/fetch_ml/api.log | grep "auth.*failed"
# Monitor unusual activity
journalctl -u fetchml-api -f | grep -E "(ERROR|WARN)"
# Check open ports
nmap -p- localhost
```
## Security Best Practices
1. **Principle of Least Privilege**: Grant minimum necessary permissions
2. **Defense in Depth**: Multiple security layers (firewall + auth + TLS)
3. **Regular Updates**: Keep all components patched
4. **Audit Regularly**: Review logs and access patterns
5. **Secure Secrets**: Never commit passwords/keys to git
6. **Network Segmentation**: Isolate services with internal networks
7. **Monitor Everything**: Enable comprehensive logging and alerting
8. **Test Security**: Regular penetration testing and vulnerability scans
## Compliance
### Data Privacy
- Logs are sanitized (no passwords/API keys)
- Experiment data is user-isolated
- No telemetry or external data sharing
### Audit Trail
All API access is logged with:
- Timestamp
- User/API key
- Action performed
- Source IP
- Result (success/failure)
## Getting Help
- **Security Issues**: Report privately via email
- **Questions**: See documentation or create issue
- **Updates**: Monitor releases for security patches