# Privacy & Security

FetchML includes privacy-conscious features for research environments handling sensitive data.

---

## Privacy Levels

Control experiment visibility with four privacy levels.

### Available Levels

| Level | Visibility | Use Case |
|-------|-----------|----------|
| `private` | Owner only (default) | Sensitive/unpublished research |
| `team` | Same team members | Collaborative team projects |
| `public` | All authenticated users | Open research, shared datasets |
| `anonymized` | All users with PII stripped | Public release, papers |

### Setting Privacy

```bash
# Make experiment private (default)
ml privacy set run_abc --level private

# Share with team
ml privacy set run_abc --level team --team vision-research

# Make public within organization
ml privacy set run_abc --level public

# Prepare for anonymized export
ml privacy set run_abc --level anonymized
```

### Privacy in Manifest

Privacy settings are stored in the experiment manifest:

```json
{
  "privacy": {
    "level": "team",
    "team": "vision-research",
    "owner": "researcher@lab.edu"
  }
}
```

---

## PII Detection

Automatically detect potentially identifying information in experiment metadata.

### What Gets Detected

- **Email addresses** - `user@example.com`
- **IP addresses** - `192.168.1.1`, `10.0.0.5`
- **Phone numbers** - Basic pattern matching
- **SSN patterns** - `123-45-6789`

### Using Privacy Scan

When adding annotations with sensitive context:

```bash
# Scan for PII before storing
ml annotate run_abc \
  --note "Contact at user@example.com for questions" \
  --privacy-scan

# Output:
# Warning: Potential PII detected:
#   - email: 'user@example.com'
# Use --force to store anyway, or edit your note.
```

### Override Warnings

If PII is intentional and acceptable:

```bash
ml annotate run_abc \
  --note "Contact at user@example.com" \
  --privacy-scan \
  --force
```

### Redacting PII

For anonymized exports, PII is automatically redacted:

```bash
ml export run_abc --anonymize
```

Redacted content becomes: `[EMAIL-1]`, `[IP-1]`, etc.

---

## Anonymized Export

Export experiments for external sharing without leaking sensitive information.

### Basic Anonymization

```bash
ml export run_abc --bundle run_abc.tar.gz --anonymize
```

### Anonymization Levels

**Metadata-only** (default):
- Strips internal paths: `/nas/private/data` → `/datasets/data`
- Replaces internal IPs: `10.0.0.5` → `[INTERNAL-1]`
- Hashes email addresses: `user@lab.edu` → `[RESEARCHER-A]`
- Keeps experiment structure and metrics

**Full**:
- Everything in metadata-only, plus:
- Removes logs entirely
- Removes annotations
- Redacts all PII from notes

```bash
# Full anonymization
ml export run_abc --anonymize --anonymize-level full
```

### What Gets Anonymized

| Original | Anonymized | Notes |
|----------|------------|-------|
| `/home/user/data` | `/workspace/data` | Paths generalized |
| `/nas/private/lab` | `/datasets/lab` | Internal mounts hidden |
| `user@lab.edu` | `[RESEARCHER-A]` | Consistent per user |
| `10.0.0.5` | `[INTERNAL-1]` | IP ranges replaced |
| `john@example.com` | `[EMAIL-1]` | PII redacted |

### Export Verification

Review what's in the export:

```bash
# Export and list contents
ml export run_abc --anonymize -o /tmp/run_abc.tar.gz
tar tzf /tmp/run_abc.tar.gz | head -20
```

---

## Dataset Identity & Checksums

Verify dataset integrity with SHA256 checksums.

### Computing Checksums

Datasets are automatically checksummed when registered:

```bash
ml dataset register /path/to/dataset --name my-dataset
# Computes SHA256 of all files in dataset
```

### Verifying Datasets

```bash
# Verify dataset integrity
ml dataset verify /path/to/my-dataset

# Output:
# ✓ Dataset checksum verified
#   Expected: sha256:abc123...
#   Actual:   sha256:abc123...
```

### Checksum in Manifest

```json
{
  "datasets": [{
    "name": "imagenet-train",
    "checksum": "sha256:def456...",
    "sample_count": 1281167
  }]
}
```

---

## Security Best Practices

### 1. Default to Private

Keep experiments private until ready to share:

```bash
# Private by default
ml queue train.py --hypothesis "..."

# Later, when ready to share
ml privacy set run_abc --level team --team my-team
```

### 2. Scan Before Sharing

Always use `--privacy-scan` when adding notes that might contain PII:

```bash
ml annotate run_abc --note "..." --privacy-scan
```

### 3. Anonymize for External Release

Before exporting for papers or public release:

```bash
ml export run_abc --anonymize --anonymize-level full
```

### 4. Verify Dataset Integrity

Regularly verify datasets, especially shared ones:

```bash
ml dataset verify /path/to/shared/dataset
```

### 5. Use Team Privacy for Collaboration

Share with specific teams rather than making public:

```bash
ml privacy set run_abc --level team --team ml-group
```

---

## Compliance Considerations

### GDPR / Research Ethics

| Requirement | FetchML Support | Status |
|-------------|-----------------|--------|
| Right to access | `ml export` creates data bundles | ✅ |
| Right to erasure | Delete command (future) | ⏳ |
| Data minimization | Narrative fields collect only necessary data | ✅ |
| PII detection | `ml annotate --privacy-scan` | ✅ |
| Anonymization | `ml export --anonymize` | ✅ |

### Handling Sensitive Data

For experiments with sensitive data:

1. **Keep private**: Use `--level private`
2. **PII scan all annotations**: Always use `--privacy-scan`
3. **Anonymize before export**: Use `--anonymize-level full`
4. **Verify team membership**: Before sharing at `--level team`

---

## Configuration

### Worker Privacy Settings

Configure privacy defaults in worker config:

```yaml
privacy:
  default_level: private
  enforce_teams: true
  audit_access: true
```

### API Server Privacy

Enable privacy enforcement:

```yaml
security:
  privacy:
    enabled: true
    default_level: private
    audit_access: true
```

---

## Troubleshooting

### PII Scan False Positives

Some valid text may trigger PII warnings:

```bash
# Example: "batch@32" looks like email
ml annotate run_abc --note "Use batch@32 for training" --privacy-scan
# Warning triggers, use --force if intended
```

### Privacy Changes Not Applied

- Verify you own the experiment
- Check server supports privacy enforcement
- Try with explicit base path: `--base /path/to/experiments`

### Export Not Anonymized

- Ensure `--anonymize` flag is set
- Check `--anonymize-level` is correct (metadata-only vs full)
- Verify manifest contains privacy data

---

## See Also

- `docs/src/research-features.md` - Research workflow features
- `docs/src/deployment.md` - Production deployment with privacy
- `docs/src/quick-start.md` - Getting started guide