fetch_ml/docs/src/privacy-security.md
Jeremie Fraeys f357624685
docs: Update CHANGELOG and add feature documentation
Update documentation for new features:
- Add CHANGELOG entries for research features and privacy enhancements
- Update README with new CLI commands and security features
- Add privacy-security.md documentation for PII detection
- Add research-features.md for narrative and outcome tracking
2026-02-18 21:28:25 -05:00

6.6 KiB

Privacy & Security

FetchML includes privacy-conscious features for research environments handling sensitive data.


Privacy Levels

Control experiment visibility with four privacy levels.

Available Levels

Level Visibility Use Case
private Owner only (default) Sensitive/unpublished research
team Same team members Collaborative team projects
public All authenticated users Open research, shared datasets
anonymized All users with PII stripped Public release, papers

Setting Privacy

# Make experiment private (default)
ml privacy set run_abc --level private

# Share with team
ml privacy set run_abc --level team --team vision-research

# Make public within organization
ml privacy set run_abc --level public

# Prepare for anonymized export
ml privacy set run_abc --level anonymized

Privacy in Manifest

Privacy settings are stored in the experiment manifest:

{
  "privacy": {
    "level": "team",
    "team": "vision-research",
    "owner": "researcher@lab.edu"
  }
}

PII Detection

Automatically detect potentially identifying information in experiment metadata.

What Gets Detected

  • Email addresses - user@example.com
  • IP addresses - 192.168.1.1, 10.0.0.5
  • Phone numbers - Basic pattern matching
  • SSN patterns - 123-45-6789

Using Privacy Scan

When adding annotations with sensitive context:

# Scan for PII before storing
ml annotate run_abc \
  --note "Contact at user@example.com for questions" \
  --privacy-scan

# Output:
# Warning: Potential PII detected:
#   - email: 'user@example.com'
# Use --force to store anyway, or edit your note.

Override Warnings

If PII is intentional and acceptable:

ml annotate run_abc \
  --note "Contact at user@example.com" \
  --privacy-scan \
  --force

Redacting PII

For anonymized exports, PII is automatically redacted:

ml export run_abc --anonymize

Redacted content becomes: [EMAIL-1], [IP-1], etc.


Anonymized Export

Export experiments for external sharing without leaking sensitive information.

Basic Anonymization

ml export run_abc --bundle run_abc.tar.gz --anonymize

Anonymization Levels

Metadata-only (default):

  • Strips internal paths: /nas/private/data/datasets/data
  • Replaces internal IPs: 10.0.0.5[INTERNAL-1]
  • Hashes email addresses: user@lab.edu[RESEARCHER-A]
  • Keeps experiment structure and metrics

Full:

  • Everything in metadata-only, plus:
  • Removes logs entirely
  • Removes annotations
  • Redacts all PII from notes
# Full anonymization
ml export run_abc --anonymize --anonymize-level full

What Gets Anonymized

Original Anonymized Notes
/home/user/data /workspace/data Paths generalized
/nas/private/lab /datasets/lab Internal mounts hidden
user@lab.edu [RESEARCHER-A] Consistent per user
10.0.0.5 [INTERNAL-1] IP ranges replaced
john@example.com [EMAIL-1] PII redacted

Export Verification

Review what's in the export:

# Export and list contents
ml export run_abc --anonymize -o /tmp/run_abc.tar.gz
tar tzf /tmp/run_abc.tar.gz | head -20

Dataset Identity & Checksums

Verify dataset integrity with SHA256 checksums.

Computing Checksums

Datasets are automatically checksummed when registered:

ml dataset register /path/to/dataset --name my-dataset
# Computes SHA256 of all files in dataset

Verifying Datasets

# Verify dataset integrity
ml dataset verify /path/to/my-dataset

# Output:
# ✓ Dataset checksum verified
#   Expected: sha256:abc123...
#   Actual:   sha256:abc123...

Checksum in Manifest

{
  "datasets": [{
    "name": "imagenet-train",
    "checksum": "sha256:def456...",
    "sample_count": 1281167
  }]
}

Security Best Practices

1. Default to Private

Keep experiments private until ready to share:

# Private by default
ml queue train.py --hypothesis "..."

# Later, when ready to share
ml privacy set run_abc --level team --team my-team

2. Scan Before Sharing

Always use --privacy-scan when adding notes that might contain PII:

ml annotate run_abc --note "..." --privacy-scan

3. Anonymize for External Release

Before exporting for papers or public release:

ml export run_abc --anonymize --anonymize-level full

4. Verify Dataset Integrity

Regularly verify datasets, especially shared ones:

ml dataset verify /path/to/shared/dataset

5. Use Team Privacy for Collaboration

Share with specific teams rather than making public:

ml privacy set run_abc --level team --team ml-group

Compliance Considerations

GDPR / Research Ethics

Requirement FetchML Support Status
Right to access ml export creates data bundles
Right to erasure Delete command (future)
Data minimization Narrative fields collect only necessary data
PII detection ml annotate --privacy-scan
Anonymization ml export --anonymize

Handling Sensitive Data

For experiments with sensitive data:

  1. Keep private: Use --level private
  2. PII scan all annotations: Always use --privacy-scan
  3. Anonymize before export: Use --anonymize-level full
  4. Verify team membership: Before sharing at --level team

Configuration

Worker Privacy Settings

Configure privacy defaults in worker config:

privacy:
  default_level: private
  enforce_teams: true
  audit_access: true

API Server Privacy

Enable privacy enforcement:

security:
  privacy:
    enabled: true
    default_level: private
    audit_access: true

Troubleshooting

PII Scan False Positives

Some valid text may trigger PII warnings:

# Example: "batch@32" looks like email
ml annotate run_abc --note "Use batch@32 for training" --privacy-scan
# Warning triggers, use --force if intended

Privacy Changes Not Applied

  • Verify you own the experiment
  • Check server supports privacy enforcement
  • Try with explicit base path: --base /path/to/experiments

Export Not Anonymized

  • Ensure --anonymize flag is set
  • Check --anonymize-level is correct (metadata-only vs full)
  • Verify manifest contains privacy data

See Also

  • docs/src/research-features.md - Research workflow features
  • docs/src/deployment.md - Production deployment with privacy
  • docs/src/quick-start.md - Getting started guide