Update documentation for new features: - Add CHANGELOG entries for research features and privacy enhancements - Update README with new CLI commands and security features - Add privacy-security.md documentation for PII detection - Add research-features.md for narrative and outcome tracking
6.6 KiB
Privacy & Security
FetchML includes privacy-conscious features for research environments handling sensitive data.
Privacy Levels
Control experiment visibility with four privacy levels.
Available Levels
| Level | Visibility | Use Case |
|---|---|---|
private |
Owner only (default) | Sensitive/unpublished research |
team |
Same team members | Collaborative team projects |
public |
All authenticated users | Open research, shared datasets |
anonymized |
All users with PII stripped | Public release, papers |
Setting Privacy
# Make experiment private (default)
ml privacy set run_abc --level private
# Share with team
ml privacy set run_abc --level team --team vision-research
# Make public within organization
ml privacy set run_abc --level public
# Prepare for anonymized export
ml privacy set run_abc --level anonymized
Privacy in Manifest
Privacy settings are stored in the experiment manifest:
{
"privacy": {
"level": "team",
"team": "vision-research",
"owner": "researcher@lab.edu"
}
}
PII Detection
Automatically detect potentially identifying information in experiment metadata.
What Gets Detected
- Email addresses -
user@example.com - IP addresses -
192.168.1.1,10.0.0.5 - Phone numbers - Basic pattern matching
- SSN patterns -
123-45-6789
Using Privacy Scan
When adding annotations with sensitive context:
# Scan for PII before storing
ml annotate run_abc \
--note "Contact at user@example.com for questions" \
--privacy-scan
# Output:
# Warning: Potential PII detected:
# - email: 'user@example.com'
# Use --force to store anyway, or edit your note.
Override Warnings
If PII is intentional and acceptable:
ml annotate run_abc \
--note "Contact at user@example.com" \
--privacy-scan \
--force
Redacting PII
For anonymized exports, PII is automatically redacted:
ml export run_abc --anonymize
Redacted content becomes: [EMAIL-1], [IP-1], etc.
Anonymized Export
Export experiments for external sharing without leaking sensitive information.
Basic Anonymization
ml export run_abc --bundle run_abc.tar.gz --anonymize
Anonymization Levels
Metadata-only (default):
- Strips internal paths:
/nas/private/data→/datasets/data - Replaces internal IPs:
10.0.0.5→[INTERNAL-1] - Hashes email addresses:
user@lab.edu→[RESEARCHER-A] - Keeps experiment structure and metrics
Full:
- Everything in metadata-only, plus:
- Removes logs entirely
- Removes annotations
- Redacts all PII from notes
# Full anonymization
ml export run_abc --anonymize --anonymize-level full
What Gets Anonymized
| Original | Anonymized | Notes |
|---|---|---|
/home/user/data |
/workspace/data |
Paths generalized |
/nas/private/lab |
/datasets/lab |
Internal mounts hidden |
user@lab.edu |
[RESEARCHER-A] |
Consistent per user |
10.0.0.5 |
[INTERNAL-1] |
IP ranges replaced |
john@example.com |
[EMAIL-1] |
PII redacted |
Export Verification
Review what's in the export:
# Export and list contents
ml export run_abc --anonymize -o /tmp/run_abc.tar.gz
tar tzf /tmp/run_abc.tar.gz | head -20
Dataset Identity & Checksums
Verify dataset integrity with SHA256 checksums.
Computing Checksums
Datasets are automatically checksummed when registered:
ml dataset register /path/to/dataset --name my-dataset
# Computes SHA256 of all files in dataset
Verifying Datasets
# Verify dataset integrity
ml dataset verify /path/to/my-dataset
# Output:
# ✓ Dataset checksum verified
# Expected: sha256:abc123...
# Actual: sha256:abc123...
Checksum in Manifest
{
"datasets": [{
"name": "imagenet-train",
"checksum": "sha256:def456...",
"sample_count": 1281167
}]
}
Security Best Practices
1. Default to Private
Keep experiments private until ready to share:
# Private by default
ml queue train.py --hypothesis "..."
# Later, when ready to share
ml privacy set run_abc --level team --team my-team
2. Scan Before Sharing
Always use --privacy-scan when adding notes that might contain PII:
ml annotate run_abc --note "..." --privacy-scan
3. Anonymize for External Release
Before exporting for papers or public release:
ml export run_abc --anonymize --anonymize-level full
4. Verify Dataset Integrity
Regularly verify datasets, especially shared ones:
ml dataset verify /path/to/shared/dataset
5. Use Team Privacy for Collaboration
Share with specific teams rather than making public:
ml privacy set run_abc --level team --team ml-group
Compliance Considerations
GDPR / Research Ethics
| Requirement | FetchML Support | Status |
|---|---|---|
| Right to access | ml export creates data bundles |
✅ |
| Right to erasure | Delete command (future) | ⏳ |
| Data minimization | Narrative fields collect only necessary data | ✅ |
| PII detection | ml annotate --privacy-scan |
✅ |
| Anonymization | ml export --anonymize |
✅ |
Handling Sensitive Data
For experiments with sensitive data:
- Keep private: Use
--level private - PII scan all annotations: Always use
--privacy-scan - Anonymize before export: Use
--anonymize-level full - Verify team membership: Before sharing at
--level team
Configuration
Worker Privacy Settings
Configure privacy defaults in worker config:
privacy:
default_level: private
enforce_teams: true
audit_access: true
API Server Privacy
Enable privacy enforcement:
security:
privacy:
enabled: true
default_level: private
audit_access: true
Troubleshooting
PII Scan False Positives
Some valid text may trigger PII warnings:
# Example: "batch@32" looks like email
ml annotate run_abc --note "Use batch@32 for training" --privacy-scan
# Warning triggers, use --force if intended
Privacy Changes Not Applied
- Verify you own the experiment
- Check server supports privacy enforcement
- Try with explicit base path:
--base /path/to/experiments
Export Not Anonymized
- Ensure
--anonymizeflag is set - Check
--anonymize-levelis correct (metadata-only vs full) - Verify manifest contains privacy data
See Also
docs/src/research-features.md- Research workflow featuresdocs/src/deployment.md- Production deployment with privacydocs/src/quick-start.md- Getting started guide