docs: Update CHANGELOG and add feature documentation

Update documentation for new features:
- Add CHANGELOG entries for research features and privacy enhancements
- Update README with new CLI commands and security features
- Add privacy-security.md documentation for PII detection
- Add research-features.md for narrative and outcome tracking

2026-02-18 21:28:25 -05:00

6.6 KiB

Raw Blame History

Privacy & Security

FetchML includes privacy-conscious features for research environments handling sensitive data.

Privacy Levels

Control experiment visibility with four privacy levels.

Available Levels

Level	Visibility	Use Case
`private`	Owner only (default)	Sensitive/unpublished research
`team`	Same team members	Collaborative team projects
`public`	All authenticated users	Open research, shared datasets
`anonymized`	All users with PII stripped	Public release, papers

Setting Privacy

# Make experiment private (default)
ml privacy set run_abc --level private

# Share with team
ml privacy set run_abc --level team --team vision-research

# Make public within organization
ml privacy set run_abc --level public

# Prepare for anonymized export
ml privacy set run_abc --level anonymized

Privacy in Manifest

Privacy settings are stored in the experiment manifest:

{
  "privacy": {
    "level": "team",
    "team": "vision-research",
    "owner": "researcher@lab.edu"
  }
}

PII Detection

Automatically detect potentially identifying information in experiment metadata.

What Gets Detected

Email addresses - user@example.com
IP addresses - 192.168.1.1, 10.0.0.5
Phone numbers - Basic pattern matching
SSN patterns - 123-45-6789

Using Privacy Scan

When adding annotations with sensitive context:

# Scan for PII before storing
ml annotate run_abc \
  --note "Contact at user@example.com for questions" \
  --privacy-scan

# Output:
# Warning: Potential PII detected:
#   - email: 'user@example.com'
# Use --force to store anyway, or edit your note.

Override Warnings

If PII is intentional and acceptable:

ml annotate run_abc \
  --note "Contact at user@example.com" \
  --privacy-scan \
  --force

Redacting PII

For anonymized exports, PII is automatically redacted:

ml export run_abc --anonymize

Redacted content becomes: [EMAIL-1], [IP-1], etc.

Anonymized Export

Export experiments for external sharing without leaking sensitive information.

Basic Anonymization

ml export run_abc --bundle run_abc.tar.gz --anonymize

Anonymization Levels

Metadata-only (default):

Strips internal paths: /nas/private/data → /datasets/data
Replaces internal IPs: 10.0.0.5 → [INTERNAL-1]
Hashes email addresses: user@lab.edu → [RESEARCHER-A]
Keeps experiment structure and metrics

Full:

Everything in metadata-only, plus:
Removes logs entirely
Removes annotations
Redacts all PII from notes

# Full anonymization
ml export run_abc --anonymize --anonymize-level full

What Gets Anonymized

Original	Anonymized	Notes
`/home/user/data`	`/workspace/data`	Paths generalized
`/nas/private/lab`	`/datasets/lab`	Internal mounts hidden
`user@lab.edu`	`[RESEARCHER-A]`	Consistent per user
`10.0.0.5`	`[INTERNAL-1]`	IP ranges replaced
`john@example.com`	`[EMAIL-1]`	PII redacted

Export Verification

Review what's in the export:

# Export and list contents
ml export run_abc --anonymize -o /tmp/run_abc.tar.gz
tar tzf /tmp/run_abc.tar.gz | head -20

Dataset Identity & Checksums

Verify dataset integrity with SHA256 checksums.

Computing Checksums

Datasets are automatically checksummed when registered:

ml dataset register /path/to/dataset --name my-dataset
# Computes SHA256 of all files in dataset

Verifying Datasets

# Verify dataset integrity
ml dataset verify /path/to/my-dataset

# Output:
# ✓ Dataset checksum verified
#   Expected: sha256:abc123...
#   Actual:   sha256:abc123...

Checksum in Manifest

{
  "datasets": [{
    "name": "imagenet-train",
    "checksum": "sha256:def456...",
    "sample_count": 1281167
  }]
}

Security Best Practices

1. Default to Private

Keep experiments private until ready to share:

# Private by default
ml queue train.py --hypothesis "..."

# Later, when ready to share
ml privacy set run_abc --level team --team my-team

Always use --privacy-scan when adding notes that might contain PII:

ml annotate run_abc --note "..." --privacy-scan

3. Anonymize for External Release

Before exporting for papers or public release:

ml export run_abc --anonymize --anonymize-level full

4. Verify Dataset Integrity

Regularly verify datasets, especially shared ones:

ml dataset verify /path/to/shared/dataset

5. Use Team Privacy for Collaboration

Share with specific teams rather than making public:

ml privacy set run_abc --level team --team ml-group

Compliance Considerations

Requirement	FetchML Support	Status
Right to access	`ml export` creates data bundles	✅
Right to erasure	Delete command (future)	⏳
Data minimization	Narrative fields collect only necessary data	✅
PII detection	`ml annotate --privacy-scan`	✅
Anonymization	`ml export --anonymize`	✅

Handling Sensitive Data

For experiments with sensitive data:

Keep private: Use --level private
PII scan all annotations: Always use --privacy-scan
Anonymize before export: Use --anonymize-level full
Verify team membership: Before sharing at --level team

Configuration

Worker Privacy Settings

Configure privacy defaults in worker config:

privacy:
  default_level: private
  enforce_teams: true
  audit_access: true

API Server Privacy

Enable privacy enforcement:

security:
  privacy:
    enabled: true
    default_level: private
    audit_access: true

Troubleshooting

PII Scan False Positives

Some valid text may trigger PII warnings:

# Example: "batch@32" looks like email
ml annotate run_abc --note "Use batch@32 for training" --privacy-scan
# Warning triggers, use --force if intended

Privacy Changes Not Applied

Verify you own the experiment
Check server supports privacy enforcement
Try with explicit base path: --base /path/to/experiments

Export Not Anonymized

Ensure --anonymize flag is set
Check --anonymize-level is correct (metadata-only vs full)
Verify manifest contains privacy data

6.6 KiB Raw Blame History