docs: expand Research Trustworthiness section with detailed design rationale

Add comprehensive explanation of the reproducibility problem and fix:
- Document readdir filesystem-dependent ordering issue
- Explain std::sort fix for lexicographic ordering
- Clarify recursive traversal with cycle detection
- Document hidden file and special file exclusions
- Warn researchers about silent omissions and empty hash edge cases

This addresses the core concern that researchers need to understand
the hash is computed over sorted paths to trust cross-machine verification.
This commit is contained in:
Jeremie Fraeys 2026-02-21 13:38:25 -05:00
parent 7efe8bbfbf
commit 472590f831
No known key found for this signature in database

View file

@ -34,11 +34,23 @@ This directory contains selective C++ optimizations for the highest-impact perfo
### Research Trustworthiness
**dataset_hash guarantees:**
**Critical Design Decisions for Research Use:**
For dataset hashing to be trustworthy in research, it must be **reproducible**. The original `collect_files` used `readdir()` which returns files in filesystem-dependent order — inode order on ext4, creation order on others, essentially random on network filesystems. This meant researchers hashing the same dataset on different machines would get different combined hashes for identical content, breaking cross-collaborator verification.
**The fix:** `std::sort(paths.begin(), paths.end())` after collection ensures lexicographic ordering. The hash is computed over sorted file paths, making it reproducible across machines and time.
**Behavior Summary:**
- **Deterministic ordering**: Files sorted lexicographically before hashing
- **Recursive traversal**: Nested directories fully hashed (max depth 32)
- **Reproducible**: Same dataset produces identical hash across machines
- **Documented exclusions**: Hidden files (`.name`) and special files excluded
- **Recursive traversal**: Nested directories fully hashed (max depth 32 with cycle detection)
- **Reproducible**: Same dataset produces identical hash across machines and filesystems
**Documented Exclusions (intentional, not bugs):**
- **Hidden files** (names starting with `.`) are excluded — if your dataset has `.data` files or dotfiles that are part of the data, they will be silently skipped
- **Special files** (symlinks, devices, sockets) are excluded — only regular files (`S_ISREG`) are hashed
- **Non-regular entries** in subdirectories are silently skipped
These exclusions were conscious design choices for security (symlink attack prevention) and predictability. However, researchers must be aware: a dataset directory with only hidden files or non-regular files will produce an empty hash, not an error. Verify your dataset structure matches expectations.
## Build Requirements