diff --git a/native/README.md b/native/README.md index cd06502..dd0f872 100644 --- a/native/README.md +++ b/native/README.md @@ -34,11 +34,23 @@ This directory contains selective C++ optimizations for the highest-impact perfo ### Research Trustworthiness -**dataset_hash guarantees:** +**Critical Design Decisions for Research Use:** + +For dataset hashing to be trustworthy in research, it must be **reproducible**. The original `collect_files` used `readdir()` which returns files in filesystem-dependent order — inode order on ext4, creation order on others, essentially random on network filesystems. This meant researchers hashing the same dataset on different machines would get different combined hashes for identical content, breaking cross-collaborator verification. + +**The fix:** `std::sort(paths.begin(), paths.end())` after collection ensures lexicographic ordering. The hash is computed over sorted file paths, making it reproducible across machines and time. + +**Behavior Summary:** - **Deterministic ordering**: Files sorted lexicographically before hashing -- **Recursive traversal**: Nested directories fully hashed (max depth 32) -- **Reproducible**: Same dataset produces identical hash across machines -- **Documented exclusions**: Hidden files (`.name`) and special files excluded +- **Recursive traversal**: Nested directories fully hashed (max depth 32 with cycle detection) +- **Reproducible**: Same dataset produces identical hash across machines and filesystems + +**Documented Exclusions (intentional, not bugs):** +- **Hidden files** (names starting with `.`) are excluded — if your dataset has `.data` files or dotfiles that are part of the data, they will be silently skipped +- **Special files** (symlinks, devices, sockets) are excluded — only regular files (`S_ISREG`) are hashed +- **Non-regular entries** in subdirectories are silently skipped + +These exclusions were conscious design choices for security (symlink attack prevention) and predictability. However, researchers must be aware: a dataset directory with only hidden files or non-regular files will produce an empty hash, not an error. Verify your dataset structure matches expectations. ## Build Requirements