Add comprehensive explanation of the reproducibility problem and fix: - Document readdir filesystem-dependent ordering issue - Explain std::sort fix for lexicographic ordering - Clarify recursive traversal with cycle detection - Document hidden file and special file exclusions - Warn researchers about silent omissions and empty hash edge cases This addresses the core concern that researchers need to understand the hash is computed over sorted paths to trust cross-machine verification. |
||
|---|---|---|
| .. | ||
| common | ||
| dataset_hash | ||
| queue_index | ||
| tests | ||
| CMakeLists.txt | ||
| README.md | ||
Native C++ Libraries
High-performance C++ libraries for critical system components.
Overview
This directory contains selective C++ optimizations for the highest-impact performance bottlenecks. Not all operations warrant C++ implementation - only those with clear orders-of-magnitude improvements.
Current Libraries
queue_index (Priority Queue Index)
- Purpose: High-performance task queue with binary heap
- Performance: 21,000x faster than JSON-based Go implementation
- Memory: 99% allocation reduction
- Security: CVE-2024-45339, CVE-2025-47290, CVE-2025-0838 mitigations applied
- Status: ✅ Production ready
dataset_hash (SHA256 Hashing)
- Purpose: SIMD-accelerated file hashing (ARMv8 crypto / Intel SHA-NI)
- Performance: 78% syscall reduction, batch-first API
- Memory: 99% less memory than Go implementation
- Research: Deterministic sorted hashing, recursive directory traversal
- Status: ✅ Production ready for research use
Security
CVE Mitigations Applied
| CVE | Description | Mitigation |
|---|---|---|
| CVE-2024-45339 | Symlink attack on temp files | O_EXCL flag with retry-on-EEXIST |
| CVE-2025-47290 | TOCTOU race in file open | openat_nofollow() via path_sanitizer |
| CVE-2025-0838 | Integer overflow in batch ops | MAX_BATCH_SIZE = 10000 limit |
Research Trustworthiness
Critical Design Decisions for Research Use:
For dataset hashing to be trustworthy in research, it must be reproducible. The original collect_files used readdir() which returns files in filesystem-dependent order — inode order on ext4, creation order on others, essentially random on network filesystems. This meant researchers hashing the same dataset on different machines would get different combined hashes for identical content, breaking cross-collaborator verification.
The fix: std::sort(paths.begin(), paths.end()) after collection ensures lexicographic ordering. The hash is computed over sorted file paths, making it reproducible across machines and time.
Behavior Summary:
- Deterministic ordering: Files sorted lexicographically before hashing
- Recursive traversal: Nested directories fully hashed (max depth 32 with cycle detection)
- Reproducible: Same dataset produces identical hash across machines and filesystems
Documented Exclusions (intentional, not bugs):
- Hidden files (names starting with
.) are excluded — if your dataset has.datafiles or dotfiles that are part of the data, they will be silently skipped - Special files (symlinks, devices, sockets) are excluded — only regular files (
S_ISREG) are hashed - Non-regular entries in subdirectories are silently skipped
These exclusions were conscious design choices for security (symlink attack prevention) and predictability. However, researchers must be aware: a dataset directory with only hidden files or non-regular files will produce an empty hash, not an error. Verify your dataset structure matches expectations.
Build Requirements
- CMake 3.20+
- C++20 compiler (GCC 11+, Clang 14+, or MSVC 2022+)
- Go 1.25+ (for CGo integration)
Quick Start
# Build all native libraries
make native-build
# Run with native libraries enabled
FETCHML_NATIVE_LIBS=1 go run ./...
# Run benchmarks
FETCHML_NATIVE_LIBS=1 go test -bench=. ./tests/benchmarks/
Test Coverage
make native-test
8/8 tests passing:
storage_smoke- Basic storage operationsdataset_hash_smoke- Hashing correctnessstorage_init_new_dir- Directory creationparallel_hash_large_dir- 300+ file handlingqueue_index_compact- Compaction operationssha256_arm_kat- ARMv8 SHA256 verificationstorage_symlink_resistance- CVE-2024-45339 verificationqueue_index_batch_limit- CVE-2025-0838 verification
Build Options
# Debug build with AddressSanitizer
cd native/build && cmake .. -DCMAKE_BUILD_TYPE=Debug -DENABLE_ASAN=ON
# Release build (optimized)
cd native/build && cmake .. -DCMAKE_BUILD_TYPE=Release
# Build specific library
cd native/build && make queue_index
Architecture
Design Principles
- Selective optimization: Only 2 libraries out of 80+ profiled functions
- Batch-first APIs: Minimize CGo overhead (~100ns/call)
- Zero-allocation hot paths: Arena allocators, no malloc in critical sections
- C ABI for CGo: Simple C structs, no C++ exceptions across boundary
- Cross-platform: Runtime SIMD detection (ARMv8 / x86_64 SHA-NI)
CGo Integration
// #cgo LDFLAGS: -L${SRCDIR}/../../native/build -lqueue_index
// #include "../../native/queue_index/queue_index.h"
import "C"
Error Handling
- C functions return
-1for errors, positive values for success - Use
qi_last_error()/fh_last_error()for error messages - Go code checks
rc < 0notrc != 0
When to Add New C++ Libraries
DO implement when:
- Profile shows >90% syscall overhead
- Batch operations amortize CGo cost
- SIMD can provide 3x+ speedup
- Memory pressure is critical
DON'T implement when:
- Speedup <2x (CGo overhead negates gains)
- Single-file operations (per-call overhead too high)
- Team <3 backend engineers (maintenance burden)
- Complex error handling required
History
Implemented:
- ✅ queue_index: Binary priority queue replacing JSON filesystem queue
- ✅ dataset_hash: SIMD SHA256 for artifact verification
Deferred:
- ⏸️ task_json_codec: 2-3x speedup not worth maintenance (small team)
- ⏸️ artifact_scanner: Go filepath.Walk faster for typical workloads
- ⏸️ streaming_io: Complexity exceeds benefit without io_uring
Maintenance
Build verification:
make native-build
FETCHML_NATIVE_LIBS=1 make test
Adding new library:
- Create subdirectory with CMakeLists.txt
- Implement C ABI in
.h/.cppfiles - Add to root CMakeLists.txt
- Create Go bridge in
internal/ - Add benchmarks in
tests/benchmarks/ - Document in this README
Troubleshooting
Library not found:
- Ensure
native/build/lib*.dylib(macOS) or.so(Linux) exists - Check
LD_LIBRARY_PATHorDYLD_LIBRARY_PATH
CGo undefined symbols:
- Verify C function names match exactly (no name mangling)
- Check
#includepaths are correct - Rebuild:
make native-clean && make native-build
Performance regression:
- Verify
FETCHML_NATIVE_LIBS=1is set - Check benchmark:
go test -bench=BenchmarkQueue -v - Profile with:
go test -bench=. -cpuprofile=cpu.prof