Security Fixes: - CVE-2024-45339: Add O_EXCL flag to temp file creation in storage_write_entries() Prevents symlink attacks on predictable .tmp file paths - CVE-2025-47290: Use openat_nofollow() in storage_open() Closes TOCTOU race condition via path_sanitizer infrastructure - CVE-2025-0838: Add MAX_BATCH_SIZE=10000 to add_tasks() Prevents integer overflow in batch operations Research Trustworthiness (dataset_hash): - Deterministic file ordering: std::sort after collect_files() - Recursive directory traversal: depth-limited with cycle detection - Documented exclusions: hidden files and special files noted in API Bug Fixes: - R1: storage_init path validation for non-existent directories - R2: safe_strncpy return value check before strcat - R3: parallel_hash 256-file cap replaced with std::vector - R4: wire qi_compact_index/qi_rebuild_index stubs - R5: CompletionLatch race condition fix (hold mutex during decrement) - R6: ARMv8 SHA256 transform fix (save abcd_pre before vsha256hq_u32) - R7: fuzz_index_storage header format fix - R8: enforce null termination in add_tasks/update_tasks - R9: use 64 bytes (not 65) in combined hash to exclude null terminator - R10: status field persistence in save() New Tests: - test_recursive_dataset.cpp: Verify deterministic recursive hashing - test_storage_symlink_resistance.cpp: Verify CVE-2024-45339 fix - test_queue_index_batch_limit.cpp: Verify CVE-2025-0838 fix - test_sha256_arm_kat.cpp: ARMv8 known-answer tests - test_storage_init_new_dir.cpp: F1 verification - test_parallel_hash_large_dir.cpp: F3 verification - test_queue_index_compact.cpp: F4 verification All 8 native tests passing. Library ready for research lab deployment.
5.1 KiB
5.1 KiB
Native C++ Libraries
High-performance C++ libraries for critical system components.
Overview
This directory contains selective C++ optimizations for the highest-impact performance bottlenecks. Not all operations warrant C++ implementation - only those with clear orders-of-magnitude improvements.
Current Libraries
queue_index (Priority Queue Index)
- Purpose: High-performance task queue with binary heap
- Performance: 21,000x faster than JSON-based Go implementation
- Memory: 99% allocation reduction
- Security: CVE-2024-45339, CVE-2025-47290, CVE-2025-0838 mitigations applied
- Status: ✅ Production ready
dataset_hash (SHA256 Hashing)
- Purpose: SIMD-accelerated file hashing (ARMv8 crypto / Intel SHA-NI)
- Performance: 78% syscall reduction, batch-first API
- Memory: 99% less memory than Go implementation
- Research: Deterministic sorted hashing, recursive directory traversal
- Status: ✅ Production ready for research use
Security
CVE Mitigations Applied
| CVE | Description | Mitigation |
|---|---|---|
| CVE-2024-45339 | Symlink attack on temp files | O_EXCL flag with retry-on-EEXIST |
| CVE-2025-47290 | TOCTOU race in file open | openat_nofollow() via path_sanitizer |
| CVE-2025-0838 | Integer overflow in batch ops | MAX_BATCH_SIZE = 10000 limit |
Research Trustworthiness
dataset_hash guarantees:
- Deterministic ordering: Files sorted lexicographically before hashing
- Recursive traversal: Nested directories fully hashed (max depth 32)
- Reproducible: Same dataset produces identical hash across machines
- Documented exclusions: Hidden files (
.name) and special files excluded
Build Requirements
- CMake 3.20+
- C++20 compiler (GCC 11+, Clang 14+, or MSVC 2022+)
- Go 1.25+ (for CGo integration)
Quick Start
# Build all native libraries
make native-build
# Run with native libraries enabled
FETCHML_NATIVE_LIBS=1 go run ./...
# Run benchmarks
FETCHML_NATIVE_LIBS=1 go test -bench=. ./tests/benchmarks/
Test Coverage
make native-test
8/8 tests passing:
storage_smoke- Basic storage operationsdataset_hash_smoke- Hashing correctnessstorage_init_new_dir- Directory creationparallel_hash_large_dir- 300+ file handlingqueue_index_compact- Compaction operationssha256_arm_kat- ARMv8 SHA256 verificationstorage_symlink_resistance- CVE-2024-45339 verificationqueue_index_batch_limit- CVE-2025-0838 verification
Build Options
# Debug build with AddressSanitizer
cd native/build && cmake .. -DCMAKE_BUILD_TYPE=Debug -DENABLE_ASAN=ON
# Release build (optimized)
cd native/build && cmake .. -DCMAKE_BUILD_TYPE=Release
# Build specific library
cd native/build && make queue_index
Architecture
Design Principles
- Selective optimization: Only 2 libraries out of 80+ profiled functions
- Batch-first APIs: Minimize CGo overhead (~100ns/call)
- Zero-allocation hot paths: Arena allocators, no malloc in critical sections
- C ABI for CGo: Simple C structs, no C++ exceptions across boundary
- Cross-platform: Runtime SIMD detection (ARMv8 / x86_64 SHA-NI)
CGo Integration
// #cgo LDFLAGS: -L${SRCDIR}/../../native/build -lqueue_index
// #include "../../native/queue_index/queue_index.h"
import "C"
Error Handling
- C functions return
-1for errors, positive values for success - Use
qi_last_error()/fh_last_error()for error messages - Go code checks
rc < 0notrc != 0
When to Add New C++ Libraries
DO implement when:
- Profile shows >90% syscall overhead
- Batch operations amortize CGo cost
- SIMD can provide 3x+ speedup
- Memory pressure is critical
DON'T implement when:
- Speedup <2x (CGo overhead negates gains)
- Single-file operations (per-call overhead too high)
- Team <3 backend engineers (maintenance burden)
- Complex error handling required
History
Implemented:
- ✅ queue_index: Binary priority queue replacing JSON filesystem queue
- ✅ dataset_hash: SIMD SHA256 for artifact verification
Deferred:
- ⏸️ task_json_codec: 2-3x speedup not worth maintenance (small team)
- ⏸️ artifact_scanner: Go filepath.Walk faster for typical workloads
- ⏸️ streaming_io: Complexity exceeds benefit without io_uring
Maintenance
Build verification:
make native-build
FETCHML_NATIVE_LIBS=1 make test
Adding new library:
- Create subdirectory with CMakeLists.txt
- Implement C ABI in
.h/.cppfiles - Add to root CMakeLists.txt
- Create Go bridge in
internal/ - Add benchmarks in
tests/benchmarks/ - Document in this README
Troubleshooting
Library not found:
- Ensure
native/build/lib*.dylib(macOS) or.so(Linux) exists - Check
LD_LIBRARY_PATHorDYLD_LIBRARY_PATH
CGo undefined symbols:
- Verify C function names match exactly (no name mangling)
- Check
#includepaths are correct - Rebuild:
make native-clean && make native-build
Performance regression:
- Verify
FETCHML_NATIVE_LIBS=1is set - Check benchmark:
go test -bench=BenchmarkQueue -v - Profile with:
go test -bench=. -cpuprofile=cpu.prof