fetch_ml/native/README.md
Jeremie Fraeys 7efe8bbfbf
native: security hardening, research trustworthiness, and CVE mitigations
Security Fixes:
- CVE-2024-45339: Add O_EXCL flag to temp file creation in storage_write_entries()
  Prevents symlink attacks on predictable .tmp file paths
- CVE-2025-47290: Use openat_nofollow() in storage_open()
  Closes TOCTOU race condition via path_sanitizer infrastructure
- CVE-2025-0838: Add MAX_BATCH_SIZE=10000 to add_tasks()
  Prevents integer overflow in batch operations

Research Trustworthiness (dataset_hash):
- Deterministic file ordering: std::sort after collect_files()
- Recursive directory traversal: depth-limited with cycle detection
- Documented exclusions: hidden files and special files noted in API

Bug Fixes:
- R1: storage_init path validation for non-existent directories
- R2: safe_strncpy return value check before strcat
- R3: parallel_hash 256-file cap replaced with std::vector
- R4: wire qi_compact_index/qi_rebuild_index stubs
- R5: CompletionLatch race condition fix (hold mutex during decrement)
- R6: ARMv8 SHA256 transform fix (save abcd_pre before vsha256hq_u32)
- R7: fuzz_index_storage header format fix
- R8: enforce null termination in add_tasks/update_tasks
- R9: use 64 bytes (not 65) in combined hash to exclude null terminator
- R10: status field persistence in save()

New Tests:
- test_recursive_dataset.cpp: Verify deterministic recursive hashing
- test_storage_symlink_resistance.cpp: Verify CVE-2024-45339 fix
- test_queue_index_batch_limit.cpp: Verify CVE-2025-0838 fix
- test_sha256_arm_kat.cpp: ARMv8 known-answer tests
- test_storage_init_new_dir.cpp: F1 verification
- test_parallel_hash_large_dir.cpp: F3 verification
- test_queue_index_compact.cpp: F4 verification

All 8 native tests passing. Library ready for research lab deployment.
2026-02-21 13:33:45 -05:00

170 lines
5.1 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Native C++ Libraries
High-performance C++ libraries for critical system components.
## Overview
This directory contains selective C++ optimizations for the highest-impact performance bottlenecks. Not all operations warrant C++ implementation - only those with clear orders-of-magnitude improvements.
## Current Libraries
### queue_index (Priority Queue Index)
- **Purpose**: High-performance task queue with binary heap
- **Performance**: 21,000x faster than JSON-based Go implementation
- **Memory**: 99% allocation reduction
- **Security**: CVE-2024-45339, CVE-2025-47290, CVE-2025-0838 mitigations applied
- **Status**: ✅ Production ready
### dataset_hash (SHA256 Hashing)
- **Purpose**: SIMD-accelerated file hashing (ARMv8 crypto / Intel SHA-NI)
- **Performance**: 78% syscall reduction, batch-first API
- **Memory**: 99% less memory than Go implementation
- **Research**: Deterministic sorted hashing, recursive directory traversal
- **Status**: ✅ Production ready for research use
## Security
### CVE Mitigations Applied
| CVE | Description | Mitigation |
|-----|-------------|------------|
| CVE-2024-45339 | Symlink attack on temp files | `O_EXCL` flag with retry-on-EEXIST |
| CVE-2025-47290 | TOCTOU race in file open | `openat_nofollow()` via path_sanitizer |
| CVE-2025-0838 | Integer overflow in batch ops | `MAX_BATCH_SIZE = 10000` limit |
### Research Trustworthiness
**dataset_hash guarantees:**
- **Deterministic ordering**: Files sorted lexicographically before hashing
- **Recursive traversal**: Nested directories fully hashed (max depth 32)
- **Reproducible**: Same dataset produces identical hash across machines
- **Documented exclusions**: Hidden files (`.name`) and special files excluded
## Build Requirements
- CMake 3.20+
- C++20 compiler (GCC 11+, Clang 14+, or MSVC 2022+)
- Go 1.25+ (for CGo integration)
## Quick Start
```bash
# Build all native libraries
make native-build
# Run with native libraries enabled
FETCHML_NATIVE_LIBS=1 go run ./...
# Run benchmarks
FETCHML_NATIVE_LIBS=1 go test -bench=. ./tests/benchmarks/
```
## Test Coverage
```bash
make native-test
```
**8/8 tests passing:**
- `storage_smoke` - Basic storage operations
- `dataset_hash_smoke` - Hashing correctness
- `storage_init_new_dir` - Directory creation
- `parallel_hash_large_dir` - 300+ file handling
- `queue_index_compact` - Compaction operations
- `sha256_arm_kat` - ARMv8 SHA256 verification
- `storage_symlink_resistance` - CVE-2024-45339 verification
- `queue_index_batch_limit` - CVE-2025-0838 verification
## Build Options
```bash
# Debug build with AddressSanitizer
cd native/build && cmake .. -DCMAKE_BUILD_TYPE=Debug -DENABLE_ASAN=ON
# Release build (optimized)
cd native/build && cmake .. -DCMAKE_BUILD_TYPE=Release
# Build specific library
cd native/build && make queue_index
```
## Architecture
### Design Principles
1. **Selective optimization**: Only 2 libraries out of 80+ profiled functions
2. **Batch-first APIs**: Minimize CGo overhead (~100ns/call)
3. **Zero-allocation hot paths**: Arena allocators, no malloc in critical sections
4. **C ABI for CGo**: Simple C structs, no C++ exceptions across boundary
5. **Cross-platform**: Runtime SIMD detection (ARMv8 / x86_64 SHA-NI)
### CGo Integration
```go
// #cgo LDFLAGS: -L${SRCDIR}/../../native/build -lqueue_index
// #include "../../native/queue_index/queue_index.h"
import "C"
```
### Error Handling
- C functions return `-1` for errors, positive values for success
- Use `qi_last_error()` / `fh_last_error()` for error messages
- Go code checks `rc < 0` not `rc != 0`
## When to Add New C++ Libraries
**DO implement when:**
- Profile shows >90% syscall overhead
- Batch operations amortize CGo cost
- SIMD can provide 3x+ speedup
- Memory pressure is critical
**DON'T implement when:**
- Speedup <2x (CGo overhead negates gains)
- Single-file operations (per-call overhead too high)
- Team <3 backend engineers (maintenance burden)
- Complex error handling required
## History
**Implemented:**
- queue_index: Binary priority queue replacing JSON filesystem queue
- dataset_hash: SIMD SHA256 for artifact verification
**Deferred:**
- task_json_codec: 2-3x speedup not worth maintenance (small team)
- artifact_scanner: Go filepath.Walk faster for typical workloads
- streaming_io: Complexity exceeds benefit without io_uring
## Maintenance
**Build verification:**
```bash
make native-build
FETCHML_NATIVE_LIBS=1 make test
```
**Adding new library:**
1. Create subdirectory with CMakeLists.txt
2. Implement C ABI in `.h` / `.cpp` files
3. Add to root CMakeLists.txt
4. Create Go bridge in `internal/`
5. Add benchmarks in `tests/benchmarks/`
6. Document in this README
## Troubleshooting
**Library not found:**
- Ensure `native/build/lib*.dylib` (macOS) or `.so` (Linux) exists
- Check `LD_LIBRARY_PATH` or `DYLD_LIBRARY_PATH`
**CGo undefined symbols:**
- Verify C function names match exactly (no name mangling)
- Check `#include` paths are correct
- Rebuild: `make native-clean && make native-build`
**Performance regression:**
- Verify `FETCHML_NATIVE_LIBS=1` is set
- Check benchmark: `go test -bench=BenchmarkQueue -v`
- Profile with: `go test -bench=. -cpuprofile=cpu.prof`