fetch_ml/native
Jeremie Fraeys 7efe8bbfbf
native: security hardening, research trustworthiness, and CVE mitigations
Security Fixes:
- CVE-2024-45339: Add O_EXCL flag to temp file creation in storage_write_entries()
  Prevents symlink attacks on predictable .tmp file paths
- CVE-2025-47290: Use openat_nofollow() in storage_open()
  Closes TOCTOU race condition via path_sanitizer infrastructure
- CVE-2025-0838: Add MAX_BATCH_SIZE=10000 to add_tasks()
  Prevents integer overflow in batch operations

Research Trustworthiness (dataset_hash):
- Deterministic file ordering: std::sort after collect_files()
- Recursive directory traversal: depth-limited with cycle detection
- Documented exclusions: hidden files and special files noted in API

Bug Fixes:
- R1: storage_init path validation for non-existent directories
- R2: safe_strncpy return value check before strcat
- R3: parallel_hash 256-file cap replaced with std::vector
- R4: wire qi_compact_index/qi_rebuild_index stubs
- R5: CompletionLatch race condition fix (hold mutex during decrement)
- R6: ARMv8 SHA256 transform fix (save abcd_pre before vsha256hq_u32)
- R7: fuzz_index_storage header format fix
- R8: enforce null termination in add_tasks/update_tasks
- R9: use 64 bytes (not 65) in combined hash to exclude null terminator
- R10: status field persistence in save()

New Tests:
- test_recursive_dataset.cpp: Verify deterministic recursive hashing
- test_storage_symlink_resistance.cpp: Verify CVE-2024-45339 fix
- test_queue_index_batch_limit.cpp: Verify CVE-2025-0838 fix
- test_sha256_arm_kat.cpp: ARMv8 known-answer tests
- test_storage_init_new_dir.cpp: F1 verification
- test_parallel_hash_large_dir.cpp: F3 verification
- test_queue_index_compact.cpp: F4 verification

All 8 native tests passing. Library ready for research lab deployment.
2026-02-21 13:33:45 -05:00
..
common native: security hardening, research trustworthiness, and CVE mitigations 2026-02-21 13:33:45 -05:00
dataset_hash native: security hardening, research trustworthiness, and CVE mitigations 2026-02-21 13:33:45 -05:00
queue_index native: security hardening, research trustworthiness, and CVE mitigations 2026-02-21 13:33:45 -05:00
tests native: security hardening, research trustworthiness, and CVE mitigations 2026-02-21 13:33:45 -05:00
CMakeLists.txt native: security hardening, research trustworthiness, and CVE mitigations 2026-02-21 13:33:45 -05:00
README.md native: security hardening, research trustworthiness, and CVE mitigations 2026-02-21 13:33:45 -05:00

Native C++ Libraries

High-performance C++ libraries for critical system components.

Overview

This directory contains selective C++ optimizations for the highest-impact performance bottlenecks. Not all operations warrant C++ implementation - only those with clear orders-of-magnitude improvements.

Current Libraries

queue_index (Priority Queue Index)

  • Purpose: High-performance task queue with binary heap
  • Performance: 21,000x faster than JSON-based Go implementation
  • Memory: 99% allocation reduction
  • Security: CVE-2024-45339, CVE-2025-47290, CVE-2025-0838 mitigations applied
  • Status: Production ready

dataset_hash (SHA256 Hashing)

  • Purpose: SIMD-accelerated file hashing (ARMv8 crypto / Intel SHA-NI)
  • Performance: 78% syscall reduction, batch-first API
  • Memory: 99% less memory than Go implementation
  • Research: Deterministic sorted hashing, recursive directory traversal
  • Status: Production ready for research use

Security

CVE Mitigations Applied

CVE Description Mitigation
CVE-2024-45339 Symlink attack on temp files O_EXCL flag with retry-on-EEXIST
CVE-2025-47290 TOCTOU race in file open openat_nofollow() via path_sanitizer
CVE-2025-0838 Integer overflow in batch ops MAX_BATCH_SIZE = 10000 limit

Research Trustworthiness

dataset_hash guarantees:

  • Deterministic ordering: Files sorted lexicographically before hashing
  • Recursive traversal: Nested directories fully hashed (max depth 32)
  • Reproducible: Same dataset produces identical hash across machines
  • Documented exclusions: Hidden files (.name) and special files excluded

Build Requirements

  • CMake 3.20+
  • C++20 compiler (GCC 11+, Clang 14+, or MSVC 2022+)
  • Go 1.25+ (for CGo integration)

Quick Start

# Build all native libraries
make native-build

# Run with native libraries enabled
FETCHML_NATIVE_LIBS=1 go run ./...

# Run benchmarks
FETCHML_NATIVE_LIBS=1 go test -bench=. ./tests/benchmarks/

Test Coverage

make native-test

8/8 tests passing:

  • storage_smoke - Basic storage operations
  • dataset_hash_smoke - Hashing correctness
  • storage_init_new_dir - Directory creation
  • parallel_hash_large_dir - 300+ file handling
  • queue_index_compact - Compaction operations
  • sha256_arm_kat - ARMv8 SHA256 verification
  • storage_symlink_resistance - CVE-2024-45339 verification
  • queue_index_batch_limit - CVE-2025-0838 verification

Build Options

# Debug build with AddressSanitizer
cd native/build && cmake .. -DCMAKE_BUILD_TYPE=Debug -DENABLE_ASAN=ON

# Release build (optimized)
cd native/build && cmake .. -DCMAKE_BUILD_TYPE=Release

# Build specific library
cd native/build && make queue_index

Architecture

Design Principles

  1. Selective optimization: Only 2 libraries out of 80+ profiled functions
  2. Batch-first APIs: Minimize CGo overhead (~100ns/call)
  3. Zero-allocation hot paths: Arena allocators, no malloc in critical sections
  4. C ABI for CGo: Simple C structs, no C++ exceptions across boundary
  5. Cross-platform: Runtime SIMD detection (ARMv8 / x86_64 SHA-NI)

CGo Integration

// #cgo LDFLAGS: -L${SRCDIR}/../../native/build -lqueue_index
// #include "../../native/queue_index/queue_index.h"
import "C"

Error Handling

  • C functions return -1 for errors, positive values for success
  • Use qi_last_error() / fh_last_error() for error messages
  • Go code checks rc < 0 not rc != 0

When to Add New C++ Libraries

DO implement when:

  • Profile shows >90% syscall overhead
  • Batch operations amortize CGo cost
  • SIMD can provide 3x+ speedup
  • Memory pressure is critical

DON'T implement when:

  • Speedup <2x (CGo overhead negates gains)
  • Single-file operations (per-call overhead too high)
  • Team <3 backend engineers (maintenance burden)
  • Complex error handling required

History

Implemented:

  • queue_index: Binary priority queue replacing JSON filesystem queue
  • dataset_hash: SIMD SHA256 for artifact verification

Deferred:

  • ⏸️ task_json_codec: 2-3x speedup not worth maintenance (small team)
  • ⏸️ artifact_scanner: Go filepath.Walk faster for typical workloads
  • ⏸️ streaming_io: Complexity exceeds benefit without io_uring

Maintenance

Build verification:

make native-build
FETCHML_NATIVE_LIBS=1 make test

Adding new library:

  1. Create subdirectory with CMakeLists.txt
  2. Implement C ABI in .h / .cpp files
  3. Add to root CMakeLists.txt
  4. Create Go bridge in internal/
  5. Add benchmarks in tests/benchmarks/
  6. Document in this README

Troubleshooting

Library not found:

  • Ensure native/build/lib*.dylib (macOS) or .so (Linux) exists
  • Check LD_LIBRARY_PATH or DYLD_LIBRARY_PATH

CGo undefined symbols:

  • Verify C function names match exactly (no name mangling)
  • Check #include paths are correct
  • Rebuild: make native-clean && make native-build

Performance regression:

  • Verify FETCHML_NATIVE_LIBS=1 is set
  • Check benchmark: go test -bench=BenchmarkQueue -v
  • Profile with: go test -bench=. -cpuprofile=cpu.prof