History

Jeremie Fraeys fc2459977c refactor(worker): update worker tests and native bridge Worker Refactoring: - Update internal/worker/factory.go, worker.go, snapshot_store.go - Update native_bridge.go and native_bridge_nocgo.go for native library integration Test Updates: - Update all worker unit tests for new interfaces - Update chaos tests - Update container/podman_test.go - Add internal/workertest/worker.go for shared test utilities Documentation: - Update native/README.md		2026-02-23 18:04:22 -05:00
..
common	native: security hardening, research trustworthiness, and CVE mitigations	2026-02-21 13:33:45 -05:00
dataset_hash	native: security hardening, research trustworthiness, and CVE mitigations	2026-02-21 13:33:45 -05:00
nvml_gpu	feat: native GPU detection and NVML bridge for macOS and Linux	2026-02-21 17:59:59 -05:00
queue_index	native: security hardening, research trustworthiness, and CVE mitigations	2026-02-21 13:33:45 -05:00
tests	docs: Update privacy/security and research runner docs	2026-02-23 14:13:35 -05:00
CMakeLists.txt	feat: implement NVML-based GPU monitoring	2026-02-21 15:16:09 -05:00
README.md	refactor(worker): update worker tests and native bridge	2026-02-23 18:04:22 -05:00

README.md

Native C++ Libraries

High-performance C++ libraries for critical system components.

Overview

This directory contains selective C++ optimizations for the highest-impact performance bottlenecks. Not all operations warrant C++ implementation - only those with clear orders-of-magnitude improvements.

Current Libraries

queue_index (Priority Queue Index)

Purpose: High-performance task queue with binary heap
Performance: 21,000x faster than JSON-based Go implementation
Memory: 99% allocation reduction
Security: CVE-2024-45339, CVE-2025-47290, CVE-2025-0838 mitigations applied
Status: ✅ Production ready

dataset_hash (SHA256 Hashing)

Purpose: SIMD-accelerated file hashing (ARMv8 crypto / Intel SHA-NI)
Performance: 78% syscall reduction, batch-first API
Memory: 99% less memory than Go implementation
Research: Deterministic sorted hashing, recursive directory traversal
Status: ✅ Production ready for research use

Security

CVE Mitigations Applied

CVE	Description	Mitigation
CVE-2024-45339	Symlink attack on temp files	`O_EXCL` flag with retry-on-EEXIST
CVE-2025-47290	TOCTOU race in file open	`openat_nofollow()` via path_sanitizer
CVE-2025-0838	Integer overflow in batch ops	`MAX_BATCH_SIZE = 10000` limit

Research Trustworthiness

Critical Design Decisions for Research Use:

For dataset hashing to be trustworthy in research, it must be reproducible. The original collect_files used readdir() which returns files in filesystem-dependent order — inode order on ext4, creation order on others, essentially random on network filesystems. This meant researchers hashing the same dataset on different machines would get different combined hashes for identical content, breaking cross-collaborator verification.

The fix: std::sort(paths.begin(), paths.end()) after collection ensures lexicographic ordering. The hash is computed over sorted file paths, making it reproducible across machines and time.

Behavior Summary:

Deterministic ordering: Files sorted lexicographically before hashing
Recursive traversal: Nested directories fully hashed (max depth 32 with cycle detection)
Reproducible: Same dataset produces identical hash across machines and filesystems

Documented Exclusions (intentional, not bugs):

Hidden files (names starting with .) are excluded — if your dataset has .data files or dotfiles that are part of the data, they will be silently skipped
Special files (symlinks, devices, sockets) are excluded — only regular files (S_ISREG) are hashed
Non-regular entries in subdirectories are silently skipped

These exclusions were conscious design choices for security (symlink attack prevention) and predictability. However, researchers must be aware: a dataset directory with only hidden files or non-regular files will produce an empty hash, not an error. Verify your dataset structure matches expectations.

Build Requirements

CMake 3.20+
C++20 compiler (GCC 11+, Clang 14+, or MSVC 2022+)
Go 1.25+ (for CGo integration)

Quick Start

# Build all native libraries
make native-build

# Run with native libraries enabled (using -tags native_libs)
go build -tags native_libs ./...

# Run benchmarks
go test -tags native_libs -bench=. ./tests/benchmarks/

Test Coverage

make native-test

8/8 tests passing:

storage_smoke - Basic storage operations
dataset_hash_smoke - Hashing correctness
storage_init_new_dir - Directory creation
parallel_hash_large_dir - 300+ file handling
queue_index_compact - Compaction operations
sha256_arm_kat - ARMv8 SHA256 verification
storage_symlink_resistance - CVE-2024-45339 verification
queue_index_batch_limit - CVE-2025-0838 verification

Build Options

# Debug build with AddressSanitizer
cd native/build && cmake .. -DCMAKE_BUILD_TYPE=Debug -DENABLE_ASAN=ON

# Release build (optimized)
cd native/build && cmake .. -DCMAKE_BUILD_TYPE=Release

# Build specific library
cd native/build && make queue_index

Architecture

Design Principles

Selective optimization: Only 2 libraries out of 80+ profiled functions
Batch-first APIs: Minimize CGo overhead (~100ns/call)
Zero-allocation hot paths: Arena allocators, no malloc in critical sections
C ABI for CGo: Simple C structs, no C++ exceptions across boundary
Cross-platform: Runtime SIMD detection (ARMv8 / x86_64 SHA-NI)

CGo Integration

// #cgo LDFLAGS: -L${SRCDIR}/../../native/build -lqueue_index
// #include "../../native/queue_index/queue_index.h"
import "C"

Error Handling

C functions return -1 for errors, positive values for success
Use qi_last_error() / fh_last_error() for error messages
Go code checks rc < 0 not rc != 0

When to Add New C++ Libraries

DO implement when:

Profile shows >90% syscall overhead
Batch operations amortize CGo cost
SIMD can provide 3x+ speedup
Memory pressure is critical

DON'T implement when:

Speedup <2x (CGo overhead negates gains)
Single-file operations (per-call overhead too high)
Team <3 backend engineers (maintenance burden)
Complex error handling required

History

Implemented:

✅ queue_index: Binary priority queue replacing JSON filesystem queue
✅ dataset_hash: SIMD SHA256 for artifact verification

Deferred:

⏸️ task_json_codec: 2-3x speedup not worth maintenance (small team)
⏸️ artifact_scanner: Go filepath.Walk faster for typical workloads
⏸️ streaming_io: Complexity exceeds benefit without io_uring

Maintenance

Build verification:

make native-build

# Test with native libs (using build tag)
go test -tags native_libs ./tests/...

# Or use the test script with Redis
docker-compose -f deployments/docker-compose.dev.yml up -d redis
sleep 2
go test -tags native_libs ./tests/...

Adding new library:

Create subdirectory with CMakeLists.txt
Implement C ABI in .h / .cpp files
Add to root CMakeLists.txt
Create Go bridge in internal/
Add benchmarks in tests/benchmarks/
Document in this README

Troubleshooting

Library not found:

Ensure native/build/lib*.dylib (macOS) or .so (Linux) exists
Check LD_LIBRARY_PATH or DYLD_LIBRARY_PATH

CGo undefined symbols:

Verify C function names match exactly (no name mangling)
Check #include paths are correct
Rebuild: make native-clean && make native-build

Performance regression:

Verify code is built with -tags native_libs
Check benchmark: go test -bench=BenchmarkQueue -v
Profile with: go test -bench=. -cpuprofile=cpu.prof