# ADR-006: Use Runtime SIMD Detection for Cross-Platform Correctness ## Status Proposed ## Context The C++ native libraries use SIMD instructions for performance (SHA-NI on Intel, ARMv8 crypto extensions on Apple Silicon). However, macOS universal binaries support both x86_64 and arm64, and not all CPUs support the same extensions. Compile-time detection (e.g., `#ifdef __AVX__`) is insufficient because: 1. Universal binaries must compile for the lowest common denominator 2. Runtime CPU detection is required for correct operation on heterogeneous hardware 3. Silent failures or illegal instruction crashes occur if SIMD is assumed ## Decision Use **runtime CPU feature detection** with compile-time guards. Function pointers are resolved at library initialization based on detected CPU capabilities. **Implementation Pattern:** ```cpp // sha256_simd.cpp #include #include enum class Sha256Impl { GENERIC, // Pure C++ (fallback) SHA_NI, // Intel SHA-NI (x86_64) ARMV8_CRYPTO // ARMv8 crypto extensions }; // Runtime detection (called once at init) Sha256Impl detect_best_impl() { #if defined(__aarch64__) // Apple Silicon - always has ARMv8 crypto return Sha256Impl::ARMV8_CRYPTO; #elif defined(__x86_64__) unsigned int eax, ebx, ecx, edx; __get_cpuid(7, &eax, &ebx, &ecx, &edx); if (ebx & bit_SHA) { // SHA-NI bit return Sha256Impl::SHA_NI; } return Sha256Impl::GENERIC; #else return Sha256Impl::GENERIC; #endif } // Function pointer set at library init void (*sha256_block_fn)(uint32_t* state, const uint8_t* data, size_t blocks) = nullptr; extern "C" void fh_init_impl() { auto impl = detect_best_impl(); switch (impl) { case Sha256Impl::SHA_NI: sha256_block_fn = sha256_block_sha_ni; break; case Sha256Impl::ARMV8_CRYPTO: sha256_block_fn = sha256_block_armv8; break; default: sha256_block_fn = sha256_block_generic; break; } } ``` ## Consequences ### Positive - Single binary works on all supported platforms - Graceful degradation to generic implementation - No runtime crashes from illegal instructions - Apple Silicon and Intel Macs supported equally ### Negative - Slight initialization overhead (one-time, negligible) - Function pointer indirection in hot path - More complex build (multiple SIMD variants compiled) - Larger binary (generic + SIMD code paths) ## Options Considered ### Option 1: Compile-Time Detection Only **Pros:** Simple, no runtime overhead **Cons:** Universal binary fails on one architecture or the other **Verdict:** Rejected - macOS requires universal binaries ### Option 2: Separate Binaries per Architecture **Pros:** Maximum performance, simple code **Cons:** Complex distribution, user must choose correct binary **Verdict:** Rejected - poor user experience ### Option 3: Runtime Detection with Function Pointers (Chosen) **Pros:** Single binary, correct on all platforms **Cons:** Slight indirection overhead **Verdict:** Accepted - correctness over micro-optimization ## Enforcement ### Build Requirements - Compile with `-march=x86-64-v2` (baseline) + separate SHA-NI object - Compile with `-march=armv8-a` (baseline) + separate crypto object - Linker combines all variants into single binary ### CI Strategy ```yaml # Build and test on both architectures jobs: test-intel: runs-on: macos-latest # Intel steps: - build - test -bench=. ./tests/benchmarks/ test-arm: runs-on: macos-latest # Apple Silicon steps: - build - test -bench=. ./tests/benchmarks/ ``` ### Correctness Verification - Same hashes produced on Intel and ARM - Benchmarks verify SIMD path is faster than generic - Unit tests run on both architectures in CI ## Rationale Runtime detection is mandatory for macOS universal binaries. The function pointer indirection cost (~1-2 cycles) is negligible compared to the SIMD speedup (2-4x for SHA-256). The generic fallback ensures the code never crashes, even on unexpected hardware. The key discipline: **never assume CPU features**. Detection happens once at init, then the optimal path is used for all subsequent operations.