fetch_ml/docs/src/adr/ADR-006-runtime-simd-detection.md
Jeremie Fraeys d673bce216
docs: fix mermaid graphs and update outdated content
- Fix mermaid graph syntax errors (escape parentheses in node labels)
- Move mermaid-init.js to Hugo static directory for correct MIME type
- Update Future Extensions section in cli-tui-ux-contract-v1.md to match current roadmap
- Add ADR-004 through ADR-007 documenting C++ native optimization strategy
2026-02-16 20:37:38 -05:00

4.1 KiB

ADR-006: Use Runtime SIMD Detection for Cross-Platform Correctness

Status

Proposed

Context

The C++ native libraries use SIMD instructions for performance (SHA-NI on Intel, ARMv8 crypto extensions on Apple Silicon). However, macOS universal binaries support both x86_64 and arm64, and not all CPUs support the same extensions. Compile-time detection (e.g., #ifdef __AVX__) is insufficient because:

  1. Universal binaries must compile for the lowest common denominator
  2. Runtime CPU detection is required for correct operation on heterogeneous hardware
  3. Silent failures or illegal instruction crashes occur if SIMD is assumed

Decision

Use runtime CPU feature detection with compile-time guards. Function pointers are resolved at library initialization based on detected CPU capabilities.

Implementation Pattern:

// sha256_simd.cpp
#include <cpuid.h>
#include <arm_neon.h>

enum class Sha256Impl {
    GENERIC,      // Pure C++ (fallback)
    SHA_NI,       // Intel SHA-NI (x86_64)
    ARMV8_CRYPTO  // ARMv8 crypto extensions
};

// Runtime detection (called once at init)
Sha256Impl detect_best_impl() {
#if defined(__aarch64__)
    // Apple Silicon - always has ARMv8 crypto
    return Sha256Impl::ARMV8_CRYPTO;
#elif defined(__x86_64__)
    unsigned int eax, ebx, ecx, edx;
    __get_cpuid(7, &eax, &ebx, &ecx, &edx);
    if (ebx & bit_SHA) {  // SHA-NI bit
        return Sha256Impl::SHA_NI;
    }
    return Sha256Impl::GENERIC;
#else
    return Sha256Impl::GENERIC;
#endif
}

// Function pointer set at library init
void (*sha256_block_fn)(uint32_t* state, const uint8_t* data, size_t blocks) = nullptr;

extern "C" void fh_init_impl() {
    auto impl = detect_best_impl();
    switch (impl) {
        case Sha256Impl::SHA_NI: sha256_block_fn = sha256_block_sha_ni; break;
        case Sha256Impl::ARMV8_CRYPTO: sha256_block_fn = sha256_block_armv8; break;
        default: sha256_block_fn = sha256_block_generic; break;
    }
}

Consequences

Positive

  • Single binary works on all supported platforms
  • Graceful degradation to generic implementation
  • No runtime crashes from illegal instructions
  • Apple Silicon and Intel Macs supported equally

Negative

  • Slight initialization overhead (one-time, negligible)
  • Function pointer indirection in hot path
  • More complex build (multiple SIMD variants compiled)
  • Larger binary (generic + SIMD code paths)

Options Considered

Option 1: Compile-Time Detection Only

Pros: Simple, no runtime overhead
Cons: Universal binary fails on one architecture or the other
Verdict: Rejected - macOS requires universal binaries

Option 2: Separate Binaries per Architecture

Pros: Maximum performance, simple code
Cons: Complex distribution, user must choose correct binary
Verdict: Rejected - poor user experience

Option 3: Runtime Detection with Function Pointers (Chosen)

Pros: Single binary, correct on all platforms
Cons: Slight indirection overhead
Verdict: Accepted - correctness over micro-optimization

Enforcement

Build Requirements

  • Compile with -march=x86-64-v2 (baseline) + separate SHA-NI object
  • Compile with -march=armv8-a (baseline) + separate crypto object
  • Linker combines all variants into single binary

CI Strategy

# Build and test on both architectures
jobs:
  test-intel:
    runs-on: macos-latest  # Intel
    steps:
      - build
      - test -bench=. ./tests/benchmarks/
  
  test-arm:
    runs-on: macos-latest  # Apple Silicon
    steps:
      - build
      - test -bench=. ./tests/benchmarks/

Correctness Verification

  • Same hashes produced on Intel and ARM
  • Benchmarks verify SIMD path is faster than generic
  • Unit tests run on both architectures in CI

Rationale

Runtime detection is mandatory for macOS universal binaries. The function pointer indirection cost (~1-2 cycles) is negligible compared to the SIMD speedup (2-4x for SHA-256). The generic fallback ensures the code never crashes, even on unexpected hardware.

The key discipline: never assume CPU features. Detection happens once at init, then the optimal path is used for all subsequent operations.