# ADR-006: Use Runtime SIMD Detection for Cross-Platform Correctness

## Status
Proposed

## Context
The C++ native libraries use SIMD instructions for performance (SHA-NI on Intel, ARMv8 crypto extensions on Apple Silicon). However, macOS universal binaries support both x86_64 and arm64, and not all CPUs support the same extensions. Compile-time detection (e.g., `#ifdef __AVX__`) is insufficient because:

1. Universal binaries must compile for the lowest common denominator
2. Runtime CPU detection is required for correct operation on heterogeneous hardware
3. Silent failures or illegal instruction crashes occur if SIMD is assumed

## Decision
Use **runtime CPU feature detection** with compile-time guards. Function pointers are resolved at library initialization based on detected CPU capabilities.

**Implementation Pattern:**
```cpp
// sha256_simd.cpp
#include <cpuid.h>
#include <arm_neon.h>

enum class Sha256Impl {
    GENERIC,      // Pure C++ (fallback)
    SHA_NI,       // Intel SHA-NI (x86_64)
    ARMV8_CRYPTO  // ARMv8 crypto extensions
};

// Runtime detection (called once at init)
Sha256Impl detect_best_impl() {
#if defined(__aarch64__)
    // Apple Silicon - always has ARMv8 crypto
    return Sha256Impl::ARMV8_CRYPTO;
#elif defined(__x86_64__)
    unsigned int eax, ebx, ecx, edx;
    __get_cpuid(7, &eax, &ebx, &ecx, &edx);
    if (ebx & bit_SHA) {  // SHA-NI bit
        return Sha256Impl::SHA_NI;
    }
    return Sha256Impl::GENERIC;
#else
    return Sha256Impl::GENERIC;
#endif
}

// Function pointer set at library init
void (*sha256_block_fn)(uint32_t* state, const uint8_t* data, size_t blocks) = nullptr;

extern "C" void fh_init_impl() {
    auto impl = detect_best_impl();
    switch (impl) {
        case Sha256Impl::SHA_NI: sha256_block_fn = sha256_block_sha_ni; break;
        case Sha256Impl::ARMV8_CRYPTO: sha256_block_fn = sha256_block_armv8; break;
        default: sha256_block_fn = sha256_block_generic; break;
    }
}
```

## Consequences

### Positive
- Single binary works on all supported platforms
- Graceful degradation to generic implementation
- No runtime crashes from illegal instructions
- Apple Silicon and Intel Macs supported equally

### Negative
- Slight initialization overhead (one-time, negligible)
- Function pointer indirection in hot path
- More complex build (multiple SIMD variants compiled)
- Larger binary (generic + SIMD code paths)

## Options Considered

### Option 1: Compile-Time Detection Only
**Pros:** Simple, no runtime overhead  
**Cons:** Universal binary fails on one architecture or the other  
**Verdict:** Rejected - macOS requires universal binaries

### Option 2: Separate Binaries per Architecture
**Pros:** Maximum performance, simple code  
**Cons:** Complex distribution, user must choose correct binary  
**Verdict:** Rejected - poor user experience

### Option 3: Runtime Detection with Function Pointers (Chosen)
**Pros:** Single binary, correct on all platforms  
**Cons:** Slight indirection overhead  
**Verdict:** Accepted - correctness over micro-optimization

## Enforcement

### Build Requirements
- Compile with `-march=x86-64-v2` (baseline) + separate SHA-NI object
- Compile with `-march=armv8-a` (baseline) + separate crypto object
- Linker combines all variants into single binary

### CI Strategy
```yaml
# Build and test on both architectures
jobs:
  test-intel:
    runs-on: macos-latest  # Intel
    steps:
      - build
      - test -bench=. ./tests/benchmarks/
  
  test-arm:
    runs-on: macos-latest  # Apple Silicon
    steps:
      - build
      - test -bench=. ./tests/benchmarks/
```

### Correctness Verification
- Same hashes produced on Intel and ARM
- Benchmarks verify SIMD path is faster than generic
- Unit tests run on both architectures in CI

## Rationale

Runtime detection is mandatory for macOS universal binaries. The function pointer indirection cost (~1-2 cycles) is negligible compared to the SIMD speedup (2-4x for SHA-256). The generic fallback ensures the code never crashes, even on unexpected hardware.

The key discipline: **never assume CPU features**. Detection happens once at init, then the optimal path is used for all subsequent operations.