- Fix mermaid graph syntax errors (escape parentheses in node labels) - Move mermaid-init.js to Hugo static directory for correct MIME type - Update Future Extensions section in cli-tui-ux-contract-v1.md to match current roadmap - Add ADR-004 through ADR-007 documenting C++ native optimization strategy
4.1 KiB
ADR-006: Use Runtime SIMD Detection for Cross-Platform Correctness
Status
Proposed
Context
The C++ native libraries use SIMD instructions for performance (SHA-NI on Intel, ARMv8 crypto extensions on Apple Silicon). However, macOS universal binaries support both x86_64 and arm64, and not all CPUs support the same extensions. Compile-time detection (e.g., #ifdef __AVX__) is insufficient because:
- Universal binaries must compile for the lowest common denominator
- Runtime CPU detection is required for correct operation on heterogeneous hardware
- Silent failures or illegal instruction crashes occur if SIMD is assumed
Decision
Use runtime CPU feature detection with compile-time guards. Function pointers are resolved at library initialization based on detected CPU capabilities.
Implementation Pattern:
// sha256_simd.cpp
#include <cpuid.h>
#include <arm_neon.h>
enum class Sha256Impl {
GENERIC, // Pure C++ (fallback)
SHA_NI, // Intel SHA-NI (x86_64)
ARMV8_CRYPTO // ARMv8 crypto extensions
};
// Runtime detection (called once at init)
Sha256Impl detect_best_impl() {
#if defined(__aarch64__)
// Apple Silicon - always has ARMv8 crypto
return Sha256Impl::ARMV8_CRYPTO;
#elif defined(__x86_64__)
unsigned int eax, ebx, ecx, edx;
__get_cpuid(7, &eax, &ebx, &ecx, &edx);
if (ebx & bit_SHA) { // SHA-NI bit
return Sha256Impl::SHA_NI;
}
return Sha256Impl::GENERIC;
#else
return Sha256Impl::GENERIC;
#endif
}
// Function pointer set at library init
void (*sha256_block_fn)(uint32_t* state, const uint8_t* data, size_t blocks) = nullptr;
extern "C" void fh_init_impl() {
auto impl = detect_best_impl();
switch (impl) {
case Sha256Impl::SHA_NI: sha256_block_fn = sha256_block_sha_ni; break;
case Sha256Impl::ARMV8_CRYPTO: sha256_block_fn = sha256_block_armv8; break;
default: sha256_block_fn = sha256_block_generic; break;
}
}
Consequences
Positive
- Single binary works on all supported platforms
- Graceful degradation to generic implementation
- No runtime crashes from illegal instructions
- Apple Silicon and Intel Macs supported equally
Negative
- Slight initialization overhead (one-time, negligible)
- Function pointer indirection in hot path
- More complex build (multiple SIMD variants compiled)
- Larger binary (generic + SIMD code paths)
Options Considered
Option 1: Compile-Time Detection Only
Pros: Simple, no runtime overhead
Cons: Universal binary fails on one architecture or the other
Verdict: Rejected - macOS requires universal binaries
Option 2: Separate Binaries per Architecture
Pros: Maximum performance, simple code
Cons: Complex distribution, user must choose correct binary
Verdict: Rejected - poor user experience
Option 3: Runtime Detection with Function Pointers (Chosen)
Pros: Single binary, correct on all platforms
Cons: Slight indirection overhead
Verdict: Accepted - correctness over micro-optimization
Enforcement
Build Requirements
- Compile with
-march=x86-64-v2(baseline) + separate SHA-NI object - Compile with
-march=armv8-a(baseline) + separate crypto object - Linker combines all variants into single binary
CI Strategy
# Build and test on both architectures
jobs:
test-intel:
runs-on: macos-latest # Intel
steps:
- build
- test -bench=. ./tests/benchmarks/
test-arm:
runs-on: macos-latest # Apple Silicon
steps:
- build
- test -bench=. ./tests/benchmarks/
Correctness Verification
- Same hashes produced on Intel and ARM
- Benchmarks verify SIMD path is faster than generic
- Unit tests run on both architectures in CI
Rationale
Runtime detection is mandatory for macOS universal binaries. The function pointer indirection cost (~1-2 cycles) is negligible compared to the SIMD speedup (2-4x for SHA-256). The generic fallback ensures the code never crashes, even on unexpected hardware.
The key discipline: never assume CPU features. Detection happens once at init, then the optimal path is used for all subsequent operations.