fetch_ml/tests/fixtures/consistency
Jeremie Fraeys a239f3a14f
test(consistency): add dataset hash consistency test suite
Add cross-implementation consistency tests for dataset hash functionality:

## Test Fixtures
- Single file, nested directories, and multiple file test cases
- Expected hashes in JSON format for validation

## Test Infrastructure
- harness.go: Common test utilities and reference implementation runner
- dataset_hash_test.go: Consistency test cases comparing implementations
- cmd/update.go: Tool to regenerate expected hashes from reference

## Purpose
Ensures hash implementations (Go, C++, Zig) produce identical results
across all supported platforms and implementations.
2026-03-05 14:41:14 -05:00
..
dataset_hash test(consistency): add dataset hash consistency test suite 2026-03-05 14:41:14 -05:00
README.md test(consistency): add dataset hash consistency test suite 2026-03-05 14:41:14 -05:00

Consistency Test Fixtures

This directory contains canonical test fixtures for cross-implementation consistency testing.

Each implementation (native C++, Go, Zig) must produce identical outputs for these fixtures.

Algorithm Specification

Dataset Hash Algorithm v1

  1. Recursively collect all regular files (not symlinks, not directories)
  2. Skip hidden files (names starting with '.')
  3. Sort file paths lexicographically (full relative paths)
  4. For each file:
    • Compute SHA256 of file contents
    • Convert to lowercase hex (64 chars)
  5. Combine: SHA256(concatenation of all file hashes in sorted order)
  6. Return lowercase hex (64 chars)

Empty directory: Returns SHA256 of empty string: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

Directory Structure

dataset_hash/
├── 01_empty_dir/         # Empty directory
├── 02_single_file/         # One file with "hello world"
├── 03_nested/              # Nested directories
├── 04_special_chars/       # Files with spaces and unicode
└── expected_hashes.json    # All expected outputs

Adding New Fixtures

  1. Create directory with input/ subdirectory
  2. Add files to input/
  3. Compute expected hash using reference implementation
  4. Add entry to expected_hashes.json
  5. Document any special considerations in README.md