fetch_ml/tests/fixtures/consistency/README.md
Jeremie Fraeys a239f3a14f
test(consistency): add dataset hash consistency test suite
Add cross-implementation consistency tests for dataset hash functionality:

## Test Fixtures
- Single file, nested directories, and multiple file test cases
- Expected hashes in JSON format for validation

## Test Infrastructure
- harness.go: Common test utilities and reference implementation runner
- dataset_hash_test.go: Consistency test cases comparing implementations
- cmd/update.go: Tool to regenerate expected hashes from reference

## Purpose
Ensures hash implementations (Go, C++, Zig) produce identical results
across all supported platforms and implementations.
2026-03-05 14:41:14 -05:00

40 lines
1.3 KiB
Markdown

# Consistency Test Fixtures
This directory contains canonical test fixtures for cross-implementation consistency testing.
Each implementation (native C++, Go, Zig) must produce identical outputs for these fixtures.
## Algorithm Specification
### Dataset Hash Algorithm v1
1. Recursively collect all regular files (not symlinks, not directories)
2. Skip hidden files (names starting with '.')
3. Sort file paths lexicographically (full relative paths)
4. For each file:
- Compute SHA256 of file contents
- Convert to lowercase hex (64 chars)
5. Combine: SHA256(concatenation of all file hashes in sorted order)
6. Return lowercase hex (64 chars)
**Empty directory**: Returns SHA256 of empty string:
`e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855`
### Directory Structure
```
dataset_hash/
├── 01_empty_dir/ # Empty directory
├── 02_single_file/ # One file with "hello world"
├── 03_nested/ # Nested directories
├── 04_special_chars/ # Files with spaces and unicode
└── expected_hashes.json # All expected outputs
```
## Adding New Fixtures
1. Create directory with `input/` subdirectory
2. Add files to `input/`
3. Compute expected hash using reference implementation
4. Add entry to `expected_hashes.json`
5. Document any special considerations in `README.md`