Benchmark fixtures¶

The benchmark fixture library (benchmarks/fixtures/) provides six deterministic, seed-controlled generators shaped like real enterprise tables. Each injects only defects within FreshData's declared repair scope and documents them so the harness can score repair fidelity against known ground truth.

Every fixture module exposes:

generate(n_rows, seed=42, defect_rate=None) -> pd.DataFrame — deterministic; the same (n_rows, seed, defect_rate) always yields identical data. With defect_rate=None each defect family is injected at its documented base rate; with a float in [0, 1] every family is injected at that uniform rate (the knob the trust-monotonicity metric sweeps).
GOLD_LABELS: dict[str, dict] — per column: role, expected_dtype, fill_action, preserve, and an optional reference set.
DEFECT_MANIFEST: list[dict] — one record per injected defect family.
ID_COLUMNS, TARGET_COLUMN, TEXT_COLUMNS, SCALE_VARIANTS, N_COLS.

The gold fixture additionally returns a GoldBundle (dirty/clean frames and preservation/repair/false-repair masks) instead of a bare DataFrame.

Fixture schemas¶

fixture	cols	id column(s)	protected (never mutated)	scale variants
crm	40	`customer_id`	id, `full_name`/`email`/`phone` (text)	10k / 100k / 1M / 5M
finance	60	`transaction_id`, `account_code`	ids, `gl_account`/`cost_center` (text)	10k / 500k / 5M / 25M
event_log	25	`event_id`, `entity_key`	ids	10k / 1M / 10M / 50M
wide_schema	100–5000	`row_uuid`	id, `y_target` (target), `text_*`	1k / 10k / 100k rows × 100/500/1k/5k cols
provenance	18	`record_id`, `invoice_number`	ids, `vendor_name` (text)	1k / 10k / 100k
gold	7	`gid`	`gid`, `gtarget` (target), `gtext` (text)	10k

Defect families (in scope only)¶

Sentinel normalization — recognised tokens (N/A, null, --, n.a.) → missing. 999 is excluded: FreshData treats it as a number, so it is not an in-scope sentinel.
Dtype coercion — numeric strings (incl. $ and thousands commas) → float; ISO date strings → datetime64. Foreign-currency words (EUR 500) and accounting negatives ((1234.56)) are out of generic scope (finance domain pack territory) and intentionally not relied on.
Median fill — low-missingness numeric columns; skewed distributions make the engine's robust median the deterministic choice.
Categorical normalization — sentinels → mode (when a dominant mode ≥ 50% exists) or the canonical Missing/Unknown marker.
Duplicate removal — exact full-row duplicates.
Reference-set violations / CDC / provenance — values outside an approved set, late arrivals, replay, out-of-order versions, low parser confidence: these are flagged, not silently rewritten.
Preservation checks — null ids, missing targets, missing free text: must be flagged and never filled or mutated.

DEFECT_MANIFEST structure¶

Each entry (built from fixtures._common.Defect) is a JSON-friendly dict:

{
  "id": "crm-ltv-missing",          # stable family id
  "column": "lifetime_value",        # column, "a|b", "prefix_*", or "*" (row-level)
  "defect_type": "missing_numeric",  # human-readable defect kind
  "rate": 0.06,                       # base injection rate (defect_rate=None)
  "in_scope_repair": "median_fill",   # the repair the harness expects
  "preservation": false,              # True => correct behaviour is "do not change"
  "note": ""                          # optional scope caveat
}

The harness expands column patterns ("score_*", "a|b", "*") and maps in_scope_repair to an observable post-condition for the family-level repair fidelity metric. The test_fixtures.py suite asserts the actual injected counts match each manifest rate within ±10%.

GOLD_LABELS structure¶

{
  "lifetime_value": {
    "role": "numeric",            # id|target|text|numeric|categorical|datetime|bool
    "expected_dtype": "float64",  # dtype after cleaning
    "fill_action": "median_fill", # the expected action
    "preserve": false,            # True for id/target/text invariants
    "reference": ["US", "CA"]     # optional approved set (categoricals)
  }
}

For wide_schema, labels depend on the column count, so use wide_schema.gold_labels(n_cols); the module-level GOLD_LABELS is the 100-column default.

The gold fixture¶

gold.generate(n_rows=10_000, seed=42, defect_rate=None) returns a GoldBundle:

field	shape	meaning
`dirty_df`	(n + dupes, 7)	input with injected defects
`clean_df`	(n, 7)	analytically derived expected output (deduped, repaired)
`preservation_mask`	(n, 7)	True where a protected cell must be byte-identical
`repair_mask`	(n, 7)	True where a cell must change to a known value
`false_repair_traps`	(n, 7)	True where changing the cell is a failure

clean_df is derived from first principles (independent oracle), never by running the library under test, so the repair-fidelity and false-repair metrics are grounded in known-correct ground truth.

Contributing a new fixture¶

Add benchmarks/fixtures/<name>.py using the _common.py helpers; expose generate, GOLD_LABELS, DEFECT_MANIFEST, and the role/scale constants.
Inject only in-scope defects (HARD CONSTRAINT 5). Verify against real fd.clean behaviour — encode what the library actually does, don't assume.
Register it in fixtures/__init__.py and add a default size in bench.py's DEFAULT_SIZES.
Add a determinism + manifest-count case to tests/benchmark/test_fixtures.py.