Benchmark fixtures¶
The benchmark fixture library (benchmarks/fixtures/) provides six
deterministic, seed-controlled generators shaped like real enterprise tables.
Each injects only defects within FreshData's declared repair scope and documents
them so the harness can score repair fidelity against known ground truth.
Every fixture module exposes:
generate(n_rows, seed=42, defect_rate=None) -> pd.DataFrame— deterministic; the same(n_rows, seed, defect_rate)always yields identical data. Withdefect_rate=Noneeach defect family is injected at its documented base rate; with a float in[0, 1]every family is injected at that uniform rate (the knob the trust-monotonicity metric sweeps).GOLD_LABELS: dict[str, dict]— per column:role,expected_dtype,fill_action,preserve, and an optionalreferenceset.DEFECT_MANIFEST: list[dict]— one record per injected defect family.ID_COLUMNS,TARGET_COLUMN,TEXT_COLUMNS,SCALE_VARIANTS,N_COLS.
The gold fixture additionally returns a GoldBundle (dirty/clean frames and
preservation/repair/false-repair masks) instead of a bare DataFrame.
Fixture schemas¶
| fixture | cols | id column(s) | protected (never mutated) | scale variants |
|---|---|---|---|---|
| crm | 40 | customer_id |
id, full_name/email/phone (text) |
10k / 100k / 1M / 5M |
| finance | 60 | transaction_id, account_code |
ids, gl_account/cost_center (text) |
10k / 500k / 5M / 25M |
| event_log | 25 | event_id, entity_key |
ids | 10k / 1M / 10M / 50M |
| wide_schema | 100–5000 | row_uuid |
id, y_target (target), text_* |
1k / 10k / 100k rows × 100/500/1k/5k cols |
| provenance | 18 | record_id, invoice_number |
ids, vendor_name (text) |
1k / 10k / 100k |
| gold | 7 | gid |
gid, gtarget (target), gtext (text) |
10k |
Defect families (in scope only)¶
- Sentinel normalization — recognised tokens (
N/A,null,--,n.a.) → missing.999is excluded: FreshData treats it as a number, so it is not an in-scope sentinel. - Dtype coercion — numeric strings (incl.
$and thousands commas) → float; ISO date strings → datetime64. Foreign-currency words (EUR 500) and accounting negatives ((1234.56)) are out of generic scope (finance domain pack territory) and intentionally not relied on. - Median fill — low-missingness numeric columns; skewed distributions make the engine's robust median the deterministic choice.
- Categorical normalization — sentinels → mode (when a dominant mode ≥ 50%
exists) or the canonical
Missing/Unknownmarker. - Duplicate removal — exact full-row duplicates.
- Reference-set violations / CDC / provenance — values outside an approved set, late arrivals, replay, out-of-order versions, low parser confidence: these are flagged, not silently rewritten.
- Preservation checks — null ids, missing targets, missing free text: must be flagged and never filled or mutated.
DEFECT_MANIFEST structure¶
Each entry (built from fixtures._common.Defect) is a JSON-friendly dict:
{
"id": "crm-ltv-missing", # stable family id
"column": "lifetime_value", # column, "a|b", "prefix_*", or "*" (row-level)
"defect_type": "missing_numeric", # human-readable defect kind
"rate": 0.06, # base injection rate (defect_rate=None)
"in_scope_repair": "median_fill", # the repair the harness expects
"preservation": false, # True => correct behaviour is "do not change"
"note": "" # optional scope caveat
}
The harness expands column patterns ("score_*", "a|b", "*") and maps
in_scope_repair to an observable post-condition for the family-level repair
fidelity metric. The test_fixtures.py suite asserts the actual injected counts
match each manifest rate within ±10%.
GOLD_LABELS structure¶
{
"lifetime_value": {
"role": "numeric", # id|target|text|numeric|categorical|datetime|bool
"expected_dtype": "float64", # dtype after cleaning
"fill_action": "median_fill", # the expected action
"preserve": false, # True for id/target/text invariants
"reference": ["US", "CA"] # optional approved set (categoricals)
}
}
For wide_schema, labels depend on the column count, so use
wide_schema.gold_labels(n_cols); the module-level GOLD_LABELS is the
100-column default.
The gold fixture¶
gold.generate(n_rows=10_000, seed=42, defect_rate=None) returns a GoldBundle:
| field | shape | meaning |
|---|---|---|
dirty_df |
(n + dupes, 7) | input with injected defects |
clean_df |
(n, 7) | analytically derived expected output (deduped, repaired) |
preservation_mask |
(n, 7) | True where a protected cell must be byte-identical |
repair_mask |
(n, 7) | True where a cell must change to a known value |
false_repair_traps |
(n, 7) | True where changing the cell is a failure |
clean_df is derived from first principles (independent oracle), never by
running the library under test, so the repair-fidelity and false-repair metrics
are grounded in known-correct ground truth.
Contributing a new fixture¶
- Add
benchmarks/fixtures/<name>.pyusing the_common.pyhelpers; exposegenerate,GOLD_LABELS,DEFECT_MANIFEST, and the role/scale constants. - Inject only in-scope defects (HARD CONSTRAINT 5). Verify against real
fd.cleanbehaviour — encode what the library actually does, don't assume. - Register it in
fixtures/__init__.pyand add a default size inbench.py'sDEFAULT_SIZES. - Add a determinism + manifest-count case to
tests/benchmark/test_fixtures.py.