Benchmarks¶

freshdata is built on vectorized pandas/NumPy with one-pass engine caching (correlation matrix, column contexts). No C extension is required.

Typical throughput¶

Measured on a modern laptop (see tests/fixtures/perf/baselines.json):

Dataset size	Balanced	Aggressive
500 rows	< 0.5 s	< 1 s
3,000 rows	< 2.5 s	< 6 s
29k rows (full AQI)	< 5 s	KNN gated

The aggressive bottleneck is KNN imputation on large frames, which is why KNN is gated to aggressive mode only.

MissForest-style imputation is also benchmarked separately because it is opt-in, scikit-learn-backed, and intentionally slower than the default engine. Use python benchmarks/bench_missforest.py after installing freshdata-cleaner[ml] to compare median/mode, aggressive KNN, and MissForest on mixed-type synthetic data.

The Benchmark Release harness¶

benchmarks/bench.py is a reproducible, schema-stable harness that measures nine standardized metrics against an enterprise fixture library, with pandas and pyjanitor baselines. It calls FreshData exactly as a user would; it never modifies library internals.

pip install -e ".[dev]" jsonschema

python benchmarks/bench.py fixtures     # (optional) write fixture CSVs to disk
python benchmarks/bench.py run          # all fixtures, 10k rows, write results/<run_id>/
python benchmarks/bench.py report       # render report.md + report.json for the latest run
python benchmarks/bench.py compare --fixture crm --size 100000   # FreshData vs baselines
python benchmarks/bench.py single --fixture crm --size 100000 --metric time

Makefile shortcuts: make benchmark, make benchmark-ci, make benchmark-report, make benchmark-fixtures, make benchmark-test. Results land in results/<run_id>/<fixture>/<size>.json in the schema in benchmarks/results_schema.py, so runs diff cleanly across versions.

The legacy quick-bench (python benchmarks/bench_quick.py --fixtures --compare) is preserved for ad-hoc throughput checks on the tests/fixtures/ corpus.

Calibration metrics (CleanBench Phase 3)¶

The CleanBench mini-suite (benchmarks/cleanbench/, run in CI by tests/test_cleanbench.py) gates the semantic layer's confidence honesty alongside the Phase-2 safety gates:

metric	definition	Phase-3 gate	long-run target
protected-column violation rate	any diff in protected columns	= 0, absolute	= 0
false modification rate	already-correct cells changed anyway	≤ 0.1%	trend to 0
expected calibration error (ECE)	equal-width-bin gap between confidence and accuracy over semantic proposals	≤ 0.05	≤ 0.03
precision @ confidence ≥ 0.95	share of high-confidence proposals that match ground truth	≥ 0.98	≥ 0.99
coverage @ precision 0.98	largest share of proposals acceptable at that precision (abstention quality)	reported	grows per phase
ambiguous auto-applies	embedding merges applied without margin/allowed-values evidence	= 0	= 0

Pairs are extracted with cleanbench.confidence_outcomes(report, truth, corrupted) — every semantic proposal's calibrated confidence against whether its repair matches the fixture's ground truth. The Phase-3 fixture runs the full embedding path with a deterministic stub encoder, so the gates hold with no model files and no network.

Strategic-report scaling benchmarks¶

benchmarks/bench_report.py covers five reproducible scaling cases, each measured for strategy="balanced" vs strategy="aggressive" where applicable. It generates its own synthetic fixtures and writes benchmarks/results/report_bench.json.

pip install -e ".[bench]"          # pyarrow + psutil

python benchmarks/bench_report.py --all                 # everything
python benchmarks/bench_report.py csv_ingest --mb 100   # ~100 MB CSV ingest + clean
python benchmarks/bench_report.py profile --rows 1000000     # 1M-row mixed-schema profile
python benchmarks/bench_report.py nullfill --rows 10000000   # 10M-row null-fill / flag pass
python benchmarks/bench_report.py import_time           # cold `import freshdata`
python benchmarks/bench_report.py memory --rows 1000000      # peak-RSS of a full clean

Results — not yet measured¶

These numbers are environment-specific and are not committed. Run the commands above and read benchmarks/results/report_bench.json. Fill the table in with the hardware/software you actually ran on; do not copy numbers from elsewhere.

Hardware: (CPU, cores, RAM — fill in)
Software: (OS, Python version, pandas/pyarrow/polars/duckdb versions — fill in)

Benchmark	Balanced	Aggressive
100 MB CSV ingest + clean	not yet measured	not yet measured
1M-row profile	not yet measured	n/a (read-only)
10M-row null-fill / flag	not yet measured	not yet measured
Import time (`import freshdata`)	not yet measured	n/a
Peak memory @ 1M rows	not yet measured	not yet measured

What each metric measures (and why)¶

#	Metric	Definition	Why it matters
1	Wall-clock	p50/p95 seconds for `fd.clean(df, return_report=True)` over 5 repeats, I/O excluded.	Raw throughput on real-shaped data.
2	Peak memory (MB)	Peak/delta `tracemalloc` allocation during the run.	Footprint, esp. the wide-schema report-generation stress case.
3	Repair fidelity (%)	Cell-level match to the gold oracle; family-level post-conditions (`DEFECT_MANIFEST`) for named fixtures.	Does FreshData actually fix what it claims to?
4	False-repair rate (%)	% of cells that must not change (id/target/free-text traps) that changed anyway.	The core safety contract. Must be 0.
5	Preservation rate (%)	% of protected values identical before/after.	The id/target/text invariant under load. Must be 100 on non-null ids.
6	Authored-code reduction	FreshData call lines vs the pandas/pyjanitor baseline lines for the same defect set.	Expressiveness (reported separately from timing).
7	Diagnosis speed	Latency of `report.summary()` / `.to_frame()` / `.to_dict()` (+ enterprise `quality.to_markdown()` / `lineage.emit()`).	Time from report to explanation.
8	Trust-score usefulness	Strict monotonic decrease of the 0–100 trust score as defect rate rises (0→60%).	The trust score must be a meaningful signal.
9	Export completeness	All report fields populated; every repaired action carries before/after, risk, confidence (rationale for engine decisions).	A repair you cannot explain or export is not auditable.

Current results — 10k-row scale, balanced mode¶

Refreshed by the CI benchmark workflow; values below are a reference local run (freshdata 1.0.0). Re-render with bench.py run && bench.py report.

fixture	n_rows	n_cols	p50 s	p95 s	peak MB	repair %	preserve %	trust	monotonic	export %
crm	10,200	40	0.93	0.93	8.2	100.0	100.0	93.8	✅	100.0
finance	10,200	60	1.16	1.16	10.0	100.0	100.0	99.5	✅	100.0
event_log	10,000	25	0.35	0.35	6.4	100.0	100.0	99.7	✅	100.0
wide_schema	10,000	100	2.28	2.29	14.6	100.0	100.0	96.0	✅	100.0
provenance	10,000	18	0.29	0.30	7.5	100.0	100.0	99.8	✅	100.0
gold	10,200	7	0.14	0.14	2.7	100.0	100.0	98.3	✅	100.0

Authored-code reduction (Metric 6): FreshData = 3 lines; pandas baseline = 26 lines (88.5% reduction); pyjanitor baseline = 20 lines (85.0% reduction).

The false-repair rate is 0.0% and preservation 100.0% on every fixture — the id/target/free-text invariant holds under load, not just on toy examples.

Competitor differentiation¶

See benchmarks/competitor_analysis.md: a curated, factual table covering pandas, pyjanitor, Great Expectations, Soda, dbt, AWS Glue Data Quality, Google Dataplex, OpenRefine, Dedupe, ydata-profiling, sweetviz and cleanlab. Capability claims come from each tool's public docs; FreshData advantages appear only where a harness metric confirms them.

Adding a fixture or baseline¶

See docs/fixtures.md for fixture schemas and the DEFECT_MANIFEST structure, and benchmarks/README.md for the baseline contract (run(df) + AUTHORED_LINES, lazy imports, graceful skip).

Reproduce the legacy corpus¶

python benchmarks/bench_quick.py --fixtures --compare   # tests/fixtures corpus, side by side
python benchmarks/bench_quick.py --online --compare     # cached online datasets

Every fixture in tests/fixtures/ is run under conservative, balanced, and aggressive strategies in CI, plus 50 curated real public datasets. Reproduce the quality/efficiency matrix on your own data with:

import freshdata as fd
print(fd.compare_clean(df))