FreshCore architecture¶
FreshCore is FreshData's optional native backend for cleaning-first workloads. It is designed to plug into the existing execution layer rather than replace the public API or the pandas reference implementation.
Integration model¶
- Users opt in with
fd.clean(df, engine="freshcore"). CleanConfigstill controls what cleaning is requested.EngineConfigstill controls how execution is routed.- The default
fd.clean(df)path remains unchanged. - Unsupported work delegates to the pandas reference pipeline and is recorded in
CleanReport.fallback_events.
The Python adapter lives at freshdata.execution.backends._freshcore. It
materializes pandas-compatible inputs, sends compact columns and plan parameters
to the Rust module, then maps native output back into a pandas DataFrame and the
standard CleanReport audit contract.
Native engine¶
The Rust crate is in crates/freshcore and exposes the freshdata_freshcore
PyO3 module. Its internal model is deliberately smaller than a general-purpose
DataFrame:
- typed nullable arrays (
Float,Bool,Utf8) - separate validity/null representation through
Option<T> - a compact physical cleaning plan
- native kernels for strings, missing values, casts, duplicates, outliers, and simple profiles
- operation-level timing records for benchmarks
FreshCore does not depend on pandas, Polars, DuckDB, Spark, Dask, Modin, Vaex, or any existing DataFrame execution engine.
V1 support boundary¶
FreshCore v1 runs native kernels for conservative, deterministic cleaning:
- column name normalization
- whitespace trimming
- optional text case normalization via
string_case=None|"lower"|"upper" - sentinel-to-missing normalization
- empty row/column drops
- full-row duplicate detection with
duplicate_keep="first"or"last" - boolean and numeric string casts where safe
- mean/median/mode imputation
- IQR/z-score outlier clipping or flagging
- simple per-column profile metadata
FreshCore falls back for semantic cleaning, context/policy protection, cleaning memory, domain packs, contracts, non-default indexes, duplicate subsets, aggregate/drop duplicate modes, model-based outliers, constant-column dropping, memory downcasting, and the balanced/aggressive decision engine.
Benchmarking¶
Build the native module before benchmarking:
pip install -e ".[dev,freshcore]"
maturin develop --manifest-path crates/freshcore/Cargo.toml --features extension-module
python benchmarks/bench_freshcore.py --rows 10000 100000 1000000 --workload full
The benchmark compares:
- a handwritten pandas baseline
- FreshData's pandas reference path
- FreshCore through
engine="freshcore"
The output includes timing, peak memory, fallback events, parity shape checks, and FreshCore stage timings when the native module is installed.