Skip to content

FreshCore architecture

FreshCore is FreshData's optional native backend for cleaning-first workloads. It is designed to plug into the existing execution layer rather than replace the public API or the pandas reference implementation.

Integration model

  • Users opt in with fd.clean(df, engine="freshcore").
  • CleanConfig still controls what cleaning is requested.
  • EngineConfig still controls how execution is routed.
  • The default fd.clean(df) path remains unchanged.
  • Unsupported work delegates to the pandas reference pipeline and is recorded in CleanReport.fallback_events.

The Python adapter lives at freshdata.execution.backends._freshcore. It materializes pandas-compatible inputs, sends compact columns and plan parameters to the Rust module, then maps native output back into a pandas DataFrame and the standard CleanReport audit contract.

Native engine

The Rust crate is in crates/freshcore and exposes the freshdata_freshcore PyO3 module. Its internal model is deliberately smaller than a general-purpose DataFrame:

  • typed nullable arrays (Float, Bool, Utf8)
  • separate validity/null representation through Option<T>
  • a compact physical cleaning plan
  • native kernels for strings, missing values, casts, duplicates, outliers, and simple profiles
  • operation-level timing records for benchmarks

FreshCore does not depend on pandas, Polars, DuckDB, Spark, Dask, Modin, Vaex, or any existing DataFrame execution engine.

V1 support boundary

FreshCore v1 runs native kernels for conservative, deterministic cleaning:

  • column name normalization
  • whitespace trimming
  • optional text case normalization via string_case=None|"lower"|"upper"
  • sentinel-to-missing normalization
  • empty row/column drops
  • full-row duplicate detection with duplicate_keep="first" or "last"
  • boolean and numeric string casts where safe
  • mean/median/mode imputation
  • IQR/z-score outlier clipping or flagging
  • simple per-column profile metadata

FreshCore falls back for semantic cleaning, context/policy protection, cleaning memory, domain packs, contracts, non-default indexes, duplicate subsets, aggregate/drop duplicate modes, model-based outliers, constant-column dropping, memory downcasting, and the balanced/aggressive decision engine.

Benchmarking

Build the native module before benchmarking:

pip install -e ".[dev,freshcore]"
maturin develop --manifest-path crates/freshcore/Cargo.toml --features extension-module
python benchmarks/bench_freshcore.py --rows 10000 100000 1000000 --workload full

The benchmark compares:

  • a handwritten pandas baseline
  • FreshData's pandas reference path
  • FreshCore through engine="freshcore"

The output includes timing, peak memory, fallback events, parity shape checks, and FreshCore stage timings when the native module is installed.