Skip to content

The cleaning engine

freshdata cleans in two layers: deterministic representation repair, then a role-aware decision engine. Every action is recorded with a rationale, a risk level, and a confidence score.

Layer 1 — representation repair (always on)

order step what it does
1 column_names snake_case names, deduplicate collisions ("a", "a""a", "a_2")
2 strip_whitespace trim surrounding whitespace in text cells (internal spacing kept)
3 normalize_sentinels "N/A", "null", "-", "", "#REF!", … → missing
4 drop_empty_columns / drop_empty_rows remove all-missing columns and rows
5 fix_dtypes text → numeric ("$1,234.56" works) / datetime / boolean, validated
6 drop_duplicates resolve duplicate rows (first / last / drop / aggregate)

Layer 2 — the decision engine

With strategy="balanced" (the default), the engine infers each column's role — id, target/label, datetime, free text, categorical, numeric — and applies explicit threshold rules.

Missing values

missing ratio numeric categorical datetime
≤ 5% (low) mean if ~normal & no outliers, else median mode if clear majority, else "Unknown" ffill/bfill if time-ordered
5–30% (medium) median (KNN only in aggressive mode) mode if dominant, else "Missing" ffill/bfill if time-ordered
> 30% (high) preserved + warning (balanced); dropped in aggressive unless informative same same

Role gates run first

Targets are never modified, IDs are never imputed, and free text is never force-filled — those columns are preserved with the reason written into the report, so a remaining NaN is never silent. A <col>_was_missing indicator column is added when missingness itself correlates with other features. On frames under 30 rows, the engine preserves and recommends manual review instead of guessing on noisy ratios.

Outliers

Detection methods: IQR fences (default), z-score, "auto" (z-score for ~normal columns, IQR for skewed), or "isolation_forest" (scikit-learn, ≥ 100 rows, falls back to IQR).

The default outlier_action="auto" is context-aware: it flags (adds a boolean <col>_outlier column) under balanced mode and caps (winsorizes to the fences) under aggressive mode, and flags heavy-tailed columns (>15% outlying) rather than rewriting real data. Setting an explicit "cap", "remove", or "flag" is a directive applied to every eligible numeric column — heavy-tailed columns too, with a warning. Outliers in ID/target columns, preserve_columns, and domain-sensitive columns (AQI, pollutants, fraud / risk names) are always preserved — there the extremes usually are the signal.

Duplicates

Exact duplicates are removed by default (count and percentage reported). Time-indexed frames never lose rows unless allow_timeseries_duplicates=True. A duplicate ratio above duplicate_threshold (10%) raises a quality warning. With duplicate_subset, duplicate_keep="aggregate" collapses each group (numeric mean, first non-missing otherwise).

MissForest-style imputation

The default engine keeps fast, conservative missing-value rules. For nonlinear mixed tabular data, you can opt into MissForest-style random-forest imputation with the optional ML extra:

pip install "freshdata-cleaner[ml]"
cleaned, report = fd.clean(
    df,
    impute_method="missforest",
    target_column="churn",
    id_columns=("customer_id",),
    return_report=True,
)

Column-level overrides can mix MissForest with simple strategies:

cleaned, report = fd.clean(
    df,
    impute_strategy={
        "age": "missforest",
        "income": "median",
        "segment": "missforest",
    },
    return_report=True,
)

MissForest starts with safe median/mode fills, then iteratively trains random forest regressors for numeric targets and classifiers for categorical/boolean targets. Each action reports the model type, imputed count, iteration count, convergence status, confidence, risk, optional OOB score, indicator status, and any fallback reason.

Use it when missing values depend on nonlinear relationships across several columns. Avoid it for tiny datasets, mostly missing columns, high-cardinality categorical fields, free text, IDs, targets, and latency-sensitive workloads. Those cases fall back to safe simple fills or preservation with an audit note.

Tuning

fd.clean(
    df,
    strategy="balanced",             # "aggressive" | "conservative"
    missing_threshold_low=0.05,      # band edges for the missing-value rules
    missing_threshold_medium=0.30,
    missing_threshold_high=0.60,
    duplicate_threshold=0.10,
    outlier_method="iqr",            # "zscore" | "auto" | "isolation_forest"
    outlier_action="auto",           # context-aware; "cap" | "remove" | "flag" | None
    target_column="churn",
    preserve_columns=("notes",),
    id_columns=("ref",),
    impute_method="missforest",      # optional; requires freshdata-cleaner[ml]
    missforest_max_iter=5,
    missforest_n_estimators=100,
    return_report=True,
)

Explicit choices always override the engine. Every option lives on one frozen dataclass — CleanConfig — and unknown names fail immediately with a "did you mean" suggestion.

What freshdata will not do

  • Touch a target/label column, impute an identifier, or force-fill free text.
  • Remove outliers blindly — fraud/anomaly-style columns keep their extremes.
  • Guess fuzzy entity resolution in clean() — that is opt-in via the enterprise layer's clustering.
  • Parse ambiguous European decimal commas ("1.234,56") — too risky to guess.
  • Mutate your DataFrame (unless you pass preserve_original=False).