Semantic cleaning layer¶

FreshData's core repairs representation (whitespace, sentinels, dtypes, duplicates) and runs a statistical engine for missing values and outliers. The Semantic Cleaning Layer adds a complementary stage that repairs meaning: values that are syntactically fine but semantically wrong.

Examples it handles, deterministically and offline:

Issue type	Example	Becomes	Where
`spelled_number`	`"twenty"`	`20`	numeric-like columns
`boolean_synonym`	`"yes"`, `"y"`	`True`	boolean-like columns
`currency_string`	`"$1,200.50"`	`1200.5`	money-like columns
`unit_suffix`	`"10 kg"`	`10`	unit-consistent columns
`category_synonym`	`"M"`, `"Male "`	`"male"`	low-cardinality categoricals
`date_phrase`	`"2026-07-01"`, `"today"`	`Timestamp`	date-like columns

How it works¶

Every change flows through a fixed, auditable pipeline:

profile semantic issues (on distinct values only — never row-by-row);
generate repair proposals via deterministic experts;
score each proposal (explainable confidence + risk);
validate against policy, config, and column protection;
auto-apply only when safe;
otherwise record a suggestion for review;
preserve every decision in the CleanReport.

An LLM never mutates the DataFrame. Everything is fully deterministic and offline (no network, no API keys). Retrieval-backed semantic memory replay is implemented as a local, server-free extension; the interfaces still leave a clean extension point for an optional future private-LLM candidate generator.

Enabling it¶

The layer is off by default. Opt in with semantic_mode:

Mode	Behavior
`None`/`"off"`	disabled (default)
`"assist"`	detect and report proposals only; never mutates
`"review"`	apply zero-risk deterministic repairs; suggest the rest
`"auto"`	apply proposals with confidence ≥ `semantic_auto_threshold`, non-high risk, that pass policy

import freshdata as fd

# assist: suggestions only, no mutation
cleaned, report = fd.clean(df, semantic_mode="assist", return_report=True)

# auto: safe repairs applied, with column hints
cleaned, report = fd.clean(
    df,
    semantic_mode="auto",
    semantic_context={
        "columns": {
            "age": {"semantic_type": "number"},
            "customer_id": {"semantic_type": "identifier", "mutable": False},
        }
    },
    return_report=True,
)

Date phrase repair¶

DatePhraseExpert normalizes obvious date phrases and strings in date-like columns (matched by role, semantic_type, a date-like column name, or a strong share of parseable values): ISO/slash-ordered dates (2026-07-01, 2026/07/01), month-name dates (July 1 2026, 1 July 2026), and numeric dates where day/month order is unambiguous.

"today"/"yesterday"/"tomorrow" are only resolved when you supply an explicit reference_date — the real wall-clock date is never used silently:

cleaned, report = fd.clean(
    df,
    semantic_mode="auto",
    semantic_context={
        "reference_date": "2026-07-01",
        "columns": {"signup_date": {"semantic_type": "date"}},
    },
    return_report=True,
)

Numeric dates where both day and month are <= 12 ("01/02/2026") are ambiguous and are never auto-applied — they are recorded as high-risk, human_review=True suggestions unless you resolve the convention explicitly with dayfirst:

semantic_context={"columns": {"event_date": {"semantic_type": "date", "dayfirst": True}}}

Applied date repairs convert the column to pandas datetime64 dtype only when every value in the column resolved successfully — a column with any leftover unconverted string stays as-is, so the conversion never silently drops information.

Semantic memory replay¶

Accepted semantic repairs can be learned into a CleaningMemory (fd.learn_cleaning_memory) and replayed as candidate proposals on similar future data:

cleaned, report = fd.clean(df, semantic_mode="auto", return_report=True)

memory = fd.learn_cleaning_memory(df, decisions=report, dataset_id="crm_signups")

next_cleaned, next_report = fd.clean(
    next_df,
    semantic_mode="auto",
    memory=memory,
    return_report=True,
)

Memory is local and server-free — it stores learned repairs in memory.value_patterns["semantic_repairs"], no different from the rest of CleaningMemory's JSON/SQLite storage.
Memory is evidence, not authority: every retrieved repair still passes through the same policy gate as a deterministic proposal, so it can never override target_column/id_columns/preserve_columns protection, the confidence floor, or the ambiguous-date rules above.
Replayed repairs are fully audited: memory_influenced=True and model_id="semantic:<issue_type>:memory" mark which decisions came from memory, and Action.metadata carries the raw/proposed value and evidence either way.
Retrieval matches on the exact normalized value first, falling back to a lightweight, no-dependency similarity check (difflib) for minor value drift; low-similarity or conflicting repairs are never auto-applied.

Configuration¶

`CleanConfig` field	Default	Meaning
`semantic_mode`	`None`	`None`/`off`/`assist`/`review`/`auto`
`semantic_auto_threshold`	`0.95`	min confidence to auto-apply
`semantic_review_threshold`	`0.70`	below this, never apply (skip)
`semantic_max_distinct_values`	`500`	skip very high-cardinality columns
`semantic_sample_size`	`10_000`	rows sampled when profiling distinct values
`semantic_backends`	`("deterministic",)`	candidate backends (extension point)
`semantic_context`	`None`	per-column hints (semantic_type/unit/allowed_values/mutable/dayfirst) plus a top-level `reference_date` for relative date phrases
`semantic_privacy_policy`	`"local_only"`	privacy posture for future external inference
`semantic_budget`	`None`	budget hints (extension point)

Safety guarantees¶

ID, target, and preserve_columns are protected. Identifier-like columns are vetoed unless semantic_context marks them mutable.
Ambiguous repairs are suggestions, not silent mutations (status="suggested", human_review=True).
Codes are never mangled — zero-padded numbers ("007"), mixed alphanumerics ("105A"), and punctuated handles ("A-100", "D@vid") are left alone.
Deterministic and repeatable — the same input always produces the same report.

Auditing¶

Each decision appears in the report with step="semantic":

for a in report:
    if a.step == "semantic":
        print(a.status, a.column, a.confidence, a.risk, a.model_id, a.description)
        print(a.memory_influenced, a.metadata)  # metadata carries raw/proposed value + evidence

Preview proposal counts before running, via suggest_plan:

plan = fd.suggest_plan(df, semantic_mode="assist")
plan.to_frame()[["column", "semantic_proposals"]]

Backends¶

Semantic cleaning runs on the in-memory pandas path. When you select a native engine (Polars, DuckDB, Spark), FreshData routes the clean through pandas and records the fallback in report.fallback_events rather than silently skipping the semantic stage.