Decision-preserving workflow¶
freshdata is the explainable cleaning layer: clean once, explain always, remember next time. These features turn a one-off clean into a reusable, auditable workflow from notebook exploration to production governance.
Cleaning memory¶
Record human-approved decisions and replay them on similar future data.
import freshdata as fd
cleaned, report = fd.clean(df, return_report=True)
memory = fd.learn_cleaning_memory(df, decisions=report, dataset_id="crm_contacts")
memory.to_json("crm_memory.json") # or "crm_memory.sqlite" — no server
# next week, on new data of the same shape:
memory = fd.load_cleaning_memory("crm_memory.json")
cleaned, report = fd.clean(df_next, memory=memory, return_report=True)
Memory stores the dataset signature, column roles, accepted/rejected decisions,
thresholds, value-pattern decisions and stakeholder exceptions, with timestamp
and version metadata. On replay freshdata checks the new data still matches what
it learned; if it drifted too much, replay is blocked and the report explains
why rather than applying stale decisions. Replayed actions are marked
status="approved" / memory_influenced=True so the audit trail is honest.
memory.diff(other) and memory.summary() round out the API.
Compare to a baseline (drift)¶
Dead-simple temporal quality comparison.
diff = fd.compare_to_baseline(
current_df,
baseline=last_week_df, # a raw DataFrame, or a prebuilt DatasetBaseline
key="customer_id",
event_time="updated_at",
)
diff.what_likely_matters() # business-language highlights
diff.show() # interactive drift report
Reports schema diff (added / removed / dtype-changed columns), completeness and
duplicate deltas, category churn, distribution shift, and — with key= —
key-level record counts (added / removed / changed). .to_frame(), .to_dict()
and .to_html() are all available.
Quality-debt ledger¶
A middle ground between hard failure and ignoring issues.
cleaned, gate = fd.evaluate_quality_debt(
df,
debt_policy="warn_then_fail", # "warn" | "fail" | "warn_then_fail"
ledger="quality_debt.sqlite",
)
print(gate.status) # "pass" | "warn" | "fail"
Scores nine debt dimensions (missingness, duplicates, schema drift, type instability, outlier spikes, PII risk, category churn, failed repairs, human- review backlog), persists the history to SQLite, and escalates warn→fail when an issue repeats or worsens across runs.
Dirty-join assistant¶
Reviewable fuzzy joins for messy keys — never a silent low-confidence join.
matches = fd.suggest_join_keys(
left_df, right_df,
on=["company_name", "address"],
exact_within=["country"], # blocking key: must match exactly
)
matches.to_frame() # candidates with confidence + per-field scores
Suggests exact keys, ranks fuzzy candidates with confidence and per-field explanations, groups candidates by block, and flags ambiguous matches in a separate section for human review.
Text & encoding lint¶
lint = fd.lint_text_encoding(
df, columns=["name", "address", "city"], locale_hints=["en_IN", "ar_AE"])
lint.to_frame()
Detects mixed scripts, mojibake-like artifacts, NFC/NFD inconsistency, RTL/LTR risk, locale-ambiguous dates/numbers, replacement characters and stray control/zero-width whitespace. Diagnostic only — each issue carries a severity, examples, a suggested repair, and whether that repair is safe to automate. Nothing is modified.
Stakeholder-safe summaries¶
summary = fd.stakeholder_summary(report, audience="business", format="markdown")
print(summary.to_markdown()) # or summary.to_html()
Customer email completeness fell from 98.7% to 92.1%. Five columns changed meaningfully. Phone numbers were normalized automatically. Suspicious revenue outliers were preserved for review.