Feature overview¶
Core¶
| Feature | Description |
|---|---|
| Automated cleaning | fd.clean(df) handles missing values, outliers, duplicates, dtype repair, and column names in one call. |
| Decision engine | Per-column actions chosen from inferred role + explicit threshold rules. |
| Explainable reports | Every action carries a rationale, risk level, and confidence score. |
| Profiling | fd.profile(df) — read-only data-quality insight using the same inference as clean. |
| Plans & comparisons | fd.suggest_plan, fd.compare_plans, fd.compare_clean, fd.explain_clean. |
| Safe defaults | Targets, IDs, and free-text columns are protected from leakage and corruption. |
| Typed & tested | py.typed, 800+ tests, 95%+ coverage, mypy-clean. |
| pandas-first | Pure pandas + NumPy core; no heavy dependencies required. |
The enterprise layer¶
freshdata.enterprise adds opt-in governance and data-quality capabilities. It
accepts and returns either pandas or Polars — running Polars-native fast
paths when available and falling back to vectorized pandas otherwise. Optional
dependencies stay lazy, so a plain import freshdata is unaffected.
| Capability | API |
|---|---|
| Full enterprise pipeline | clean_enterprise(df, *, enterprise=…) → EnterpriseResult |
| Data Trust Score (0–100) | compute_trust_score(df) → completeness / validity / uniqueness / consistency |
| Fuzzy value clustering | merge_clusters(df, cols) / cluster_column(df, col) |
| PII masking | mask_dataframe(df, rules) — hash / redact / partial / regex-scrub / drop |
| Semantic validation | run_semantic_validation(df, configs) — reference / regex / API checks |
| Lineage | LineageTracker / schema_of — OpenLineage-compatible metadata |
| Label-noise (ML) | detect_label_issues / detect_outliers — optional Cleanlab wrappers |
| Batch CLI | freshdata clean | trust | profile with quality-gate exit codes |
from freshdata.enterprise import clean_enterprise, EnterpriseConfig, ClusterConfig
ec = EnterpriseConfig(enable_clustering=True, clustering=ClusterConfig(columns=("vendor",)), fail_under_trust=80)
result = clean_enterprise(df, enterprise=ec)
print(result.quality.to_markdown())
assert result.passed_gate
Compliance reports¶
The freshdata.compliance subpackage turns a CleanReport into a regulatory
audit artifact, mapping freshdata's transformations onto named control
frameworks — 21 CFR Part 11, GDPR (Art. 30/17), ALCOA+, SOX-404, and HIPAA Safe
Harbor. The generators are purely additive and report-only. See the
compliance reports guide.
Orchestration integrations¶
Run freshdata's clean + trust gate inside Dagster, Airflow, or dbt and warn / fail / skip a pipeline on low data quality. See the orchestration integrations guide.
Polars support¶
import polars as pl
import freshdata as fd
cleaned = fd.clean(pl_df) # returns a pl.DataFrame when the input is Polars
Install with pip install "freshdata-cleaner[polars]".