API reference¶
Auto-generated from the source docstrings. Everything below is available as a
top-level attribute of freshdata (e.g. import freshdata as fd; fd.clean(...)).
Cleaning¶
freshdata.clean ¶
clean(df: DataFrame, config: CleanConfig | Mapping[str, object] | None = None, *, return_report: bool = False, source_provenance: dict[str, object] | None = None, provenance_confidence_threshold: float = 0.7, contract: object | None = None, on_unexpected: str = 'warn', on_missing: str = 'fail', domain: str | None = None, column_map: dict[str, str] | None = None, gtfs_file: str | None = None, fhir_resource: str | None = None, media_type: str | None = None, finance_mode: str | None = None, audit_include_phi: bool = False, domain_kwargs: dict[str, object] | None = None, engine: str = 'pandas', output_format: str = 'pandas', engine_config: EngineConfig | None = None, memory: object | None = None, context: str | None = None, policy: object | None = None, strict: bool = False, **options: object) -> pd.DataFrame | tuple[pd.DataFrame, CleanReport]
Clean a DataFrame and return a new, repaired one.
Two layers run in order. Representation repair always happens first:
column_names— snake_case column names, deduplicate collisions.strip_whitespace— trim surrounding whitespace in text cells.normalize_sentinels— turn "N/A", "null", "-", "" … into missing.drop_empty_columns/drop_empty_rows— remove all-missing ones.fix_dtypes— text that is really numeric / datetime / boolean gets the right dtype (validated;numeric_thresholdof values must parse).drop_duplicates— resolve duplicate rows (duplicate_keepchooses first/last/drop/aggregate; time-indexed frames are protected).
Then, with strategy="auto" (the default), the decision engine
profiles every column — missing ratio, dtype, skewness, cardinality,
inferred role (id / target / datetime / text / categorical), whether
missingness looks informative — and applies threshold rules for missing
values and outliers. Nothing is done silently: every action (including
deliberately preserving a column) is logged with a rationale, a risk
level, and a confidence score. strategy="conservative" disables the
engine; imputation and outlier handling are then opt-in via impute= /
outliers=.
Parameters¶
df:
The DataFrame to clean.
config:
A prebuilt :class:~freshdata.CleanConfig to start from.
return_report:
If True, return (cleaned_df, CleanReport). The report carries
per-action rationale/risk/confidence, missing counts before/after,
warnings, and recommendations for manual review.
domain:
Optional domain validator pack (e.g. "finance"). When set, generic
cleaning runs first (defaulting to strategy="conservative" so the
statistical engine never silently alters ledgers/IDs unless you pass an
explicit strategy), then the pack validates in layers and repairs
separately; findings and a domain_trust_score are folded into the
report. Unknown names raise :class:~freshdata.domains.UnknownDomainError.
column_map:
Optional {actual_column: canonical_field} overrides for the domain
pack's column detection. Requires domain to be set.
gtfs_file:
File selector for a single-frame feed-domain run, such as "stops.txt"
with domain="transport". Full feeds can instead be passed as a dict.
fhir_resource:
FHIR resource selector for domain="healthcare" ("Patient",
"Observation", "Encounter"); auto-detected from columns if omitted.
media_type:
Sub-schema selector for domain="media" ("content" / "release");
auto-detected from columns if omitted.
audit_include_phi:
For PHI-aware packs (healthcare, education), include raw PHI values in the
audit trail instead of masking them as [PHI]. Defaults to False.
domain_kwargs:
Optional pack-specific constructor arguments. These are forwarded for
both single-frame and feed-domain runs.
**options:
Any :class:~freshdata.CleanConfig field as a keyword override — e.g.
strategy ("balanced" default / "aggressive" / "conservative"),
missing_threshold_low/_medium/_high, duplicate_threshold,
outlier_method, outlier_action, preserve_original, verbose,
preserve_columns, target_column, duplicate_keep, impute,
outliers. Unknown names raise :class:TypeError.
Examples¶
import freshdata as fd cleaned = fd.clean(df) cleaned, rep = fd.clean(df, return_report=True) print(rep.summary())
fd.clean(df, outlier_action="flag", target_column="churn", ... preserve_columns=("notes",), verbose=False)
ledger = fd.clean(df, domain="finance") # validate + repair ledger, rep = fd.clean(df, domain="finance", return_report=True) rep.domain_trust_score # 0–1
Natural-language cleaning rules compile deterministically into a
:class:~freshdata.ContextPolicy that governs the run (protected columns,
id columns, per-column semantic hints)::
cleaned = fd.clean(df, context="CustomerID is unique. Never modify revenue.") policy = fd.compile_context("...", df=df) # inspect/review first cleaned = fd.clean(df, policy=policy) # skip parsing, use as-is
Source code in src/freshdata/api.py
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 | |
freshdata.clean_csv ¶
clean_csv(path: str | Path, config: CleanConfig | Mapping[str, object] | None = None, *, output_path: str | Path | None = None, return_report: bool = False, read_csv_kwargs: dict[str, object] | None = None, to_csv_kwargs: dict[str, object] | None = None, context: str | None = None, policy: object | None = None, strict: bool = False, **options: object) -> pd.DataFrame | tuple[pd.DataFrame, CleanReport]
Read a CSV file, clean it, and optionally write the result to disk.
Parameters¶
path:
Path to the input CSV file.
output_path:
Optional path to write the cleaned CSV.
return_report:
If True, return (cleaned_df, CleanReport).
read_csv_kwargs:
Optional keyword arguments forwarded to pandas.read_csv.
to_csv_kwargs:
Optional keyword arguments forwarded to DataFrame.to_csv.
index defaults to False unless explicitly overridden.
context / policy / strict:
Natural-language rules or a pre-compiled
:class:~freshdata.ContextPolicy, forwarded to :func:freshdata.clean.
**options:
Any :class:~freshdata.CleanConfig field accepted by
:func:freshdata.clean.
Examples¶
import freshdata as fd cleaned = fd.clean_csv("input.csv") fd.clean_csv("input.csv", output_path="cleaned.csv") cleaned, report = fd.clean_csv("input.csv", return_report=True) fd.clean_csv("input.csv", context="Emails must be valid.")
Source code in src/freshdata/api.py
freshdata.Cleaner ¶
A configured, reusable cleaning pipeline.
Useful when the same settings are applied to many frames (e.g. every file in a directory), or when you want the report after the fact::
cleaner = fd.Cleaner(impute="median", drop_constant_columns=True)
for path in paths:
cleaned = cleaner.clean(pd.read_csv(path))
print(cleaner.report_.summary())
Attributes¶
config:
The immutable :class:~freshdata.CleanConfig in effect.
report_:
The :class:~freshdata.CleanReport from the most recent
:meth:clean call (None before the first call).
Source code in src/freshdata/cleaner.py
clean ¶
clean(df: DataFrame, *, report: bool = False, memory: object | None = None) -> pd.DataFrame | tuple[pd.DataFrame, CleanReport]
Clean df and return the result (the input is left unchanged
unless preserve_original=False was configured).
With report=True, returns (cleaned_df, CleanReport) instead.
The latest report is always available as :attr:report_. memory
(a :class:~freshdata.CleaningMemory) lets the semantic stage replay
compatible learned repairs; see :func:freshdata.clean's memory=.
Source code in src/freshdata/cleaner.py
Profiling & inspection¶
freshdata.profile ¶
Read-only data profiling: what is in this frame, and what would clean() do?
:func:build_profile reuses the exact inference code from the cleaning steps,
so every "would convert to …" suggestion is a faithful preview, not a guess.
Profiling never modifies the input.
ColumnProfile
dataclass
¶
ColumnProfile(name: str, dtype: str, non_null: int, missing: int, missing_pct: float, unique: int | None, sample_values: list[Any], suggested_dtype: str | None, issues: list[str] = list())
Statistics and detected issues for one column.
Profile
dataclass
¶
Profile(n_rows: int, n_cols: int, memory: int, duplicate_rows: int | None, missing_cells: int, missing_pct: float, columns: list[ColumnProfile], materialization: dict[str, Any] | None = None)
Bases: HtmlReprMixin
A whole-table profile. Render with print(profile), export with
:meth:to_frame or :meth:to_dict, or display the interactive quality
cockpit with :meth:show / _repr_html_.
to_frame ¶
One row per column — convenient to sort/filter in a notebook.
Source code in src/freshdata/profile.py
build_profile ¶
build_profile(df: DataFrame, config: CleanConfig, *, sample: int | None = None, max_columns: int | None = None, lazy: bool = False) -> Profile
Profile df without modifying it.
Wide-schema / large-frame perf controls (all optional, behaviour-preserving when unset):
sample: profile a deterministicrandom_state=0row sample of this size when the frame is larger; per-column statistics become estimates.max_columns: profile only the first this-many columns; the rest are recorded as omitted.lazy: skip the expensive full-frame duplicate-row scan (duplicate_rowsis leftNone).
When any control changes the work set, the returned :class:Profile
describes the profiled subset and carries the totals in
profile.materialization.
Source code in src/freshdata/profile.py
freshdata.infer_roles ¶
infer_roles(df: DataFrame, *, strategy: str = 'balanced', config: CleanConfig | None = None, **options: object) -> pd.DataFrame
Infer column roles and primary missing models without mutating data.
Also reports a per-column semantic_type (email/phone/url/... —
see :data:freshdata.semantic.semantic_types.SEMANTIC_TYPES) with a
confidence and a compact evidence string. Semantic-type inference is
deterministic and model-free; an explicit semantic_context hint always
wins. These three columns are additive — the original output is unchanged.
Source code in src/freshdata/api.py
freshdata.explain_clean ¶
explain_clean(df: DataFrame, *, strategy: str = 'balanced', config: CleanConfig | None = None, **options: object) -> ExplainReport
Run clean() and return a structured before/after explanation.
Source code in src/freshdata/explain.py
Context policies¶
freshdata.compile_context ¶
compile_context(text: str, df: DataFrame | None = None, *, columns: Sequence[str] | None = None, config: CleanConfig | None = None, strict: bool = False) -> Any
Compile natural-language cleaning rules into a :class:~freshdata.ContextPolicy.
Deterministic, model-free, and offline: the same text always compiles to the
same policy. With a frame (or explicit columns) column phrases are
resolved against the effective post-normalization schema; without one the
policy compiles schema-free and resolves when it meets a frame. The result
is inspectable (policy.summary()), reviewable (policy.to_json()),
and reusable (fd.clean(df, policy=policy)).
Examples¶
import freshdata as fd policy = fd.compile_context( ... "CustomerID is unique. Never modify revenue values.", df=df) print(policy.summary()) policy.to_json("policy.json") cleaned = fd.clean(df, policy=policy)
Source code in src/freshdata/api.py
freshdata.validate ¶
validate(df: DataFrame, *, context: str | None = None, policy: object | None = None, config: CleanConfig | None = None, strict: bool = False, **options: object) -> Any
Check df against a context policy without mutating anything.
Returns a :class:~freshdata.FindingList (a plain list of
:class:~freshdata.QualityFinding with .errors / .warnings
shortcuts) covering unresolved references, compile issues, protected
columns, and unique / allowed-values / range violations.
Examples¶
findings = fd.validate(df, context="CustomerID is unique.") assert not findings.errors
Source code in src/freshdata/api.py
freshdata.ContextPolicy
dataclass
¶
ContextPolicy(policy_version: str = POLICY_VERSION, dataset_domain: str | None = None, constraints: tuple[ColumnConstraint, ...] = (), unresolved: tuple[UnresolvedRef, ...] = (), issues: tuple[PolicyIssue, ...] = (), source_text_sha256: str | None = None, strict: bool = False)
The compiled, inspectable contract between the user's prose and the engine.
Immutable and JSON-round-trippable (:meth:to_json / :meth:from_json), so a
policy can be reviewed in a pull request like code and passed back verbatim via
fd.clean(df, policy=...).
protected_columns
property
¶
Resolved columns under a protected constraint, in policy order.
constraints_for ¶
All constraints that apply to column (post-normalization name).
Source code in src/freshdata/context/types.py
is_protected ¶
thresholds ¶
Effective (auto, review, floor) thresholds for column and kind.
impute_missing constraints raise the auto threshold for
kind="impute" (or "missing"); everything else falls back to the
global config thresholds.
Source code in src/freshdata/context/types.py
to_json ¶
Serialize to JSON; optionally also write it to path.
Source code in src/freshdata/context/types.py
from_json
classmethod
¶
Load a policy from a JSON string or a path to a .json file.
Source code in src/freshdata/context/types.py
summary ¶
Human-readable one-screen description of the compiled policy.
Source code in src/freshdata/context/types.py
lower ¶
Return a new :class:CleanConfig with this policy folded in.
Pure: cfg is never mutated. Protected columns are appended to
preserve_columns, unique columns to id_columns, per-column
semantic hints (semantic_type, region, allowed_values, range,
impute confidence, mutability) are merged into semantic_context,
and the policy itself is attached as cfg.policy (with context
cleared so the text is never recompiled downstream).
Source code in src/freshdata/context/types.py
Planning & comparison¶
freshdata.suggest_plan ¶
suggest_plan(df: DataFrame, *, config: CleanConfig | None = None, contract: object | None = None, on_unexpected: str = 'warn', on_missing: str = 'fail', context: str | None = None, policy: object | None = None, strict: bool = False, **options: object) -> CleanPlan
Preview engine model choices without mutating df.
With contract= (a :class:~freshdata.DataContract or its mapping),
attaches a baseline-free schema diff at plan.schema_diff (a
DriftReport) explaining incoming structural drift before any repair;
on_unexpected (fail|warn|preserve) and on_missing
(fail|warn|ignore) grade undeclared and missing columns. See
:func:freshdata.diff_schema.
context= (natural-language rules) or policy= (a pre-compiled
:class:~freshdata.ContextPolicy) fold a deterministic context policy into
the planned config first — protected columns, id columns, and per-column
semantic hints then shape the plan exactly as they would shape
:func:freshdata.clean. strict=True raises
:class:~freshdata.PolicyError on unresolved or unparsed context.
The returned plan's config carries the compiled policy at plan.config.policy.
Source code in src/freshdata/plan.py
272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 | |
freshdata.compare_plans ¶
compare_plans(df: DataFrame, *, strategies: tuple[str, ...] = ('conservative', 'balanced', 'aggressive'), config: CleanConfig | None = None, include_metrics: bool = False, **options: object) -> pd.DataFrame
Side-by-side primary models for each strategy.
With include_metrics=True, adds actual clean outcomes (missing_after,
duration_seconds, …) from :func:compare_clean.
Source code in src/freshdata/plan.py
freshdata.compare_clean ¶
compare_clean(df: DataFrame, *, strategies: tuple[str, ...] = ('conservative', 'balanced', 'aggressive'), config: CleanConfig | None = None, **options: object) -> pd.DataFrame
Run clean under each strategy and compare quality + efficiency metrics.
Source code in src/freshdata/plan.py
Configuration¶
freshdata.CleanConfig
dataclass
¶
CleanConfig(column_names: bool = True, drop_empty_rows: bool = True, drop_empty_columns: bool = True, drop_constant_columns: bool = False, strip_whitespace: bool = True, normalize_sentinels: bool = True, extra_sentinels: tuple[str, ...] = (), string_case: str | None = None, fix_dtypes: bool = True, numeric_threshold: float = 0.95, datetime_threshold: float = 0.95, preserve_leading_zeros: bool = True, dayfirst: bool | str = 'auto', decimal: str = '.', thousands: str = ',', drop_duplicates: bool = True, duplicate_subset: tuple[str, ...] | None = None, strategy: str = 'balanced', missing_threshold_low: float = 0.05, missing_threshold_medium: float = 0.3, missing_threshold_high: float = 0.6, duplicate_threshold: float = 0.1, outlier_action: str | None = 'auto', preserve_original: bool = True, verbose: bool = True, preserve_columns: tuple[str, ...] = (), target_column: str | None = None, id_columns: tuple[str, ...] = (), duplicate_keep: str = 'first', allow_timeseries_duplicates: bool = False, advanced_imputation: bool | str = 'auto', missing_indicators: bool | str = 'auto', impute: str | None = None, impute_strategy: Mapping[str, str] | None = None, missforest_max_iter: int = 5, missforest_n_estimators: int = 100, missforest_random_state: int = 42, missforest_min_rows_for_model: int = 50, missforest_add_indicators: bool | str = 'auto', outliers: str | None = None, outlier_method: str = 'iqr', outlier_factor: float | None = None, optimize_memory: bool = False, category_threshold: float = 0.5, reset_index: bool = False, sample_size: int = 10000, random_state: int = 0, semantic_mode: str | None = None, semantic_auto_threshold: float = 0.95, semantic_review_threshold: float = 0.7, semantic_max_distinct_values: int = 500, semantic_sample_size: int = 10000, semantic_backends: tuple[str, ...] = ('deterministic',), semantic_context: object | None = None, semantic_privacy_policy: str = 'local_only', semantic_budget: dict[str, object] | None = None, semantic_embedding_cache_size: int = 65536, context: str | None = None, policy: object | None = None, strict: bool = False)
Options controlling what :func:freshdata.clean does.
Two layers of cleaning are controlled here:
- Representation repair (whitespace, sentinel strings, wrong dtypes, exact duplicate rows, structurally empty rows/columns) — always safe, on by default.
- The decision engine (
strategy="balanced", the default) — profiles every column and applies accuracy-first rules for missing values and outliers. Usestrategy="aggressive"for zero-NaN scrubbing (KNN, column drops, capping). Setstrategy="conservative"to disable the engine and only repair representation; statistical changes are then opt-in viaimpute/outliers.
Reports & results¶
freshdata.CleanReport
dataclass
¶
CleanReport(actions: list[Action] = list(), rows_before: int = 0, rows_after: int = 0, cols_before: int = 0, cols_after: int = 0, memory_before: int = 0, memory_after: int = 0, duration_seconds: float = 0.0, missing_before: int = 0, missing_after: int = 0, duplicates_removed: int = 0, outliers_handled: int = 0, columns_dropped: list[str] = list(), columns_imputed: list[str] = list(), columns_preserved: list[str] = list(), warnings: list[str] = list(), recommendations: list[str] = list(), domain: str | None = None, domain_trust_score: float | None = None, domain_findings: list[dict[str, Any]] = list(), domain_repairs: list[dict[str, Any]] = list(), streaming: dict[str, Any] | None = None, backend: str | None = None, materialized: bool = True, fallback_events: list[dict[str, Any]] = list(), backend_differences: list[dict[str, Any]] = list(), stage_timings: list[dict[str, Any]] = list(), source_provenance: dict[str, Any] | None = None, contract_violations: dict[str, Any] | None = None, decisions_hash: str | None = None, undo_log: dict[str, Any] | None = None)
Bases: HtmlReprMixin
Everything one :func:freshdata.clean run did, in order.
Iterable and sized: len(report) is the number of actions, and
for action in report walks them in execution order. bool(report)
is True iff anything was changed.
Beyond the action log, the report carries a cleaning summary (missing cells before/after, duplicates removed, outliers handled, columns dropped/imputed/preserved), engine warnings for risky columns, and recommendations for manual review.
record_fallback ¶
Record that backend delegated step to the pandas reference.
Source code in src/freshdata/report.py
record_backend_difference ¶
record_backend_difference(backend: str, step: str, detail: str, *, column: str | None = None) -> None
Record a semantics difference between a native backend and pandas.
Source code in src/freshdata/report.py
record_stage_timing ¶
Record a backend-provided stage runtime in seconds.
add ¶
add(step: str, description: str, *, column: str | None = None, count: int = 0, rationale: str = '', risk: str = 'low', confidence: float = 1.0, model_id: str = '', status: str = 'automatic', reversible: bool | None = None, memory_influenced: bool = False, human_review: bool = False, metadata: dict[str, Any] | None = None) -> None
Record one action (internal; called by the pipeline).
Source code in src/freshdata/report.py
add_warning ¶
add_recommendation ¶
to_dict ¶
Return a JSON-friendly dictionary representation of the report.
This format is ideal for writing to logs, persisting audit snapshots, or returning a stable object from service endpoints.
Examples¶
report = CleanReport(rows_before=10, rows_after=8, cols_before=4, cols_after=3) payload = report.to_dict() 'actions' in payload True payload['rows_before'], payload['rows_after'] (10, 8)
Source code in src/freshdata/report.py
revert ¶
Undo reversible plan actions on df, returning a new frame.
Requires this report to come from
freshdata.apply_plan(..., keep_undo=True); with action_ids=None
every recorded reversible action is undone. Cells whose rows no longer
exist in df are skipped silently (they cannot be restored).
Source code in src/freshdata/report.py
to_findings ¶
Project this report into normalized :class:~freshdata.QualityFinding objects.
Surfaces violated domain rules (enriched with any repair that was applied)
and medium/high-risk engine actions, so the result can be exported to dbt
tests, a Great Expectations suite, or an exception table. CleanReport
keeps its own shape; this is a pure read-only projection.
Examples¶
CleanReport().to_findings() []
Source code in src/freshdata/report.py
to_frame ¶
Return one action per row as a pandas.DataFrame.
This representation works best when you want to inspect the report in notebooks, ad hoc dashboards, or quick filtering workflows.
Examples¶
from freshdata import CleanReport, Action report = CleanReport(actions=[Action(step='coerce', column='age')]) frame = report.to_frame() frame.loc[0, 'step'] 'coerce'
Source code in src/freshdata/report.py
summary ¶
Render a concise text summary for terminal or notebook output.
This method is the quickest way to create a human-readable snapshot of what happened during a clean run.
Examples¶
report = CleanReport(rows_before=4, rows_after=4, cols_before=2, cols_after=2) text = report.summary() text.startswith('freshdata clean report') True 'rows:' in text True
Source code in src/freshdata/report.py
452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 | |
brief ¶
Compact summary for verbose=True console output.
Source code in src/freshdata/report.py
freshdata.Action
dataclass
¶
Action(step: str, column: str | None, description: str, count: int = 0, rationale: str = '', risk: str = 'low', confidence: float = 1.0, model_id: str = '', status: str = 'automatic', reversible: bool | None = None, memory_influenced: bool = False, human_review: bool = False, metadata: dict[str, Any] = dict())
One transformation (or deliberate non-transformation) of the data.
Attributes¶
step:
Machine-readable step name, e.g. "fix_dtypes" or "missing".
column:
Column the action applied to, or None for table-level actions.
description:
Human-readable summary of what happened.
count:
Number of cells or rows affected (0 for informational notes).
rationale:
Why the decision engine chose this action ("" for non-engine steps).
risk:
"low", "medium", or "high" — how likely the action is to need review.
confidence:
Engine confidence in the decision, in [0, 1] (1.0 for non-engine steps,
which are deterministic representation repairs).
freshdata.CleanPlan
dataclass
¶
CleanPlan(config: CleanConfig, column_plans: dict[str, ColumnPlan] = dict(), schema_diff: Any = None, repair_plan: Any = None)
Bases: HtmlReprMixin
Recommended cleaning configuration and per-column model choices.
summary ¶
Human-readable primary model per column.
Source code in src/freshdata/plan.py
alternatives ¶
One row per (column, model, rank) for notebook review.
Source code in src/freshdata/plan.py
to_frame ¶
One row per column with primary missing/outlier choices.
Source code in src/freshdata/plan.py
freshdata.ColumnPlan
dataclass
¶
ColumnPlan(column: str, missing: ModelChoice | None = None, missing_alternatives: tuple[ModelChoice, ...] = (), outlier: ModelChoice | None = None, outlier_action: str | None = None, n_outliers: int = 0, semantic_proposals: int = 0)
Primary and alternative models for one column.
freshdata.Profile
dataclass
¶
Profile(n_rows: int, n_cols: int, memory: int, duplicate_rows: int | None, missing_cells: int, missing_pct: float, columns: list[ColumnProfile], materialization: dict[str, Any] | None = None)
Bases: HtmlReprMixin
A whole-table profile. Render with print(profile), export with
:meth:to_frame or :meth:to_dict, or display the interactive quality
cockpit with :meth:show / _repr_html_.
to_frame ¶
One row per column — convenient to sort/filter in a notebook.
Source code in src/freshdata/profile.py
freshdata.ColumnProfile
dataclass
¶
ColumnProfile(name: str, dtype: str, non_null: int, missing: int, missing_pct: float, unique: int | None, sample_values: list[Any], suggested_dtype: str | None, issues: list[str] = list())
Statistics and detected issues for one column.
freshdata.ExplainReport
dataclass
¶
ExplainReport(strategy: str, rows_before: int, rows_after: int, cols_before: int, cols_after: int, before_stats: dict[str, dict[str, Any]], after_stats: dict[str, dict[str, Any]], cell_changes: dict[str, int], actions_by_step: dict[str, list[dict[str, Any]]], narratives: list[str], report: CleanReport, roles: DataFrame)
Bases: HtmlReprMixin
Structured explanation of a clean() run.
to_frame ¶
One row per column: before/after dtype and changed-cell count.
Source code in src/freshdata/explain.py
Enterprise layer¶
The freshdata.enterprise subpackage is documented in the
feature overview. Import its symbols lazily:
Compliance¶
The freshdata.compliance subpackage maps a CleanReport onto regulatory control
frameworks; it is documented in the compliance reports guide.
Import its symbols lazily:
Integrations¶
The freshdata.integrations subpackage runs the clean + trust gate inside Dagster,
Airflow, and dbt; it is documented in the
orchestration integrations guide. The framework-agnostic core is
always importable: