Context policies — cleaning rules in English¶

Write your expectations about a dataset as plain English, and FreshData compiles them — deterministically, offline, with no model and no new dependency — into a typed, reviewable ContextPolicy that governs the clean:

import freshdata as fd

clean_df = fd.clean(
    df,
    context="""
    This is an ecommerce customer dataset.
    CustomerID is unique.
    Emails must be valid.
    Phone numbers are Indian.
    Missing Age should be estimated only if confidence >95%.
    Never modify revenue values.
    """,
)

The point is not the prose — it is the artifact. The compiled policy is a JSON document you can print, diff, commit next to your pipeline, and review in a pull request like code. fd.clean(..., context=...) is exactly equivalent to compiling the policy yourself and passing it back:

policy = fd.compile_context(ctx_text, df=df)
print(policy.summary())
policy.to_json("policy.json")            # review it, commit it
clean_df = fd.clean(df, policy=policy)   # skip parsing, use it verbatim

What the compiler understands (tier 0)¶

The parser is a fixed, deterministic lexicon of ~12 intent families. The same text always compiles to the same policy — there is no LLM, no embedding, no network call, and no randomness anywhere in this path.

You write	Compiles to
`This is an ecommerce customer dataset.`	dataset domain metadata
`CustomerID is unique.`	`unique` — validation + id-column protection
`Emails must be valid.`	`valid_format(email)` semantic hint
`Phone numbers are Indian.`	`locale_format(phone, region=IN)` hint
`Missing Age should be estimated only if confidence >95%.`	per-column imputation confidence gate (0.95)
`Never modify revenue values.` / `Do not touch revenue.`	`protected` — hard, never modified
`Allowed status values are active, inactive, pending.`	`allowed_values` validation
`Age must be between 18 and 100.`	`range` validation
`Deduplicate by email and phone.`	duplicate-detection key
`Drop rows where age is missing.`	recorded `custom` rule (executed in a later phase)
`Rename cust to customer_id.` / `Replace 'M' with 'Male' in gender.`	recorded `custom` rules

Confidence phrases are normalized: >95%, confidence > 95%, confidence above 0.95, and only if confidence >= 95 percent all compile to 0.95.

Column references resolve against your real schema¶

You write CustomerID; your frame has cust_id. The resolver walks a fixed ladder — exact match, snake_case normalization (the same renaming column_names=True applies), a curated alias lexicon, token containment, and finally difflib similarity (threshold 0.85):

"CustomerID"    → cust_id           (alias)       0.90
"Emails"        → email_addr        (alias)       0.90
"Phone numbers" → mobile            (alias)       0.90
"Age"           → age               (normalized)  1.00
"revenue"       → monthly_revenue   (alias)       0.90

Nothing is ever guessed. Two candidates that score too close together, or a reference below the threshold, come back in policy.unresolved with the ranked candidates so you disambiguate.

With the optional [semantic] extra installed, its model pulled, and "embedding" listed in semantic_backends, a final rescue rung runs on references the deterministic ladder gave up on: cosine similarity between the phrase and the column names. It follows exactly the same discipline — a close runner-up stays unresolved, it can never override an exact/alias/fuzzy match, and an embedding-resolved constraint records its evidence (cosine, ranked candidates, model id) in params["resolution_evidence"]. Without the extra the ladder is byte-identical to the deterministic behavior above. See Optional semantic models. Sentences the lexicon cannot parse are surfaced in policy.issues (kind unparsed_sentence) — never silently dropped.

Strict mode¶

By default, unresolved references and unparsed sentences become report warnings and the rest of the policy still applies. With strict=True they raise fd.PolicyError before any data is touched:

fd.clean(df, context=ctx, strict=True)       # raises PolicyError on any gap
fd.compile_context(ctx, df=df, strict=True)  # same, at compile time

Checking without cleaning: `fd.validate`¶

findings = fd.validate(df, context=ctx)   # never mutates df
assert not findings.errors

Returns a FindingList of QualityFindings covering unresolved references, compile issues, protected columns, and — where checkable today — unique, allowed_values, and range violations.

Everywhere the policy threads¶

fd.clean(df, context=..., policy=..., strict=...)
fd.suggest_plan(df, context=...)          # the plan's config carries the policy
fd.clean_csv("in.csv", context=...)
fd.Cleaner(context=...)                   # reusable, compiled per frame

context= and policy= are mutually exclusive; passing both raises. On the CLI:

freshdata clean input.csv -o output.csv --context-file rules.txt
freshdata clean input.csv -o output.csv --context-file rules.txt --strict
freshdata policy compile rules.txt --schema input.csv --output policy.json

policy compile prints the human-readable summary; --strict exits non-zero on unresolved or unparsed lines (handy as a CI gate for the committed rules.txt).

How a policy lowers into the engine¶

policy.lower(config) is a pure function producing a new CleanConfig:

protected columns → preserve_columns + semantic_context mutable=False
unique columns → id_columns (imputation/outlier veto for free) + validation metadata
valid_format / locale_format → per-column semantic_type (+ region) hints
imputation confidence gates → per-column impute_min_confidence metadata
allowed_values / range → per-column validation metadata
dedup keys → duplicate_subset

The lowered config is what actually runs — the existing pipeline, unchanged. Every compile is audited: the report gains a context action carrying the full policy dict, plus a warning per unresolved/unparsed item.

Guarantees and limitations¶

context=None is a no-op. Not passing a context changes nothing, byte for byte — the compiler is never even imported.
Deterministic tier-0 only. This is a fixed pattern lexicon, not natural language understanding. Phrasings outside it are surfaced as unparsed, and that is by design: a rule the compiler cannot prove it understood is a rule it refuses to half-apply.
No model, no network, no new dependency. Pure Python + stdlib.
Protection wins ties. A column that is both protected and asked to be repaired compiles with the repair demoted to validation and an explicit protection_conflict issue (an error under strict).
Phase-2 enforcement. valid_format(email) and locale_format(phone, region=IN) now drive deterministic value experts (email normalization, Indian phone canonicalization to +91XXXXXXXXXX), allowed_values drives the reference-list expert, imputation confidence gates ("only if confidence >95%") are enforced inside the statistical engine, and protected columns are physically verified byte-identical before any result is returned (fd.ProtectedColumnError on violation). See repair-plans.md for the reviewable plan/apply workflow. drop_if/rename/map rules are still compiled and carried in the policy but not yet executed.

See the notebook notebooks/06_context_cleaning.ipynb for the full ecommerce walkthrough.