Repair plans: suggest, review, apply, undo¶

Phase 2 turns freshdata's semantic proposals into an executable, reviewable workflow. Instead of trusting fd.clean end-to-end, you can look at exactly what it wants to do, approve or reject each action, execute only what you approved, and revert individual repairs afterwards — with a deterministic audit hash for change control.

Everything below is deterministic, offline, and model-free.

The ecommerce example, end to end¶

import pandas as pd
import freshdata as fd

df = pd.DataFrame({
    "cust_id": ["C001", "C002", "C003", "C003"],
    "full_name": ["Asha Rao", "Ravi Kumar", "Neha Iyer", "Neha Iyer"],
    "email_addr": [" Asha@GMAIL.COM ", "ravi @ example.com", "neha@@shop.in", "neha@@shop.in"],
    "mobile": ["98765 43210", "+91 91234 56780", "09123456789", "09123456789"],
    "age": [25, None, None, 31],
    "monthly_revenue": ["1000", "2000", "3000", "3000"],
    "status": [" Active ", "INACTIVE", "pend-ing", "pend-ing"],
})

context = """
This is an ecommerce customer dataset.
CustomerID is unique.
Emails must be valid.
Phone numbers are Indian.
Allowed status values are active, inactive, pending.
Missing Age should be estimated only if confidence >95%.
Never modify revenue values.
"""

clean_df, report = fd.clean(df, context=context, semantic_mode="auto",
                            return_report=True)

What happens:

column	behaviour
`email_addr`	`Asha@GMAIL.COM` → `Asha@gmail.com`, `ravi @ example.com` → `ravi@example.com`, `neha@@shop.in` → `neha@shop.in` (unambiguous mechanical repairs)
`mobile`	every safe form → canonical `+91XXXXXXXXXX`; `12345`-style values are flagged, never rewritten
`status`	`Active` → `active`, `INACTIVE` → `inactive`, `pend-ing` → `pending` (exact after separator/case normalization)
`age`	stays missing — every engine fill confidence (≤ 0.9) is below the required 0.95; the held-back fill is recorded as a suggested action
`monthly_revenue`	byte-identical, physically verified; a violation would raise `fd.ProtectedColumnError`
`cust_id`	duplicate `C003` is validated and reported (`fd.validate`), not silently dropped

Every decision lands in the report with confidence, risk, rationale, status, reversibility, and metadata (issue_type, raw_value, proposed_value, expert, evidence, and region for phone repairs).

Suggest → review → apply¶

plan = fd.suggest_plan(df, context=context, semantic_mode="auto")
rp = plan.repair_plan            # fd.RepairPlan
print(rp.summary())
#   a1: [auto/approved] email_addr email_format ' Asha@GMAIL.COM ' -> 'Asha@gmail.com' ...
#   a7: [suggest/pending] status reference_value 'actve' -> 'active' ...

rp.approve("a7")                 # by action id
rp.approve("email_addr")         # by column
rp.reject("phone_format", reason="numbers come from the CRM")   # by kind
rp.override("a7", {"proposed_value": "inactive"})
rp.approve_all(max_risk="low")   # everything low-risk, not blocked/rejected

clean_df, report = fd.apply_plan(df, rp, keep_undo=True)

apply_plan executes exactly the approved actions:

nothing is re-profiled or re-decided;
rejected actions never run; blocked actions (protected columns, identifier vetoes) never run even if approved;
the physical protected-column guard runs before the result is returned;
your input frame is never mutated.

In semantic_mode="auto", low-risk high-confidence actions arrive already approved (they are what fd.clean would have applied). In "review" mode everything non-trivial stays pending.

Drift refusal¶

A plan remembers the frame it was built for (FrameSignature: row count, column names+dtypes, content sample). Applying it to different data refuses by default:

fd.apply_plan(other_df, rp)                  # raises fd.PlanDriftError
fd.apply_plan(other_df, rp, allow_drift=True)  # applies; stale actions skipped + recorded

Undo¶

clean_df, report = fd.apply_plan(df, rp, keep_undo=True)
restored = report.revert(clean_df, action_ids=["a1"])   # undo one action
restored_all = report.revert(clean_df)                   # undo everything reversible

Old values are stored compactly (cell positions + the one raw value per action). The log is capped by undo_cell_limit (default 100 000 cells); actions that don't fit are honestly marked reversible=False. Row drops and aggregations are not cell-reversible and are never claimed to be.

Audit hash¶

report.decisions_hash   # sha256 over action ids/kinds/columns/params,
                        # approvals/rejections, policy hash, thresholds

The hash is stable across runs for the same reviewed plan and changes when any decision changes — suitable for change-control records. It also appears in report.to_dict() / the CLI's --report JSON.

Plans as files (CLI)¶

freshdata plan in.csv --context-file rules.txt --out plan.json
# ... review/edit plan.json, or pass --approve-all low ...
freshdata apply-plan in.csv --plan plan.json -o out.csv --report audit.json

freshdata clean also gained --semantic-mode {off,assist,review,auto}.

Plans serialize losslessly: RepairPlan.to_json() / RepairPlan.from_json() round-trip actions, approval state, the compiled policy, and the frame signature.

Model-assisted actions (Phase 3)¶

With the optional embedding backend enabled, plans can carry proposals the deterministic experts abstained on. They are ordinary PlannedActions — same gate, same approval flow, same guard — with extra audit fields in params: backend: "embedding", the raw (pre-calibration) score, the calibration table version, and a stable features_hash. Calibrated confidence is what confidence already shows; deterministic actions are untouched by calibration. See Optional semantic models.

What this is not¶

See limitations.md: no model, no embeddings, no network — only the deterministic tier-0 context language from Phase 1, and ambiguous repairs are suggested or flagged, never auto-applied.