Optional semantic models¶
FreshData's default install is — and will remain — model-free. Everything
fd.clean does out of the box is deterministic Python: no model weights, no
downloads, no network, no LLM. This page covers the optional local model
path added in Phase 3: a small ONNX sentence encoder that upgrades three
things when (and only when) you opt in:
- Embedding backend — extra repair proposals for category/reference values the deterministic experts abstain on ("aktyve" → "active" when the allowed values say so).
- Resolver rescue rung — context phrases like "Phone numbers" can match a
mob_nocolumn that the alias lexicon and fuzzy matching missed. - Semantic-type evidence — extra signal for
fd.infer_roles.
Every model output is evidence, not authority: proposals still pass the same policy gate, protected columns keep their byte-identity guarantee, and ambiguous matches are suggested or skipped, never auto-applied.
Install¶
The extra adds the inference runtime only. Model weights are never bundled
in the wheel and are never downloaded automatically — not by fd.clean, not
by import freshdata, not by anything except the explicit pull below.
Models¶
import freshdata as fd
fd.models.list_available() # registry metadata (ids, sizes, licenses)
fd.models.status() # what is installed and verified, offline
fd.models.pull("fd-col-encoder-v1") # explicit download (the only network path)
fd.models.path("fd-col-encoder-v1") # local artifact path
| model id | what it is | used by |
|---|---|---|
fd-col-encoder-v1 |
small sentence encoder, ONNX int8 | embedding backend, resolver rung |
fd-intent-v1 |
context-sentence intent classifier | registered for forward compatibility; unused in Phase 3 |
calib-v1 |
isotonic confidence-calibration tables (plain JSON) | proposal scoring; a packaged default ships in the wheel |
Models live in ~/.freshdata/models/<model-id>/; override the directory with
FRESHDATA_MODEL_DIR.
Not yet published: official artifacts are not hosted yet, so
fd.models.pull(...)currently raisesModelNotPublishedErrorwith instructions. You can pointFRESHDATA_MODEL_URL_BASEat a mirror that hosts the files, or use the air-gapped path below. The registry pins sha256 checksums as artifacts are published; a pinned checksum that does not match is refused at download and at load time.
Air-gapped installs¶
Copy the model files into the model directory by hand:
$FRESHDATA_MODEL_DIR/fd-col-encoder-v1/model.onnx
$FRESHDATA_MODEL_DIR/fd-col-encoder-v1/tokenizer.json
fd.models.status() detects them. Manually placed files with no pinned hash
load as unverified (visible in status()); a pinned mismatch refuses to
load.
Using the embedding backend¶
clean_df, report = fd.clean(
df,
context="""
Allowed status values are active, inactive, pending.
Never modify revenue values.
""",
semantic_mode="auto",
semantic_backends=("deterministic", "memory", "embedding"),
return_report=True,
)
Or from the CLI:
freshdata models status
freshdata models pull fd-col-encoder-v1
freshdata clean in.csv -o out.csv \
--context-file rules.txt \
--semantic-mode auto \
--semantic-backends deterministic,memory,embedding \
--report report.json
Backends run in trust order. Deterministic experts always go first;
memory replays learned repairs when a CleaningMemory is supplied; the
embedding backend sees only the residue — distinct values nothing earlier
proposed a repair for. It never sees full rows: only distinct values of
eligible columns, and never values from protected, id, target, free-text, or
high-cardinality columns.
What it proposes, and how carefully¶
- With allowed values (from context or
semantic_context): a residual value whose top cosine to one allowed value clears a threshold and leads the runner-up by a margin becomes areference_valueproposal. At very high calibrated confidence the normal gate may auto-apply it insemantic_mode="auto"— the same rules deterministic proposals live by. - Without allowed values (pure similarity clustering on low-cardinality
categorical columns): proposals are
category_synonym, riskmediumor higher, and the packaged calibration caps their confidence below the default auto threshold — suggest-only by default, on purpose. - Ambiguous matches (a close second candidate, e.g.
"nactive"sitting one edit from bothactiveandinactive, orAustria/Australiaclusters) produce no proposal at all. Absence is the strongest guarantee: nothing downstream can accidentally apply what was never proposed.
Every model-assisted action records its provenance in Action.metadata:
backend, model id and sha, raw score, calibrated confidence, calibration
version, a features hash, and the top/second candidates with the margin.
Confidence calibration¶
Expert scores are treated as raw scores. A per-(backend, issue-family) isotonic table maps them to calibrated probabilities before the gate reads them:
- the packaged default (
calib-default-1) is identity for deterministic and memory proposals — calibration can never change a model-free install's decisions — and conservative for embedding proposals; - a pulled
calib-v1model overrides the packaged default; - with no table at all, raw scores pass through and actions record
calibration_version="uncalibrated"plus a report warning; - embedding proposals are never pinned at 1.0, table or no table.
CleanBench gates calibration on every run of the Phase-3 mini-suite:
protected-column violations = 0, false-modification rate ≤ 0.1%, expected
calibration error ≤ 0.05, precision at confidence ≥ 0.95 of at least 0.98,
and zero auto-applied ambiguous merges. (Long-run targets: ECE ≤ 0.03,
P@0.95 ≥ 0.99 — see docs/benchmarks.md.)
Budget¶
Model work is bounded by semantic_budget:
fd.clean(df, semantic_mode="auto",
semantic_backends=("deterministic", "embedding"),
semantic_budget={"max_columns": 50, "max_distinct_values": 5000,
"max_model_calls": 100, "max_seconds": 10})
An exhausted budget stops the embedding backend cleanly (proposals already made stand; nothing is partially applied) and records a report fallback event. Deterministic and memory backends are never metered.
Graceful degradation¶
Requesting "embedding" without the extra or the model never crashes:
- the backend self-disables;
report.fallback_eventsrecords why;- one report warning says the backend was skipped and how to enable it;
- deterministic (and memory) behavior is exactly what it would have been.
Troubleshooting¶
| symptom | cause | fix |
|---|---|---|
Semantic backend 'embedding' skipped: optional dependency missing |
[semantic] extra not installed |
pip install "freshdata-cleaner[semantic]" |
... model 'fd-col-encoder-v1' is not installed |
weights never pulled (they never download automatically) | fd.models.pull("fd-col-encoder-v1") or place files in FRESHDATA_MODEL_DIR |
ModelNotPublishedError from pull |
no official artifact hosting yet | set FRESHDATA_MODEL_URL_BASE to a mirror, or use the air-gapped path |
ModelChecksumError |
file does not match the pinned sha256 | re-pull with --force / replace the file; FreshData refuses to load mismatches |
budget exhausted (...) fallback event |
semantic_budget ceiling hit |
raise the ceiling or accept the (clean, recorded) early stop |
calibration_version="uncalibrated" in metadata |
no calibration table found | reinstall (restores the packaged default) or fd.models.pull("calib-v1") |
Testing without models¶
CI and local tests run the full embedding path with no model files: set
FRESHDATA_STUB_ENCODER=1 for a deterministic hash-based stub encoder, or
inject an encoder via freshdata.models.runtime.set_encoder_factory in
tests. Both are testing seams, not public API.
What Phase 3 deliberately does not do¶
No LLM at runtime. No cloud inference. No per-cell model calls (distinct values only, structurally). No generative repairs. No automatic downloads. No model weights in the wheel. Protected columns remain byte-identical under any backend — enforced physically by the executor guard, not by convention.