Format parsers, reference data & finance tick mode¶

This page covers three additions that turn freshdata from a DataFrame cleaner into a front-to-back pipeline: format parsers (raw message → DataFrame), the reference layer (one cached way to load code sets), and finance tick mode (market/trade-data validation).

Format parsers¶

A parser performs structural parsing only — it reads a wire/file format and returns one or more pandas DataFrames plus an auditable ParseResult. Cleaning and domain-validation are a separate step (fd.clean), so parsing and rules stay decoupled.

Format	`format=`	Frames	Suggested domain
FHIR R4 JSON	`fhir`	`patient`, `observation`, `encounter`, `condition`, `medication_request`	`healthcare`
HL7 v2 ER7	`hl7v2`	`patient`, `encounter`, `order`, `observation`	`healthcare`
GPX	`gpx`	`waypoints`, `route_points`, `track_points`	`transport`
SDMX-ML	`sdmx`	`observations`	— (audit-only)
UN/EDIFACT	`edifact`	`segments`	—

FHIR R4 JSON¶

fd.parse_domain(source, format="fhir") accepts a Bundle, a single resource, a list of resources, a JSON string, or a file path, and flattens five resource types into frames whose columns line up with the healthcare validators:

result = fd.parse_domain(bundle_json, format="fhir")
conditions = fd.clean_domain_file("bundle.json", format="fhir",
                                  domain="healthcare", frame="condition")

The healthcare pack validates all five — Patient, Observation, Encounter, Condition, MedicationRequest — auto-detecting the resource from the frame's columns. Observation code systems carry their URIs (LOINC/SNOMED/ICD-10); Observation units are checked against UCUM, Condition codes against a documented ICD-10 sample, and Condition/MedicationRequest status/intent against the FHIR R4 value sets. patient_id is PHI (masked unless audit_include_phi=True) and resource IDs are never imputed. Unsupported resource types are counted in warnings, never dropped silently. The HL7 v2 parser covers MSH/PID/PV1/OBR/OBX (OBR adds an order frame and links each OBX to its order).

import freshdata as fd

result = fd.parse_domain(hl7_message, format="hl7v2")
result.frames["observation"]          # one row per OBX
result.warnings                       # audit notes for anything skipped
result.to_dict()                      # JSON-friendly summary

# Parse a file, then clean a chosen frame with a domain in one call:
patients = fd.clean_domain_file(
    "admit.hl7", format="hl7v2", domain="healthcare",
    frame="patient", fhir_resource="Patient",
)

fd.parse_domain accepts a path, raw text/bytes, or a file-like object. Malformed input is recorded in ParseResult.warnings rather than raising, so a partial message is still usable.

Honest scope

Parsers are structural readers for the common parts of each format — HL7 MSH/PID/PV1/OBX, GPX waypoints/routes/tracks, SDMX generic & structure-specific data, EDIFACT segments/elements with UNA delimiters and the release character. They are not full conformance engines. SDMX is audit-only: an unrecognized layout yields a warning and an empty frame. HL7 observation code systems are mapped to URIs (LOINC http://loinc.org, SNOMED http://snomed.info/sct, ICD-10).

Third-party parsers can register through the freshdata.parsers entry-point group, or at runtime with freshdata.parsers.register("myfmt", MyParser).

Reference-data layer¶

freshdata.domains.reference gives every pack and parser one cached, normalizer-aware way to load a bundled reference code set, instead of each caller re-implementing case folding and synonym coercion.

from freshdata.domains.reference import load_reference, available_references

available_references()                      # ['iso3166', 'iso4217', 'ucum_common', ...]
cur = load_reference("iso4217", normalizer="upper")
cur.contains("usd")                          # True (normalized)
cur.invalid_mask(series)                     # boolean mask of unknown codes
cur.coerce(series)                           # map synonyms/case to canonical

Bundled sets include ISO-4217 (currencies), ISO-3166 (countries), UN/CEFACT units, UCUM (case-sensitive — load with normalizer="exact"), and a UN/LOCODE sample. Every set carries a _meta block (version / source / disclaimer): these are documented common subsets, not exhaustive code systems.

Finance tick mode¶

Pass finance_mode="tick" to validate market tick / trade data instead of the default double-entry ledger model:

out, report = fd.clean(ticks_df, domain="finance", finance_mode="tick", return_report=True)

The tick model is one row per (symbol, timestamp, price, size) tick, with optional exchange, bid, ask, and currency. Rules (FIN-T-001…010) cover:

ISO-8601, non-future timestamp; symbol present (an ID — never imputed);
price and size strictly positive;
currency ∈ ISO-4217 (validated through the reference layer, case-insensitively);
bid <= ask (no crossed quotes);
duplicate-tick detection on (symbol, timestamp, price, size);
BCBS-239 / SOX-style completeness: the currency/exchange control dimensions needed to roll up control totals are flagged when missing (audit-only, info severity).

Symbol and exchange are IDs and are never imputed; bad ticks are flagged for audit, never silently dropped. The default finance_mode="ledger" is unchanged.