Skip to content

Examples

Every script in the examples/ directory is self-contained and runnable (python examples/<name>.py). The notebooks/ directory has narrated Jupyter walkthroughs.

Example What it shows
01_missing_values.py Smart, role-aware missing-value imputation
02_outliers.py Outlier detection and flagging vs removal
03_normalization.py Feature normalization for ML
04_profiling.py Read-only data profiling and EDA
05_ml_pipeline.py End-to-end ML preprocessing with scikit-learn
06_large_dataset.py Cleaning a large synthetic dataset, with timing
07_pandas_integration.py Dropping freshdata into an existing pandas workflow
08_csv_automation.py Batch CSV cleaning automation with audit logs

Missing-value cleaning

import pandas as pd
import freshdata as fd

df = pd.DataFrame({
    "customer_id": [1, 2, 3, 4, 5],
    "age": [34, None, 41, None, 29],
    "segment": ["A", "B", None, "A", None],
})

cleaned, report = fd.clean(df, id_columns=("customer_id",), return_report=True)
print(report.summary())

ML preprocessing pipeline

import pandas as pd
import freshdata as fd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

raw = pd.read_csv("customers.csv")
clean_df, report = fd.clean(raw, target_column="churn", return_report=True)
assert not report.warnings

X = pd.get_dummies(clean_df.drop(columns="churn"))
y = clean_df["churn"]
X_tr, X_te, y_tr, y_te = train_test_split(X, y, random_state=0)
model = RandomForestClassifier(random_state=0).fit(X_tr, y_tr)
print("accuracy:", model.score(X_te, y_te))

CSV automation

from pathlib import Path
import pandas as pd
import freshdata as fd

cleaner = fd.Cleaner(strategy="balanced")
for path in Path("inbox").glob("*.csv"):
    out = cleaner.clean(pd.read_csv(path))
    out.to_csv(Path("clean") / path.name, index=False)
    print(path.name, "→", cleaner.report_.summary().splitlines()[0])