Examples¶
Every script in the examples/
directory is self-contained and runnable (python examples/<name>.py). The
notebooks/
directory has narrated Jupyter walkthroughs.
| Example | What it shows |
|---|---|
01_missing_values.py |
Smart, role-aware missing-value imputation |
02_outliers.py |
Outlier detection and flagging vs removal |
03_normalization.py |
Feature normalization for ML |
04_profiling.py |
Read-only data profiling and EDA |
05_ml_pipeline.py |
End-to-end ML preprocessing with scikit-learn |
06_large_dataset.py |
Cleaning a large synthetic dataset, with timing |
07_pandas_integration.py |
Dropping freshdata into an existing pandas workflow |
08_csv_automation.py |
Batch CSV cleaning automation with audit logs |
Missing-value cleaning¶
import pandas as pd
import freshdata as fd
df = pd.DataFrame({
"customer_id": [1, 2, 3, 4, 5],
"age": [34, None, 41, None, 29],
"segment": ["A", "B", None, "A", None],
})
cleaned, report = fd.clean(df, id_columns=("customer_id",), return_report=True)
print(report.summary())
ML preprocessing pipeline¶
import pandas as pd
import freshdata as fd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
raw = pd.read_csv("customers.csv")
clean_df, report = fd.clean(raw, target_column="churn", return_report=True)
assert not report.warnings
X = pd.get_dummies(clean_df.drop(columns="churn"))
y = clean_df["churn"]
X_tr, X_te, y_tr, y_te = train_test_split(X, y, random_state=0)
model = RandomForestClassifier(random_state=0).fit(X_tr, y_tr)
print("accuracy:", model.score(X_te, y_te))
CSV automation¶
from pathlib import Path
import pandas as pd
import freshdata as fd
cleaner = fd.Cleaner(strategy="balanced")
for path in Path("inbox").glob("*.csv"):
out = cleaner.clean(pd.read_csv(path))
out.to_csv(Path("clean") / path.name, index=False)
print(path.name, "→", cleaner.report_.summary().splitlines()[0])