What it is: A null/NaN value is a missing placeholder — the data simply was never recorded. Most ML algorithms throw errors or produce wrong results when they encounter NaN.
python
import pandas aspdimport numpy asnpdf = pd.read_csv('data.csv')
# ── DETECT ───────────────────────────────────────────df.isnull().sum() # count per column
(df.isnull().mean() * 100).round(2) # % missing per columndf.columns[df.isnull().any()] # which columns have nulls# ── DROP ─────────────────────────────────────────────df.dropna() # drop rows with ANY nulldf.dropna(subset=['age', 'salary']) # drop only if null in these colsdf.dropna(axis=1, thresh=int(0.6*len(df))) # drop cols missing >40%# ── FILL ─────────────────────────────────────────────df['age'].fillna(df['age'].median()) # fill numeric with mediandf['dept'].fillna('Unknown') # fill categorical with constantdf.ffill() # time series: forward filldf.bfill() # time series: backward fill
⚠Never fill the target variable (y). Drop those rows instead. Never call fit_transform on test data — only transform().
02
Missing Values — Imputation
MCAR
Missing Completely At Random
Randomly missing. No pattern. Safe to drop rows. Rare in practice.
MAR
Missing At Random
Missing depends on other observed columns. Impute using other features.
MNAR
Missing Not At Random
Missingness depends on the missing value itself. Add indicator flag + impute.
python — all imputation strategies
from sklearn.impute importSimpleImputer, KNNImputerfrom sklearn.experimental import enable_iterative_imputer # must import firstfrom sklearn.impute importIterativeImputer# Strategy 1 — Mean / Median / Mode (SimpleImputer)imp_median = SimpleImputer(strategy='median') # numericimp_mode = SimpleImputer(strategy='most_frequent') # categoricalimp_const = SimpleImputer(strategy='constant', fill_value='Unknown')
X_train = imp_median.fit_transform(X_train) # fit ONLY on trainX_test = imp_median.transform(X_test) # transform only on test# Strategy 2 — KNN Imputer (borrows from similar rows)knn_imp = KNNImputer(n_neighbors=5)
X_imputed = knn_imp.fit_transform(X_train)
# Strategy 3 — Iterative / MICE (most accurate, most expensive)iter_imp = IterativeImputer(max_iter=10, random_state=42)
X_imputed = iter_imp.fit_transform(X_train)
# Strategy 4 — MNAR: add missingness indicator flagdf['salary_was_missing'] = df['salary'].isna().astype(int)
df['salary'] = df['salary'].fillna(df['salary'].median())
03
Outlier Detection & Treatment
What it is: Outliers are values far from the rest of the data. They can be genuine (a billionaire in a salary dataset) or errors (age = -5). Either way they distort model learning.
python — IQR method
# IQR — works for any distribution (most robust)Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1lower = Q1 - 1.5 * IQRupper = Q3 + 1.5 * IQRoutliers = df[(df['salary'] < lower) | (df['salary'] > upper)]
df_clean = df[(df['salary'] >= lower) & (df['salary'] <= upper)]
df['salary'] = df['salary'].clip(lower, upper) # or cap (Winsorize)
python — Z-score method
from scipy import stats
# Z-score — good for normally distributed dataz = np.abs(stats.zscore(df['salary']))
df_clean = df[z < 3] # keep within 3 standard deviations# Multi-column Z-scorenumeric_cols = df.select_dtypes('number').columns
z_all = df[numeric_cols].apply(stats.zscore)
df_clean = df[(np.abs(z_all) < 3).all(axis=1)]
💡Use IQR for single columns, Isolation Forest when outliers appear across multiple correlated features simultaneously.
04
Duplicates
Why it matters: Duplicate rows cause the model to learn some samples with double weight, biasing predictions. During cross-validation, duplicates in both train and test folds cause data leakage.
Common issues: Ages stored as strings, dates in mixed formats ("2021-03-15" vs "March 2021"), booleans stored as "yes"/"True"/"1". Wrong types make math operations fail silently.
🚨handle_unknown='ignore' is critical. Without it, your model crashes the moment it sees a new category in production.
07
Misspelling & Text Cleaning
Why it matters: "Marketing", "marketing", "MARKETING", "Marketting" are treated as 4 different categories by any encoder. One-hot encoding would create 4 columns instead of 1.
Golden rule: Always split BEFORE any cleaning, scaling, or imputation. Fit transformers on training data only. Transform both. This prevents data leakage — the #1 mistake in ML.
🚨Data Leakage: If you scale/impute BEFORE splitting, your test set has "seen" the training distribution. Model appears to generalize but it is cheating. Always split first.
09
Scaling & Standardization
Why it matters: A feature ranging 0–1,000,000 dominates a feature ranging 0–1 in distance-based and gradient-based models. Scaling puts all features on equal footing. Tree models (Random Forest, XGBoost) do NOT need scaling.
💡Tree models (Decision Tree, Random Forest, XGBoost, LightGBM) are invariant to scaling — don't waste time scaling for them. Must scale for: Linear Regression, Logistic Regression, SVM, KNN, Neural Networks, PCA.
10
EDA — Exploratory Data Analysis
What it is: Visually and statistically examining data BEFORE modelling to discover patterns, anomalies, relationships, and guide preprocessing decisions. Coined by John Tukey (1977): "Let the data speak."
Pillar 1
Univariate
One variable at a time. Distribution, shape, skew. Histogram, boxplot.
Pillar 2
Bivariate
Two variables. Correlation, trend, group differences. Scatter, grouped bar.
Pillar 3
Multivariate
Many variables. Interactions, clusters. Heatmap, pair plot, PCA.
Pillar 4
Temporal
Time-ordered data. Trends, seasonality. Line plot, rolling average.
EDA Finding
→ Action to Take
Feature severely right-skewed
Apply np.log1p() transform
Two features strongly correlated
Drop one — multicollinearity hurts linear models
Target class imbalanced (60/40+)
Use stratify=y in split + class_weight='balanced'
Feature has zero correlation with target
Candidate for removal — it's noise
Outliers present in feature
Use RobustScaler or clip with IQR bounds
Categorical with 100s of values
Use Target Encoding or group rare values into "Other"
11
Overfitting & Underfitting
Problem
Underfitting
Model too simple to learn the pattern.
Train error: HIGH Test error: HIGH Gap: small
Analogy: never studied for the exam.
Goal
Just Right
Model learned the signal, not the noise.
Train error: LOW Test error: LOW Gap: small
Analogy: understood the concepts.
Problem
Overfitting
Model memorized training data including noise.
Train error: VERY LOW Test error: HIGH Gap: large
Analogy: memorized exam answers word-for-word.
Fix Underfitting
Use a more powerful model (Tree → Random Forest → Neural Net)
Add more / better features (polynomial, interactions)
Rank-based. Works for ordinal data and non-normal distributions. Robust to outliers. Use when relationship is curved but consistently directional.
⚠Correlation ≠ Causation. Ice cream sales and drowning rates are highly correlated (both peak in summer) — but ice cream does not cause drowning. Always think about confounders.