ML Study Cheatsheet

01

Null Values

What it is: A null/NaN value is a missing placeholder — the data simply was never recorded. Most ML algorithms throw errors or produce wrong results when they encounter NaN.

python

import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')

# ── DETECT ───────────────────────────────────────────
df.isnull().sum()                   # count per column
(df.isnull().mean() * 100).round(2) # % missing per column
df.columns[df.isnull().any()]       # which columns have nulls

# ── DROP ─────────────────────────────────────────────
df.dropna()                          # drop rows with ANY null
df.dropna(subset=['age', 'salary']) # drop only if null in these cols
df.dropna(axis=1, thresh=int(0.6*len(df))) # drop cols missing >40%

# ── FILL ─────────────────────────────────────────────
df['age'].fillna(df['age'].median())  # fill numeric with median
df['dept'].fillna('Unknown')         # fill categorical with constant
df.ffill()                           # time series: forward fill
df.bfill()                           # time series: backward fill

⚠Never fill the target variable (y). Drop those rows instead. Never call fit_transform on test data — only transform().

02

Missing Values — Imputation

MCAR

Missing Completely At Random

Randomly missing. No pattern. Safe to drop rows. Rare in practice.

MAR

Missing At Random

Missing depends on other observed columns. Impute using other features.

MNAR

Missing Not At Random

Missingness depends on the missing value itself. Add indicator flag + impute.

python — all imputation strategies

from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer  # must import first
from sklearn.impute import IterativeImputer

# Strategy 1 — Mean / Median / Mode (SimpleImputer)
imp_median = SimpleImputer(strategy='median')     # numeric
imp_mode   = SimpleImputer(strategy='most_frequent') # categorical
imp_const  = SimpleImputer(strategy='constant', fill_value='Unknown')

X_train = imp_median.fit_transform(X_train) # fit ONLY on train
X_test  = imp_median.transform(X_test)      # transform only on test

# Strategy 2 — KNN Imputer (borrows from similar rows)
knn_imp = KNNImputer(n_neighbors=5)
X_imputed = knn_imp.fit_transform(X_train)

# Strategy 3 — Iterative / MICE (most accurate, most expensive)
iter_imp = IterativeImputer(max_iter=10, random_state=42)
X_imputed = iter_imp.fit_transform(X_train)

# Strategy 4 — MNAR: add missingness indicator flag
df['salary_was_missing'] = df['salary'].isna().astype(int)
df['salary'] = df['salary'].fillna(df['salary'].median())

03

Outlier Detection & Treatment

What it is: Outliers are values far from the rest of the data. They can be genuine (a billionaire in a salary dataset) or errors (age = -5). Either way they distort model learning.

python — IQR method

# IQR — works for any distribution (most robust)
Q1    = df['salary'].quantile(0.25)
Q3    = df['salary'].quantile(0.75)
IQR   = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outliers  = df[(df['salary'] < lower) | (df['salary'] > upper)]
df_clean  = df[(df['salary'] >= lower) & (df['salary'] <= upper)]
df['salary'] = df['salary'].clip(lower, upper) # or cap (Winsorize)

python — Z-score method

from scipy import stats

# Z-score — good for normally distributed data
z = np.abs(stats.zscore(df['salary']))
df_clean = df[z < 3]  # keep within 3 standard deviations

# Multi-column Z-score
numeric_cols = df.select_dtypes('number').columns
z_all        = df[numeric_cols].apply(stats.zscore)
df_clean     = df[(np.abs(z_all) < 3).all(axis=1)]

python — Isolation Forest (multivariate)

from sklearn.ensemble import IsolationForest

iso    = IsolationForest(contamination=0.05, random_state=42)
labels = iso.fit_predict(X_numeric)  # -1 = outlier, 1 = inlier
df_clean = df[labels == 1]

💡Use IQR for single columns, Isolation Forest when outliers appear across multiple correlated features simultaneously.

04

Duplicates

Why it matters: Duplicate rows cause the model to learn some samples with double weight, biasing predictions. During cross-validation, duplicates in both train and test folds cause data leakage.

python

# ── DETECT ───────────────────────────────────────────
df.duplicated().sum()                         # count exact duplicates
df[df.duplicated()]                            # view them
df.duplicated(subset=['email']).sum()         # by specific columns

# ── REMOVE ───────────────────────────────────────────
df = df.drop_duplicates()                       # keep first occurrence
df = df.drop_duplicates(subset=['email', 'name'], keep='first')

# ── NEAR-DUPLICATES (fuzzy matching) ─────────────────
from thefuzz import fuzz
score = fuzz.ratio('Alice Smith', 'Alce Smth')  # 88 — likely duplicate

05

Wrong Format & Data Types

Common issues: Ages stored as strings, dates in mixed formats ("2021-03-15" vs "March 2021"), booleans stored as "yes"/"True"/"1". Wrong types make math operations fail silently.

python — type fixing

# ── NUMERIC ──────────────────────────────────────────
df['age'] = pd.to_numeric(df['age'], errors='coerce')  # bad → NaN

# ── DATETIME ─────────────────────────────────────────
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df['year']       = df['date'].dt.year
df['month']      = df['date'].dt.month
df['dayofweek']  = df['date'].dt.dayofweek
df['is_weekend'] = df['dayofweek'] >= 5

# ── BOOLEAN UNIFICATION ──────────────────────────────
df['active'] = df['active'].map({
    'yes': True, 'True': True,  '1': True,  1: True,
    'no':  False, 'False': False, '0': False, 0: False
})

# ── MEMORY OPTIMIZATION ──────────────────────────────
df['dept'] = df['dept'].astype('category')  # less memory for categoricals
df['age']  = df['age'].astype('int32')      # int64 → int32 saves 50% RAM

06

Encoding Categorical Data

Method	When to use	Watch out
One-Hot	Nominal categories (no order) — color, city, dept	Use `drop='first'` to avoid dummy trap
Ordinal	Ordered categories — low/medium/high	Never use on nominal data — implies false order
Label	Tree models only — just assigns integers	Implies false order for linear/SVM models
Target	High cardinality — city with 500 values	Must fit inside CV loop to prevent leakage

python — all encoding methods

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, LabelEncoder

# One-Hot Encoding (nominal — no order)
enc = OneHotEncoder(drop='first', handle_unknown='ignore', sparse_output=False)
X_enc = enc.fit_transform(df[['dept', 'city']])

# pandas shortcut
df = pd.get_dummies(df, columns=['dept'], drop_first=True)

# Ordinal Encoding (ordered — must specify order!)
enc_ord = OrdinalEncoder(categories=[['low', 'medium', 'high']])
df['risk_enc'] = enc_ord.fit_transform(df[['risk']])

# Manual ordinal mapping (clearest approach)
df['risk_enc'] = df['risk'].map({'low': 0, 'medium': 1, 'high': 2})

# Label Encoding (tree models only)
le = LabelEncoder()
df['dept_enc'] = le.fit_transform(df['dept'])

🚨handle_unknown='ignore' is critical. Without it, your model crashes the moment it sees a new category in production.

07

Misspelling & Text Cleaning

Why it matters: "Marketing", "marketing", "MARKETING", "Marketting" are treated as 4 different categories by any encoder. One-hot encoding would create 4 columns instead of 1.

python — string normalization

# ── BASIC NORMALIZATION ───────────────────────────────
df['dept'] = (
    df['dept']
    .str.lower()           # "MARKETING" → "marketing"
    .str.strip()           # "marketing " → "marketing"
    .str.replace(r'\s+', ' ', regex=True)  # collapse multiple spaces
)

# ── FIX KNOWN VARIANTS ───────────────────────────────
corrections = {
    'mktg': 'marketing', 'mkt': 'marketing',
    'eng':  'engineering', 'engg': 'engineering',
    'hr':   'human resources'
}
df['dept'] = df['dept'].replace(corrections)

# ── REMOVE SPECIAL CHARACTERS ────────────────────────
df['name'] = df['name'].str.replace(r'[^a-zA-Z\s]', '', regex=True)

# ── FUZZY MATCHING — fix misspellings automatically ──
from thefuzz import process

valid_depts = ['marketing', 'engineering', 'finance', 'hr']

def fix_spelling(val, choices, threshold=80):
    match, score = process.extractOne(val, choices)
    return match if score >= threshold else 'unknown'

df['dept_clean'] = df['dept'].apply(lambda x: fix_spelling(x, valid_depts))
# "marketting" → "marketing" (score 92)
# "finace"     → "finance"   (score 91)

08

Train / Test Split

Golden rule: Always split BEFORE any cleaning, scaling, or imputation. Fit transformers on training data only. Transform both. This prevents data leakage — the #1 mistake in ML.

python — all split strategies

from sklearn.model_selection import (
    train_test_split, StratifiedKFold,
    KFold, cross_val_score
)

# ── BASIC SPLIT (80/20) ───────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,      # 20% test, 80% train
    random_state=42,    # reproducibility
    stratify=y          # preserve class ratio (use for classification!)
)

# ── THREE-WAY SPLIT (train / val / test) ─────────────
X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42, stratify=y)

X_train, X_val, y_train, y_val = train_test_split(
    X_trainval, y_trainval, test_size=0.15, random_state=42, stratify=y_trainval)

# ── K-FOLD CROSS VALIDATION ───────────────────────────
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
print(f"CV: {scores.mean():.3f} ± {scores.std():.3f}")

# ── CORRECT PIPELINE (no leakage) ────────────────────
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler()),
    ('model',   your_model)
])
pipe.fit(X_train, y_train)   # fits imputer+scaler on train only
pipe.score(X_test, y_test)  # transforms test correctly

🚨Data Leakage: If you scale/impute BEFORE splitting, your test set has "seen" the training distribution. Model appears to generalize but it is cheating. Always split first.

09

Scaling & Standardization

Why it matters: A feature ranging 0–1,000,000 dominates a feature ranging 0–1 in distance-based and gradient-based models. Scaling puts all features on equal footing. Tree models (Random Forest, XGBoost) do NOT need scaling.

Scaler	Formula	Output	Best for	Outlier sensitive?
`StandardScaler`	(x − μ) / σ	~[-3, 3]	Linear models, SVM, PCA, Neural nets	Yes
`MinMaxScaler`	(x − min) / (max − min)	[0, 1]	KNN, Neural nets with sigmoid	Yes
`RobustScaler`	(x − median) / IQR	unbounded	Data with outliers you want to keep	No
`PowerTransformer`	Yeo-Johnson	~N(0,1)	Skewed distributions → make Gaussian	No

python — all scalers

from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler,
    RobustScaler, PowerTransformer
)

# StandardScaler — mean=0, std=1
scaler     = StandardScaler()
X_train_s  = scaler.fit_transform(X_train)  # fit + transform
X_test_s   = scaler.transform(X_test)       # transform ONLY
X_original = scaler.inverse_transform(X_train_s)  # reverse it

# MinMaxScaler — squish to [0,1]
mm        = MinMaxScaler(feature_range=(0, 1))
X_norm    = mm.fit_transform(X_train)

# RobustScaler — uses median+IQR, ignores outliers
robust    = RobustScaler()
X_robust  = robust.fit_transform(X_train)

# Log transform — for severely right-skewed data (salaries, prices)
df['log_salary'] = np.log1p(df['salary'])  # log1p = log(1+x), safe for 0

# PowerTransformer — makes distribution Gaussian
pt        = PowerTransformer(method='yeo-johnson')
X_normal  = pt.fit_transform(X_train[['salary', 'price']])

# Check skew before/after
print(df['salary'].skew())      # before: 4.8 (very skewed)
print(pd.Series(X_normal[:,0]).skew())  # after:  ~0.1

💡Tree models (Decision Tree, Random Forest, XGBoost, LightGBM) are invariant to scaling — don't waste time scaling for them. Must scale for: Linear Regression, Logistic Regression, SVM, KNN, Neural Networks, PCA.

10

EDA — Exploratory Data Analysis

What it is: Visually and statistically examining data BEFORE modelling to discover patterns, anomalies, relationships, and guide preprocessing decisions. Coined by John Tukey (1977): "Let the data speak."

Pillar 1

Univariate

One variable at a time. Distribution, shape, skew. Histogram, boxplot.

Pillar 2

Bivariate

Two variables. Correlation, trend, group differences. Scatter, grouped bar.

Pillar 3

Multivariate

Many variables. Interactions, clusters. Heatmap, pair plot, PCA.

Pillar 4

Temporal

Time-ordered data. Trends, seasonality. Line plot, rolling average.

EDA Finding	→ Action to Take
Feature severely right-skewed	Apply `np.log1p()` transform
Two features strongly correlated	Drop one — multicollinearity hurts linear models
Target class imbalanced (60/40+)	Use `stratify=y` in split + `class_weight='balanced'`
Feature has zero correlation with target	Candidate for removal — it's noise
Outliers present in feature	Use RobustScaler or clip with IQR bounds
Categorical with 100s of values	Use Target Encoding or group rare values into "Other"

11

Overfitting & Underfitting

Problem

Underfitting

Model too simple to learn the pattern.

Train error: HIGH
Test error: HIGH
Gap: small

Analogy: never studied for the exam.

Goal

Just Right

Model learned the signal, not the noise.

Train error: LOW
Test error: LOW
Gap: small

Analogy: understood the concepts.

Problem

Overfitting

Model memorized training data including noise.

Train error: VERY LOW
Test error: HIGH
Gap: large

Analogy: memorized exam answers word-for-word.

Fix Underfitting

Use a more powerful model (Tree → Random Forest → Neural Net)

Add more / better features (polynomial, interactions)

Reduce regularization (increase C, decrease alpha)

Train for more epochs (neural networks)

Fix Overfitting

Get more training data

Add regularization — L1, L2, Dropout

Use ensemble methods (Random Forest, Boosting)

Early stopping — stop before memorization begins

12

Bias & Variance

THE FUNDAMENTAL EQUATION Total Error = Bias² + Variance + Irreducible Noise
Bias² = (avg prediction − true value)²
Variance = avg( (prediction − avg prediction)² )
Goal = Minimize Bias² + Variance simultaneously

Bias — "Systematic Wrong"

Also calledUnderfitting

Train errorHigh

Test errorHigh

GapSmall (both bad)

CauseToo simple, too few features

Dart analogyClustered together, far from bullseye

Real analogyGPS always points 2 km north of destination

FixMore complex model, more features, less regularization

Variance — "Sensitive to Data"

Also calledOverfitting

Train errorVery low (~0)

Test errorHigh

GapLarge (train ≫ test)

CauseToo complex, too little data

Dart analogyCentered on bullseye but wildly spread

Real analogyStudent memorized answers not concepts

FixMore data, regularization, ensembles, dropout

13

Generalization Techniques

Regularization

L1 / L2 / Dropout

L1 (Lasso): penalty = λ×Σ|w| → drives weights to exactly 0 (sparse model, feature selection)

L2 (Ridge): penalty = λ×Σw² → shrinks all weights (smooth, keeps all features)

Dropout: randomly zeros p% of neurons each step → prevents co-adaptation

Cross-Validation

K-Fold / Stratified

Train on k-1 folds, test on 1 fold. Repeat k times. Report mean ± std.

K-Fold: general purpose
StratifiedKFold: preserves class ratio — ALWAYS use for classification

Reliable because every sample is tested exactly once.

Ensemble Methods

Bagging / Boosting

Bagging (Random Forest): average many independent trees → reduces variance

Boosting (XGBoost/GBM): sequential models each fixing previous errors → reduces bias + variance

Stacking: meta-model learns how to combine base models

Early Stopping

Stop Before Memorization

Monitor validation loss during training. Stop when it stops improving.

patience=10: wait 10 epochs for improvement before stopping

Always restore best weights — the checkpoint before degradation started.

Symptom	Technique to Apply	Fixes
Train ≫ Test accuracy	L1/L2 regularization, Dropout, more data	Overfitting
Both train & test low	More complex model, more features, less regularization	Underfitting
Limited training data	Data augmentation, Transfer learning, SMOTE	Overfitting
Val loss rising during training	Early stopping, reduce learning rate	Overfitting
Need best accuracy	Ensemble — Random Forest, GBM, Stacking	Both

14

Performance & Accuracy Metrics

Confusion Matrix — The Foundation

Predicted Positive

Predicted Negative

Actual Positive

TP

True Positive

FN

False Negative

Actual Negative

FP

False Positive

TN

True Negative

TP: Correctly predicted positive. The wins.

FN: Missed a positive. Cancer patient sent home. Dangerous.

FP: False alarm. Healthy labeled sick. Annoying but safer.

TN: Correctly predicted negative. The quiet wins.

ALL FORMULAS Accuracy = (TP + TN) / (TP + FP + TN + FN) → overall correct rate
Precision = TP / (TP + FP) → quality of positive predictions
Recall = TP / (TP + FN) → coverage of actual positives
F1 Score = 2 × (Precision × Recall) / (Precision + Recall)
Specificity = TN / (TN + FP) → true negative rate
MAE = mean(|y_true − y_pred|) → avg absolute error (regression)
RMSE = √mean((y_true − y_pred)²) → penalizes large errors more
R² = 1 − SS_res / SS_tot → % variance explained

Metric	Use when	Avoid when
Accuracy	Balanced classes, equal error costs	Imbalanced data — 99% accuracy can mean nothing
Precision	FP is costly — spam filters, recommendations	Missing positives is dangerous
Recall	FN is costly — cancer, fraud, safety systems	FP are very costly (flags everything)
F1 Score	Imbalanced classes, both FP and FN matter	When one error is clearly worse than the other
ROC-AUC	Comparing models, threshold-independent	Severe imbalance — use PR-AUC instead
RMSE	Large errors especially bad (regression)	Outlier predictions that shouldn't dominate
R²	Explaining % of variance, comparing models	Non-linear data — can still look good when it isn't

Correlation

Pearson r

Linear relationship

Range −1 to +1. Measures strength of linear relationship. Sensitive to outliers.

|r| < 0.1 negligible · 0.3 weak · 0.5 moderate · 0.7 strong · >0.9 very strong

Spearman ρ

Monotonic relationship

Rank-based. Works for ordinal data and non-normal distributions. Robust to outliers. Use when relationship is curved but consistently directional.

⚠Correlation ≠ Causation. Ice cream sales and drowning rates are highly correlated (both peak in summer) — but ice cream does not cause drowning. Always think about confounders.

Machine LearningStudy Cheatsheet