Machine Learning
Study Cheatsheet

Data Cleaning · EDA · Overfitting · Bias-Variance · Metrics · Code Examples
Python 3.x pandas scikit-learn numpy scipy torch
01

Null Values

What it is: A null/NaN value is a missing placeholder — the data simply was never recorded. Most ML algorithms throw errors or produce wrong results when they encounter NaN.
python
import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')

# ── DETECT ───────────────────────────────────────────
df.isnull().sum()                   # count per column
(df.isnull().mean() * 100).round(2) # % missing per column
df.columns[df.isnull().any()]       # which columns have nulls

# ── DROP ─────────────────────────────────────────────
df.dropna()                          # drop rows with ANY null
df.dropna(subset=['age', 'salary']) # drop only if null in these cols
df.dropna(axis=1, thresh=int(0.6*len(df))) # drop cols missing >40%

# ── FILL ─────────────────────────────────────────────
df['age'].fillna(df['age'].median())  # fill numeric with median
df['dept'].fillna('Unknown')         # fill categorical with constant
df.ffill()                           # time series: forward fill
df.bfill()                           # time series: backward fill
Never fill the target variable (y). Drop those rows instead. Never call fit_transform on test data — only transform().
02

Missing Values — Imputation

MCAR
Missing Completely At Random
Randomly missing. No pattern. Safe to drop rows. Rare in practice.
MAR
Missing At Random
Missing depends on other observed columns. Impute using other features.
MNAR
Missing Not At Random
Missingness depends on the missing value itself. Add indicator flag + impute.
python — all imputation strategies
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer  # must import first
from sklearn.impute import IterativeImputer

# Strategy 1 — Mean / Median / Mode (SimpleImputer)
imp_median = SimpleImputer(strategy='median')     # numeric
imp_mode   = SimpleImputer(strategy='most_frequent') # categorical
imp_const  = SimpleImputer(strategy='constant', fill_value='Unknown')

X_train = imp_median.fit_transform(X_train) # fit ONLY on train
X_test  = imp_median.transform(X_test)      # transform only on test

# Strategy 2 — KNN Imputer (borrows from similar rows)
knn_imp = KNNImputer(n_neighbors=5)
X_imputed = knn_imp.fit_transform(X_train)

# Strategy 3 — Iterative / MICE (most accurate, most expensive)
iter_imp = IterativeImputer(max_iter=10, random_state=42)
X_imputed = iter_imp.fit_transform(X_train)

# Strategy 4 — MNAR: add missingness indicator flag
df['salary_was_missing'] = df['salary'].isna().astype(int)
df['salary'] = df['salary'].fillna(df['salary'].median())
03

Outlier Detection & Treatment

What it is: Outliers are values far from the rest of the data. They can be genuine (a billionaire in a salary dataset) or errors (age = -5). Either way they distort model learning.
python — IQR method
# IQR — works for any distribution (most robust)
Q1    = df['salary'].quantile(0.25)
Q3    = df['salary'].quantile(0.75)
IQR   = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outliers  = df[(df['salary'] < lower) | (df['salary'] > upper)]
df_clean  = df[(df['salary'] >= lower) & (df['salary'] <= upper)]
df['salary'] = df['salary'].clip(lower, upper) # or cap (Winsorize)
python — Z-score method
from scipy import stats

# Z-score — good for normally distributed data
z = np.abs(stats.zscore(df['salary']))
df_clean = df[z < 3]  # keep within 3 standard deviations

# Multi-column Z-score
numeric_cols = df.select_dtypes('number').columns
z_all        = df[numeric_cols].apply(stats.zscore)
df_clean     = df[(np.abs(z_all) < 3).all(axis=1)]
python — Isolation Forest (multivariate)
from sklearn.ensemble import IsolationForest

iso    = IsolationForest(contamination=0.05, random_state=42)
labels = iso.fit_predict(X_numeric)  # -1 = outlier, 1 = inlier
df_clean = df[labels == 1]
💡Use IQR for single columns, Isolation Forest when outliers appear across multiple correlated features simultaneously.
04

Duplicates

Why it matters: Duplicate rows cause the model to learn some samples with double weight, biasing predictions. During cross-validation, duplicates in both train and test folds cause data leakage.
python
# ── DETECT ───────────────────────────────────────────
df.duplicated().sum()                         # count exact duplicates
df[df.duplicated()]                            # view them
df.duplicated(subset=['email']).sum()         # by specific columns

# ── REMOVE ───────────────────────────────────────────
df = df.drop_duplicates()                       # keep first occurrence
df = df.drop_duplicates(subset=['email', 'name'], keep='first')

# ── NEAR-DUPLICATES (fuzzy matching) ─────────────────
from thefuzz import fuzz
score = fuzz.ratio('Alice Smith', 'Alce Smth')  # 88 — likely duplicate
05

Wrong Format & Data Types

Common issues: Ages stored as strings, dates in mixed formats ("2021-03-15" vs "March 2021"), booleans stored as "yes"/"True"/"1". Wrong types make math operations fail silently.
python — type fixing
# ── NUMERIC ──────────────────────────────────────────
df['age'] = pd.to_numeric(df['age'], errors='coerce')  # bad → NaN

# ── DATETIME ─────────────────────────────────────────
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df['year']       = df['date'].dt.year
df['month']      = df['date'].dt.month
df['dayofweek']  = df['date'].dt.dayofweek
df['is_weekend'] = df['dayofweek'] >= 5

# ── BOOLEAN UNIFICATION ──────────────────────────────
df['active'] = df['active'].map({
    'yes': True, 'True': True,  '1': True,  1: True,
    'no':  False, 'False': False, '0': False, 0: False
})

# ── MEMORY OPTIMIZATION ──────────────────────────────
df['dept'] = df['dept'].astype('category')  # less memory for categoricals
df['age']  = df['age'].astype('int32')      # int64 → int32 saves 50% RAM
06

Encoding Categorical Data

MethodWhen to useWatch out
One-HotNominal categories (no order) — color, city, deptUse drop='first' to avoid dummy trap
OrdinalOrdered categories — low/medium/highNever use on nominal data — implies false order
LabelTree models only — just assigns integersImplies false order for linear/SVM models
TargetHigh cardinality — city with 500 valuesMust fit inside CV loop to prevent leakage
python — all encoding methods
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, LabelEncoder

# One-Hot Encoding (nominal — no order)
enc = OneHotEncoder(drop='first', handle_unknown='ignore', sparse_output=False)
X_enc = enc.fit_transform(df[['dept', 'city']])

# pandas shortcut
df = pd.get_dummies(df, columns=['dept'], drop_first=True)

# Ordinal Encoding (ordered — must specify order!)
enc_ord = OrdinalEncoder(categories=[['low', 'medium', 'high']])
df['risk_enc'] = enc_ord.fit_transform(df[['risk']])

# Manual ordinal mapping (clearest approach)
df['risk_enc'] = df['risk'].map({'low': 0, 'medium': 1, 'high': 2})

# Label Encoding (tree models only)
le = LabelEncoder()
df['dept_enc'] = le.fit_transform(df['dept'])
🚨handle_unknown='ignore' is critical. Without it, your model crashes the moment it sees a new category in production.
07

Misspelling & Text Cleaning

Why it matters: "Marketing", "marketing", "MARKETING", "Marketting" are treated as 4 different categories by any encoder. One-hot encoding would create 4 columns instead of 1.
python — string normalization
# ── BASIC NORMALIZATION ───────────────────────────────
df['dept'] = (
    df['dept']
    .str.lower()           # "MARKETING" → "marketing"
    .str.strip()           # "marketing " → "marketing"
    .str.replace(r'\s+', ' ', regex=True)  # collapse multiple spaces
)

# ── FIX KNOWN VARIANTS ───────────────────────────────
corrections = {
    'mktg': 'marketing', 'mkt': 'marketing',
    'eng':  'engineering', 'engg': 'engineering',
    'hr':   'human resources'
}
df['dept'] = df['dept'].replace(corrections)

# ── REMOVE SPECIAL CHARACTERS ────────────────────────
df['name'] = df['name'].str.replace(r'[^a-zA-Z\s]', '', regex=True)

# ── FUZZY MATCHING — fix misspellings automatically ──
from thefuzz import process

valid_depts = ['marketing', 'engineering', 'finance', 'hr']

def fix_spelling(val, choices, threshold=80):
    match, score = process.extractOne(val, choices)
    return match if score >= threshold else 'unknown'

df['dept_clean'] = df['dept'].apply(lambda x: fix_spelling(x, valid_depts))
# "marketting" → "marketing" (score 92)
# "finace"     → "finance"   (score 91)
08

Train / Test Split

Golden rule: Always split BEFORE any cleaning, scaling, or imputation. Fit transformers on training data only. Transform both. This prevents data leakage — the #1 mistake in ML.
python — all split strategies
from sklearn.model_selection import (
    train_test_split, StratifiedKFold,
    KFold, cross_val_score
)

# ── BASIC SPLIT (80/20) ───────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,      # 20% test, 80% train
    random_state=42,    # reproducibility
    stratify=y          # preserve class ratio (use for classification!)
)

# ── THREE-WAY SPLIT (train / val / test) ─────────────
X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42, stratify=y)

X_train, X_val, y_train, y_val = train_test_split(
    X_trainval, y_trainval, test_size=0.15, random_state=42, stratify=y_trainval)

# ── K-FOLD CROSS VALIDATION ───────────────────────────
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
print(f"CV: {scores.mean():.3f} ± {scores.std():.3f}")

# ── CORRECT PIPELINE (no leakage) ────────────────────
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler()),
    ('model',   your_model)
])
pipe.fit(X_train, y_train)   # fits imputer+scaler on train only
pipe.score(X_test, y_test)  # transforms test correctly
🚨Data Leakage: If you scale/impute BEFORE splitting, your test set has "seen" the training distribution. Model appears to generalize but it is cheating. Always split first.
09

Scaling & Standardization

Why it matters: A feature ranging 0–1,000,000 dominates a feature ranging 0–1 in distance-based and gradient-based models. Scaling puts all features on equal footing. Tree models (Random Forest, XGBoost) do NOT need scaling.
ScalerFormulaOutputBest forOutlier sensitive?
StandardScaler(x − μ) / σ~[-3, 3]Linear models, SVM, PCA, Neural netsYes
MinMaxScaler(x − min) / (max − min)[0, 1]KNN, Neural nets with sigmoidYes
RobustScaler(x − median) / IQRunboundedData with outliers you want to keepNo
PowerTransformerYeo-Johnson~N(0,1)Skewed distributions → make GaussianNo
python — all scalers
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler,
    RobustScaler, PowerTransformer
)

# StandardScaler — mean=0, std=1
scaler     = StandardScaler()
X_train_s  = scaler.fit_transform(X_train)  # fit + transform
X_test_s   = scaler.transform(X_test)       # transform ONLY
X_original = scaler.inverse_transform(X_train_s)  # reverse it

# MinMaxScaler — squish to [0,1]
mm        = MinMaxScaler(feature_range=(0, 1))
X_norm    = mm.fit_transform(X_train)

# RobustScaler — uses median+IQR, ignores outliers
robust    = RobustScaler()
X_robust  = robust.fit_transform(X_train)

# Log transform — for severely right-skewed data (salaries, prices)
df['log_salary'] = np.log1p(df['salary'])  # log1p = log(1+x), safe for 0

# PowerTransformer — makes distribution Gaussian
pt        = PowerTransformer(method='yeo-johnson')
X_normal  = pt.fit_transform(X_train[['salary', 'price']])

# Check skew before/after
print(df['salary'].skew())      # before: 4.8 (very skewed)
print(pd.Series(X_normal[:,0]).skew())  # after:  ~0.1
💡Tree models (Decision Tree, Random Forest, XGBoost, LightGBM) are invariant to scaling — don't waste time scaling for them. Must scale for: Linear Regression, Logistic Regression, SVM, KNN, Neural Networks, PCA.
10

EDA — Exploratory Data Analysis

What it is: Visually and statistically examining data BEFORE modelling to discover patterns, anomalies, relationships, and guide preprocessing decisions. Coined by John Tukey (1977): "Let the data speak."
Pillar 1
Univariate
One variable at a time. Distribution, shape, skew. Histogram, boxplot.
Pillar 2
Bivariate
Two variables. Correlation, trend, group differences. Scatter, grouped bar.
Pillar 3
Multivariate
Many variables. Interactions, clusters. Heatmap, pair plot, PCA.
Pillar 4
Temporal
Time-ordered data. Trends, seasonality. Line plot, rolling average.
EDA Finding→ Action to Take
Feature severely right-skewedApply np.log1p() transform
Two features strongly correlatedDrop one — multicollinearity hurts linear models
Target class imbalanced (60/40+)Use stratify=y in split + class_weight='balanced'
Feature has zero correlation with targetCandidate for removal — it's noise
Outliers present in featureUse RobustScaler or clip with IQR bounds
Categorical with 100s of valuesUse Target Encoding or group rare values into "Other"
11

Overfitting & Underfitting

Problem
Underfitting
Model too simple to learn the pattern.

Train error: HIGH
Test error: HIGH
Gap: small


Analogy: never studied for the exam.
Goal
Just Right
Model learned the signal, not the noise.

Train error: LOW
Test error: LOW
Gap: small


Analogy: understood the concepts.
Problem
Overfitting
Model memorized training data including noise.

Train error: VERY LOW
Test error: HIGH
Gap: large


Analogy: memorized exam answers word-for-word.

Fix Underfitting
Use a more powerful model (Tree → Random Forest → Neural Net)
Add more / better features (polynomial, interactions)
Reduce regularization (increase C, decrease alpha)
Train for more epochs (neural networks)
Fix Overfitting
Get more training data
Add regularization — L1, L2, Dropout
Use ensemble methods (Random Forest, Boosting)
Early stopping — stop before memorization begins
12

Bias & Variance

THE FUNDAMENTAL EQUATION Total Error = Bias² + Variance + Irreducible Noise
Bias² = (avg prediction − true value)²
Variance = avg( (prediction − avg prediction)² )
Goal = Minimize Bias² + Variance simultaneously

Bias — "Systematic Wrong"

Also calledUnderfitting
Train errorHigh
Test errorHigh
GapSmall (both bad)
CauseToo simple, too few features
Dart analogyClustered together, far from bullseye
Real analogyGPS always points 2 km north of destination
FixMore complex model, more features, less regularization

Variance — "Sensitive to Data"

Also calledOverfitting
Train errorVery low (~0)
Test errorHigh
GapLarge (train ≫ test)
CauseToo complex, too little data
Dart analogyCentered on bullseye but wildly spread
Real analogyStudent memorized answers not concepts
FixMore data, regularization, ensembles, dropout
13

Generalization Techniques

Regularization
L1 / L2 / Dropout
L1 (Lasso): penalty = λ×Σ|w| → drives weights to exactly 0 (sparse model, feature selection)

L2 (Ridge): penalty = λ×Σw² → shrinks all weights (smooth, keeps all features)

Dropout: randomly zeros p% of neurons each step → prevents co-adaptation
Cross-Validation
K-Fold / Stratified
Train on k-1 folds, test on 1 fold. Repeat k times. Report mean ± std.

K-Fold: general purpose
StratifiedKFold: preserves class ratio — ALWAYS use for classification

Reliable because every sample is tested exactly once.
Ensemble Methods
Bagging / Boosting
Bagging (Random Forest): average many independent trees → reduces variance

Boosting (XGBoost/GBM): sequential models each fixing previous errors → reduces bias + variance

Stacking: meta-model learns how to combine base models
Early Stopping
Stop Before Memorization
Monitor validation loss during training. Stop when it stops improving.

patience=10: wait 10 epochs for improvement before stopping

Always restore best weights — the checkpoint before degradation started.
SymptomTechnique to ApplyFixes
Train ≫ Test accuracyL1/L2 regularization, Dropout, more dataOverfitting
Both train & test lowMore complex model, more features, less regularizationUnderfitting
Limited training dataData augmentation, Transfer learning, SMOTEOverfitting
Val loss rising during trainingEarly stopping, reduce learning rateOverfitting
Need best accuracyEnsemble — Random Forest, GBM, StackingBoth
14

Performance & Accuracy Metrics

Confusion Matrix — The Foundation
Predicted Positive
Predicted Negative
Actual Positive
TP
True Positive
FN
False Negative
Actual Negative
FP
False Positive
TN
True Negative
TP: Correctly predicted positive. The wins.
FN: Missed a positive. Cancer patient sent home. Dangerous.
FP: False alarm. Healthy labeled sick. Annoying but safer.
TN: Correctly predicted negative. The quiet wins.
ALL FORMULAS Accuracy = (TP + TN) / (TP + FP + TN + FN) → overall correct rate
Precision = TP / (TP + FP) → quality of positive predictions
Recall = TP / (TP + FN) → coverage of actual positives
F1 Score = 2 × (Precision × Recall) / (Precision + Recall)
Specificity = TN / (TN + FP) → true negative rate
MAE = mean(|y_true − y_pred|) → avg absolute error (regression)
RMSE = √mean((y_true − y_pred)²) → penalizes large errors more
R² = 1 − SS_res / SS_tot → % variance explained
MetricUse whenAvoid when
AccuracyBalanced classes, equal error costsImbalanced data — 99% accuracy can mean nothing
PrecisionFP is costly — spam filters, recommendationsMissing positives is dangerous
RecallFN is costly — cancer, fraud, safety systemsFP are very costly (flags everything)
F1 ScoreImbalanced classes, both FP and FN matterWhen one error is clearly worse than the other
ROC-AUCComparing models, threshold-independentSevere imbalance — use PR-AUC instead
RMSELarge errors especially bad (regression)Outlier predictions that shouldn't dominate
Explaining % of variance, comparing modelsNon-linear data — can still look good when it isn't
Correlation
Pearson r
Linear relationship
Range −1 to +1. Measures strength of linear relationship. Sensitive to outliers.

|r| < 0.1 negligible · 0.3 weak · 0.5 moderate · 0.7 strong · >0.9 very strong
Spearman ρ
Monotonic relationship
Rank-based. Works for ordinal data and non-normal distributions. Robust to outliers. Use when relationship is curved but consistently directional.
Correlation ≠ Causation. Ice cream sales and drowning rates are highly correlated (both peak in summer) — but ice cream does not cause drowning. Always think about confounders.