Machine Learning Algorithms

A comprehensive guide to fundamental machine learning algorithms, their applications, strengths, weaknesses, and when to use each approach.

Quick Algorithm Selection Guide

By Problem Type

Regression: Linear Regression, Random Forest, SVM, Neural Networks
Classification: Logistic Regression, Decision Trees, Random Forest, SVM, k-NN, Naive Bayes
Clustering: k-Means, Hierarchical Clustering, DBSCAN
Dimensionality Reduction: PCA, t-SNE, UMAP
Anomaly Detection: Isolation Forest, One-Class SVM, Local Outlier Factor

By Data Size

Small datasets (< 1K samples): k-NN, Naive Bayes, Decision Trees, AdaBoost
Medium datasets (1K-100K): SVM, Random Forest, XGBoost, LightGBM, CatBoost
Large datasets (> 100K): Linear models, Neural Networks, LightGBM, SGD variants

By Interpretability

High: Linear Regression, Decision Trees, Naive Bayes
Medium: Random Forest, k-NN, AdaBoost
Low: SVM (with RBF kernel), Neural Networks, XGBoost, LightGBM, CatBoost

Supervised Learning Algorithms

Linear Regression

Purpose: Predict continuous target variable using linear relationship

Mathematical Formula:

y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε
Cost Function: J(θ) = 1/(2m) Σ(hθ(x⁽ⁱ⁾) - y⁽ⁱ⁾)²

Key Parameters: - fit_intercept: Whether to calculate intercept (default: True) - normalize: Normalize features before fitting (deprecated, use StandardScaler) - solver: Algorithm for optimization ('auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga')

When to Use: - ✅ Linear relationship between features and target - ✅ Need interpretable results - ✅ Fast training and prediction required - ✅ Baseline model for comparison

Strengths: - Fast training and prediction - No hyperparameter tuning needed - Provides feature importance (coefficients) - Works well with linearly separable data - Memory efficient

Weaknesses: - Assumes linear relationship - Sensitive to outliers - Requires feature scaling - Poor performance with non-linear patterns

Preprocessing: - Scale features (StandardScaler, MinMaxScaler) - Handle outliers - Feature engineering for non-linear relationships

Logistic Regression

Purpose: Binary or multi-class classification using logistic function

Mathematical Formula:

Sigmoid: σ(z) = 1 / (1 + e^(-z))
Probability: P(y=1|x) = σ(θᵀx)
Cost Function: J(θ) = -1/m Σ[y⁽ⁱ⁾log(hθ(x⁽ⁱ⁾)) + (1-y⁽ⁱ⁾)log(1-hθ(x⁽ⁱ⁾))]

Key Parameters: - C: Inverse of regularization strength (default: 1.0) - penalty: Regularization type ('l1', 'l2', 'elasticnet', 'none') - solver: Algorithm ('liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga') - max_iter: Maximum iterations (default: 100)

When to Use: - ✅ Binary or multi-class classification - ✅ Need probability estimates - ✅ Linear decision boundary is appropriate - ✅ Fast training required

Strengths: - Outputs probability estimates - Less prone to overfitting than complex models - Fast training - Interpretable coefficients

Weaknesses: - Assumes linear relationship between features and log-odds - Sensitive to outliers - Requires feature scaling - Can struggle with complex relationships

Decision Trees

Purpose: Classification or regression using tree-like model of decisions

Key Concepts:

Information Gain = Entropy(parent) - Weighted_Avg(Entropy(children))
Gini Impurity = 1 - Σ(pᵢ)²
MSE (regression) = 1/n Σ(yᵢ - ŷ)²

Key Parameters: - criterion: Split quality measure ('gini', 'entropy', 'log_loss' for classification; 'squared_error', 'absolute_error' for regression) - max_depth: Maximum tree depth (default: None) - min_samples_split: Minimum samples to split node (default: 2) - min_samples_leaf: Minimum samples in leaf node (default: 1) - max_features: Number of features for best split (default: None)

When to Use: - ✅ Need interpretable model - ✅ Non-linear relationships exist - ✅ Mixed data types (numerical and categorical) - ✅ Feature interactions are important

Strengths: - Highly interpretable - Handles both numerical and categorical data - No need for feature scaling - Automatically handles feature interactions - Fast prediction

Weaknesses: - Prone to overfitting - Unstable (small data changes → different trees) - Biased toward features with more levels - Cannot capture linear relationships efficiently

Hyperparameter Tuning:

# Prevent overfitting
max_depth = [3, 5, 7, 10, None]
min_samples_split = [2, 5, 10, 20]
min_samples_leaf = [1, 2, 4, 8]

Random Forest

Purpose: Ensemble of decision trees for improved accuracy and reduced overfitting

Key Concepts:

Final Prediction = Average/Majority Vote of all trees
Out-of-Bag Error: Error rate using samples not used in training each tree
Feature Importance: Average decrease in impurity when feature is used for splits

Key Parameters: - n_estimators: Number of trees (default: 100) - max_depth: Maximum depth of trees (default: None) - min_samples_split: Minimum samples to split (default: 2) - min_samples_leaf: Minimum samples in leaf (default: 1) - max_features: Features to consider for splits (default: 'sqrt') - bootstrap: Whether to bootstrap samples (default: True)

When to Use: - ✅ Need robust, accurate model - ✅ Have mixed data types - ✅ Want feature importance rankings - ✅ Can tolerate longer training time

Strengths: - Reduces overfitting compared to single trees - Provides feature importance - Handles missing values - Works well out-of-the-box - Robust to outliers

Weaknesses: - Less interpretable than single tree - Can overfit with very noisy data - Biased toward categorical features with many categories - Memory intensive

Hyperparameter Tuning:

# Key parameters to tune
n_estimators = [100, 200, 300, 500]
max_depth = [3, 5, 7, 10, None]
min_samples_split = [2, 5, 10]
max_features = ['sqrt', 'log2', None]

Gradient Boosting

Purpose: Sequential ensemble method that builds models iteratively, each correcting errors of previous models

Key Concepts:

Sequential Learning: F_m(x) = F_(m-1)(x) + h_m(x)
Gradient Descent: h_m = argmin Σ L(y_i, F_(m-1)(x_i) + h(x_i))
Loss Function: Optimize differentiable loss functions
Weak Learners: Typically shallow decision trees (stumps)
Shrinkage: F_m(x) = F_(m-1)(x) + ν * h_m(x) where ν is learning rate

Mathematical Framework:

1. Initialize: F_0(x) = argmin_γ Σ L(y_i, γ)
2. For m = 1 to M:
   a. Compute residuals: r_im = -∂L(y_i, F_(m-1)(x_i))/∂F_(m-1)(x_i)
   b. Fit weak learner: h_m(x) to predict residuals r_im
   c. Find optimal step size: γ_m = argmin_γ Σ L(y_i, F_(m-1)(x_i) + γh_m(x_i))
   d. Update: F_m(x) = F_(m-1)(x) + γ_m * h_m(x)
3. Output: F_M(x)

Popular Algorithms:

XGBoost (eXtreme Gradient Boosting): - Strengths: High performance, handles missing values, built-in regularization, parallel processing - Best for: Structured/tabular data, competitions, high accuracy requirements - Key Parameters: - n_estimators: Number of boosting rounds (100-1000) - max_depth: Maximum tree depth (3-10) - learning_rate: Step size shrinkage (0.01-0.3) - subsample: Fraction of samples (0.8-1.0) - colsample_bytree: Fraction of features (0.8-1.0) - reg_alpha: L1 regularization (0-10) - reg_lambda: L2 regularization (1-10)

# XGBoost Example
import xgboost as xgb
from sklearn.model_selection import GridSearchCV

# Basic usage
xgb_model = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    random_state=42
)
xgb_model.fit(X_train, y_train)

# Hyperparameter tuning
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 0.9, 1.0]
}

LightGBM (Light Gradient Boosting Machine): - Strengths: Fast training, low memory usage, high accuracy, handles categorical features - Best for: Large datasets, fast training requirements, limited memory - Key Parameters: - num_leaves: Maximum leaves in one tree (31-300) - learning_rate: Shrinkage rate (0.01-0.3) - feature_fraction: Fraction of features (0.8-1.0) - bagging_fraction: Fraction of data (0.8-1.0) - min_data_in_leaf: Minimum samples in leaf (20-100) - lambda_l1, lambda_l2: Regularization terms

# LightGBM Example
import lightgbm as lgb

# Basic usage
lgb_model = lgb.LGBMClassifier(
    num_leaves=31,
    learning_rate=0.1,
    n_estimators=100,
    random_state=42
)
lgb_model.fit(X_train, y_train)

# Advanced configuration
lgb_train = lgb.Dataset(X_train, y_train)
params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'num_leaves': 31,
    'learning_rate': 0.1,
    'feature_fraction': 0.9
}
model = lgb.train(params, lgb_train, num_boost_round=100)

CatBoost (Categorical Boosting): - Strengths: Handles categorical features automatically, robust to overfitting, good default parameters - Best for: Datasets with many categorical features, minimal preprocessing - Key Parameters: - iterations: Number of boosting iterations (100-1000) - learning_rate: Learning rate (0.01-0.3) - depth: Tree depth (4-10) - l2_leaf_reg: L2 regularization (1-10) - border_count: Number of splits for numerical features (32-255)

# CatBoost Example
from catboost import CatBoostClassifier

# Basic usage (handles categorical features automatically)
cat_model = CatBoostClassifier(
    iterations=100,
    learning_rate=0.1,
    depth=6,
    verbose=False
)
cat_model.fit(X_train, y_train, cat_features=['category_col1', 'category_col2'])

# With categorical feature indices
cat_features = [0, 1, 5]  # Column indices of categorical features
cat_model.fit(X_train, y_train, cat_features=cat_features)

AdaBoost (Adaptive Boosting): - Strengths: Simple algorithm, good for binary classification, less prone to overfitting - Best for: Small datasets, when interpretability is important - Key Parameters: - n_estimators: Number of weak learners (50-500) - learning_rate: Weight applied to each classifier (0.1-2.0) - algorithm: SAMME or SAMME.R for multi-class

# AdaBoost Example
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Note: The 'estimator' parameter was named 'base_estimator' in scikit-learn < 1.2
ada_model = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=100,
    learning_rate=1.0,
    random_state=42
)
ada_model.fit(X_train, y_train)

When to Use Gradient Boosting: - ✅ Tabular/structured data with mixed feature types - ✅ High accuracy is priority over interpretability
- ✅ Medium to large datasets (1K+ samples) - ✅ Complex non-linear relationships exist - ✅ Have time for hyperparameter tuning - ✅ Competition or benchmark scenarios

Advantages: - High predictive accuracy - Handles mixed data types well - Built-in feature selection - Robust to outliers and missing values - No need for feature scaling - Provides feature importance - Can optimize various loss functions

Disadvantages: - Prone to overfitting (especially with small datasets) - Computationally expensive - Many hyperparameters to tune - Sequential training (harder to parallelize base algorithm) - Less interpretable than single trees - Sensitive to noisy data

Hyperparameter Tuning Strategy:

# 1. Start with learning rate and number of estimators
learning_rates = [0.01, 0.1, 0.2]
n_estimators = [100, 300, 500]

# 2. Tune tree-specific parameters
max_depths = [3, 5, 7, 10]
min_samples_splits = [2, 5, 10]

# 3. Add regularization
reg_alphas = [0, 0.1, 1, 10]  # L1
reg_lambdas = [1, 5, 10]      # L2

# 4. Fine-tune sampling parameters
subsamples = [0.8, 0.9, 1.0]
colsample_bytrees = [0.8, 0.9, 1.0]

Common Loss Functions: - Regression: Squared error, absolute error, Huber, Quantile - Classification: Logistic loss, exponential loss, Hinge loss - Ranking: Pairwise ranking, LambdaRank, NDCG

Best Practices: 1. Start simple: Begin with default parameters and small learning rate 2. Cross-validation: Use CV for reliable performance estimates 3. Early stopping: Monitor validation loss to prevent overfitting 4. Feature engineering: Create meaningful features before boosting 5. Regularization: Use L1/L2 regularization and sampling techniques 6. Ensemble stacking: Combine multiple boosting models 7. Monitor overfitting: Track training vs validation metrics

Performance Comparison: | Algorithm | Training Speed | Memory Usage | Categorical Handling | Accuracy | Interpretability | |-----------|---------------|--------------|---------------------|----------|------------------| | XGBoost | ⚡⚡ | ⚡⚡ | ⚡ | ⭐⭐⭐ | ⭐ | | LightGBM | ⚡⚡⚡ | ⚡⚡⚡ | ⚡⚡ | ⭐⭐⭐ | ⭐ | | CatBoost | ⚡⚡ | ⚡⚡ | ⭐⭐⭐ | ⭐⭐⭐ | ⭐ | | AdaBoost | ⚡ | ⚡⚡⚡ | ⚡ | ⭐⭐ | ⭐⭐ |

Support Vector Machine (SVM)

Purpose: Classification or regression by finding optimal hyperplane

Mathematical Formula:

Objective: minimize 1/2||w||² subject to yᵢ(wᵀxᵢ + b) ≥ 1
Kernel Trick: K(x, x') maps input space to higher dimensional space
RBF Kernel: K(x, x') = exp(-γ||x - x'||²)

Key Parameters: - C: Regularization parameter (default: 1.0) - kernel: Kernel type ('linear', 'poly', 'rbf', 'sigmoid') - gamma: Kernel coefficient for 'rbf', 'poly', 'sigmoid' ('scale', 'auto', or float) - degree: Degree for polynomial kernel (default: 3)

When to Use: - ✅ High-dimensional data - ✅ Clear margin of separation - ✅ Memory efficient solution needed - ✅ Versatile (different kernels)

Strengths: - Effective in high dimensions - Memory efficient - Versatile (different kernel functions) - Works well with clear separation margin

Weaknesses: - Poor performance on large datasets - Sensitive to feature scaling - No probabilistic output - Choice of kernel and parameters is crucial

Hyperparameter Tuning:

# Grid search parameters
C = [0.1, 1, 10, 100]
gamma = ['scale', 'auto', 0.001, 0.01, 0.1, 1]
kernel = ['linear', 'rbf', 'poly']

k-Nearest Neighbors (k-NN)

Purpose: Classification or regression based on k closest training examples

Mathematical Formula:

Distance Metrics:
- Euclidean: d(x,y) = √Σ(xᵢ - yᵢ)²
- Manhattan: d(x,y) = Σ|xᵢ - yᵢ|
- Minkowski: d(x,y) = (Σ|xᵢ - yᵢ|ᵖ)^(1/p)

Classification: Majority vote of k neighbors
Regression: Average of k neighbors

Key Parameters: - n_neighbors: Number of neighbors (default: 5) - weights: Weight function ('uniform', 'distance', or callable) - algorithm: Algorithm to compute neighbors ('auto', 'ball_tree', 'kd_tree', 'brute') - metric: Distance metric ('euclidean', 'manhattan', 'minkowski') - p: Power parameter for Minkowski metric (default: 2)

When to Use: - ✅ Simple, non-parametric approach needed - ✅ Local patterns in data are important - ✅ Irregular decision boundaries - ✅ Small to medium datasets

Strengths: - Simple to understand and implement - No assumptions about data distribution - Works well with small datasets - Naturally handles multi-class problems

Weaknesses: - Computationally expensive for large datasets - Sensitive to irrelevant features (curse of dimensionality) - Sensitive to local structure of data - Memory intensive

Hyperparameter Tuning:

# Key parameters
n_neighbors = [3, 5, 7, 9, 11, 15]
weights = ['uniform', 'distance']
metric = ['euclidean', 'manhattan', 'minkowski']

Naive Bayes

Purpose: Classification based on Bayes' theorem with independence assumption

Mathematical Formula:

Bayes' Theorem: P(A|B) = P(B|A) * P(A) / P(B)
Naive Bayes: P(y|x₁,...,xₙ) = P(y) * ∏P(xᵢ|y) / P(x₁,...,xₙ)

Variants:
- Gaussian: P(xᵢ|y) = 1/√(2πσᵧ²) * exp(-(xᵢ-μᵧ)²/2σᵧ²)
- Multinomial: P(xᵢ|y) = (count(xᵢ,y) + α) / (count(y) + α*n)
- Bernoulli: P(xᵢ|y) = P(xᵢ=1|y)^xᵢ * (1-P(xᵢ=1|y))^(1-xᵢ)

Types & Parameters: - GaussianNB: For continuous features - MultinomialNB: For discrete counts (alpha for smoothing) - BernoulliNB: For binary features (alpha for smoothing)

When to Use: - ✅ Text classification - ✅ Small datasets - ✅ Need fast, simple baseline - ✅ Features are relatively independent

Strengths: - Fast training and prediction - Works well with small datasets - Handles multi-class naturally - Good baseline for text classification - Not sensitive to irrelevant features

Weaknesses: - Strong independence assumption - Can be outperformed by more sophisticated methods - Requires smoothing for zero probabilities - Poor estimator for probability

Neural Networks

Purpose: Complex pattern recognition using interconnected nodes

Mathematical Formula:

Forward Pass: aʲ⁽ˡ⁺¹⁾ = σ(Wʲˡaˡ + bʲˡ)
Loss Function: L = 1/m Σ loss(ŷ⁽ⁱ⁾, y⁽ⁱ⁾)
Backpropagation: ∂L/∂W = ∂L/∂a * ∂a/∂z * ∂z/∂W

Common Activation Functions:
- Sigmoid: σ(x) = 1/(1+e⁻ˣ)
- ReLU: f(x) = max(0,x)
- Tanh: tanh(x) = (eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)

Key Parameters: - hidden_layer_sizes: Tuple of hidden layer sizes (default: (100,)) - activation: Activation function ('identity', 'logistic', 'tanh', 'relu') - solver: Weight optimization solver ('lbfgs', 'sgd', 'adam') - alpha: L2 penalty parameter (default: 0.0001) - learning_rate: Learning rate schedule ('constant', 'invscaling', 'adaptive') - max_iter: Maximum iterations (default: 200)

When to Use: - ✅ Complex, non-linear relationships - ✅ Large datasets available - ✅ High-dimensional problems - ✅ Can afford longer training time

Strengths: - Can model complex non-linear relationships - Universal function approximators - Flexible architecture - Good performance with large datasets

Weaknesses: - Requires large datasets - Prone to overfitting - Many hyperparameters to tune - Black box (low interpretability) - Computationally intensive

Unsupervised Learning Algorithms

k-Means Clustering

Purpose: Partition data into k clusters based on feature similarity

Mathematical Formula:

Objective: minimize Σᵢ₌₁ᵏ Σₓ∈Cᵢ ||x - μᵢ||²
Where μᵢ is the centroid of cluster Cᵢ

Algorithm:
1. Initialize k centroids randomly
2. Assign points to nearest centroid
3. Update centroids to cluster mean
4. Repeat until convergence

Key Parameters: - n_clusters: Number of clusters (default: 8) - init: Initialization method ('k-means++', 'random') - n_init: Number of random initializations (default: 10) - max_iter: Maximum iterations (default: 300) - tol: Tolerance for convergence (default: 1e-4)

When to Use: - ✅ Know approximate number of clusters - ✅ Clusters are spherical and similar sized - ✅ Need fast clustering algorithm - ✅ Continuous features

Strengths: - Simple and fast - Works well with spherical clusters - Scales well to large datasets - Guaranteed convergence

Weaknesses: - Must specify number of clusters - Assumes spherical clusters - Sensitive to initialization and outliers - Struggles with clusters of different sizes/densities

Choosing k:

# Elbow method: plot WCSS vs k
# Silhouette analysis: measure cluster cohesion
# Gap statistic: compare to random data

Hierarchical Clustering

Purpose: Create tree of clusters showing nested grouping of data

Types: - Agglomerative: Bottom-up (merge clusters) - Divisive: Top-down (split clusters)

Linkage Criteria:

Single: min(distance(a,b)) where a∈A, b∈B
Complete: max(distance(a,b)) where a∈A, b∈B
Average: mean(distance(a,b)) where a∈A, b∈B
Ward: minimizes within-cluster variance

Key Parameters: - n_clusters: Number of clusters to find (default: 2) - linkage: Linkage criterion ('ward', 'complete', 'average', 'single') - affinity: Distance metric ('euclidean', 'manhattan', 'cosine')

When to Use: - ✅ Don't know number of clusters beforehand - ✅ Want to see cluster hierarchy - ✅ Small to medium datasets - ✅ Need deterministic results

Strengths: - No need to specify number of clusters initially - Deterministic results - Creates hierarchy of clusters - Works with any distance metric

Weaknesses: - O(n³) time complexity - Sensitive to noise and outliers - Difficult to handle large datasets - Cannot undo previous steps

DBSCAN (Density-Based Spatial Clustering)

Purpose: Group together points that are closely packed, marking outliers

Key Concepts:

Core Point: Point with at least MinPts neighbors within ε distance
Border Point: Not core but within ε distance of core point
Noise Point: Neither core nor border point

Algorithm:
1. For each point, find neighbors within ε
2. If point has ≥ MinPts neighbors, mark as core
3. Form clusters by connecting core points
4. Add border points to nearby clusters

Key Parameters: - eps: Maximum distance between two samples to be neighbors - min_samples: Minimum number of samples in neighborhood for core point - metric: Distance metric ('euclidean', 'manhattan', 'cosine')

When to Use: - ✅ Clusters have varying shapes and sizes - ✅ Data contains noise/outliers - ✅ Don't know number of clusters - ✅ Density-based clusters expected

Strengths: - Finds arbitrarily shaped clusters - Automatically determines number of clusters - Robust to outliers - Identifies outliers explicitly

Weaknesses: - Sensitive to hyperparameters (eps, min_samples) - Struggles with varying densities - Memory intensive for large datasets - Difficult to use with high-dimensional data

Principal Component Analysis (PCA)

Purpose: Reduce dimensionality while preserving maximum variance

Mathematical Formula:

Covariance Matrix: C = 1/(n-1) * XᵀX
Eigendecomposition: C = PΛPᵀ
Principal Components: PC = X * P
Explained Variance Ratio: λᵢ / Σλⱼ

Key Parameters: - n_components: Number of components to keep (int, float, 'mle', or None) - whiten: Whether to whiten the components (default: False) - svd_solver: SVD solver ('auto', 'full', 'arpack', 'randomized')

When to Use: - ✅ High-dimensional data - ✅ Need dimensionality reduction - ✅ Want to remove correlated features - ✅ Visualization of high-dim data

Strengths: - Reduces overfitting - Removes correlated features - Fast and simple - Interpretable components

Weaknesses: - Linear transformation only - Components may not be interpretable - Sensitive to feature scaling - May lose important information

Choosing Components:

# Cumulative explained variance ≥ 95%
# Scree plot: elbow in eigenvalue plot
# Cross-validation performance

t-SNE (t-Distributed Stochastic Neighbor Embedding)

Purpose: Nonlinear dimensionality reduction for visualization

Key Concepts:

High-dimensional similarity: pⱼ|ᵢ = exp(-||xᵢ-xⱼ||²/2σᵢ²) / Σₖ exp(-||xᵢ-xₖ||²/2σᵢ²)
Low-dimensional similarity: qᵢⱼ = (1+||yᵢ-yⱼ||²)⁻¹ / Σₖₗ(1+||yₖ-yₗ||²)⁻¹
Cost function: KL(P||Q) = Σᵢⱼ pᵢⱼ log(pᵢⱼ/qᵢⱼ)

Key Parameters: - n_components: Dimension of embedded space (default: 2) - perplexity: Number of nearest neighbors (default: 30) - learning_rate: Learning rate (default: 200) - n_iter: Maximum iterations (default: 1000)

When to Use: - ✅ Visualization of high-dimensional data - ✅ Exploring cluster structure - ✅ Non-linear relationships exist - ✅ Small to medium datasets

Strengths: - Excellent for visualization - Preserves local structure - Reveals cluster structure - Non-linear dimensionality reduction

Weaknesses: - Computationally expensive - Non-deterministic results - Hyperparameter sensitive - Not suitable for new data projection

Algorithm Selection Framework

Data Characteristics Decision Tree

Sample Size?
├── Small (< 1K)
│   ├── Classification → Naive Bayes, k-NN, Decision Tree
│   └── Regression → Linear Regression, k-NN
├── Medium (1K-100K)
│   ├── Linear relationship → Linear/Logistic Regression
│   ├── Non-linear → Random Forest, SVM
│   └── Complex patterns → Neural Networks
└── Large (> 100K)
    ├── Speed priority → Linear models, SGD
    ├── Accuracy priority → Random Forest, Gradient Boosting
    └── Very complex → Neural Networks

Interpretability needed?
├── Yes → Linear Regression, Decision Trees, Naive Bayes
└── No → Random Forest, SVM, Neural Networks

Training Speed priority?
├── Yes → Naive Bayes, Linear Regression, k-NN
└── No → SVM, Random Forest, Neural Networks

Performance Characteristics

Algorithm	Training Speed	Prediction Speed	Memory Usage	Interpretability
Linear Regression	⚡⚡⚡	⚡⚡⚡	⚡⚡⚡	⭐⭐⭐
Logistic Regression	⚡⚡⚡	⚡⚡⚡	⚡⚡⚡	⭐⭐⭐
Decision Tree	⚡⚡	⚡⚡⚡	⚡⚡	⭐⭐⭐
Random Forest	⚡	⚡⚡	⚡	⭐⭐
XGBoost	⚡	⚡⚡	⚡⚡	⭐
LightGBM	⚡⚡	⚡⚡⚡	⚡⚡⚡	⭐
CatBoost	⚡	⚡⚡	⚡⚡	⭐
AdaBoost	⚡	⚡⚡	⚡⚡	⭐⭐
SVM	⚡	⚡⚡	⚡⚡	⭐
k-NN	⚡⚡⚡	⚡	⚡	⭐⭐
Naive Bayes	⚡⚡⚡	⚡⚡⚡	⚡⚡⚡	⭐⭐
Neural Networks	⚡	⚡⚡	⚡	⭐

Evaluation Metrics

Classification Metrics

Accuracy: (TP + TN) / (TP + TN + FP + FN) - Use when: Balanced classes, all errors equally important

Precision: TP / (TP + FP) - Use when: False positives are costly (spam detection)

Recall (Sensitivity): TP / (TP + FN) - Use when: False negatives are costly (disease detection)

F1-Score: 2 * (Precision * Recall) / (Precision + Recall) - Use when: Balance between precision and recall needed

ROC-AUC: Area under Receiver Operating Characteristic curve - Use when: Comparing models across different thresholds

Regression Metrics

Mean Absolute Error (MAE): 1/n Σ|yᵢ - ŷᵢ| - Robust to outliers, interpretable in original units

Mean Squared Error (MSE): 1/n Σ(yᵢ - ŷᵢ)² - Penalizes large errors more, sensitive to outliers

Root Mean Squared Error (RMSE): √(1/n Σ(yᵢ - ŷᵢ)²) - Same units as target, interpretable

R² Score: 1 - SS_res/SS_tot - Proportion of variance explained, 0-1 scale

Clustering Metrics

Silhouette Score: Measures cluster cohesion and separation - Range: [-1, 1], higher is better

Adjusted Rand Index: Measures similarity to true clustering - Range: [-1, 1], 1 = perfect match

Inertia: Within-cluster sum of squares (k-means objective) - Lower is better, use with elbow method

Common Preprocessing Steps

Feature Scaling

# Standardization (mean=0, std=1)
StandardScaler()  # For normal distribution

# Normalization (min=0, max=1)
MinMaxScaler()    # For uniform distribution

# Robust scaling (median=0, IQR=1)
RobustScaler()    # For data with outliers

Feature Engineering

Polynomial Features: Create interaction terms
One-Hot Encoding: Convert categorical to binary
Target Encoding: Use target statistics for categories
Binning: Convert continuous to categorical
Date Features: Extract year, month, day, weekday

Missing Data

Drop: Remove rows/columns with missing values
Mean/Median/Mode: Fill with central tendency
Forward/Backward Fill: Use adjacent values
Interpolation: Estimate based on trends
Model-based: Predict missing values

Overfitting vs Underfitting

Overfitting (High Variance)

Symptoms: - High training accuracy, low validation accuracy - Complex model performs worse on new data

Solutions: - More training data - Regularization (L1/L2) - Feature selection - Cross-validation - Early stopping - Ensemble methods

Underfitting (High Bias)

Symptoms: - Low training and validation accuracy - Model too simple for the problem

Solutions: - More complex model - Add features - Reduce regularization - Increase model capacity - Feature engineering

Model Selection Best Practices

Start Simple: Begin with baseline models (linear, naive bayes)
Cross-Validation: Use k-fold CV for reliable performance estimates
Feature Engineering: Often more important than algorithm choice
Ensemble Methods: Combine multiple models for better performance
Hyperparameter Tuning: Use grid search or randomized search
Monitor Overfitting: Track train vs validation performance
Business Metrics: Optimize for what matters to the business
Interpretability Trade-off: Balance accuracy with explainability

Quick Reference Table

Problem Type	First Try	If More Accuracy Needed	If Interpretability Needed
Binary Classification	Logistic Regression	XGBoost, LightGBM, CatBoost	Decision Tree, Naive Bayes
Multi-class Classification	Logistic Regression	XGBoost, LightGBM, CatBoost	Decision Tree
Regression	Linear Regression	XGBoost, LightGBM, CatBoost	Linear Regression
Clustering	k-Means	DBSCAN, Hierarchical	k-Means with visualization
Dimensionality Reduction	PCA	t-SNE, UMAP	PCA
Anomaly Detection	Isolation Forest	One-Class SVM	Statistical methods

This cheat sheet provides a foundation for understanding and applying machine learning algorithms. Always consider your specific problem context, data characteristics, and business requirements when selecting algorithms.