Files
DMCT-NOTES/unit 2/05_Imbalanced_Data.md
Akshat Mehta 8f8e35ae95 unit 2 added
2025-11-24 15:26:41 +05:30

1.2 KiB

Handling Imbalanced Data

What is Imbalanced Data?

Data is imbalanced when one class has many more examples than the other.

  • Example: 960 patients have diabetes, 40 do not.
  • Problem: A model might just guess the majority class and get high accuracy (Accuracy Paradox), but it fails to find the minority class.

Techniques to Handle Imbalance

1. Resampling

  • Up-sampling (Over-sampling): Randomly duplicate examples from the minority class.
    • Pros: No information loss.
    • Cons: Can lead to overfitting.
  • Down-sampling (Under-sampling): Randomly remove examples from the majority class.
    • Pros: Balances the dataset.
    • Cons: We lose potentially useful data.

2. SMOTE (Synthetic Minority Oversampling Technique)

Instead of just copying data, SMOTE creates new, synthetic examples.

  • How it works:
    1. Pick a minority instance.
    2. Find its nearest neighbors (similar instances).
    3. Create a new point along the line between the instance and a neighbor.
  • Benefit: Increases minority class size without just duplicating exact copies.

3. Other Methods

  • Change the performance metric (use F1-score or AUC instead of Accuracy).
  • Use algorithms that handle imbalance well (like Tree-based models).