akshat/DMCT-NOTES

Files

Akshat Mehta 8f8e35ae95 unit 2 added

2025-11-24 15:26:41 +05:30

1.2 KiB

Raw Blame History

Handling Imbalanced Data

What is Imbalanced Data?

Data is imbalanced when one class has many more examples than the other.

Example: 960 patients have diabetes, 40 do not.
Problem: A model might just guess the majority class and get high accuracy (Accuracy Paradox), but it fails to find the minority class.

Techniques to Handle Imbalance

1. Resampling

Up-sampling (Over-sampling): Randomly duplicate examples from the minority class.
- Pros: No information loss.
- Cons: Can lead to overfitting.
Down-sampling (Under-sampling): Randomly remove examples from the majority class.
- Pros: Balances the dataset.
- Cons: We lose potentially useful data.

2. SMOTE (Synthetic Minority Oversampling Technique)

Instead of just copying data, SMOTE creates new, synthetic examples.

How it works:
1. Pick a minority instance.
2. Find its nearest neighbors (similar instances).
3. Create a new point along the line between the instance and a neighbor.
Benefit: Increases minority class size without just duplicating exact copies.

3. Other Methods

Change the performance metric (use F1-score or AUC instead of Accuracy).
Use algorithms that handle imbalance well (like Tree-based models).