Files
DMCT-NOTES/unit 2/05_Imbalanced_Data.md
Akshat Mehta 8f8e35ae95 unit 2 added
2025-11-24 15:26:41 +05:30

29 lines
1.2 KiB
Markdown

# Handling Imbalanced Data
## What is Imbalanced Data?
Data is **imbalanced** when one class has many more examples than the other.
- **Example**: 960 patients have diabetes, 40 do not.
- **Problem**: A model might just guess the majority class and get high accuracy (Accuracy Paradox), but it fails to find the minority class.
## Techniques to Handle Imbalance
### 1. Resampling
- **Up-sampling (Over-sampling)**: Randomly duplicate examples from the **minority** class.
- *Pros*: No information loss.
- *Cons*: Can lead to overfitting.
- **Down-sampling (Under-sampling)**: Randomly remove examples from the **majority** class.
- *Pros*: Balances the dataset.
- *Cons*: We lose potentially useful data.
### 2. SMOTE (Synthetic Minority Oversampling Technique)
Instead of just copying data, SMOTE creates **new, synthetic** examples.
- **How it works**:
1. Pick a minority instance.
2. Find its nearest neighbors (similar instances).
3. Create a new point along the line between the instance and a neighbor.
- **Benefit**: Increases minority class size without just duplicating exact copies.
### 3. Other Methods
- Change the performance metric (use F1-score or AUC instead of Accuracy).
- Use algorithms that handle imbalance well (like Tree-based models).