1.2 KiB
1.2 KiB
Handling Imbalanced Data
What is Imbalanced Data?
Data is imbalanced when one class has many more examples than the other.
- Example: 960 patients have diabetes, 40 do not.
- Problem: A model might just guess the majority class and get high accuracy (Accuracy Paradox), but it fails to find the minority class.
Techniques to Handle Imbalance
1. Resampling
- Up-sampling (Over-sampling): Randomly duplicate examples from the minority class.
- Pros: No information loss.
- Cons: Can lead to overfitting.
- Down-sampling (Under-sampling): Randomly remove examples from the majority class.
- Pros: Balances the dataset.
- Cons: We lose potentially useful data.
2. SMOTE (Synthetic Minority Oversampling Technique)
Instead of just copying data, SMOTE creates new, synthetic examples.
- How it works:
- Pick a minority instance.
- Find its nearest neighbors (similar instances).
- Create a new point along the line between the instance and a neighbor.
- Benefit: Increases minority class size without just duplicating exact copies.
3. Other Methods
- Change the performance metric (use F1-score or AUC instead of Accuracy).
- Use algorithms that handle imbalance well (like Tree-based models).