unit 2 added
This commit is contained in:
28
unit 2/05_Imbalanced_Data.md
Normal file
28
unit 2/05_Imbalanced_Data.md
Normal file
@@ -0,0 +1,28 @@
|
||||
# Handling Imbalanced Data
|
||||
|
||||
## What is Imbalanced Data?
|
||||
Data is **imbalanced** when one class has many more examples than the other.
|
||||
- **Example**: 960 patients have diabetes, 40 do not.
|
||||
- **Problem**: A model might just guess the majority class and get high accuracy (Accuracy Paradox), but it fails to find the minority class.
|
||||
|
||||
## Techniques to Handle Imbalance
|
||||
|
||||
### 1. Resampling
|
||||
- **Up-sampling (Over-sampling)**: Randomly duplicate examples from the **minority** class.
|
||||
- *Pros*: No information loss.
|
||||
- *Cons*: Can lead to overfitting.
|
||||
- **Down-sampling (Under-sampling)**: Randomly remove examples from the **majority** class.
|
||||
- *Pros*: Balances the dataset.
|
||||
- *Cons*: We lose potentially useful data.
|
||||
|
||||
### 2. SMOTE (Synthetic Minority Oversampling Technique)
|
||||
Instead of just copying data, SMOTE creates **new, synthetic** examples.
|
||||
- **How it works**:
|
||||
1. Pick a minority instance.
|
||||
2. Find its nearest neighbors (similar instances).
|
||||
3. Create a new point along the line between the instance and a neighbor.
|
||||
- **Benefit**: Increases minority class size without just duplicating exact copies.
|
||||
|
||||
### 3. Other Methods
|
||||
- Change the performance metric (use F1-score or AUC instead of Accuracy).
|
||||
- Use algorithms that handle imbalance well (like Tree-based models).
|
||||
Reference in New Issue
Block a user