DMCT-NOTES/unit 2/05_Imbalanced_Data.md

# Handling Imbalanced Data

## What is Imbalanced Data?
Data is **imbalanced** when one class has many more examples than the other.
- **Example**: 960 patients have diabetes, 40 do not.
- **Problem**: A model might just guess the majority class and get high accuracy (Accuracy Paradox), but it fails to find the minority class.

## Techniques to Handle Imbalance

### 1. Resampling
- **Up-sampling (Over-sampling)**: Randomly duplicate examples from the **minority** class.
  - *Pros*: No information loss.
  - *Cons*: Can lead to overfitting.
- **Down-sampling (Under-sampling)**: Randomly remove examples from the **majority** class.
  - *Pros*: Balances the dataset.
  - *Cons*: We lose potentially useful data.

### 2. SMOTE (Synthetic Minority Oversampling Technique)
Instead of just copying data, SMOTE creates **new, synthetic** examples.
- **How it works**:
  1. Pick a minority instance.
  2. Find its nearest neighbors (similar instances).
  3. Create a new point along the line between the instance and a neighbor.
- **Benefit**: Increases minority class size without just duplicating exact copies.

### 3. Other Methods
- Change the performance metric (use F1-score or AUC instead of Accuracy).
- Use algorithms that handle imbalance well (like Tree-based models).