# Handling Imbalanced Data ## What is Imbalanced Data? Data is **imbalanced** when one class has many more examples than the other. - **Example**: 960 patients have diabetes, 40 do not. - **Problem**: A model might just guess the majority class and get high accuracy (Accuracy Paradox), but it fails to find the minority class. ## Techniques to Handle Imbalance ### 1. Resampling - **Up-sampling (Over-sampling)**: Randomly duplicate examples from the **minority** class. - *Pros*: No information loss. - *Cons*: Can lead to overfitting. - **Down-sampling (Under-sampling)**: Randomly remove examples from the **majority** class. - *Pros*: Balances the dataset. - *Cons*: We lose potentially useful data. ### 2. SMOTE (Synthetic Minority Oversampling Technique) Instead of just copying data, SMOTE creates **new, synthetic** examples. - **How it works**: 1. Pick a minority instance. 2. Find its nearest neighbors (similar instances). 3. Create a new point along the line between the instance and a neighbor. - **Benefit**: Increases minority class size without just duplicating exact copies. ### 3. Other Methods - Change the performance metric (use F1-score or AUC instead of Accuracy). - Use algorithms that handle imbalance well (like Tree-based models).