unit 2 added

2025-11-24 15:26:41 +05:30
commit 8f8e35ae95
9 changed files with 326 additions and 0 deletions
--- a/2/05_Imbalanced_Data.md
+++ b/2/05_Imbalanced_Data.md
@@ -0,0 +1,28 @@
+# Handling Imbalanced Data
+
+## What is Imbalanced Data?
+Data is **imbalanced** when one class has many more examples than the other.
+- **Example**: 960 patients have diabetes, 40 do not.
+- **Problem**: A model might just guess the majority class and get high accuracy (Accuracy Paradox), but it fails to find the minority class.
+
+## Techniques to Handle Imbalance
+
+### 1. Resampling
+- **Up-sampling (Over-sampling)**: Randomly duplicate examples from the **minority** class.
+  - *Pros*: No information loss.
+  - *Cons*: Can lead to overfitting.
+- **Down-sampling (Under-sampling)**: Randomly remove examples from the **majority** class.
+  - *Pros*: Balances the dataset.
+  - *Cons*: We lose potentially useful data.
+
+### 2. SMOTE (Synthetic Minority Oversampling Technique)
+Instead of just copying data, SMOTE creates **new, synthetic** examples.
+- **How it works**:
+  1. Pick a minority instance.
+  2. Find its nearest neighbors (similar instances).
+  3. Create a new point along the line between the instance and a neighbor.
+- **Benefit**: Increases minority class size without just duplicating exact copies.
+
+### 3. Other Methods
+- Change the performance metric (use F1-score or AUC instead of Accuracy).
+- Use algorithms that handle imbalance well (like Tree-based models).