unit 2 added

2025-11-24 15:26:41 +05:30
commit 8f8e35ae95
9 changed files with 326 additions and 0 deletions
--- a/2/00_Index.md
+++ b/2/00_Index.md
@@ -0,0 +1,28 @@
+# Machine Learning Notes
+
+Welcome to your simplified Machine Learning notes! These notes are designed to be easy to understand.
+
+## Table of Contents
+
+1. [[01_Introduction_to_ML|Introduction to Machine Learning]]
+   - Supervised Learning
+   - Regression vs Classification
+2. [[02_Data_Science_Process|Standard Process for Data Science (CRISP-DM)]]
+   - The 6 phases of a project
+3. [[03_Logistic_Regression|Logistic Regression]]
+   - Odds and Probability
+   - Sigmoid Function
+4. [[04_Model_Evaluation|Model Evaluation Metrics]]
+   - Confusion Matrix, Accuracy, Precision, Recall
+   - ROC and AUC
+5. [[05_Imbalanced_Data|Handling Imbalanced Data]]
+   - SMOTE and Resampling
+6. [[06_KNN_Algorithm|K-Nearest Neighbors (KNN)]]
+   - Distance Measures
+   - How KNN works
+7. [[07_Naive_Bayes|Naive Bayes Classifier]]
+   - Bayes Theorem
+   - Spam Filter Example
+8. [[08_Decision_Tree|Decision Tree Algorithm]]
+   - Nodes and Splitting
+   - Gini and Entropy
--- a/2/01_Introduction_to_ML.md
+++ b/2/01_Introduction_to_ML.md
@@ -0,0 +1,42 @@
+# Introduction to Machine Learning
+
+## Supervised Learning
+**Supervised learning** is like teaching a computer with examples. You give the computer inputs (predictors) and the correct answers (targets). The computer learns a "map" or rule to connect the inputs to the outputs.
+
+- **Goal**: Find a model that maps input variables to a target variable.
+- **Example**: Detecting phishing emails.
+  - You show the computer emails with phrases like "You have won million".
+  - You tell the computer these are "Spam".
+  - The computer learns to flag similar new emails as Spam.
+
+### Types of Supervised Learning
+There are two main types of problems:
+1. **Regression**: Predicting a number (e.g., predicting house prices).
+2. **Classification**: Predicting a category or label (e.g., Spam vs Not Spam).
+
+---
+
+## Classification
+In classification, the target variable is a **category** (also called a class label).
+
+**Example**:
+- Labels: Cold, Warm, Hot.
+- The model maps an instance to one of these labels.
+
+### Types of Classification
+
+#### 1. Binary Classification
+There are only **two** possible classes.
+- **Examples**:
+  - Email: Spam or Not Spam.
+  - Loan: Approve or Reject.
+  - Medical: Disease or No Disease.
+  - Exam: Pass or Fail.
+
+#### 2. Multiclass Classification
+There are **more than two** classes.
+- **Examples**:
+  - Digit Recognition: 0, 1, 2, ..., 9 (10 classes).
+  - Fruit: Apple, Banana, Mango, Orange.
+  - Movie Genre: Action, Comedy, Drama, Horror.
+  - Sentiment: Very Negative, Negative, Neutral, Positive, Very Positive.
--- a/2/02_Data_Science_Process.md
+++ b/2/02_Data_Science_Process.md
@@ -0,0 +1,37 @@
+# Standard Process for Data Science (CRISP-DM)
+
+**CRISP-DM** stands for **Cr**oss **I**ndustry **S**tandard **P**rocess for **D**ata **M**ining. It is a standard way to do data mining projects.
+
+It has **6 Phases**:
+
+## 1. Business Understanding
+**Goal**: Define what problem we are trying to solve.
+- **Example**: An online retailer wants to classify items as "High Demand" or "Low Demand".
+- **Questions**: Is item type related to demand? Can we predict demand accurately?
+
+## 2. Data Understanding
+**Goal**: Get to know the data.
+- **Example**: Looking at the inventory data (orders, item type).
+- **Insight**: Knowing if items are perishable (like milk) or non-perishable helps understand stock needs.
+
+## 3. Data Preparation
+**Goal**: Clean and format the data for the model.
+- **Steps**:
+  - Handle missing values.
+  - Convert categories to numbers (dummy encoding).
+  - Check for connections (correlation) between variables.
+
+## 4. Modeling
+**Goal**: Build the machine learning model.
+- We try to find a function that connects inputs (like number of orders) to the output (demand).
+- We might try different models to find the best one.
+
+## 5. Evaluation
+**Goal**: Check how good the model is.
+- We test the model on **unseen data** (data it hasn't seen before).
+- We compare the **predicted** values with the **actual** values.
+
+## 6. Deployment
+**Goal**: Use the model in the real world.
+- If the model is good, we put it to work.
+- **Example**: Create an app where the retailer enters item details and gets a demand prediction.
--- a/2/03_Logistic_Regression.md
+++ b/2/03_Logistic_Regression.md
@@ -0,0 +1,35 @@
+# Logistic Regression
+
+Logistic Regression is used for **classification** problems (predicting categories), even though it has "Regression" in its name.
+
+## Odds vs Probability
+
+### Probability
+- The chance of an event happening out of **all** possibilities.
+- **Formula**: `Probability = (Events in favour) / (Total observations)`
+- Range: 0 to 1.
+
+### Odds
+- The ratio of events **happening** to events **not happening**.
+- **Formula**: `Odds = (Events in favour) / (Events NOT in favour)`
+
+### Log of Odds (Logit)
+- We use the **Log of Odds** because odds can vary widely in magnitude.
+- Taking the log keeps the magnitude consistent across classes.
+- This is the function used in Logistic Regression.
+
+## The Sigmoid Function
+Linear regression fits a straight line, which doesn't work well for classification (where we want output between 0 and 1).
+Logistic regression uses an **S-shaped curve** called the **Sigmoid Function**.
+
+- **Formula**: `S(z) = 1 / (1 + e^-z)`
+- **Output**: Always between **0 and 1**.
+- **Usage**:
+  - If output > Threshold (e.g., 0.5) -> Classify as **Positive** (e.g., Presence of fish).
+  - If output < Threshold -> Classify as **Negative** (e.g., Absence of fish).
+
+## Assumptions of Logistic Regression
+1. **Independence of Error**: Sample outcomes are separate (no duplicates).
+2. **Linearity in the Logit**: Relationship between independent variables and log-odds is linear.
+3. **Absence of Multicollinearity**: Independent variables should not be highly correlated with each other.
+4. **No Strong Outliers**: Extreme values should not heavily influence the model.
--- a/2/04_Model_Evaluation.md
+++ b/2/04_Model_Evaluation.md
@@ -0,0 +1,53 @@
+# Model Evaluation Metrics
+
+How do we know if our classification model is good? We use several metrics.
+
+## Confusion Matrix
+A table that compares **Predicted** values with **Actual** values.
+
+| | Predicted Negative | Predicted Positive |
+|---|---|---|
+| **Actual Negative** | True Negative (TN) | False Positive (FP) |
+| **Actual Positive** | False Negative (FN) | True Positive (TP) |
+
+- **TP**: Correctly predicted positive.
+- **TN**: Correctly predicted negative.
+- **FP**: Incorrectly predicted positive (Type I Error).
+- **FN**: Incorrectly predicted negative (Type II Error).
+
+## Key Metrics
+
+### 1. Accuracy
+- Fraction of **all** correct predictions.
+- **Formula**: `(TP + TN) / Total`
+- **Problem**: Not reliable if data is imbalanced (Accuracy Paradox).
+
+### 2. Precision
+- Out of all predicted positives, how many were actually positive?
+- **Formula**: `TP / (TP + FP)`
+- Higher is better.
+
+### 3. Recall (Sensitivity / TPR)
+- Out of all **actual** positives, how many did we find?
+- **Formula**: `TP / (TP + FN)`
+- Higher is better.
+
+### 4. Specificity
+- Out of all **actual** negatives, how many did we correctly identify?
+- **Formula**: `TN / (TN + FP)`
+
+### 5. F1 Score
+- The harmonic mean of Precision and Recall.
+- Good for balancing precision and recall, especially with uneven classes.
+- **Formula**: `2 * (Precision * Recall) / (Precision + Recall)`
+
+## ROC and AUC
+
+### ROC Curve (Receiver Operating Characteristic)
+- A plot of **TPR (Recall)** vs **FPR (False Positive Rate)**.
+- Shows how the model performs at different thresholds.
+
+### AUC (Area Under the Curve)
+- Measures the entire area underneath the ROC curve.
+- **Range**: 0 to 1.
+- **Interpretation**: Higher AUC means the model is better at distinguishing between classes.
--- a/2/05_Imbalanced_Data.md
+++ b/2/05_Imbalanced_Data.md
@@ -0,0 +1,28 @@
+# Handling Imbalanced Data
+
+## What is Imbalanced Data?
+Data is **imbalanced** when one class has many more examples than the other.
+- **Example**: 960 patients have diabetes, 40 do not.
+- **Problem**: A model might just guess the majority class and get high accuracy (Accuracy Paradox), but it fails to find the minority class.
+
+## Techniques to Handle Imbalance
+
+### 1. Resampling
+- **Up-sampling (Over-sampling)**: Randomly duplicate examples from the **minority** class.
+  - *Pros*: No information loss.
+  - *Cons*: Can lead to overfitting.
+- **Down-sampling (Under-sampling)**: Randomly remove examples from the **majority** class.
+  - *Pros*: Balances the dataset.
+  - *Cons*: We lose potentially useful data.
+
+### 2. SMOTE (Synthetic Minority Oversampling Technique)
+Instead of just copying data, SMOTE creates **new, synthetic** examples.
+- **How it works**:
+  1. Pick a minority instance.
+  2. Find its nearest neighbors (similar instances).
+  3. Create a new point along the line between the instance and a neighbor.
+- **Benefit**: Increases minority class size without just duplicating exact copies.
+
+### 3. Other Methods
+- Change the performance metric (use F1-score or AUC instead of Accuracy).
+- Use algorithms that handle imbalance well (like Tree-based models).
--- a/2/06_KNN_Algorithm.md
+++ b/2/06_KNN_Algorithm.md
@@ -0,0 +1,41 @@
+# K-Nearest Neighbors (KNN)
+
+**KNN** is a simple algorithm used for classification and regression.
+
+## How it Works
+1. Store all training data.
+2. When a new data point comes in, find the **K** closest points (neighbors) to it.
+3. **Vote**: Assign the class that is most common among those K neighbors.
+
+## Key Characteristics
+- **Instance-based Learning**: Uses training instances directly to predict.
+- **Lazy Learning**: It doesn't "learn" a model during training. It waits until a prediction is needed.
+- **Non-Parametric**: It assumes nothing about the data structure.
+
+## Choosing K
+- **Small K**: Can be noisy and overfit (sensitive to outliers).
+- **Large K**: Can be biased and miss patterns.
+- **Tip**: Choose an **odd number** for K to avoid ties in voting.
+
+## Distance Measures
+How do we measure "closeness"?
+
+### 1. Euclidean Distance
+- The straight-line distance between two points.
+- Used for numeric data.
+- Formula: `sqrt((x2-x1)^2 + (y2-y1)^2)`
+
+### 2. Manhattan Distance
+- The distance if you can only move along a grid (like city blocks).
+- Formula: `|x2-x1| + |y2-y1|`
+
+### 3. Minkowski Distance
+- A generalized form of Euclidean and Manhattan.
+
+### 4. Chebyshev Distance
+- The greatest difference along any coordinate dimension.
+
+## Data Scaling
+Since KNN uses distance, it is very sensitive to the scale of data.
+- **Example**: If one feature ranges 0-1 and another 0-1000, the second one will dominate the distance.
+- **Solution**: **Normalize** or **Standardize** data so all features contribute equally.
--- a/2/07_Naive_Bayes.md
+++ b/2/07_Naive_Bayes.md
@@ -0,0 +1,29 @@
+# Naive Bayes Classifier
+
+**Naive Bayes** is a classification algorithm based on **Bayes' Theorem**.
+
+## Why "Naive"?
+It is called "Naive" because it makes a simple assumption:
+- **Assumption**: All features (predictors) are **independent** of each other.
+- **Reality**: This is rarely true in real life, but the model still works surprisingly well.
+
+## Bayes' Theorem
+It calculates the probability of an event based on prior knowledge.
+
+**Formula**:
+`P(A|B) = (P(B|A) * P(A)) / P(B)`
+
+- **P(A|B)**: **Posterior Probability** (Probability of class A given predictor B).
+- **P(B|A)**: **Likelihood** (Probability of predictor B given class A).
+- **P(A)**: **Prior Probability** (Probability of class A being true overall).
+- **P(B)**: **Evidence** (Probability of predictor B occurring).
+
+## Example: Spam Filtering
+We want to label an email as **Spam** or **Ham** (Not Spam).
+
+1. **Prior**: How common is spam overall? (e.g., 15% of emails are spam).
+2. **Likelihood**: If an email is spam, how likely is it to contain the word "Money"?
+3. **Evidence**: How common is the word "Money" in all emails?
+4. **Posterior**: Given the email has "Money", what is the probability it is Spam?
+
+We calculate this for all words and pick the class with the highest probability.
--- a/2/08_Decision_Tree.md
+++ b/2/08_Decision_Tree.md
@@ -0,0 +1,33 @@
+# Decision Tree Algorithm
+
+A **Decision Tree** is like a flowchart used for making decisions. It splits data into smaller groups based on rules.
+
+## Structure
+- **Root Node**: The starting point. It represents the entire dataset.
+- **Decision Nodes**: Points where the data is split based on a question (e.g., "Is Petal Length < 2.45?").
+- **Leaf Nodes (Terminal Nodes)**: The final output (class label) where no more splits happen.
+
+## How it Splits Data
+The tree wants to make the groups as "pure" as possible (containing only one class).
+
+### Splitting Criteria
+1. **Gini Impurity** (Default):
+   - Measures how mixed the classes are.
+   - **0** = Pure (all same class).
+   - **0.5** = Impure (mixed classes).
+   - The tree tries to **minimize** Gini.
+
+2. **Entropy**:
+   - Measures disorder or randomness.
+   - **0** = Pure.
+   - **1** = Highly disordered.
+   - The tree tries to **reduce** Entropy (maximize Information Gain).
+
+3. **Information Gain**:
+   - The difference in Entropy before and after a split.
+   - We choose the split that gives the **highest** Information Gain.
+
+## Parameters to Control the Tree
+- **max_depth**: How deep the tree can grow. (Too deep = Overfitting).
+- **min_samples_split**: Minimum samples needed to split a node.
+- **max_features**: Number of features to consider for each split.