unit 2 added

This commit is contained in:
Akshat Mehta
2025-11-24 15:26:41 +05:30
commit 8f8e35ae95
9 changed files with 326 additions and 0 deletions

28
unit 2/00_Index.md Normal file
View File

@@ -0,0 +1,28 @@
# Machine Learning Notes
Welcome to your simplified Machine Learning notes! These notes are designed to be easy to understand.
## Table of Contents
1. [[01_Introduction_to_ML|Introduction to Machine Learning]]
- Supervised Learning
- Regression vs Classification
2. [[02_Data_Science_Process|Standard Process for Data Science (CRISP-DM)]]
- The 6 phases of a project
3. [[03_Logistic_Regression|Logistic Regression]]
- Odds and Probability
- Sigmoid Function
4. [[04_Model_Evaluation|Model Evaluation Metrics]]
- Confusion Matrix, Accuracy, Precision, Recall
- ROC and AUC
5. [[05_Imbalanced_Data|Handling Imbalanced Data]]
- SMOTE and Resampling
6. [[06_KNN_Algorithm|K-Nearest Neighbors (KNN)]]
- Distance Measures
- How KNN works
7. [[07_Naive_Bayes|Naive Bayes Classifier]]
- Bayes Theorem
- Spam Filter Example
8. [[08_Decision_Tree|Decision Tree Algorithm]]
- Nodes and Splitting
- Gini and Entropy

View File

@@ -0,0 +1,42 @@
# Introduction to Machine Learning
## Supervised Learning
**Supervised learning** is like teaching a computer with examples. You give the computer inputs (predictors) and the correct answers (targets). The computer learns a "map" or rule to connect the inputs to the outputs.
- **Goal**: Find a model that maps input variables to a target variable.
- **Example**: Detecting phishing emails.
- You show the computer emails with phrases like "You have won million".
- You tell the computer these are "Spam".
- The computer learns to flag similar new emails as Spam.
### Types of Supervised Learning
There are two main types of problems:
1. **Regression**: Predicting a number (e.g., predicting house prices).
2. **Classification**: Predicting a category or label (e.g., Spam vs Not Spam).
---
## Classification
In classification, the target variable is a **category** (also called a class label).
**Example**:
- Labels: Cold, Warm, Hot.
- The model maps an instance to one of these labels.
### Types of Classification
#### 1. Binary Classification
There are only **two** possible classes.
- **Examples**:
- Email: Spam or Not Spam.
- Loan: Approve or Reject.
- Medical: Disease or No Disease.
- Exam: Pass or Fail.
#### 2. Multiclass Classification
There are **more than two** classes.
- **Examples**:
- Digit Recognition: 0, 1, 2, ..., 9 (10 classes).
- Fruit: Apple, Banana, Mango, Orange.
- Movie Genre: Action, Comedy, Drama, Horror.
- Sentiment: Very Negative, Negative, Neutral, Positive, Very Positive.

View File

@@ -0,0 +1,37 @@
# Standard Process for Data Science (CRISP-DM)
**CRISP-DM** stands for **Cr**oss **I**ndustry **S**tandard **P**rocess for **D**ata **M**ining. It is a standard way to do data mining projects.
It has **6 Phases**:
## 1. Business Understanding
**Goal**: Define what problem we are trying to solve.
- **Example**: An online retailer wants to classify items as "High Demand" or "Low Demand".
- **Questions**: Is item type related to demand? Can we predict demand accurately?
## 2. Data Understanding
**Goal**: Get to know the data.
- **Example**: Looking at the inventory data (orders, item type).
- **Insight**: Knowing if items are perishable (like milk) or non-perishable helps understand stock needs.
## 3. Data Preparation
**Goal**: Clean and format the data for the model.
- **Steps**:
- Handle missing values.
- Convert categories to numbers (dummy encoding).
- Check for connections (correlation) between variables.
## 4. Modeling
**Goal**: Build the machine learning model.
- We try to find a function that connects inputs (like number of orders) to the output (demand).
- We might try different models to find the best one.
## 5. Evaluation
**Goal**: Check how good the model is.
- We test the model on **unseen data** (data it hasn't seen before).
- We compare the **predicted** values with the **actual** values.
## 6. Deployment
**Goal**: Use the model in the real world.
- If the model is good, we put it to work.
- **Example**: Create an app where the retailer enters item details and gets a demand prediction.

View File

@@ -0,0 +1,35 @@
# Logistic Regression
Logistic Regression is used for **classification** problems (predicting categories), even though it has "Regression" in its name.
## Odds vs Probability
### Probability
- The chance of an event happening out of **all** possibilities.
- **Formula**: `Probability = (Events in favour) / (Total observations)`
- Range: 0 to 1.
### Odds
- The ratio of events **happening** to events **not happening**.
- **Formula**: `Odds = (Events in favour) / (Events NOT in favour)`
### Log of Odds (Logit)
- We use the **Log of Odds** because odds can vary widely in magnitude.
- Taking the log keeps the magnitude consistent across classes.
- This is the function used in Logistic Regression.
## The Sigmoid Function
Linear regression fits a straight line, which doesn't work well for classification (where we want output between 0 and 1).
Logistic regression uses an **S-shaped curve** called the **Sigmoid Function**.
- **Formula**: `S(z) = 1 / (1 + e^-z)`
- **Output**: Always between **0 and 1**.
- **Usage**:
- If output > Threshold (e.g., 0.5) -> Classify as **Positive** (e.g., Presence of fish).
- If output < Threshold -> Classify as **Negative** (e.g., Absence of fish).
## Assumptions of Logistic Regression
1. **Independence of Error**: Sample outcomes are separate (no duplicates).
2. **Linearity in the Logit**: Relationship between independent variables and log-odds is linear.
3. **Absence of Multicollinearity**: Independent variables should not be highly correlated with each other.
4. **No Strong Outliers**: Extreme values should not heavily influence the model.

View File

@@ -0,0 +1,53 @@
# Model Evaluation Metrics
How do we know if our classification model is good? We use several metrics.
## Confusion Matrix
A table that compares **Predicted** values with **Actual** values.
| | Predicted Negative | Predicted Positive |
|---|---|---|
| **Actual Negative** | True Negative (TN) | False Positive (FP) |
| **Actual Positive** | False Negative (FN) | True Positive (TP) |
- **TP**: Correctly predicted positive.
- **TN**: Correctly predicted negative.
- **FP**: Incorrectly predicted positive (Type I Error).
- **FN**: Incorrectly predicted negative (Type II Error).
## Key Metrics
### 1. Accuracy
- Fraction of **all** correct predictions.
- **Formula**: `(TP + TN) / Total`
- **Problem**: Not reliable if data is imbalanced (Accuracy Paradox).
### 2. Precision
- Out of all predicted positives, how many were actually positive?
- **Formula**: `TP / (TP + FP)`
- Higher is better.
### 3. Recall (Sensitivity / TPR)
- Out of all **actual** positives, how many did we find?
- **Formula**: `TP / (TP + FN)`
- Higher is better.
### 4. Specificity
- Out of all **actual** negatives, how many did we correctly identify?
- **Formula**: `TN / (TN + FP)`
### 5. F1 Score
- The harmonic mean of Precision and Recall.
- Good for balancing precision and recall, especially with uneven classes.
- **Formula**: `2 * (Precision * Recall) / (Precision + Recall)`
## ROC and AUC
### ROC Curve (Receiver Operating Characteristic)
- A plot of **TPR (Recall)** vs **FPR (False Positive Rate)**.
- Shows how the model performs at different thresholds.
### AUC (Area Under the Curve)
- Measures the entire area underneath the ROC curve.
- **Range**: 0 to 1.
- **Interpretation**: Higher AUC means the model is better at distinguishing between classes.

View File

@@ -0,0 +1,28 @@
# Handling Imbalanced Data
## What is Imbalanced Data?
Data is **imbalanced** when one class has many more examples than the other.
- **Example**: 960 patients have diabetes, 40 do not.
- **Problem**: A model might just guess the majority class and get high accuracy (Accuracy Paradox), but it fails to find the minority class.
## Techniques to Handle Imbalance
### 1. Resampling
- **Up-sampling (Over-sampling)**: Randomly duplicate examples from the **minority** class.
- *Pros*: No information loss.
- *Cons*: Can lead to overfitting.
- **Down-sampling (Under-sampling)**: Randomly remove examples from the **majority** class.
- *Pros*: Balances the dataset.
- *Cons*: We lose potentially useful data.
### 2. SMOTE (Synthetic Minority Oversampling Technique)
Instead of just copying data, SMOTE creates **new, synthetic** examples.
- **How it works**:
1. Pick a minority instance.
2. Find its nearest neighbors (similar instances).
3. Create a new point along the line between the instance and a neighbor.
- **Benefit**: Increases minority class size without just duplicating exact copies.
### 3. Other Methods
- Change the performance metric (use F1-score or AUC instead of Accuracy).
- Use algorithms that handle imbalance well (like Tree-based models).

View File

@@ -0,0 +1,41 @@
# K-Nearest Neighbors (KNN)
**KNN** is a simple algorithm used for classification and regression.
## How it Works
1. Store all training data.
2. When a new data point comes in, find the **K** closest points (neighbors) to it.
3. **Vote**: Assign the class that is most common among those K neighbors.
## Key Characteristics
- **Instance-based Learning**: Uses training instances directly to predict.
- **Lazy Learning**: It doesn't "learn" a model during training. It waits until a prediction is needed.
- **Non-Parametric**: It assumes nothing about the data structure.
## Choosing K
- **Small K**: Can be noisy and overfit (sensitive to outliers).
- **Large K**: Can be biased and miss patterns.
- **Tip**: Choose an **odd number** for K to avoid ties in voting.
## Distance Measures
How do we measure "closeness"?
### 1. Euclidean Distance
- The straight-line distance between two points.
- Used for numeric data.
- Formula: `sqrt((x2-x1)^2 + (y2-y1)^2)`
### 2. Manhattan Distance
- The distance if you can only move along a grid (like city blocks).
- Formula: `|x2-x1| + |y2-y1|`
### 3. Minkowski Distance
- A generalized form of Euclidean and Manhattan.
### 4. Chebyshev Distance
- The greatest difference along any coordinate dimension.
## Data Scaling
Since KNN uses distance, it is very sensitive to the scale of data.
- **Example**: If one feature ranges 0-1 and another 0-1000, the second one will dominate the distance.
- **Solution**: **Normalize** or **Standardize** data so all features contribute equally.

29
unit 2/07_Naive_Bayes.md Normal file
View File

@@ -0,0 +1,29 @@
# Naive Bayes Classifier
**Naive Bayes** is a classification algorithm based on **Bayes' Theorem**.
## Why "Naive"?
It is called "Naive" because it makes a simple assumption:
- **Assumption**: All features (predictors) are **independent** of each other.
- **Reality**: This is rarely true in real life, but the model still works surprisingly well.
## Bayes' Theorem
It calculates the probability of an event based on prior knowledge.
**Formula**:
`P(A|B) = (P(B|A) * P(A)) / P(B)`
- **P(A|B)**: **Posterior Probability** (Probability of class A given predictor B).
- **P(B|A)**: **Likelihood** (Probability of predictor B given class A).
- **P(A)**: **Prior Probability** (Probability of class A being true overall).
- **P(B)**: **Evidence** (Probability of predictor B occurring).
## Example: Spam Filtering
We want to label an email as **Spam** or **Ham** (Not Spam).
1. **Prior**: How common is spam overall? (e.g., 15% of emails are spam).
2. **Likelihood**: If an email is spam, how likely is it to contain the word "Money"?
3. **Evidence**: How common is the word "Money" in all emails?
4. **Posterior**: Given the email has "Money", what is the probability it is Spam?
We calculate this for all words and pick the class with the highest probability.

View File

@@ -0,0 +1,33 @@
# Decision Tree Algorithm
A **Decision Tree** is like a flowchart used for making decisions. It splits data into smaller groups based on rules.
## Structure
- **Root Node**: The starting point. It represents the entire dataset.
- **Decision Nodes**: Points where the data is split based on a question (e.g., "Is Petal Length < 2.45?").
- **Leaf Nodes (Terminal Nodes)**: The final output (class label) where no more splits happen.
## How it Splits Data
The tree wants to make the groups as "pure" as possible (containing only one class).
### Splitting Criteria
1. **Gini Impurity** (Default):
- Measures how mixed the classes are.
- **0** = Pure (all same class).
- **0.5** = Impure (mixed classes).
- The tree tries to **minimize** Gini.
2. **Entropy**:
- Measures disorder or randomness.
- **0** = Pure.
- **1** = Highly disordered.
- The tree tries to **reduce** Entropy (maximize Information Gain).
3. **Information Gain**:
- The difference in Entropy before and after a split.
- We choose the split that gives the **highest** Information Gain.
## Parameters to Control the Tree
- **max_depth**: How deep the tree can grow. (Too deep = Overfitting).
- **min_samples_split**: Minimum samples needed to split a node.
- **max_features**: Number of features to consider for each split.