unit 2 added
This commit is contained in:
28
unit 2/00_Index.md
Normal file
28
unit 2/00_Index.md
Normal file
@@ -0,0 +1,28 @@
|
||||
# Machine Learning Notes
|
||||
|
||||
Welcome to your simplified Machine Learning notes! These notes are designed to be easy to understand.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [[01_Introduction_to_ML|Introduction to Machine Learning]]
|
||||
- Supervised Learning
|
||||
- Regression vs Classification
|
||||
2. [[02_Data_Science_Process|Standard Process for Data Science (CRISP-DM)]]
|
||||
- The 6 phases of a project
|
||||
3. [[03_Logistic_Regression|Logistic Regression]]
|
||||
- Odds and Probability
|
||||
- Sigmoid Function
|
||||
4. [[04_Model_Evaluation|Model Evaluation Metrics]]
|
||||
- Confusion Matrix, Accuracy, Precision, Recall
|
||||
- ROC and AUC
|
||||
5. [[05_Imbalanced_Data|Handling Imbalanced Data]]
|
||||
- SMOTE and Resampling
|
||||
6. [[06_KNN_Algorithm|K-Nearest Neighbors (KNN)]]
|
||||
- Distance Measures
|
||||
- How KNN works
|
||||
7. [[07_Naive_Bayes|Naive Bayes Classifier]]
|
||||
- Bayes Theorem
|
||||
- Spam Filter Example
|
||||
8. [[08_Decision_Tree|Decision Tree Algorithm]]
|
||||
- Nodes and Splitting
|
||||
- Gini and Entropy
|
||||
42
unit 2/01_Introduction_to_ML.md
Normal file
42
unit 2/01_Introduction_to_ML.md
Normal file
@@ -0,0 +1,42 @@
|
||||
# Introduction to Machine Learning
|
||||
|
||||
## Supervised Learning
|
||||
**Supervised learning** is like teaching a computer with examples. You give the computer inputs (predictors) and the correct answers (targets). The computer learns a "map" or rule to connect the inputs to the outputs.
|
||||
|
||||
- **Goal**: Find a model that maps input variables to a target variable.
|
||||
- **Example**: Detecting phishing emails.
|
||||
- You show the computer emails with phrases like "You have won million".
|
||||
- You tell the computer these are "Spam".
|
||||
- The computer learns to flag similar new emails as Spam.
|
||||
|
||||
### Types of Supervised Learning
|
||||
There are two main types of problems:
|
||||
1. **Regression**: Predicting a number (e.g., predicting house prices).
|
||||
2. **Classification**: Predicting a category or label (e.g., Spam vs Not Spam).
|
||||
|
||||
---
|
||||
|
||||
## Classification
|
||||
In classification, the target variable is a **category** (also called a class label).
|
||||
|
||||
**Example**:
|
||||
- Labels: Cold, Warm, Hot.
|
||||
- The model maps an instance to one of these labels.
|
||||
|
||||
### Types of Classification
|
||||
|
||||
#### 1. Binary Classification
|
||||
There are only **two** possible classes.
|
||||
- **Examples**:
|
||||
- Email: Spam or Not Spam.
|
||||
- Loan: Approve or Reject.
|
||||
- Medical: Disease or No Disease.
|
||||
- Exam: Pass or Fail.
|
||||
|
||||
#### 2. Multiclass Classification
|
||||
There are **more than two** classes.
|
||||
- **Examples**:
|
||||
- Digit Recognition: 0, 1, 2, ..., 9 (10 classes).
|
||||
- Fruit: Apple, Banana, Mango, Orange.
|
||||
- Movie Genre: Action, Comedy, Drama, Horror.
|
||||
- Sentiment: Very Negative, Negative, Neutral, Positive, Very Positive.
|
||||
37
unit 2/02_Data_Science_Process.md
Normal file
37
unit 2/02_Data_Science_Process.md
Normal file
@@ -0,0 +1,37 @@
|
||||
# Standard Process for Data Science (CRISP-DM)
|
||||
|
||||
**CRISP-DM** stands for **Cr**oss **I**ndustry **S**tandard **P**rocess for **D**ata **M**ining. It is a standard way to do data mining projects.
|
||||
|
||||
It has **6 Phases**:
|
||||
|
||||
## 1. Business Understanding
|
||||
**Goal**: Define what problem we are trying to solve.
|
||||
- **Example**: An online retailer wants to classify items as "High Demand" or "Low Demand".
|
||||
- **Questions**: Is item type related to demand? Can we predict demand accurately?
|
||||
|
||||
## 2. Data Understanding
|
||||
**Goal**: Get to know the data.
|
||||
- **Example**: Looking at the inventory data (orders, item type).
|
||||
- **Insight**: Knowing if items are perishable (like milk) or non-perishable helps understand stock needs.
|
||||
|
||||
## 3. Data Preparation
|
||||
**Goal**: Clean and format the data for the model.
|
||||
- **Steps**:
|
||||
- Handle missing values.
|
||||
- Convert categories to numbers (dummy encoding).
|
||||
- Check for connections (correlation) between variables.
|
||||
|
||||
## 4. Modeling
|
||||
**Goal**: Build the machine learning model.
|
||||
- We try to find a function that connects inputs (like number of orders) to the output (demand).
|
||||
- We might try different models to find the best one.
|
||||
|
||||
## 5. Evaluation
|
||||
**Goal**: Check how good the model is.
|
||||
- We test the model on **unseen data** (data it hasn't seen before).
|
||||
- We compare the **predicted** values with the **actual** values.
|
||||
|
||||
## 6. Deployment
|
||||
**Goal**: Use the model in the real world.
|
||||
- If the model is good, we put it to work.
|
||||
- **Example**: Create an app where the retailer enters item details and gets a demand prediction.
|
||||
35
unit 2/03_Logistic_Regression.md
Normal file
35
unit 2/03_Logistic_Regression.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# Logistic Regression
|
||||
|
||||
Logistic Regression is used for **classification** problems (predicting categories), even though it has "Regression" in its name.
|
||||
|
||||
## Odds vs Probability
|
||||
|
||||
### Probability
|
||||
- The chance of an event happening out of **all** possibilities.
|
||||
- **Formula**: `Probability = (Events in favour) / (Total observations)`
|
||||
- Range: 0 to 1.
|
||||
|
||||
### Odds
|
||||
- The ratio of events **happening** to events **not happening**.
|
||||
- **Formula**: `Odds = (Events in favour) / (Events NOT in favour)`
|
||||
|
||||
### Log of Odds (Logit)
|
||||
- We use the **Log of Odds** because odds can vary widely in magnitude.
|
||||
- Taking the log keeps the magnitude consistent across classes.
|
||||
- This is the function used in Logistic Regression.
|
||||
|
||||
## The Sigmoid Function
|
||||
Linear regression fits a straight line, which doesn't work well for classification (where we want output between 0 and 1).
|
||||
Logistic regression uses an **S-shaped curve** called the **Sigmoid Function**.
|
||||
|
||||
- **Formula**: `S(z) = 1 / (1 + e^-z)`
|
||||
- **Output**: Always between **0 and 1**.
|
||||
- **Usage**:
|
||||
- If output > Threshold (e.g., 0.5) -> Classify as **Positive** (e.g., Presence of fish).
|
||||
- If output < Threshold -> Classify as **Negative** (e.g., Absence of fish).
|
||||
|
||||
## Assumptions of Logistic Regression
|
||||
1. **Independence of Error**: Sample outcomes are separate (no duplicates).
|
||||
2. **Linearity in the Logit**: Relationship between independent variables and log-odds is linear.
|
||||
3. **Absence of Multicollinearity**: Independent variables should not be highly correlated with each other.
|
||||
4. **No Strong Outliers**: Extreme values should not heavily influence the model.
|
||||
53
unit 2/04_Model_Evaluation.md
Normal file
53
unit 2/04_Model_Evaluation.md
Normal file
@@ -0,0 +1,53 @@
|
||||
# Model Evaluation Metrics
|
||||
|
||||
How do we know if our classification model is good? We use several metrics.
|
||||
|
||||
## Confusion Matrix
|
||||
A table that compares **Predicted** values with **Actual** values.
|
||||
|
||||
| | Predicted Negative | Predicted Positive |
|
||||
|---|---|---|
|
||||
| **Actual Negative** | True Negative (TN) | False Positive (FP) |
|
||||
| **Actual Positive** | False Negative (FN) | True Positive (TP) |
|
||||
|
||||
- **TP**: Correctly predicted positive.
|
||||
- **TN**: Correctly predicted negative.
|
||||
- **FP**: Incorrectly predicted positive (Type I Error).
|
||||
- **FN**: Incorrectly predicted negative (Type II Error).
|
||||
|
||||
## Key Metrics
|
||||
|
||||
### 1. Accuracy
|
||||
- Fraction of **all** correct predictions.
|
||||
- **Formula**: `(TP + TN) / Total`
|
||||
- **Problem**: Not reliable if data is imbalanced (Accuracy Paradox).
|
||||
|
||||
### 2. Precision
|
||||
- Out of all predicted positives, how many were actually positive?
|
||||
- **Formula**: `TP / (TP + FP)`
|
||||
- Higher is better.
|
||||
|
||||
### 3. Recall (Sensitivity / TPR)
|
||||
- Out of all **actual** positives, how many did we find?
|
||||
- **Formula**: `TP / (TP + FN)`
|
||||
- Higher is better.
|
||||
|
||||
### 4. Specificity
|
||||
- Out of all **actual** negatives, how many did we correctly identify?
|
||||
- **Formula**: `TN / (TN + FP)`
|
||||
|
||||
### 5. F1 Score
|
||||
- The harmonic mean of Precision and Recall.
|
||||
- Good for balancing precision and recall, especially with uneven classes.
|
||||
- **Formula**: `2 * (Precision * Recall) / (Precision + Recall)`
|
||||
|
||||
## ROC and AUC
|
||||
|
||||
### ROC Curve (Receiver Operating Characteristic)
|
||||
- A plot of **TPR (Recall)** vs **FPR (False Positive Rate)**.
|
||||
- Shows how the model performs at different thresholds.
|
||||
|
||||
### AUC (Area Under the Curve)
|
||||
- Measures the entire area underneath the ROC curve.
|
||||
- **Range**: 0 to 1.
|
||||
- **Interpretation**: Higher AUC means the model is better at distinguishing between classes.
|
||||
28
unit 2/05_Imbalanced_Data.md
Normal file
28
unit 2/05_Imbalanced_Data.md
Normal file
@@ -0,0 +1,28 @@
|
||||
# Handling Imbalanced Data
|
||||
|
||||
## What is Imbalanced Data?
|
||||
Data is **imbalanced** when one class has many more examples than the other.
|
||||
- **Example**: 960 patients have diabetes, 40 do not.
|
||||
- **Problem**: A model might just guess the majority class and get high accuracy (Accuracy Paradox), but it fails to find the minority class.
|
||||
|
||||
## Techniques to Handle Imbalance
|
||||
|
||||
### 1. Resampling
|
||||
- **Up-sampling (Over-sampling)**: Randomly duplicate examples from the **minority** class.
|
||||
- *Pros*: No information loss.
|
||||
- *Cons*: Can lead to overfitting.
|
||||
- **Down-sampling (Under-sampling)**: Randomly remove examples from the **majority** class.
|
||||
- *Pros*: Balances the dataset.
|
||||
- *Cons*: We lose potentially useful data.
|
||||
|
||||
### 2. SMOTE (Synthetic Minority Oversampling Technique)
|
||||
Instead of just copying data, SMOTE creates **new, synthetic** examples.
|
||||
- **How it works**:
|
||||
1. Pick a minority instance.
|
||||
2. Find its nearest neighbors (similar instances).
|
||||
3. Create a new point along the line between the instance and a neighbor.
|
||||
- **Benefit**: Increases minority class size without just duplicating exact copies.
|
||||
|
||||
### 3. Other Methods
|
||||
- Change the performance metric (use F1-score or AUC instead of Accuracy).
|
||||
- Use algorithms that handle imbalance well (like Tree-based models).
|
||||
41
unit 2/06_KNN_Algorithm.md
Normal file
41
unit 2/06_KNN_Algorithm.md
Normal file
@@ -0,0 +1,41 @@
|
||||
# K-Nearest Neighbors (KNN)
|
||||
|
||||
**KNN** is a simple algorithm used for classification and regression.
|
||||
|
||||
## How it Works
|
||||
1. Store all training data.
|
||||
2. When a new data point comes in, find the **K** closest points (neighbors) to it.
|
||||
3. **Vote**: Assign the class that is most common among those K neighbors.
|
||||
|
||||
## Key Characteristics
|
||||
- **Instance-based Learning**: Uses training instances directly to predict.
|
||||
- **Lazy Learning**: It doesn't "learn" a model during training. It waits until a prediction is needed.
|
||||
- **Non-Parametric**: It assumes nothing about the data structure.
|
||||
|
||||
## Choosing K
|
||||
- **Small K**: Can be noisy and overfit (sensitive to outliers).
|
||||
- **Large K**: Can be biased and miss patterns.
|
||||
- **Tip**: Choose an **odd number** for K to avoid ties in voting.
|
||||
|
||||
## Distance Measures
|
||||
How do we measure "closeness"?
|
||||
|
||||
### 1. Euclidean Distance
|
||||
- The straight-line distance between two points.
|
||||
- Used for numeric data.
|
||||
- Formula: `sqrt((x2-x1)^2 + (y2-y1)^2)`
|
||||
|
||||
### 2. Manhattan Distance
|
||||
- The distance if you can only move along a grid (like city blocks).
|
||||
- Formula: `|x2-x1| + |y2-y1|`
|
||||
|
||||
### 3. Minkowski Distance
|
||||
- A generalized form of Euclidean and Manhattan.
|
||||
|
||||
### 4. Chebyshev Distance
|
||||
- The greatest difference along any coordinate dimension.
|
||||
|
||||
## Data Scaling
|
||||
Since KNN uses distance, it is very sensitive to the scale of data.
|
||||
- **Example**: If one feature ranges 0-1 and another 0-1000, the second one will dominate the distance.
|
||||
- **Solution**: **Normalize** or **Standardize** data so all features contribute equally.
|
||||
29
unit 2/07_Naive_Bayes.md
Normal file
29
unit 2/07_Naive_Bayes.md
Normal file
@@ -0,0 +1,29 @@
|
||||
# Naive Bayes Classifier
|
||||
|
||||
**Naive Bayes** is a classification algorithm based on **Bayes' Theorem**.
|
||||
|
||||
## Why "Naive"?
|
||||
It is called "Naive" because it makes a simple assumption:
|
||||
- **Assumption**: All features (predictors) are **independent** of each other.
|
||||
- **Reality**: This is rarely true in real life, but the model still works surprisingly well.
|
||||
|
||||
## Bayes' Theorem
|
||||
It calculates the probability of an event based on prior knowledge.
|
||||
|
||||
**Formula**:
|
||||
`P(A|B) = (P(B|A) * P(A)) / P(B)`
|
||||
|
||||
- **P(A|B)**: **Posterior Probability** (Probability of class A given predictor B).
|
||||
- **P(B|A)**: **Likelihood** (Probability of predictor B given class A).
|
||||
- **P(A)**: **Prior Probability** (Probability of class A being true overall).
|
||||
- **P(B)**: **Evidence** (Probability of predictor B occurring).
|
||||
|
||||
## Example: Spam Filtering
|
||||
We want to label an email as **Spam** or **Ham** (Not Spam).
|
||||
|
||||
1. **Prior**: How common is spam overall? (e.g., 15% of emails are spam).
|
||||
2. **Likelihood**: If an email is spam, how likely is it to contain the word "Money"?
|
||||
3. **Evidence**: How common is the word "Money" in all emails?
|
||||
4. **Posterior**: Given the email has "Money", what is the probability it is Spam?
|
||||
|
||||
We calculate this for all words and pick the class with the highest probability.
|
||||
33
unit 2/08_Decision_Tree.md
Normal file
33
unit 2/08_Decision_Tree.md
Normal file
@@ -0,0 +1,33 @@
|
||||
# Decision Tree Algorithm
|
||||
|
||||
A **Decision Tree** is like a flowchart used for making decisions. It splits data into smaller groups based on rules.
|
||||
|
||||
## Structure
|
||||
- **Root Node**: The starting point. It represents the entire dataset.
|
||||
- **Decision Nodes**: Points where the data is split based on a question (e.g., "Is Petal Length < 2.45?").
|
||||
- **Leaf Nodes (Terminal Nodes)**: The final output (class label) where no more splits happen.
|
||||
|
||||
## How it Splits Data
|
||||
The tree wants to make the groups as "pure" as possible (containing only one class).
|
||||
|
||||
### Splitting Criteria
|
||||
1. **Gini Impurity** (Default):
|
||||
- Measures how mixed the classes are.
|
||||
- **0** = Pure (all same class).
|
||||
- **0.5** = Impure (mixed classes).
|
||||
- The tree tries to **minimize** Gini.
|
||||
|
||||
2. **Entropy**:
|
||||
- Measures disorder or randomness.
|
||||
- **0** = Pure.
|
||||
- **1** = Highly disordered.
|
||||
- The tree tries to **reduce** Entropy (maximize Information Gain).
|
||||
|
||||
3. **Information Gain**:
|
||||
- The difference in Entropy before and after a split.
|
||||
- We choose the split that gives the **highest** Information Gain.
|
||||
|
||||
## Parameters to Control the Tree
|
||||
- **max_depth**: How deep the tree can grow. (Too deep = Overfitting).
|
||||
- **min_samples_split**: Minimum samples needed to split a node.
|
||||
- **max_features**: Number of features to consider for each split.
|
||||
Reference in New Issue
Block a user