addition of unit 1 3 4 5

This commit is contained in:
Akshat Mehta
2025-11-24 16:55:19 +05:30
parent 8f8e35ae95
commit f8aea15aaa
24 changed files with 596 additions and 0 deletions

21
unit 4/00_Index.md Normal file
View File

@@ -0,0 +1,21 @@
# Unit 4: Classification and Prediction
Welcome to your simplified notes for Unit 4.
## Table of Contents
1. [[01_Classification_Basics|Classification Basics]]
- Classification vs Prediction
- Training vs Testing
2. [[02_Decision_Trees|Decision Tree Induction]]
- How Trees work
- Attribute Selection (Info Gain, Gini Index)
- Pruning
3. [[03_Bayesian_Classification|Bayesian Classification]]
- Bayes' Theorem
- Naive Bayes Classifier
4. [[04_KNN_Algorithm|K-Nearest Neighbors (KNN)]]
- Lazy Learning
- Distance Measures
5. [[05_Rule_Based_Classification|Rule-Based Classification]]
- IF-THEN Rules

View File

@@ -0,0 +1,22 @@
# Classification Basics
## What is Classification?
**Classification** is the process of predicting the **class label** of a data item.
- **Goal**: To assign a category to a new item based on past data.
- **Example**:
- Input: A bank loan application.
- Output Class: "Safe" or "Risky".
## Classification vs Prediction
- **Classification**: Predicts a **category** (Discrete value).
- *Example*: Yes/No, Red/Blue/Green.
- **Prediction (Regression)**: Predicts a **number** (Continuous value).
- *Example*: Predicting the price of a house ($500k, $505k...).
## The Process
1. **Training Phase (Learning)**:
- The algorithm learns from a "Training Set" where the correct answers (labels) are known.
- It builds a **Model** (e.g., a Decision Tree).
2. **Testing Phase (Classification)**:
- The model is tested on new, unseen data ("Test Set").
- We check the **Accuracy**: Percentage of correct predictions.

View File

@@ -0,0 +1,30 @@
# Decision Tree Induction
A **Decision Tree** is a flowchart-like structure used for classification.
## Structure
- **Root Node**: The top question (e.g., "Is it raining?").
- **Branch**: The answer (e.g., "Yes" or "No").
- **Leaf Node**: The final decision/class (e.g., "Play Football" or "Stay Inside").
## How to Build a Tree?
We need to decide which attribute to split on first. We use **Attribute Selection Measures**:
### 1. Information Gain (Used in ID3 Algorithm)
- Measures how much "uncertainty" (Entropy) is reduced by splitting on an attribute.
- We choose the attribute with the **Highest Information Gain**.
- **Entropy**: A measure of randomness.
- High Entropy = Messy/Mixed data (50% Yes, 50% No).
- Low Entropy = Pure data (100% Yes).
### 2. Gain Ratio (Used in C4.5)
- An improvement over Information Gain. It handles attributes with many values (like "Date") better.
### 3. Gini Index (Used in CART)
- Measures "Impurity". We want to minimize the Gini Index.
## Tree Pruning
Trees can become too complex and memorize the training data (**Overfitting**).
- **Pruning**: Cutting off weak branches to make the tree simpler and better at generalizing.
- **Pre-pruning**: Stop building early.
- **Post-pruning**: Build the full tree, then cut branches.

View File

@@ -0,0 +1,20 @@
# Bayesian Classification
**Bayesian Classifiers** are based on probability (Bayes' Theorem). They predict the likelihood that a tuple belongs to a class.
## Bayes' Theorem
$$ P(H|X) = \frac{P(X|H) \cdot P(H)}{P(X)} $$
- **P(H|X)**: Posterior Probability (Probability of Hypothesis H given Evidence X).
- **P(H)**: Prior Probability (Probability of H being true generally).
- **P(X|H)**: Likelihood (Probability of seeing Evidence X if H is true).
- **P(X)**: Evidence (Probability of X occurring).
## Naive Bayes Classifier
- **"Naive"**: It assumes that all attributes are **independent** of each other.
- *Example*: It assumes "Income" and "Age" don't affect each other, which simplifies the math.
- **Pros**: Very fast and effective for large datasets (like spam filtering).
- **Cons**: The independence assumption is often not true in real life.
## Bayesian Belief Networks (BBN)
- Unlike Naive Bayes, BBNs **allow** dependencies between variables.
- They use a graph structure (DAG) to show which variables affect others.

View File

@@ -0,0 +1,25 @@
# K-Nearest Neighbors (KNN)
**KNN** is a simple, "Lazy" learning algorithm.
## How it Works
1. Store all training data.
2. When a new item arrives, find the **K** closest items (neighbors) to it.
3. Check the class of those neighbors.
4. Assign the most common class to the new item.
## Key Concepts
- **Lazy Learner**: It doesn't build a model during training. It waits until it needs to classify.
- **Distance Measure**: How do we measure "closeness"?
- **Euclidean Distance**: Straight line distance (most common).
- **Manhattan Distance**: Grid-like distance.
- **Choosing K**:
- If K is too small (e.g., K=1), it's sensitive to noise.
- If K is too large, it might include points from other classes.
- Usually, K is an odd number (like 3, 5) to avoid ties.
## Example
- New Point: Green Circle.
- K = 3.
- Neighbors: 2 Red Triangles, 1 Blue Square.
- Result: Green Circle is classified as **Red Triangle**.

View File

@@ -0,0 +1,18 @@
# Rule-Based Classification
**Rule-Based Classifiers** use a set of **IF-THEN** rules to classify data.
## Structure
- **Rule**: `IF (Condition) THEN (Class)`
- *Example*:
- `IF (Age = Youth) AND (Student = Yes) THEN (Buys_Computer = Yes)`
## Extracting Rules from Decision Trees
- We can easily turn a decision tree into rules.
- Each path from the **Root** to a **Leaf** becomes one rule.
- The conditions along the path become the `IF` part (joined by AND).
- The leaf node becomes the `THEN` part.
## Advantages
- Easy for humans to understand.
- Can be created directly or from other models (like trees).