addition of unit 1 3 4 5

This commit is contained in:
Akshat Mehta
2025-11-24 16:55:19 +05:30
parent 8f8e35ae95
commit f8aea15aaa
24 changed files with 596 additions and 0 deletions

18
unit 3/00_Index.md Normal file
View File

@@ -0,0 +1,18 @@
# Unit 3: Association Rule Mining
Welcome to your simplified notes for Unit 3.
## Table of Contents
1. [[01_Association_Rule_Mining|Introduction to Association Rules]]
- What is it? (Market Basket Analysis)
- Key Terms: Support, Confidence, Frequent Itemsets
2. [[02_Apriori_Algorithm|The Apriori Algorithm]]
- How it works (Join & Prune)
- Example Calculation
3. [[03_FP_Growth_Algorithm|FP-Growth Algorithm]]
- FP-Tree Structure
- Why it is faster than Apriori
4. [[04_Advanced_Pattern_Mining|Advanced Pattern Mining]]
- Closed and Maximal Patterns
- Vertical Data Format (Eclat)

View File

@@ -0,0 +1,27 @@
# Introduction to Association Rules
**Association Rule Mining** is a technique to find relationships between items in a large dataset.
- **Classic Example**: "Market Basket Analysis" - finding what products customers buy together.
- *Example*: "If a customer buys **Bread**, they are 80% likely to buy **Butter**."
## Key Concepts
### 1. Itemset
- A collection of one or more items.
- *Example*: `{Milk, Bread, Diapers}`
### 2. Support (Frequency)
- How often an itemset appears in the database.
- **Formula**:
$$ \text{Support}(A) = \frac{\text{Transactions containing } A}{\text{Total Transactions}} $$
- *Example*: If Milk appears in 4 out of 5 transactions, Support = 80%.
### 3. Confidence (Reliability)
- How likely item B is purchased when item A is purchased.
- **Formula**:
$$ \text{Confidence}(A \to B) = \frac{\text{Support}(A \cup B)}{\text{Support}(A)} $$
- *Example*: If Milk and Bread appear together in 3 transactions, and Milk appears in 4:
- Confidence(Milk -> Bread) = 3/4 = 75%.
### 4. Frequent Itemset
- An itemset that meets a minimum **Support Threshold** (e.g., must appear at least 3 times).

View File

@@ -0,0 +1,45 @@
# The Apriori Algorithm
**Apriori** is a classic algorithm to find frequent itemsets.
## Key Principle (Apriori Property)
> "All non-empty subsets of a frequent itemset must also be frequent."
- *Meaning*: If `{Beer, Diapers}` is frequent, then `{Beer}` must be frequent and `{Diapers}` must be frequent.
- **Reverse**: If `{Beer}` is NOT frequent, then `{Beer, Diapers}` cannot be frequent. (This helps us ignore/prune many combinations).
## How it Works
1. **Scan 1**: Count all single items. Remove those below minimum support.
2. **Join**: Combine remaining items to make pairs (Size 2).
3. **Prune**: Remove pairs that contain infrequent items.
4. **Scan 2**: Count the pairs. Remove those below support.
5. **Repeat**: Make Size 3 itemsets, Size 4, etc., until no more can be found.
## Example
**Transactions**:
1. {Bread, Milk}
2. {Bread, Diapers, Beer, Eggs}
3. {Milk, Diapers, Beer, Cola}
4. {Bread, Milk, Diapers, Beer}
5. {Bread, Milk, Diapers, Cola}
**Minimum Support = 3**
1. **Count Items**:
- Bread: 4 (Keep)
- Milk: 4 (Keep)
- Diapers: 4 (Keep)
- Beer: 3 (Keep)
- Cola: 2 (Drop)
- Eggs: 1 (Drop)
2. **Make Pairs**:
- {Bread, Milk}, {Bread, Diapers}, {Bread, Beer}...
3. **Count Pairs**:
- {Bread, Milk}: 3 (Keep)
- {Bread, Diapers}: 3 (Keep)
- {Milk, Diapers}: 3 (Keep)
- {Diapers, Beer}: 3 (Keep)
- ...others dropped...
4. **Result**: We found the frequent groups!

View File

@@ -0,0 +1,27 @@
# FP-Growth Algorithm
**FP-Growth (Frequent Pattern Growth)** is a faster and more efficient algorithm than Apriori.
## Why is it better?
- **Apriori** scans the database many times (once for size 1, once for size 2, etc.). This is slow.
- **FP-Growth** scans the database **only twice**.
1. First scan: Count frequencies.
2. Second scan: Build the **FP-Tree**.
## The FP-Tree
An **FP-Tree** (Frequent Pattern Tree) is a compressed tree structure that stores all the transaction information.
- More frequent items are near the root (top).
- Less frequent items are leaves (bottom).
## Steps
1. **Count Frequencies**: Find support for all items. Drop infrequent ones.
2. **Sort**: Sort items in each transaction by frequency (Highest to Lowest).
3. **Build Tree**: Insert transactions into the tree. Shared items share the same path.
4. **Mine Tree**:
- Start from the bottom (least frequent item).
- Build a "Conditional Pattern Base" (all paths leading to that item).
- Construct a "Conditional FP-Tree".
- Recursively find frequent patterns.
## Divide and Conquer
FP-Growth uses a **Divide and Conquer** strategy. It breaks the problem into smaller sub-problems (conditional trees) rather than generating millions of candidate sets like Apriori.

View File

@@ -0,0 +1,34 @@
# Advanced Pattern Mining
Beyond basic frequent itemsets, we have more efficient ways to represent patterns.
## 1. Closed and Maximal Patterns
Finding *all* frequent itemsets can produce too many results. We can summarize them.
### Closed Frequent Itemset
- An itemset is **Closed** if none of its supersets have the **same support**.
- *Example*:
- {Milk}: Support 4
- {Milk, Bread}: Support 4
- Here, {Milk} is NOT closed because adding Bread didn't change the support count. {Milk, Bread} captures the same information.
- **Benefit**: Lossless compression (we don't lose any support info).
### Maximal Frequent Itemset
- An itemset is **Maximal** if none of its supersets are **frequent**.
- *Example*:
- {Milk, Bread} is frequent.
- {Milk, Bread, Diapers} is NOT frequent.
- Then {Milk, Bread} is a Maximal Frequent Itemset.
- **Benefit**: Smallest set of patterns, but we lose the exact support counts of subsets.
## 2. Vertical Data Format (Eclat Algorithm)
- **Horizontal Format** (Standard):
- T1: {A, B, C}
- T2: {A, B}
- **Vertical Format**:
- Item A: {T1, T2}
- Item B: {T1, T2}
- Item C: {T1}
- **How it works**: Instead of counting, we just intersect the Transaction ID lists.
- Support(A, B) = Intersection({T1, T2}, {T1, T2}) = Size is 2.
- **Benefit**: Very fast for calculating support.