addition of unit 1 3 4 5
This commit is contained in:
24
unit 1/00_Index.md
Normal file
24
unit 1/00_Index.md
Normal file
@@ -0,0 +1,24 @@
|
||||
# Unit 1: Introduction to Data Mining
|
||||
|
||||
Welcome to your simplified notes for Unit 1.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [[01_Introduction_to_Data_Mining|Introduction & DIKW Pyramid]]
|
||||
- What is Data Mining?
|
||||
- The DIKW Pyramid (Data, Information, Knowledge, Wisdom)
|
||||
2. [[02_Data_Mining_Process|The Data Mining Process]]
|
||||
- Steps from Goal Definition to Deployment
|
||||
- Issues in Data Mining (Privacy, Scalability)
|
||||
3. [[03_Data_Mining_Techniques|Techniques & Functionalities]]
|
||||
- Predictive vs Descriptive Mining
|
||||
- Classification, Regression, Clustering, Association Rules
|
||||
4. [[04_Data_Preprocessing|Data Preprocessing]]
|
||||
- Why do we need it?
|
||||
- Cleaning, Integration, Reduction, Transformation
|
||||
5. [[05_Data_Processing_Methods|Data Processing Methods]]
|
||||
- Manual vs Electronic
|
||||
- Batch, Real-time, Online Processing
|
||||
6. [[06_Data_Discretization|Data Discretization]]
|
||||
- Binning, Histograms
|
||||
- Concept Hierarchy
|
||||
27
unit 1/01_Introduction_to_Data_Mining.md
Normal file
27
unit 1/01_Introduction_to_Data_Mining.md
Normal file
@@ -0,0 +1,27 @@
|
||||
# Introduction to Data Mining
|
||||
|
||||
## What is Data Mining?
|
||||
**Data Mining** is the process of digging through large amounts of raw data to find useful patterns, trends, and knowledge.
|
||||
- **Analogy**: Like mining gold from rocks. The rocks are the "raw data," and the gold is the "knowledge."
|
||||
|
||||
### Key Definitions
|
||||
- **Data**: Raw facts and figures (e.g., sales logs, sensor readings).
|
||||
- **Mining**: Extracting something valuable.
|
||||
|
||||
## The DIKW Pyramid
|
||||
The **DIKW** model shows how we move from raw data to wisdom.
|
||||
|
||||
1. **Data (D)**: Raw, unprocessed facts.
|
||||
- *Example*: Numbers like 42, 35, 50.
|
||||
2. **Information (I)**: Data that is organized and has meaning.
|
||||
- *Example*: "These are the ages of employees."
|
||||
3. **Knowledge (K)**: Understanding gained from analysis.
|
||||
- *Example*: "The team has a mix of young and experienced people."
|
||||
4. **Wisdom (W)**: Applying knowledge to make good decisions.
|
||||
- *Example*: "Let's create a mentorship program to share skills."
|
||||
|
||||
## Major Issues in Data Mining
|
||||
1. **Privacy and Security**: Mining can reveal sensitive personal info. We must protect it.
|
||||
2. **Scalability**: Can the system handle huge amounts of data (Big Data)?
|
||||
3. **Data Quality**: If data is dirty or missing, the results will be wrong ("Garbage In, Garbage Out").
|
||||
4. **Ethical Use**: Ensuring data isn't used for discrimination or bias.
|
||||
24
unit 1/02_Data_Mining_Process.md
Normal file
24
unit 1/02_Data_Mining_Process.md
Normal file
@@ -0,0 +1,24 @@
|
||||
# The Data Mining Process
|
||||
|
||||
How do we actually do data mining? It follows a standard process (often similar to CRISP-DM).
|
||||
|
||||
## Steps in the Process
|
||||
1. **Define the Goal**: What do you want to achieve? (e.g., Increase sales, detect fraud).
|
||||
2. **Gather Data**: Collect data from databases, logs, etc.
|
||||
3. **Cleanse Data**: Fix errors, remove duplicates, and handle missing values.
|
||||
4. **Interrogate Data**: Explore the data (charts, graphs) to find initial patterns.
|
||||
5. **Build a Model**: Use algorithms (like decision trees or regression) to find the solution.
|
||||
6. **Validate Results**: Check if the model is accurate.
|
||||
7. **Implement**: Use the insights in the real world.
|
||||
|
||||
## Data Mining Functionalities
|
||||
Tasks are generally divided into two types:
|
||||
|
||||
### 1. Descriptive Mining
|
||||
- Describes what is in the data.
|
||||
- Finds patterns and relationships.
|
||||
- *Examples*: Clustering, Association Rules.
|
||||
|
||||
### 2. Predictive Mining
|
||||
- Predicts future or unknown values.
|
||||
- *Examples*: Classification, Regression, Prediction.
|
||||
28
unit 1/03_Data_Mining_Techniques.md
Normal file
28
unit 1/03_Data_Mining_Techniques.md
Normal file
@@ -0,0 +1,28 @@
|
||||
# Data Mining Techniques
|
||||
|
||||
There are several key techniques used to mine data.
|
||||
|
||||
## 1. Classification (Predictive)
|
||||
- **Goal**: Assign items to predefined categories (classes).
|
||||
- **Supervised Learning**: We know the categories beforehand.
|
||||
- **Example**: Is this email **Spam** or **Not Spam**?
|
||||
|
||||
## 2. Regression (Predictive)
|
||||
- **Goal**: Predict a continuous **number**.
|
||||
- **Example**: Predicting the **price** of a house based on its size and location.
|
||||
|
||||
## 3. Clustering (Descriptive)
|
||||
- **Goal**: Group similar items together.
|
||||
- **Unsupervised Learning**: We don't know the groups beforehand.
|
||||
- **Example**: Grouping customers into segments (e.g., "High Spenders", "Budget Shoppers").
|
||||
|
||||
## 4. Association Rules (Descriptive)
|
||||
- **Goal**: Find relationships between items.
|
||||
- **Market Basket Analysis**: "People who buy Bread often also buy Butter."
|
||||
- **Key Terms**:
|
||||
- **Support**: How often items appear together.
|
||||
- **Confidence**: How likely item B is purchased if item A is purchased.
|
||||
|
||||
## 5. Outlier Detection
|
||||
- **Goal**: Find unusual data points that don't fit the pattern.
|
||||
- **Example**: Detecting credit card fraud (a huge transaction in a usually quiet account).
|
||||
31
unit 1/04_Data_Preprocessing.md
Normal file
31
unit 1/04_Data_Preprocessing.md
Normal file
@@ -0,0 +1,31 @@
|
||||
# Data Preprocessing
|
||||
|
||||
**Data Preprocessing** is the most important step before mining. Real-world data is often dirty, incomplete, and inconsistent.
|
||||
|
||||
## Why Preprocess?
|
||||
- **Accuracy**: Bad data leads to bad results.
|
||||
- **Completeness**: Missing data can break algorithms.
|
||||
- **Consistency**: Different formats (e.g., "USA" vs "U.S.A.") confuse the system.
|
||||
|
||||
## Major Steps
|
||||
|
||||
### 1. Data Cleaning
|
||||
- **Fill Missing Values**: Use the average (mean) or a specific value.
|
||||
- **Remove Noisy Data**: Smooth out errors (binning, regression).
|
||||
- **Remove Outliers**: Delete data that doesn't make sense.
|
||||
|
||||
### 2. Data Integration
|
||||
- Combining data from multiple sources (databases, files).
|
||||
- **Challenge**: Handling different names for the same thing (e.g., "CustID" vs "CustomerID").
|
||||
|
||||
### 3. Data Reduction
|
||||
- Reducing the size of the data while keeping the important parts.
|
||||
- **Dimensionality Reduction**: Removing unimportant attributes.
|
||||
- **Numerosity Reduction**: Replacing raw data with smaller representations (like histograms).
|
||||
|
||||
### 4. Data Transformation
|
||||
- Converting data into a format suitable for mining.
|
||||
- **Normalization**: Scaling data to a small range (e.g., 0 to 1).
|
||||
- *Min-Max Normalization*
|
||||
- *Z-Score Normalization*
|
||||
- **Discretization**: Converting continuous numbers into intervals (e.g., Age 0-10, 11-20).
|
||||
23
unit 1/05_Data_Processing_Methods.md
Normal file
23
unit 1/05_Data_Processing_Methods.md
Normal file
@@ -0,0 +1,23 @@
|
||||
# Data Processing Methods
|
||||
|
||||
How is data actually processed by computers?
|
||||
|
||||
## 1. Batch Processing
|
||||
- Data is collected over time and processed **all at once** (in a batch).
|
||||
- **Example**: Payroll systems (calculating salaries at the end of the month).
|
||||
- **Pros**: Efficient for large volumes.
|
||||
- **Cons**: Not immediate.
|
||||
|
||||
## 2. Real-time Processing
|
||||
- Data is processed **immediately** as it comes in.
|
||||
- **Example**: ATM withdrawals. You need to know your balance *right now*.
|
||||
- **Pros**: Instant results.
|
||||
- **Cons**: Complex and expensive.
|
||||
|
||||
## 3. Online Processing
|
||||
- Similar to real-time, often used for internet applications.
|
||||
- **Example**: Barcode scanning at a store checkout. The price is fetched instantly.
|
||||
|
||||
## 4. Distributed Processing
|
||||
- Breaking a task into pieces and running them on **multiple computers** at the same time.
|
||||
- **Example**: Google Search. Many servers work together to find your result.
|
||||
27
unit 1/06_Data_Discretization.md
Normal file
27
unit 1/06_Data_Discretization.md
Normal file
@@ -0,0 +1,27 @@
|
||||
# Data Discretization
|
||||
|
||||
**Data Discretization** is the process of converting a large number of continuous values into a smaller number of finite intervals (bins).
|
||||
|
||||
## Why use it?
|
||||
- Makes data easier to understand.
|
||||
- Many algorithms work better with categories than infinite numbers.
|
||||
|
||||
## Techniques
|
||||
|
||||
### 1. Binning
|
||||
- Sorting data and dividing it into "bins".
|
||||
- **Example**: Grouping ages into [0-10], [11-20], etc.
|
||||
- Helps smooth out noise.
|
||||
|
||||
### 2. Histogram Analysis
|
||||
- Using a bar chart (histogram) to see the distribution and decide where to split the data.
|
||||
|
||||
### 3. Cluster Analysis
|
||||
- Using clustering (like K-Means) to group similar values, then using those groups as the intervals.
|
||||
|
||||
## Concept Hierarchy
|
||||
- Organizing data from **low-level** concepts to **high-level** concepts.
|
||||
- **Example (Location)**:
|
||||
- Street -> City -> State -> Country.
|
||||
- **Top-down Mapping**: General to Specific.
|
||||
- **Bottom-up Mapping**: Specific to General.
|
||||
18
unit 3/00_Index.md
Normal file
18
unit 3/00_Index.md
Normal file
@@ -0,0 +1,18 @@
|
||||
# Unit 3: Association Rule Mining
|
||||
|
||||
Welcome to your simplified notes for Unit 3.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [[01_Association_Rule_Mining|Introduction to Association Rules]]
|
||||
- What is it? (Market Basket Analysis)
|
||||
- Key Terms: Support, Confidence, Frequent Itemsets
|
||||
2. [[02_Apriori_Algorithm|The Apriori Algorithm]]
|
||||
- How it works (Join & Prune)
|
||||
- Example Calculation
|
||||
3. [[03_FP_Growth_Algorithm|FP-Growth Algorithm]]
|
||||
- FP-Tree Structure
|
||||
- Why it is faster than Apriori
|
||||
4. [[04_Advanced_Pattern_Mining|Advanced Pattern Mining]]
|
||||
- Closed and Maximal Patterns
|
||||
- Vertical Data Format (Eclat)
|
||||
27
unit 3/01_Association_Rule_Mining.md
Normal file
27
unit 3/01_Association_Rule_Mining.md
Normal file
@@ -0,0 +1,27 @@
|
||||
# Introduction to Association Rules
|
||||
|
||||
**Association Rule Mining** is a technique to find relationships between items in a large dataset.
|
||||
- **Classic Example**: "Market Basket Analysis" - finding what products customers buy together.
|
||||
- *Example*: "If a customer buys **Bread**, they are 80% likely to buy **Butter**."
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### 1. Itemset
|
||||
- A collection of one or more items.
|
||||
- *Example*: `{Milk, Bread, Diapers}`
|
||||
|
||||
### 2. Support (Frequency)
|
||||
- How often an itemset appears in the database.
|
||||
- **Formula**:
|
||||
$$ \text{Support}(A) = \frac{\text{Transactions containing } A}{\text{Total Transactions}} $$
|
||||
- *Example*: If Milk appears in 4 out of 5 transactions, Support = 80%.
|
||||
|
||||
### 3. Confidence (Reliability)
|
||||
- How likely item B is purchased when item A is purchased.
|
||||
- **Formula**:
|
||||
$$ \text{Confidence}(A \to B) = \frac{\text{Support}(A \cup B)}{\text{Support}(A)} $$
|
||||
- *Example*: If Milk and Bread appear together in 3 transactions, and Milk appears in 4:
|
||||
- Confidence(Milk -> Bread) = 3/4 = 75%.
|
||||
|
||||
### 4. Frequent Itemset
|
||||
- An itemset that meets a minimum **Support Threshold** (e.g., must appear at least 3 times).
|
||||
45
unit 3/02_Apriori_Algorithm.md
Normal file
45
unit 3/02_Apriori_Algorithm.md
Normal file
@@ -0,0 +1,45 @@
|
||||
# The Apriori Algorithm
|
||||
|
||||
**Apriori** is a classic algorithm to find frequent itemsets.
|
||||
|
||||
## Key Principle (Apriori Property)
|
||||
> "All non-empty subsets of a frequent itemset must also be frequent."
|
||||
- *Meaning*: If `{Beer, Diapers}` is frequent, then `{Beer}` must be frequent and `{Diapers}` must be frequent.
|
||||
- **Reverse**: If `{Beer}` is NOT frequent, then `{Beer, Diapers}` cannot be frequent. (This helps us ignore/prune many combinations).
|
||||
|
||||
## How it Works
|
||||
1. **Scan 1**: Count all single items. Remove those below minimum support.
|
||||
2. **Join**: Combine remaining items to make pairs (Size 2).
|
||||
3. **Prune**: Remove pairs that contain infrequent items.
|
||||
4. **Scan 2**: Count the pairs. Remove those below support.
|
||||
5. **Repeat**: Make Size 3 itemsets, Size 4, etc., until no more can be found.
|
||||
|
||||
## Example
|
||||
**Transactions**:
|
||||
1. {Bread, Milk}
|
||||
2. {Bread, Diapers, Beer, Eggs}
|
||||
3. {Milk, Diapers, Beer, Cola}
|
||||
4. {Bread, Milk, Diapers, Beer}
|
||||
5. {Bread, Milk, Diapers, Cola}
|
||||
|
||||
**Minimum Support = 3**
|
||||
|
||||
1. **Count Items**:
|
||||
- Bread: 4 (Keep)
|
||||
- Milk: 4 (Keep)
|
||||
- Diapers: 4 (Keep)
|
||||
- Beer: 3 (Keep)
|
||||
- Cola: 2 (Drop)
|
||||
- Eggs: 1 (Drop)
|
||||
|
||||
2. **Make Pairs**:
|
||||
- {Bread, Milk}, {Bread, Diapers}, {Bread, Beer}...
|
||||
|
||||
3. **Count Pairs**:
|
||||
- {Bread, Milk}: 3 (Keep)
|
||||
- {Bread, Diapers}: 3 (Keep)
|
||||
- {Milk, Diapers}: 3 (Keep)
|
||||
- {Diapers, Beer}: 3 (Keep)
|
||||
- ...others dropped...
|
||||
|
||||
4. **Result**: We found the frequent groups!
|
||||
27
unit 3/03_FP_Growth_Algorithm.md
Normal file
27
unit 3/03_FP_Growth_Algorithm.md
Normal file
@@ -0,0 +1,27 @@
|
||||
# FP-Growth Algorithm
|
||||
|
||||
**FP-Growth (Frequent Pattern Growth)** is a faster and more efficient algorithm than Apriori.
|
||||
|
||||
## Why is it better?
|
||||
- **Apriori** scans the database many times (once for size 1, once for size 2, etc.). This is slow.
|
||||
- **FP-Growth** scans the database **only twice**.
|
||||
1. First scan: Count frequencies.
|
||||
2. Second scan: Build the **FP-Tree**.
|
||||
|
||||
## The FP-Tree
|
||||
An **FP-Tree** (Frequent Pattern Tree) is a compressed tree structure that stores all the transaction information.
|
||||
- More frequent items are near the root (top).
|
||||
- Less frequent items are leaves (bottom).
|
||||
|
||||
## Steps
|
||||
1. **Count Frequencies**: Find support for all items. Drop infrequent ones.
|
||||
2. **Sort**: Sort items in each transaction by frequency (Highest to Lowest).
|
||||
3. **Build Tree**: Insert transactions into the tree. Shared items share the same path.
|
||||
4. **Mine Tree**:
|
||||
- Start from the bottom (least frequent item).
|
||||
- Build a "Conditional Pattern Base" (all paths leading to that item).
|
||||
- Construct a "Conditional FP-Tree".
|
||||
- Recursively find frequent patterns.
|
||||
|
||||
## Divide and Conquer
|
||||
FP-Growth uses a **Divide and Conquer** strategy. It breaks the problem into smaller sub-problems (conditional trees) rather than generating millions of candidate sets like Apriori.
|
||||
34
unit 3/04_Advanced_Pattern_Mining.md
Normal file
34
unit 3/04_Advanced_Pattern_Mining.md
Normal file
@@ -0,0 +1,34 @@
|
||||
# Advanced Pattern Mining
|
||||
|
||||
Beyond basic frequent itemsets, we have more efficient ways to represent patterns.
|
||||
|
||||
## 1. Closed and Maximal Patterns
|
||||
Finding *all* frequent itemsets can produce too many results. We can summarize them.
|
||||
|
||||
### Closed Frequent Itemset
|
||||
- An itemset is **Closed** if none of its supersets have the **same support**.
|
||||
- *Example*:
|
||||
- {Milk}: Support 4
|
||||
- {Milk, Bread}: Support 4
|
||||
- Here, {Milk} is NOT closed because adding Bread didn't change the support count. {Milk, Bread} captures the same information.
|
||||
- **Benefit**: Lossless compression (we don't lose any support info).
|
||||
|
||||
### Maximal Frequent Itemset
|
||||
- An itemset is **Maximal** if none of its supersets are **frequent**.
|
||||
- *Example*:
|
||||
- {Milk, Bread} is frequent.
|
||||
- {Milk, Bread, Diapers} is NOT frequent.
|
||||
- Then {Milk, Bread} is a Maximal Frequent Itemset.
|
||||
- **Benefit**: Smallest set of patterns, but we lose the exact support counts of subsets.
|
||||
|
||||
## 2. Vertical Data Format (Eclat Algorithm)
|
||||
- **Horizontal Format** (Standard):
|
||||
- T1: {A, B, C}
|
||||
- T2: {A, B}
|
||||
- **Vertical Format**:
|
||||
- Item A: {T1, T2}
|
||||
- Item B: {T1, T2}
|
||||
- Item C: {T1}
|
||||
- **How it works**: Instead of counting, we just intersect the Transaction ID lists.
|
||||
- Support(A, B) = Intersection({T1, T2}, {T1, T2}) = Size is 2.
|
||||
- **Benefit**: Very fast for calculating support.
|
||||
21
unit 4/00_Index.md
Normal file
21
unit 4/00_Index.md
Normal file
@@ -0,0 +1,21 @@
|
||||
# Unit 4: Classification and Prediction
|
||||
|
||||
Welcome to your simplified notes for Unit 4.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [[01_Classification_Basics|Classification Basics]]
|
||||
- Classification vs Prediction
|
||||
- Training vs Testing
|
||||
2. [[02_Decision_Trees|Decision Tree Induction]]
|
||||
- How Trees work
|
||||
- Attribute Selection (Info Gain, Gini Index)
|
||||
- Pruning
|
||||
3. [[03_Bayesian_Classification|Bayesian Classification]]
|
||||
- Bayes' Theorem
|
||||
- Naive Bayes Classifier
|
||||
4. [[04_KNN_Algorithm|K-Nearest Neighbors (KNN)]]
|
||||
- Lazy Learning
|
||||
- Distance Measures
|
||||
5. [[05_Rule_Based_Classification|Rule-Based Classification]]
|
||||
- IF-THEN Rules
|
||||
22
unit 4/01_Classification_Basics.md
Normal file
22
unit 4/01_Classification_Basics.md
Normal file
@@ -0,0 +1,22 @@
|
||||
# Classification Basics
|
||||
|
||||
## What is Classification?
|
||||
**Classification** is the process of predicting the **class label** of a data item.
|
||||
- **Goal**: To assign a category to a new item based on past data.
|
||||
- **Example**:
|
||||
- Input: A bank loan application.
|
||||
- Output Class: "Safe" or "Risky".
|
||||
|
||||
## Classification vs Prediction
|
||||
- **Classification**: Predicts a **category** (Discrete value).
|
||||
- *Example*: Yes/No, Red/Blue/Green.
|
||||
- **Prediction (Regression)**: Predicts a **number** (Continuous value).
|
||||
- *Example*: Predicting the price of a house ($500k, $505k...).
|
||||
|
||||
## The Process
|
||||
1. **Training Phase (Learning)**:
|
||||
- The algorithm learns from a "Training Set" where the correct answers (labels) are known.
|
||||
- It builds a **Model** (e.g., a Decision Tree).
|
||||
2. **Testing Phase (Classification)**:
|
||||
- The model is tested on new, unseen data ("Test Set").
|
||||
- We check the **Accuracy**: Percentage of correct predictions.
|
||||
30
unit 4/02_Decision_Trees.md
Normal file
30
unit 4/02_Decision_Trees.md
Normal file
@@ -0,0 +1,30 @@
|
||||
# Decision Tree Induction
|
||||
|
||||
A **Decision Tree** is a flowchart-like structure used for classification.
|
||||
|
||||
## Structure
|
||||
- **Root Node**: The top question (e.g., "Is it raining?").
|
||||
- **Branch**: The answer (e.g., "Yes" or "No").
|
||||
- **Leaf Node**: The final decision/class (e.g., "Play Football" or "Stay Inside").
|
||||
|
||||
## How to Build a Tree?
|
||||
We need to decide which attribute to split on first. We use **Attribute Selection Measures**:
|
||||
|
||||
### 1. Information Gain (Used in ID3 Algorithm)
|
||||
- Measures how much "uncertainty" (Entropy) is reduced by splitting on an attribute.
|
||||
- We choose the attribute with the **Highest Information Gain**.
|
||||
- **Entropy**: A measure of randomness.
|
||||
- High Entropy = Messy/Mixed data (50% Yes, 50% No).
|
||||
- Low Entropy = Pure data (100% Yes).
|
||||
|
||||
### 2. Gain Ratio (Used in C4.5)
|
||||
- An improvement over Information Gain. It handles attributes with many values (like "Date") better.
|
||||
|
||||
### 3. Gini Index (Used in CART)
|
||||
- Measures "Impurity". We want to minimize the Gini Index.
|
||||
|
||||
## Tree Pruning
|
||||
Trees can become too complex and memorize the training data (**Overfitting**).
|
||||
- **Pruning**: Cutting off weak branches to make the tree simpler and better at generalizing.
|
||||
- **Pre-pruning**: Stop building early.
|
||||
- **Post-pruning**: Build the full tree, then cut branches.
|
||||
20
unit 4/03_Bayesian_Classification.md
Normal file
20
unit 4/03_Bayesian_Classification.md
Normal file
@@ -0,0 +1,20 @@
|
||||
# Bayesian Classification
|
||||
|
||||
**Bayesian Classifiers** are based on probability (Bayes' Theorem). They predict the likelihood that a tuple belongs to a class.
|
||||
|
||||
## Bayes' Theorem
|
||||
$$ P(H|X) = \frac{P(X|H) \cdot P(H)}{P(X)} $$
|
||||
- **P(H|X)**: Posterior Probability (Probability of Hypothesis H given Evidence X).
|
||||
- **P(H)**: Prior Probability (Probability of H being true generally).
|
||||
- **P(X|H)**: Likelihood (Probability of seeing Evidence X if H is true).
|
||||
- **P(X)**: Evidence (Probability of X occurring).
|
||||
|
||||
## Naive Bayes Classifier
|
||||
- **"Naive"**: It assumes that all attributes are **independent** of each other.
|
||||
- *Example*: It assumes "Income" and "Age" don't affect each other, which simplifies the math.
|
||||
- **Pros**: Very fast and effective for large datasets (like spam filtering).
|
||||
- **Cons**: The independence assumption is often not true in real life.
|
||||
|
||||
## Bayesian Belief Networks (BBN)
|
||||
- Unlike Naive Bayes, BBNs **allow** dependencies between variables.
|
||||
- They use a graph structure (DAG) to show which variables affect others.
|
||||
25
unit 4/04_KNN_Algorithm.md
Normal file
25
unit 4/04_KNN_Algorithm.md
Normal file
@@ -0,0 +1,25 @@
|
||||
# K-Nearest Neighbors (KNN)
|
||||
|
||||
**KNN** is a simple, "Lazy" learning algorithm.
|
||||
|
||||
## How it Works
|
||||
1. Store all training data.
|
||||
2. When a new item arrives, find the **K** closest items (neighbors) to it.
|
||||
3. Check the class of those neighbors.
|
||||
4. Assign the most common class to the new item.
|
||||
|
||||
## Key Concepts
|
||||
- **Lazy Learner**: It doesn't build a model during training. It waits until it needs to classify.
|
||||
- **Distance Measure**: How do we measure "closeness"?
|
||||
- **Euclidean Distance**: Straight line distance (most common).
|
||||
- **Manhattan Distance**: Grid-like distance.
|
||||
- **Choosing K**:
|
||||
- If K is too small (e.g., K=1), it's sensitive to noise.
|
||||
- If K is too large, it might include points from other classes.
|
||||
- Usually, K is an odd number (like 3, 5) to avoid ties.
|
||||
|
||||
## Example
|
||||
- New Point: Green Circle.
|
||||
- K = 3.
|
||||
- Neighbors: 2 Red Triangles, 1 Blue Square.
|
||||
- Result: Green Circle is classified as **Red Triangle**.
|
||||
18
unit 4/05_Rule_Based_Classification.md
Normal file
18
unit 4/05_Rule_Based_Classification.md
Normal file
@@ -0,0 +1,18 @@
|
||||
# Rule-Based Classification
|
||||
|
||||
**Rule-Based Classifiers** use a set of **IF-THEN** rules to classify data.
|
||||
|
||||
## Structure
|
||||
- **Rule**: `IF (Condition) THEN (Class)`
|
||||
- *Example*:
|
||||
- `IF (Age = Youth) AND (Student = Yes) THEN (Buys_Computer = Yes)`
|
||||
|
||||
## Extracting Rules from Decision Trees
|
||||
- We can easily turn a decision tree into rules.
|
||||
- Each path from the **Root** to a **Leaf** becomes one rule.
|
||||
- The conditions along the path become the `IF` part (joined by AND).
|
||||
- The leaf node becomes the `THEN` part.
|
||||
|
||||
## Advantages
|
||||
- Easy for humans to understand.
|
||||
- Can be created directly or from other models (like trees).
|
||||
19
unit 5/00_Index.md
Normal file
19
unit 5/00_Index.md
Normal file
@@ -0,0 +1,19 @@
|
||||
# Unit 5: Advanced Data Mining Techniques
|
||||
|
||||
Welcome to your simplified notes for Unit 5.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [[01_Ubiquitous_Data_Mining|Ubiquitous & Invisible Data Mining]]
|
||||
- Mining everywhere (IoT, Mobile)
|
||||
- Invisible Mining (Background processes)
|
||||
2. [[02_Web_Mining|Web Mining]]
|
||||
- Content, Structure, and Usage Mining
|
||||
3. [[03_Spatial_and_Temporal_Mining|Spatial & Temporal Mining]]
|
||||
- Mining location data (Maps/GIS)
|
||||
- Mining time-based data (Trends)
|
||||
4. [[04_Other_Mining_Types|Other Mining Types]]
|
||||
- Text, Visual, Audio, and Process Mining
|
||||
5. [[05_Applications_and_Impact|Applications & Social Impact]]
|
||||
- Real-world uses (Healthcare, Retail)
|
||||
- Privacy and Ethical concerns
|
||||
30
unit 5/01_Ubiquitous_Data_Mining.md
Normal file
30
unit 5/01_Ubiquitous_Data_Mining.md
Normal file
@@ -0,0 +1,30 @@
|
||||
# Ubiquitous and Invisible Data Mining
|
||||
|
||||
## Ubiquitous Data Mining (UDM)
|
||||
**"Ubiquitous"** means existing everywhere.
|
||||
- **Definition**: Mining data from everyday objects and devices (Smartphones, IoT, Wearables) in real-time.
|
||||
- **Goal**: To provide insights anytime, anywhere, without you asking for it.
|
||||
- **Characteristics**:
|
||||
- **Mobile**: Uses GPS and sensors.
|
||||
- **Context-Aware**: Knows where you are and what time it is.
|
||||
- **Real-Time**: Processes data instantly.
|
||||
|
||||
### Examples
|
||||
- **Smartphones**: Google Maps predicting traffic.
|
||||
- **Wearables**: Smartwatches tracking your heart rate.
|
||||
- **Smart Homes**: Alexa learning your voice commands.
|
||||
|
||||
## Invisible Data Mining
|
||||
- **Definition**: Mining that happens **silently** in the background. You don't see it happening.
|
||||
- **Why "Invisible"?**: It is embedded in apps and systems. You only see the result (like a recommendation).
|
||||
- **Examples**:
|
||||
- **Amazon**: "People who bought this also bought..."
|
||||
- **Google Search**: Auto-completing your sentence.
|
||||
- **Banks**: Detecting fraud without you knowing.
|
||||
|
||||
### Difference
|
||||
| Feature | Ubiquitous Mining | Invisible Mining |
|
||||
|---|---|---|
|
||||
| **Focus** | Mining **everywhere** (IoT, Mobile) | Mining **hidden** from user |
|
||||
| **Awareness** | You might know it's happening (e.g., wearing a watch) | You usually don't know |
|
||||
| **Key Tech** | Sensors, Mobile Devices | Software Algorithms, Background Processes |
|
||||
20
unit 5/02_Web_Mining.md
Normal file
20
unit 5/02_Web_Mining.md
Normal file
@@ -0,0 +1,20 @@
|
||||
# Web Mining
|
||||
|
||||
**Web Mining** is using data mining techniques to discover useful information from the World Wide Web.
|
||||
|
||||
## Types of Web Mining
|
||||
|
||||
### 1. Web Content Mining
|
||||
- **What**: Mining the **actual content** of web pages.
|
||||
- **Data**: Text, images, audio, video.
|
||||
- **Example**: Analyzing reviews on Amazon to see if people like a product (Sentiment Analysis).
|
||||
|
||||
### 2. Web Structure Mining
|
||||
- **What**: Mining the **links** (hyperlinks) between pages.
|
||||
- **Goal**: To find important pages (Authorities) and pages that link to many others (Hubs).
|
||||
- **Example**: Google's **PageRank** algorithm uses this to rank search results.
|
||||
|
||||
### 3. Web Usage Mining
|
||||
- **What**: Mining **user activity** logs.
|
||||
- **Data**: Server logs, browser history, clicks.
|
||||
- **Example**: Analyzing which pages users visit most often and where they leave the site.
|
||||
17
unit 5/03_Spatial_and_Temporal_Mining.md
Normal file
17
unit 5/03_Spatial_and_Temporal_Mining.md
Normal file
@@ -0,0 +1,17 @@
|
||||
# Spatial and Temporal Data Mining
|
||||
|
||||
## Spatial Data Mining
|
||||
- **Spatial Data**: Data related to **location** or geography (Maps, GPS).
|
||||
- **Goal**: Finding patterns in space.
|
||||
- **Tools**: GIS (Geographic Information Systems).
|
||||
- **Examples**:
|
||||
- Finding the best location for a new store.
|
||||
- Tracking the spread of a disease on a map.
|
||||
|
||||
## Temporal Data Mining
|
||||
- **Temporal Data**: Data related to **time**.
|
||||
- **Goal**: Finding patterns that change over time (Trends).
|
||||
- **Tasks**:
|
||||
- **Trend Analysis**: Is the stock market going up or down?
|
||||
- **Sequence Analysis**: "If event A happens, does event B follow?"
|
||||
- **Example**: Analyzing weather patterns over 10 years to predict climate change.
|
||||
17
unit 5/04_Other_Mining_Types.md
Normal file
17
unit 5/04_Other_Mining_Types.md
Normal file
@@ -0,0 +1,17 @@
|
||||
# Other Types of Data Mining
|
||||
|
||||
## 1. Text Mining
|
||||
- **Data**: Unstructured text (Emails, Tweets, Documents).
|
||||
- **Technique**: Natural Language Processing (NLP).
|
||||
- **Goal**: To understand meaning, sentiment, and topics.
|
||||
- **Example**: Classifying customer feedback as "Angry" or "Happy".
|
||||
|
||||
## 2. Visual and Audio Mining
|
||||
- **Data**: Images, Videos, Sound.
|
||||
- **Goal**: To find patterns in visual or audio data.
|
||||
- **Example**: Face recognition in photos, or detecting keywords in a voice recording.
|
||||
|
||||
## 3. Process Mining
|
||||
- **Data**: Event logs from business systems (ERP, CRM).
|
||||
- **Goal**: To see how a business process *actually* works vs how it *should* work.
|
||||
- **Example**: Finding out why it takes 5 days to approve a loan instead of 2.
|
||||
22
unit 5/05_Applications_and_Impact.md
Normal file
22
unit 5/05_Applications_and_Impact.md
Normal file
@@ -0,0 +1,22 @@
|
||||
# Applications and Social Impact
|
||||
|
||||
## Applications of Data Mining
|
||||
Data mining is used everywhere!
|
||||
|
||||
1. **Healthcare**: Predicting diseases, finding side effects of drugs.
|
||||
2. **Retail (Market Basket Analysis)**: Placing Bread near Butter to increase sales.
|
||||
3. **Finance**: Detecting credit card fraud, approving loans.
|
||||
4. **Education**: Tracking student performance to help them improve.
|
||||
5. **Crime**: Identifying crime hotspots and predicting criminal behavior.
|
||||
|
||||
## Social Impact and Issues
|
||||
|
||||
### Positive Impact
|
||||
- **Convenience**: Personalized recommendations save time.
|
||||
- **Safety**: Fraud detection and medical diagnosis save money and lives.
|
||||
|
||||
### Negative Impact (Ethical Issues)
|
||||
1. **Privacy Invasion**: Companies know too much about you.
|
||||
2. **Discrimination**: Profiling can lead to unfair treatment (e.g., denying loans based on where you live).
|
||||
3. **Security**: Large databases can be hacked (Data Breaches).
|
||||
4. **Manipulation**: Targeted ads can influence your behavior or political views.
|
||||
Reference in New Issue
Block a user