addition of unit 1 3 4 5

This commit is contained in:
Akshat Mehta
2025-11-24 16:55:19 +05:30
parent 8f8e35ae95
commit f8aea15aaa
24 changed files with 596 additions and 0 deletions

24
unit 1/00_Index.md Normal file
View File

@@ -0,0 +1,24 @@
# Unit 1: Introduction to Data Mining
Welcome to your simplified notes for Unit 1.
## Table of Contents
1. [[01_Introduction_to_Data_Mining|Introduction & DIKW Pyramid]]
- What is Data Mining?
- The DIKW Pyramid (Data, Information, Knowledge, Wisdom)
2. [[02_Data_Mining_Process|The Data Mining Process]]
- Steps from Goal Definition to Deployment
- Issues in Data Mining (Privacy, Scalability)
3. [[03_Data_Mining_Techniques|Techniques & Functionalities]]
- Predictive vs Descriptive Mining
- Classification, Regression, Clustering, Association Rules
4. [[04_Data_Preprocessing|Data Preprocessing]]
- Why do we need it?
- Cleaning, Integration, Reduction, Transformation
5. [[05_Data_Processing_Methods|Data Processing Methods]]
- Manual vs Electronic
- Batch, Real-time, Online Processing
6. [[06_Data_Discretization|Data Discretization]]
- Binning, Histograms
- Concept Hierarchy

View File

@@ -0,0 +1,27 @@
# Introduction to Data Mining
## What is Data Mining?
**Data Mining** is the process of digging through large amounts of raw data to find useful patterns, trends, and knowledge.
- **Analogy**: Like mining gold from rocks. The rocks are the "raw data," and the gold is the "knowledge."
### Key Definitions
- **Data**: Raw facts and figures (e.g., sales logs, sensor readings).
- **Mining**: Extracting something valuable.
## The DIKW Pyramid
The **DIKW** model shows how we move from raw data to wisdom.
1. **Data (D)**: Raw, unprocessed facts.
- *Example*: Numbers like 42, 35, 50.
2. **Information (I)**: Data that is organized and has meaning.
- *Example*: "These are the ages of employees."
3. **Knowledge (K)**: Understanding gained from analysis.
- *Example*: "The team has a mix of young and experienced people."
4. **Wisdom (W)**: Applying knowledge to make good decisions.
- *Example*: "Let's create a mentorship program to share skills."
## Major Issues in Data Mining
1. **Privacy and Security**: Mining can reveal sensitive personal info. We must protect it.
2. **Scalability**: Can the system handle huge amounts of data (Big Data)?
3. **Data Quality**: If data is dirty or missing, the results will be wrong ("Garbage In, Garbage Out").
4. **Ethical Use**: Ensuring data isn't used for discrimination or bias.

View File

@@ -0,0 +1,24 @@
# The Data Mining Process
How do we actually do data mining? It follows a standard process (often similar to CRISP-DM).
## Steps in the Process
1. **Define the Goal**: What do you want to achieve? (e.g., Increase sales, detect fraud).
2. **Gather Data**: Collect data from databases, logs, etc.
3. **Cleanse Data**: Fix errors, remove duplicates, and handle missing values.
4. **Interrogate Data**: Explore the data (charts, graphs) to find initial patterns.
5. **Build a Model**: Use algorithms (like decision trees or regression) to find the solution.
6. **Validate Results**: Check if the model is accurate.
7. **Implement**: Use the insights in the real world.
## Data Mining Functionalities
Tasks are generally divided into two types:
### 1. Descriptive Mining
- Describes what is in the data.
- Finds patterns and relationships.
- *Examples*: Clustering, Association Rules.
### 2. Predictive Mining
- Predicts future or unknown values.
- *Examples*: Classification, Regression, Prediction.

View File

@@ -0,0 +1,28 @@
# Data Mining Techniques
There are several key techniques used to mine data.
## 1. Classification (Predictive)
- **Goal**: Assign items to predefined categories (classes).
- **Supervised Learning**: We know the categories beforehand.
- **Example**: Is this email **Spam** or **Not Spam**?
## 2. Regression (Predictive)
- **Goal**: Predict a continuous **number**.
- **Example**: Predicting the **price** of a house based on its size and location.
## 3. Clustering (Descriptive)
- **Goal**: Group similar items together.
- **Unsupervised Learning**: We don't know the groups beforehand.
- **Example**: Grouping customers into segments (e.g., "High Spenders", "Budget Shoppers").
## 4. Association Rules (Descriptive)
- **Goal**: Find relationships between items.
- **Market Basket Analysis**: "People who buy Bread often also buy Butter."
- **Key Terms**:
- **Support**: How often items appear together.
- **Confidence**: How likely item B is purchased if item A is purchased.
## 5. Outlier Detection
- **Goal**: Find unusual data points that don't fit the pattern.
- **Example**: Detecting credit card fraud (a huge transaction in a usually quiet account).

View File

@@ -0,0 +1,31 @@
# Data Preprocessing
**Data Preprocessing** is the most important step before mining. Real-world data is often dirty, incomplete, and inconsistent.
## Why Preprocess?
- **Accuracy**: Bad data leads to bad results.
- **Completeness**: Missing data can break algorithms.
- **Consistency**: Different formats (e.g., "USA" vs "U.S.A.") confuse the system.
## Major Steps
### 1. Data Cleaning
- **Fill Missing Values**: Use the average (mean) or a specific value.
- **Remove Noisy Data**: Smooth out errors (binning, regression).
- **Remove Outliers**: Delete data that doesn't make sense.
### 2. Data Integration
- Combining data from multiple sources (databases, files).
- **Challenge**: Handling different names for the same thing (e.g., "CustID" vs "CustomerID").
### 3. Data Reduction
- Reducing the size of the data while keeping the important parts.
- **Dimensionality Reduction**: Removing unimportant attributes.
- **Numerosity Reduction**: Replacing raw data with smaller representations (like histograms).
### 4. Data Transformation
- Converting data into a format suitable for mining.
- **Normalization**: Scaling data to a small range (e.g., 0 to 1).
- *Min-Max Normalization*
- *Z-Score Normalization*
- **Discretization**: Converting continuous numbers into intervals (e.g., Age 0-10, 11-20).

View File

@@ -0,0 +1,23 @@
# Data Processing Methods
How is data actually processed by computers?
## 1. Batch Processing
- Data is collected over time and processed **all at once** (in a batch).
- **Example**: Payroll systems (calculating salaries at the end of the month).
- **Pros**: Efficient for large volumes.
- **Cons**: Not immediate.
## 2. Real-time Processing
- Data is processed **immediately** as it comes in.
- **Example**: ATM withdrawals. You need to know your balance *right now*.
- **Pros**: Instant results.
- **Cons**: Complex and expensive.
## 3. Online Processing
- Similar to real-time, often used for internet applications.
- **Example**: Barcode scanning at a store checkout. The price is fetched instantly.
## 4. Distributed Processing
- Breaking a task into pieces and running them on **multiple computers** at the same time.
- **Example**: Google Search. Many servers work together to find your result.

View File

@@ -0,0 +1,27 @@
# Data Discretization
**Data Discretization** is the process of converting a large number of continuous values into a smaller number of finite intervals (bins).
## Why use it?
- Makes data easier to understand.
- Many algorithms work better with categories than infinite numbers.
## Techniques
### 1. Binning
- Sorting data and dividing it into "bins".
- **Example**: Grouping ages into [0-10], [11-20], etc.
- Helps smooth out noise.
### 2. Histogram Analysis
- Using a bar chart (histogram) to see the distribution and decide where to split the data.
### 3. Cluster Analysis
- Using clustering (like K-Means) to group similar values, then using those groups as the intervals.
## Concept Hierarchy
- Organizing data from **low-level** concepts to **high-level** concepts.
- **Example (Location)**:
- Street -> City -> State -> Country.
- **Top-down Mapping**: General to Specific.
- **Bottom-up Mapping**: Specific to General.