addition of unit 1 3 4 5

2025-11-24 16:55:19 +05:30
parent 8f8e35ae95
commit f8aea15aaa
24 changed files with 596 additions and 0 deletions
--- a/1/00_Index.md
+++ b/1/00_Index.md
@@ -0,0 +1,24 @@
+# Unit 1: Introduction to Data Mining
+
+Welcome to your simplified notes for Unit 1.
+
+## Table of Contents
+
+1. [[01_Introduction_to_Data_Mining|Introduction & DIKW Pyramid]]
+   - What is Data Mining?
+   - The DIKW Pyramid (Data, Information, Knowledge, Wisdom)
+2. [[02_Data_Mining_Process|The Data Mining Process]]
+   - Steps from Goal Definition to Deployment
+   - Issues in Data Mining (Privacy, Scalability)
+3. [[03_Data_Mining_Techniques|Techniques & Functionalities]]
+   - Predictive vs Descriptive Mining
+   - Classification, Regression, Clustering, Association Rules
+4. [[04_Data_Preprocessing|Data Preprocessing]]
+   - Why do we need it?
+   - Cleaning, Integration, Reduction, Transformation
+5. [[05_Data_Processing_Methods|Data Processing Methods]]
+   - Manual vs Electronic
+   - Batch, Real-time, Online Processing
+6. [[06_Data_Discretization|Data Discretization]]
+   - Binning, Histograms
+   - Concept Hierarchy
--- a/1/01_Introduction_to_Data_Mining.md
+++ b/1/01_Introduction_to_Data_Mining.md
@@ -0,0 +1,27 @@
+# Introduction to Data Mining
+
+## What is Data Mining?
+**Data Mining** is the process of digging through large amounts of raw data to find useful patterns, trends, and knowledge.
+- **Analogy**: Like mining gold from rocks. The rocks are the "raw data," and the gold is the "knowledge."
+
+### Key Definitions
+- **Data**: Raw facts and figures (e.g., sales logs, sensor readings).
+- **Mining**: Extracting something valuable.
+
+## The DIKW Pyramid
+The **DIKW** model shows how we move from raw data to wisdom.
+
+1. **Data (D)**: Raw, unprocessed facts.
+   - *Example*: Numbers like 42, 35, 50.
+2. **Information (I)**: Data that is organized and has meaning.
+   - *Example*: "These are the ages of employees."
+3. **Knowledge (K)**: Understanding gained from analysis.
+   - *Example*: "The team has a mix of young and experienced people."
+4. **Wisdom (W)**: Applying knowledge to make good decisions.
+   - *Example*: "Let's create a mentorship program to share skills."
+
+## Major Issues in Data Mining
+1. **Privacy and Security**: Mining can reveal sensitive personal info. We must protect it.
+2. **Scalability**: Can the system handle huge amounts of data (Big Data)?
+3. **Data Quality**: If data is dirty or missing, the results will be wrong ("Garbage In, Garbage Out").
+4. **Ethical Use**: Ensuring data isn't used for discrimination or bias.
--- a/1/02_Data_Mining_Process.md
+++ b/1/02_Data_Mining_Process.md
@@ -0,0 +1,24 @@
+# The Data Mining Process
+
+How do we actually do data mining? It follows a standard process (often similar to CRISP-DM).
+
+## Steps in the Process
+1. **Define the Goal**: What do you want to achieve? (e.g., Increase sales, detect fraud).
+2. **Gather Data**: Collect data from databases, logs, etc.
+3. **Cleanse Data**: Fix errors, remove duplicates, and handle missing values.
+4. **Interrogate Data**: Explore the data (charts, graphs) to find initial patterns.
+5. **Build a Model**: Use algorithms (like decision trees or regression) to find the solution.
+6. **Validate Results**: Check if the model is accurate.
+7. **Implement**: Use the insights in the real world.
+
+## Data Mining Functionalities
+Tasks are generally divided into two types:
+
+### 1. Descriptive Mining
+- Describes what is in the data.
+- Finds patterns and relationships.
+- *Examples*: Clustering, Association Rules.
+
+### 2. Predictive Mining
+- Predicts future or unknown values.
+- *Examples*: Classification, Regression, Prediction.
--- a/1/03_Data_Mining_Techniques.md
+++ b/1/03_Data_Mining_Techniques.md
@@ -0,0 +1,28 @@
+# Data Mining Techniques
+
+There are several key techniques used to mine data.
+
+## 1. Classification (Predictive)
+- **Goal**: Assign items to predefined categories (classes).
+- **Supervised Learning**: We know the categories beforehand.
+- **Example**: Is this email **Spam** or **Not Spam**?
+
+## 2. Regression (Predictive)
+- **Goal**: Predict a continuous **number**.
+- **Example**: Predicting the **price** of a house based on its size and location.
+
+## 3. Clustering (Descriptive)
+- **Goal**: Group similar items together.
+- **Unsupervised Learning**: We don't know the groups beforehand.
+- **Example**: Grouping customers into segments (e.g., "High Spenders", "Budget Shoppers").
+
+## 4. Association Rules (Descriptive)
+- **Goal**: Find relationships between items.
+- **Market Basket Analysis**: "People who buy Bread often also buy Butter."
+- **Key Terms**:
+  - **Support**: How often items appear together.
+  - **Confidence**: How likely item B is purchased if item A is purchased.
+
+## 5. Outlier Detection
+- **Goal**: Find unusual data points that don't fit the pattern.
+- **Example**: Detecting credit card fraud (a huge transaction in a usually quiet account).
--- a/1/04_Data_Preprocessing.md
+++ b/1/04_Data_Preprocessing.md
@@ -0,0 +1,31 @@
+# Data Preprocessing
+
+**Data Preprocessing** is the most important step before mining. Real-world data is often dirty, incomplete, and inconsistent.
+
+## Why Preprocess?
+- **Accuracy**: Bad data leads to bad results.
+- **Completeness**: Missing data can break algorithms.
+- **Consistency**: Different formats (e.g., "USA" vs "U.S.A.") confuse the system.
+
+## Major Steps
+
+### 1. Data Cleaning
+- **Fill Missing Values**: Use the average (mean) or a specific value.
+- **Remove Noisy Data**: Smooth out errors (binning, regression).
+- **Remove Outliers**: Delete data that doesn't make sense.
+
+### 2. Data Integration
+- Combining data from multiple sources (databases, files).
+- **Challenge**: Handling different names for the same thing (e.g., "CustID" vs "CustomerID").
+
+### 3. Data Reduction
+- Reducing the size of the data while keeping the important parts.
+- **Dimensionality Reduction**: Removing unimportant attributes.
+- **Numerosity Reduction**: Replacing raw data with smaller representations (like histograms).
+
+### 4. Data Transformation
+- Converting data into a format suitable for mining.
+- **Normalization**: Scaling data to a small range (e.g., 0 to 1).
+  - *Min-Max Normalization*
+  - *Z-Score Normalization*
+- **Discretization**: Converting continuous numbers into intervals (e.g., Age 0-10, 11-20).
--- a/1/05_Data_Processing_Methods.md
+++ b/1/05_Data_Processing_Methods.md
@@ -0,0 +1,23 @@
+# Data Processing Methods
+
+How is data actually processed by computers?
+
+## 1. Batch Processing
+- Data is collected over time and processed **all at once** (in a batch).
+- **Example**: Payroll systems (calculating salaries at the end of the month).
+- **Pros**: Efficient for large volumes.
+- **Cons**: Not immediate.
+
+## 2. Real-time Processing
+- Data is processed **immediately** as it comes in.
+- **Example**: ATM withdrawals. You need to know your balance *right now*.
+- **Pros**: Instant results.
+- **Cons**: Complex and expensive.
+
+## 3. Online Processing
+- Similar to real-time, often used for internet applications.
+- **Example**: Barcode scanning at a store checkout. The price is fetched instantly.
+
+## 4. Distributed Processing
+- Breaking a task into pieces and running them on **multiple computers** at the same time.
+- **Example**: Google Search. Many servers work together to find your result.
--- a/1/06_Data_Discretization.md
+++ b/1/06_Data_Discretization.md
@@ -0,0 +1,27 @@
+# Data Discretization
+
+**Data Discretization** is the process of converting a large number of continuous values into a smaller number of finite intervals (bins).
+
+## Why use it?
+- Makes data easier to understand.
+- Many algorithms work better with categories than infinite numbers.
+
+## Techniques
+
+### 1. Binning
+- Sorting data and dividing it into "bins".
+- **Example**: Grouping ages into [0-10], [11-20], etc.
+- Helps smooth out noise.
+
+### 2. Histogram Analysis
+- Using a bar chart (histogram) to see the distribution and decide where to split the data.
+
+### 3. Cluster Analysis
+- Using clustering (like K-Means) to group similar values, then using those groups as the intervals.
+
+## Concept Hierarchy
+- Organizing data from **low-level** concepts to **high-level** concepts.
+- **Example (Location)**:
+  - Street -> City -> State -> Country.
+- **Top-down Mapping**: General to Specific.
+- **Bottom-up Mapping**: Specific to General.