addition of unit 1 3 4 5
This commit is contained in:
24
unit 1/00_Index.md
Normal file
24
unit 1/00_Index.md
Normal file
@@ -0,0 +1,24 @@
|
||||
# Unit 1: Introduction to Data Mining
|
||||
|
||||
Welcome to your simplified notes for Unit 1.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [[01_Introduction_to_Data_Mining|Introduction & DIKW Pyramid]]
|
||||
- What is Data Mining?
|
||||
- The DIKW Pyramid (Data, Information, Knowledge, Wisdom)
|
||||
2. [[02_Data_Mining_Process|The Data Mining Process]]
|
||||
- Steps from Goal Definition to Deployment
|
||||
- Issues in Data Mining (Privacy, Scalability)
|
||||
3. [[03_Data_Mining_Techniques|Techniques & Functionalities]]
|
||||
- Predictive vs Descriptive Mining
|
||||
- Classification, Regression, Clustering, Association Rules
|
||||
4. [[04_Data_Preprocessing|Data Preprocessing]]
|
||||
- Why do we need it?
|
||||
- Cleaning, Integration, Reduction, Transformation
|
||||
5. [[05_Data_Processing_Methods|Data Processing Methods]]
|
||||
- Manual vs Electronic
|
||||
- Batch, Real-time, Online Processing
|
||||
6. [[06_Data_Discretization|Data Discretization]]
|
||||
- Binning, Histograms
|
||||
- Concept Hierarchy
|
||||
27
unit 1/01_Introduction_to_Data_Mining.md
Normal file
27
unit 1/01_Introduction_to_Data_Mining.md
Normal file
@@ -0,0 +1,27 @@
|
||||
# Introduction to Data Mining
|
||||
|
||||
## What is Data Mining?
|
||||
**Data Mining** is the process of digging through large amounts of raw data to find useful patterns, trends, and knowledge.
|
||||
- **Analogy**: Like mining gold from rocks. The rocks are the "raw data," and the gold is the "knowledge."
|
||||
|
||||
### Key Definitions
|
||||
- **Data**: Raw facts and figures (e.g., sales logs, sensor readings).
|
||||
- **Mining**: Extracting something valuable.
|
||||
|
||||
## The DIKW Pyramid
|
||||
The **DIKW** model shows how we move from raw data to wisdom.
|
||||
|
||||
1. **Data (D)**: Raw, unprocessed facts.
|
||||
- *Example*: Numbers like 42, 35, 50.
|
||||
2. **Information (I)**: Data that is organized and has meaning.
|
||||
- *Example*: "These are the ages of employees."
|
||||
3. **Knowledge (K)**: Understanding gained from analysis.
|
||||
- *Example*: "The team has a mix of young and experienced people."
|
||||
4. **Wisdom (W)**: Applying knowledge to make good decisions.
|
||||
- *Example*: "Let's create a mentorship program to share skills."
|
||||
|
||||
## Major Issues in Data Mining
|
||||
1. **Privacy and Security**: Mining can reveal sensitive personal info. We must protect it.
|
||||
2. **Scalability**: Can the system handle huge amounts of data (Big Data)?
|
||||
3. **Data Quality**: If data is dirty or missing, the results will be wrong ("Garbage In, Garbage Out").
|
||||
4. **Ethical Use**: Ensuring data isn't used for discrimination or bias.
|
||||
24
unit 1/02_Data_Mining_Process.md
Normal file
24
unit 1/02_Data_Mining_Process.md
Normal file
@@ -0,0 +1,24 @@
|
||||
# The Data Mining Process
|
||||
|
||||
How do we actually do data mining? It follows a standard process (often similar to CRISP-DM).
|
||||
|
||||
## Steps in the Process
|
||||
1. **Define the Goal**: What do you want to achieve? (e.g., Increase sales, detect fraud).
|
||||
2. **Gather Data**: Collect data from databases, logs, etc.
|
||||
3. **Cleanse Data**: Fix errors, remove duplicates, and handle missing values.
|
||||
4. **Interrogate Data**: Explore the data (charts, graphs) to find initial patterns.
|
||||
5. **Build a Model**: Use algorithms (like decision trees or regression) to find the solution.
|
||||
6. **Validate Results**: Check if the model is accurate.
|
||||
7. **Implement**: Use the insights in the real world.
|
||||
|
||||
## Data Mining Functionalities
|
||||
Tasks are generally divided into two types:
|
||||
|
||||
### 1. Descriptive Mining
|
||||
- Describes what is in the data.
|
||||
- Finds patterns and relationships.
|
||||
- *Examples*: Clustering, Association Rules.
|
||||
|
||||
### 2. Predictive Mining
|
||||
- Predicts future or unknown values.
|
||||
- *Examples*: Classification, Regression, Prediction.
|
||||
28
unit 1/03_Data_Mining_Techniques.md
Normal file
28
unit 1/03_Data_Mining_Techniques.md
Normal file
@@ -0,0 +1,28 @@
|
||||
# Data Mining Techniques
|
||||
|
||||
There are several key techniques used to mine data.
|
||||
|
||||
## 1. Classification (Predictive)
|
||||
- **Goal**: Assign items to predefined categories (classes).
|
||||
- **Supervised Learning**: We know the categories beforehand.
|
||||
- **Example**: Is this email **Spam** or **Not Spam**?
|
||||
|
||||
## 2. Regression (Predictive)
|
||||
- **Goal**: Predict a continuous **number**.
|
||||
- **Example**: Predicting the **price** of a house based on its size and location.
|
||||
|
||||
## 3. Clustering (Descriptive)
|
||||
- **Goal**: Group similar items together.
|
||||
- **Unsupervised Learning**: We don't know the groups beforehand.
|
||||
- **Example**: Grouping customers into segments (e.g., "High Spenders", "Budget Shoppers").
|
||||
|
||||
## 4. Association Rules (Descriptive)
|
||||
- **Goal**: Find relationships between items.
|
||||
- **Market Basket Analysis**: "People who buy Bread often also buy Butter."
|
||||
- **Key Terms**:
|
||||
- **Support**: How often items appear together.
|
||||
- **Confidence**: How likely item B is purchased if item A is purchased.
|
||||
|
||||
## 5. Outlier Detection
|
||||
- **Goal**: Find unusual data points that don't fit the pattern.
|
||||
- **Example**: Detecting credit card fraud (a huge transaction in a usually quiet account).
|
||||
31
unit 1/04_Data_Preprocessing.md
Normal file
31
unit 1/04_Data_Preprocessing.md
Normal file
@@ -0,0 +1,31 @@
|
||||
# Data Preprocessing
|
||||
|
||||
**Data Preprocessing** is the most important step before mining. Real-world data is often dirty, incomplete, and inconsistent.
|
||||
|
||||
## Why Preprocess?
|
||||
- **Accuracy**: Bad data leads to bad results.
|
||||
- **Completeness**: Missing data can break algorithms.
|
||||
- **Consistency**: Different formats (e.g., "USA" vs "U.S.A.") confuse the system.
|
||||
|
||||
## Major Steps
|
||||
|
||||
### 1. Data Cleaning
|
||||
- **Fill Missing Values**: Use the average (mean) or a specific value.
|
||||
- **Remove Noisy Data**: Smooth out errors (binning, regression).
|
||||
- **Remove Outliers**: Delete data that doesn't make sense.
|
||||
|
||||
### 2. Data Integration
|
||||
- Combining data from multiple sources (databases, files).
|
||||
- **Challenge**: Handling different names for the same thing (e.g., "CustID" vs "CustomerID").
|
||||
|
||||
### 3. Data Reduction
|
||||
- Reducing the size of the data while keeping the important parts.
|
||||
- **Dimensionality Reduction**: Removing unimportant attributes.
|
||||
- **Numerosity Reduction**: Replacing raw data with smaller representations (like histograms).
|
||||
|
||||
### 4. Data Transformation
|
||||
- Converting data into a format suitable for mining.
|
||||
- **Normalization**: Scaling data to a small range (e.g., 0 to 1).
|
||||
- *Min-Max Normalization*
|
||||
- *Z-Score Normalization*
|
||||
- **Discretization**: Converting continuous numbers into intervals (e.g., Age 0-10, 11-20).
|
||||
23
unit 1/05_Data_Processing_Methods.md
Normal file
23
unit 1/05_Data_Processing_Methods.md
Normal file
@@ -0,0 +1,23 @@
|
||||
# Data Processing Methods
|
||||
|
||||
How is data actually processed by computers?
|
||||
|
||||
## 1. Batch Processing
|
||||
- Data is collected over time and processed **all at once** (in a batch).
|
||||
- **Example**: Payroll systems (calculating salaries at the end of the month).
|
||||
- **Pros**: Efficient for large volumes.
|
||||
- **Cons**: Not immediate.
|
||||
|
||||
## 2. Real-time Processing
|
||||
- Data is processed **immediately** as it comes in.
|
||||
- **Example**: ATM withdrawals. You need to know your balance *right now*.
|
||||
- **Pros**: Instant results.
|
||||
- **Cons**: Complex and expensive.
|
||||
|
||||
## 3. Online Processing
|
||||
- Similar to real-time, often used for internet applications.
|
||||
- **Example**: Barcode scanning at a store checkout. The price is fetched instantly.
|
||||
|
||||
## 4. Distributed Processing
|
||||
- Breaking a task into pieces and running them on **multiple computers** at the same time.
|
||||
- **Example**: Google Search. Many servers work together to find your result.
|
||||
27
unit 1/06_Data_Discretization.md
Normal file
27
unit 1/06_Data_Discretization.md
Normal file
@@ -0,0 +1,27 @@
|
||||
# Data Discretization
|
||||
|
||||
**Data Discretization** is the process of converting a large number of continuous values into a smaller number of finite intervals (bins).
|
||||
|
||||
## Why use it?
|
||||
- Makes data easier to understand.
|
||||
- Many algorithms work better with categories than infinite numbers.
|
||||
|
||||
## Techniques
|
||||
|
||||
### 1. Binning
|
||||
- Sorting data and dividing it into "bins".
|
||||
- **Example**: Grouping ages into [0-10], [11-20], etc.
|
||||
- Helps smooth out noise.
|
||||
|
||||
### 2. Histogram Analysis
|
||||
- Using a bar chart (histogram) to see the distribution and decide where to split the data.
|
||||
|
||||
### 3. Cluster Analysis
|
||||
- Using clustering (like K-Means) to group similar values, then using those groups as the intervals.
|
||||
|
||||
## Concept Hierarchy
|
||||
- Organizing data from **low-level** concepts to **high-level** concepts.
|
||||
- **Example (Location)**:
|
||||
- Street -> City -> State -> Country.
|
||||
- **Top-down Mapping**: General to Specific.
|
||||
- **Bottom-up Mapping**: Specific to General.
|
||||
Reference in New Issue
Block a user