addition of unit 1 3 4 5

2025-11-24 16:55:19 +05:30
parent 8f8e35ae95
commit f8aea15aaa
24 changed files with 596 additions and 0 deletions
--- a/1/04_Data_Preprocessing.md
+++ b/1/04_Data_Preprocessing.md
@@ -0,0 +1,31 @@
+# Data Preprocessing
+
+**Data Preprocessing** is the most important step before mining. Real-world data is often dirty, incomplete, and inconsistent.
+
+## Why Preprocess?
+- **Accuracy**: Bad data leads to bad results.
+- **Completeness**: Missing data can break algorithms.
+- **Consistency**: Different formats (e.g., "USA" vs "U.S.A.") confuse the system.
+
+## Major Steps
+
+### 1. Data Cleaning
+- **Fill Missing Values**: Use the average (mean) or a specific value.
+- **Remove Noisy Data**: Smooth out errors (binning, regression).
+- **Remove Outliers**: Delete data that doesn't make sense.
+
+### 2. Data Integration
+- Combining data from multiple sources (databases, files).
+- **Challenge**: Handling different names for the same thing (e.g., "CustID" vs "CustomerID").
+
+### 3. Data Reduction
+- Reducing the size of the data while keeping the important parts.
+- **Dimensionality Reduction**: Removing unimportant attributes.
+- **Numerosity Reduction**: Replacing raw data with smaller representations (like histograms).
+
+### 4. Data Transformation
+- Converting data into a format suitable for mining.
+- **Normalization**: Scaling data to a small range (e.g., 0 to 1).
+  - *Min-Max Normalization*
+  - *Z-Score Normalization*
+- **Discretization**: Converting continuous numbers into intervals (e.g., Age 0-10, 11-20).