addition of unit 1 3 4 5
This commit is contained in:
31
unit 1/04_Data_Preprocessing.md
Normal file
31
unit 1/04_Data_Preprocessing.md
Normal file
@@ -0,0 +1,31 @@
|
||||
# Data Preprocessing
|
||||
|
||||
**Data Preprocessing** is the most important step before mining. Real-world data is often dirty, incomplete, and inconsistent.
|
||||
|
||||
## Why Preprocess?
|
||||
- **Accuracy**: Bad data leads to bad results.
|
||||
- **Completeness**: Missing data can break algorithms.
|
||||
- **Consistency**: Different formats (e.g., "USA" vs "U.S.A.") confuse the system.
|
||||
|
||||
## Major Steps
|
||||
|
||||
### 1. Data Cleaning
|
||||
- **Fill Missing Values**: Use the average (mean) or a specific value.
|
||||
- **Remove Noisy Data**: Smooth out errors (binning, regression).
|
||||
- **Remove Outliers**: Delete data that doesn't make sense.
|
||||
|
||||
### 2. Data Integration
|
||||
- Combining data from multiple sources (databases, files).
|
||||
- **Challenge**: Handling different names for the same thing (e.g., "CustID" vs "CustomerID").
|
||||
|
||||
### 3. Data Reduction
|
||||
- Reducing the size of the data while keeping the important parts.
|
||||
- **Dimensionality Reduction**: Removing unimportant attributes.
|
||||
- **Numerosity Reduction**: Replacing raw data with smaller representations (like histograms).
|
||||
|
||||
### 4. Data Transformation
|
||||
- Converting data into a format suitable for mining.
|
||||
- **Normalization**: Scaling data to a small range (e.g., 0 to 1).
|
||||
- *Min-Max Normalization*
|
||||
- *Z-Score Normalization*
|
||||
- **Discretization**: Converting continuous numbers into intervals (e.g., Age 0-10, 11-20).
|
||||
Reference in New Issue
Block a user