1.3 KiB
1.3 KiB
Data Preprocessing
Data Preprocessing is the most important step before mining. Real-world data is often dirty, incomplete, and inconsistent.
Why Preprocess?
- Accuracy: Bad data leads to bad results.
- Completeness: Missing data can break algorithms.
- Consistency: Different formats (e.g., "USA" vs "U.S.A.") confuse the system.
Major Steps
1. Data Cleaning
- Fill Missing Values: Use the average (mean) or a specific value.
- Remove Noisy Data: Smooth out errors (binning, regression).
- Remove Outliers: Delete data that doesn't make sense.
2. Data Integration
- Combining data from multiple sources (databases, files).
- Challenge: Handling different names for the same thing (e.g., "CustID" vs "CustomerID").
3. Data Reduction
- Reducing the size of the data while keeping the important parts.
- Dimensionality Reduction: Removing unimportant attributes.
- Numerosity Reduction: Replacing raw data with smaller representations (like histograms).
4. Data Transformation
- Converting data into a format suitable for mining.
- Normalization: Scaling data to a small range (e.g., 0 to 1).
- Min-Max Normalization
- Z-Score Normalization
- Discretization: Converting continuous numbers into intervals (e.g., Age 0-10, 11-20).