# Data Preprocessing **Data Preprocessing** is the most important step before mining. Real-world data is often dirty, incomplete, and inconsistent. ## Why Preprocess? - **Accuracy**: Bad data leads to bad results. - **Completeness**: Missing data can break algorithms. - **Consistency**: Different formats (e.g., "USA" vs "U.S.A.") confuse the system. ## Major Steps ### 1. Data Cleaning - **Fill Missing Values**: Use the average (mean) or a specific value. - **Remove Noisy Data**: Smooth out errors (binning, regression). - **Remove Outliers**: Delete data that doesn't make sense. ### 2. Data Integration - Combining data from multiple sources (databases, files). - **Challenge**: Handling different names for the same thing (e.g., "CustID" vs "CustomerID"). ### 3. Data Reduction - Reducing the size of the data while keeping the important parts. - **Dimensionality Reduction**: Removing unimportant attributes. - **Numerosity Reduction**: Replacing raw data with smaller representations (like histograms). ### 4. Data Transformation - Converting data into a format suitable for mining. - **Normalization**: Scaling data to a small range (e.g., 0 to 1). - *Min-Max Normalization* - *Z-Score Normalization* - **Discretization**: Converting continuous numbers into intervals (e.g., Age 0-10, 11-20).