Files
DMCT-NOTES/unit 1/04_Data_Preprocessing.md
2025-11-24 16:55:19 +05:30

1.3 KiB

Data Preprocessing

Data Preprocessing is the most important step before mining. Real-world data is often dirty, incomplete, and inconsistent.

Why Preprocess?

  • Accuracy: Bad data leads to bad results.
  • Completeness: Missing data can break algorithms.
  • Consistency: Different formats (e.g., "USA" vs "U.S.A.") confuse the system.

Major Steps

1. Data Cleaning

  • Fill Missing Values: Use the average (mean) or a specific value.
  • Remove Noisy Data: Smooth out errors (binning, regression).
  • Remove Outliers: Delete data that doesn't make sense.

2. Data Integration

  • Combining data from multiple sources (databases, files).
  • Challenge: Handling different names for the same thing (e.g., "CustID" vs "CustomerID").

3. Data Reduction

  • Reducing the size of the data while keeping the important parts.
  • Dimensionality Reduction: Removing unimportant attributes.
  • Numerosity Reduction: Replacing raw data with smaller representations (like histograms).

4. Data Transformation

  • Converting data into a format suitable for mining.
  • Normalization: Scaling data to a small range (e.g., 0 to 1).
    • Min-Max Normalization
    • Z-Score Normalization
  • Discretization: Converting continuous numbers into intervals (e.g., Age 0-10, 11-20).