# K-Nearest Neighbors (KNN) **KNN** is a simple algorithm used for classification and regression. ## How it Works 1. Store all training data. 2. When a new data point comes in, find the **K** closest points (neighbors) to it. 3. **Vote**: Assign the class that is most common among those K neighbors. ## Key Characteristics - **Instance-based Learning**: Uses training instances directly to predict. - **Lazy Learning**: It doesn't "learn" a model during training. It waits until a prediction is needed. - **Non-Parametric**: It assumes nothing about the data structure. ## Choosing K - **Small K**: Can be noisy and overfit (sensitive to outliers). - **Large K**: Can be biased and miss patterns. - **Tip**: Choose an **odd number** for K to avoid ties in voting. ## Distance Measures How do we measure "closeness"? ### 1. Euclidean Distance - The straight-line distance between two points. - Used for numeric data. - Formula: `sqrt((x2-x1)^2 + (y2-y1)^2)` ### 2. Manhattan Distance - The distance if you can only move along a grid (like city blocks). - Formula: `|x2-x1| + |y2-y1|` ### 3. Minkowski Distance - A generalized form of Euclidean and Manhattan. ### 4. Chebyshev Distance - The greatest difference along any coordinate dimension. ## Data Scaling Since KNN uses distance, it is very sensitive to the scale of data. - **Example**: If one feature ranges 0-1 and another 0-1000, the second one will dominate the distance. - **Solution**: **Normalize** or **Standardize** data so all features contribute equally.