42 lines
1.5 KiB
Markdown
42 lines
1.5 KiB
Markdown
# K-Nearest Neighbors (KNN)
|
|
|
|
**KNN** is a simple algorithm used for classification and regression.
|
|
|
|
## How it Works
|
|
1. Store all training data.
|
|
2. When a new data point comes in, find the **K** closest points (neighbors) to it.
|
|
3. **Vote**: Assign the class that is most common among those K neighbors.
|
|
|
|
## Key Characteristics
|
|
- **Instance-based Learning**: Uses training instances directly to predict.
|
|
- **Lazy Learning**: It doesn't "learn" a model during training. It waits until a prediction is needed.
|
|
- **Non-Parametric**: It assumes nothing about the data structure.
|
|
|
|
## Choosing K
|
|
- **Small K**: Can be noisy and overfit (sensitive to outliers).
|
|
- **Large K**: Can be biased and miss patterns.
|
|
- **Tip**: Choose an **odd number** for K to avoid ties in voting.
|
|
|
|
## Distance Measures
|
|
How do we measure "closeness"?
|
|
|
|
### 1. Euclidean Distance
|
|
- The straight-line distance between two points.
|
|
- Used for numeric data.
|
|
- Formula: `sqrt((x2-x1)^2 + (y2-y1)^2)`
|
|
|
|
### 2. Manhattan Distance
|
|
- The distance if you can only move along a grid (like city blocks).
|
|
- Formula: `|x2-x1| + |y2-y1|`
|
|
|
|
### 3. Minkowski Distance
|
|
- A generalized form of Euclidean and Manhattan.
|
|
|
|
### 4. Chebyshev Distance
|
|
- The greatest difference along any coordinate dimension.
|
|
|
|
## Data Scaling
|
|
Since KNN uses distance, it is very sensitive to the scale of data.
|
|
- **Example**: If one feature ranges 0-1 and another 0-1000, the second one will dominate the distance.
|
|
- **Solution**: **Normalize** or **Standardize** data so all features contribute equally.
|