1.5 KiB
1.5 KiB
K-Nearest Neighbors (KNN)
KNN is a simple algorithm used for classification and regression.
How it Works
- Store all training data.
- When a new data point comes in, find the K closest points (neighbors) to it.
- Vote: Assign the class that is most common among those K neighbors.
Key Characteristics
- Instance-based Learning: Uses training instances directly to predict.
- Lazy Learning: It doesn't "learn" a model during training. It waits until a prediction is needed.
- Non-Parametric: It assumes nothing about the data structure.
Choosing K
- Small K: Can be noisy and overfit (sensitive to outliers).
- Large K: Can be biased and miss patterns.
- Tip: Choose an odd number for K to avoid ties in voting.
Distance Measures
How do we measure "closeness"?
1. Euclidean Distance
- The straight-line distance between two points.
- Used for numeric data.
- Formula:
sqrt((x2-x1)^2 + (y2-y1)^2)
2. Manhattan Distance
- The distance if you can only move along a grid (like city blocks).
- Formula:
|x2-x1| + |y2-y1|
3. Minkowski Distance
- A generalized form of Euclidean and Manhattan.
4. Chebyshev Distance
- The greatest difference along any coordinate dimension.
Data Scaling
Since KNN uses distance, it is very sensitive to the scale of data.
- Example: If one feature ranges 0-1 and another 0-1000, the second one will dominate the distance.
- Solution: Normalize or Standardize data so all features contribute equally.