Files
DMCT-NOTES/unit 2/06_KNN_Algorithm.md
Akshat Mehta 8f8e35ae95 unit 2 added
2025-11-24 15:26:41 +05:30

1.5 KiB

K-Nearest Neighbors (KNN)

KNN is a simple algorithm used for classification and regression.

How it Works

  1. Store all training data.
  2. When a new data point comes in, find the K closest points (neighbors) to it.
  3. Vote: Assign the class that is most common among those K neighbors.

Key Characteristics

  • Instance-based Learning: Uses training instances directly to predict.
  • Lazy Learning: It doesn't "learn" a model during training. It waits until a prediction is needed.
  • Non-Parametric: It assumes nothing about the data structure.

Choosing K

  • Small K: Can be noisy and overfit (sensitive to outliers).
  • Large K: Can be biased and miss patterns.
  • Tip: Choose an odd number for K to avoid ties in voting.

Distance Measures

How do we measure "closeness"?

1. Euclidean Distance

  • The straight-line distance between two points.
  • Used for numeric data.
  • Formula: sqrt((x2-x1)^2 + (y2-y1)^2)

2. Manhattan Distance

  • The distance if you can only move along a grid (like city blocks).
  • Formula: |x2-x1| + |y2-y1|

3. Minkowski Distance

  • A generalized form of Euclidean and Manhattan.

4. Chebyshev Distance

  • The greatest difference along any coordinate dimension.

Data Scaling

Since KNN uses distance, it is very sensitive to the scale of data.

  • Example: If one feature ranges 0-1 and another 0-1000, the second one will dominate the distance.
  • Solution: Normalize or Standardize data so all features contribute equally.