akshat/DMCT-NOTES

Files

Akshat Mehta 8f8e35ae95 unit 2 added

2025-11-24 15:26:41 +05:30

1.5 KiB

Raw Blame History

K-Nearest Neighbors (KNN)

KNN is a simple algorithm used for classification and regression.

How it Works

Store all training data.
When a new data point comes in, find the K closest points (neighbors) to it.
Vote: Assign the class that is most common among those K neighbors.

Key Characteristics

Instance-based Learning: Uses training instances directly to predict.
Lazy Learning: It doesn't "learn" a model during training. It waits until a prediction is needed.
Non-Parametric: It assumes nothing about the data structure.

Choosing K

Small K: Can be noisy and overfit (sensitive to outliers).
Large K: Can be biased and miss patterns.
Tip: Choose an odd number for K to avoid ties in voting.

Distance Measures

How do we measure "closeness"?

1. Euclidean Distance

The straight-line distance between two points.
Used for numeric data.
Formula: sqrt((x2-x1)^2 + (y2-y1)^2)

2. Manhattan Distance

The distance if you can only move along a grid (like city blocks).
Formula: |x2-x1| + |y2-y1|

3. Minkowski Distance

A generalized form of Euclidean and Manhattan.

4. Chebyshev Distance

The greatest difference along any coordinate dimension.

Data Scaling

Since KNN uses distance, it is very sensitive to the scale of data.

Example: If one feature ranges 0-1 and another 0-1000, the second one will dominate the distance.
Solution: Normalize or Standardize data so all features contribute equally.