DMCT-NOTES/unit 2/06_KNN_Algorithm.md

# K-Nearest Neighbors (KNN)

**KNN** is a simple algorithm used for classification and regression.

## How it Works
1. Store all training data.
2. When a new data point comes in, find the **K** closest points (neighbors) to it.
3. **Vote**: Assign the class that is most common among those K neighbors.

## Key Characteristics
- **Instance-based Learning**: Uses training instances directly to predict.
- **Lazy Learning**: It doesn't "learn" a model during training. It waits until a prediction is needed.
- **Non-Parametric**: It assumes nothing about the data structure.

## Choosing K
- **Small K**: Can be noisy and overfit (sensitive to outliers).
- **Large K**: Can be biased and miss patterns.
- **Tip**: Choose an **odd number** for K to avoid ties in voting.

## Distance Measures
How do we measure "closeness"?

### 1. Euclidean Distance
- The straight-line distance between two points.
- Used for numeric data.
- Formula: `sqrt((x2-x1)^2 + (y2-y1)^2)`

### 2. Manhattan Distance
- The distance if you can only move along a grid (like city blocks).
- Formula: `|x2-x1| + |y2-y1|`

### 3. Minkowski Distance
- A generalized form of Euclidean and Manhattan.

### 4. Chebyshev Distance
- The greatest difference along any coordinate dimension.

## Data Scaling
Since KNN uses distance, it is very sensitive to the scale of data.
- **Example**: If one feature ranges 0-1 and another 0-1000, the second one will dominate the distance.
- **Solution**: **Normalize** or **Standardize** data so all features contribute equally.