DMCT-NOTES/unit 2/03_Logistic_Regression.md

# Logistic Regression

Logistic Regression is used for **classification** problems (predicting categories), even though it has "Regression" in its name.

## Odds vs Probability

### Probability
- The chance of an event happening out of **all** possibilities.
- **Formula**: `Probability = (Events in favour) / (Total observations)`
- Range: 0 to 1.

### Odds
- The ratio of events **happening** to events **not happening**.
- **Formula**: `Odds = (Events in favour) / (Events NOT in favour)`

### Log of Odds (Logit)
- We use the **Log of Odds** because odds can vary widely in magnitude.
- Taking the log keeps the magnitude consistent across classes.
- This is the function used in Logistic Regression.

## The Sigmoid Function
Linear regression fits a straight line, which doesn't work well for classification (where we want output between 0 and 1).
Logistic regression uses an **S-shaped curve** called the **Sigmoid Function**.

- **Formula**: `S(z) = 1 / (1 + e^-z)`
- **Output**: Always between **0 and 1**.
- **Usage**:
  - If output > Threshold (e.g., 0.5) -> Classify as **Positive** (e.g., Presence of fish).
  - If output < Threshold -> Classify as **Negative** (e.g., Absence of fish).

## Assumptions of Logistic Regression
1. **Independence of Error**: Sample outcomes are separate (no duplicates).
2. **Linearity in the Logit**: Relationship between independent variables and log-odds is linear.
3. **Absence of Multicollinearity**: Independent variables should not be highly correlated with each other.
4. **No Strong Outliers**: Extreme values should not heavily influence the model.