DataScienceToday is a new online plateform and going source for data science related content for its influential audience in the world . You can reach us via email

contact@datasciencetoday.net

datasciencetoday.net@gmail.com

made by Bouchra ZEITANE , Zaineb MOUNIR

under the supervision of Pr. Habib BENLAHMAR

## 1- Definition:

- Simple classification technique.
- Label sample based on its neighbors.
- Knn assumption:
- Duck test
- Samples with similar input values should be labeled with the same target label --> Classification of a sample is dependent on the target labels of the neighboring points.

- How knn works:
- Use labels of neighboring samples to determine label for new point

- What is k:
- Value of k determines number of closest neighbors to consider.
- Majority of vote is commonly used, so the label associated with the majority of the neighbors is used as the label of the new sample.
- Breaking rule; ex: The label of the closer neighbor is used or the label is chosen randomly among the neighbors.

- Distance measure:
- Need measure to determine ‘Closeness’ (Distance between sample).
- Distance measures that can be used:
- Distance
- Manhattan
- Hemming distance

## 2- Advantages and disadvantages:

a- Advantages:

- No separate training phase.
- No separate part where a model is constructed and its parameter is adjusted.
- Can generate complex decision boundaries.
- Effective if training data is large.

b- Disadvantages:

- Can be slow:
- Distance between new sample and all samples must be computed to classify new sample.

- Need to determine values of parameter
- Issues that affect the performance of kNN:
- The choice of k.
- If k is too small, then the result can be sensitive to noise points.
- If k is too large, then the neighborhood may include too many points from other classes.

- The approach to combining the class labels:
- Take a majority vote: this can be a problem if the nearest neighbors vary widely in their distance and the closer neighbors more reliably indicate the class of the object.
- Weights each object’s vote by its distance (is usually much less sensitive to the choice of k)
- The choice of the distance measure: Some distance measures can also be affected by the high dimensionality of the data. attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes.

- The choice of k.

These techniques, which are particularly applicable for low dimensional data, can help reduce the computational cost without affecting classification accuracy.

KNN is particularly well suited for multi-modal classes as well as applications in which

an object can have many class labels.

**auteur: LASSRI Safae**

**PhD at faculty of science Ben M'Sik.**