G4G0-2/AI & Data Mining/Week 4/Lecture 7 - Nearest Neighbor.md

- Instance Based
- Solution to new problem is solution to closest example
- Must be able to measure distance between pair of examples
- Normally euclidean distance

# Normalisation of Numeric Attributes

- Attributes measured on different scales
	- Larger scales have higher impacts
	- Must normalise (transform to scale [0, 1])

# $a_i = \frac{v_i - minv_i}{maxv_i - minv_i}$

Where:
- $a_i$ is normalised value for attribute $i$
- $v_i$ is the current value for attribute $i$
- $maxv_i$ is largest value of attribute $i$
- $minv_i$ is smallest value of attribute $i$

## Example

 # $maxv_{humidity} = 96$
# $minv_{humidity} = 65$
# $v_{humidity} = 80.5$

# $a_i = \frac{80.5-65}{96-55} = \frac{15.5}{31} = 0.5$

## Example (Transport Dataset)

# $maxv_{doors} = 5$
# $minv_{doors} = 2$
# $v_{doors} = 3$
# $a_i = \frac{3-2}{5-2} = \frac{1}{3}$

# Nearest Neighbor Applied (Transport Dataset)

- Last row is new vehicle to be classified
- N denotes normalised
- Right most column shows euclidean distances between each vehicle and new vehicle
- New vehicle is closest to the 1st example, a taxi, NN predicts taxi
![](Pasted%20image%2020241010133818.png)
# $vmin_{doors} = 2$
# $vmax_{doors} = 5$
# $vmin_{seats} = 7$
# $vmax_{seats} = 65$

# Missing Values

## Missing Nominal Values

- Assume missing feature is maximally different from any other value
- Distance is:
	- 0 if identical and not missing
	- 1 if otherwise

## Missing Numeric Values

- 1 if both missing
- Assume maximum distance if one missing. Largest of:
	- (normalised) size of known value or
	- 1 - (normalised) size of known value

## Example (Weather Data)

- Humidity of one example = 76
- Normalised = 0.36
- One missing
- Max distance = 1 - 0.36 = 0.64

## Example (Transport Data)

- Number of seats of one example = 16
- Normalised = 9/58
- One missing
- 1 - 9/58  = 49/58

## Normalised Transport Data with Missing Values

- Last row to be classified
- N denotes normalised
- Right most column is euclidean values
![](Pasted%20image%2020241010135130.png)

# Definitions of Proximity

## Euclidean Distance

# $\sqrt{(a_1-a_1')^2) + (a_2-a_2')^2 + ... + (a_n-a_n')^2}$

Where $a$ and $a'$ are two examples with $n$ attributes and $a'$ is the value of attribute $i$ for $a$

## Manhattan Distance

# $|a_1-a_1'|+|a_2-a_2'|+...+|a_n-a_n'|$

Vertical bar means absolute value
Negative becomes positive

Another distance measure could be cube root of sum of cubes.
Higher the power, greater influence of large differences
Euclidean distance is generally a good compromise

# Problems with Nearest Neighbor

- Slow since every example must be compared with new
- Assumes all attributes are equal
	- Only use important attributes to compute distance
	- Weight attributes according to importance
- Does not detect noise
	- Use k-NN, get k closest examples and take majority vote on solutions
![](Pasted%20image%2020241011131542.png)