113 lines
2.8 KiB
Markdown
Executable File
113 lines
2.8 KiB
Markdown
Executable File
- Instance Based
|
|
- Solution to new problem is solution to closest example
|
|
- Must be able to measure distance between pair of examples
|
|
- Normally euclidean distance
|
|
|
|
# Normalisation of Numeric Attributes
|
|
|
|
- Attributes measured on different scales
|
|
- Larger scales have higher impacts
|
|
- Must normalise (transform to scale [0, 1])
|
|
|
|
# $a_i = \frac{v_i - minv_i}{maxv_i - minv_i}$
|
|
|
|
Where:
|
|
- $a_i$ is normalised value for attribute $i$
|
|
- $v_i$ is the current value for attribute $i$
|
|
- $maxv_i$ is largest value of attribute $i$
|
|
- $minv_i$ is smallest value of attribute $i$
|
|
|
|
## Example
|
|
|
|
# $maxv_{humidity} = 96$
|
|
# $minv_{humidity} = 65$
|
|
# $v_{humidity} = 80.5$
|
|
|
|
# $a_i = \frac{80.5-65}{96-55} = \frac{15.5}{31} = 0.5$
|
|
|
|
## Example (Transport Dataset)
|
|
|
|
# $maxv_{doors} = 5$
|
|
# $minv_{doors} = 2$
|
|
# $v_{doors} = 3$
|
|
# $a_i = \frac{3-2}{5-2} = \frac{1}{3}$
|
|
|
|
# Nearest Neighbor Applied (Transport Dataset)
|
|
|
|
- Last row is new vehicle to be classified
|
|
- N denotes normalised
|
|
- Right most column shows euclidean distances between each vehicle and new vehicle
|
|
- New vehicle is closest to the 1st example, a taxi, NN predicts taxi
|
|

|
|
# $vmin_{doors} = 2$
|
|
# $vmax_{doors} = 5$
|
|
# $vmin_{seats} = 7$
|
|
# $vmax_{seats} = 65$
|
|
|
|
# Missing Values
|
|
|
|
## Missing Nominal Values
|
|
|
|
- Assume missing feature is maximally different from any other value
|
|
- Distance is:
|
|
- 0 if identical and not missing
|
|
- 1 if otherwise
|
|
|
|
## Missing Numeric Values
|
|
|
|
- 1 if both missing
|
|
- Assume maximum distance if one missing. Largest of:
|
|
- (normalised) size of known value or
|
|
- 1 - (normalised) size of known value
|
|
|
|
## Example (Weather Data)
|
|
|
|
- Humidity of one example = 76
|
|
- Normalised = 0.36
|
|
- One missing
|
|
- Max distance = 1 - 0.36 = 0.64
|
|
|
|
## Example (Transport Data)
|
|
|
|
- Number of seats of one example = 16
|
|
- Normalised = 9/58
|
|
- One missing
|
|
- 1 - 9/58 = 49/58
|
|
|
|
## Normalised Transport Data with Missing Values
|
|
|
|
- Last row to be classified
|
|
- N denotes normalised
|
|
- Right most column is euclidean values
|
|

|
|
|
|
# Definitions of Proximity
|
|
|
|
## Euclidean Distance
|
|
|
|
# $\sqrt{(a_1-a_1')^2) + (a_2-a_2')^2 + ... + (a_n-a_n')^2}$
|
|
|
|
Where $a$ and $a'$ are two examples with $n$ attributes and $a'$ is the value of attribute $i$ for $a$
|
|
|
|
## Manhattan Distance
|
|
|
|
# $|a_1-a_1'|+|a_2-a_2'|+...+|a_n-a_n'|$
|
|
|
|
Vertical bar means absolute value
|
|
Negative becomes positive
|
|
|
|
Another distance measure could be cube root of sum of cubes.
|
|
Higher the power, greater influence of large differences
|
|
Euclidean distance is generally a good compromise
|
|
|
|
# Problems with Nearest Neighbor
|
|
|
|
- Slow since every example must be compared with new
|
|
- Assumes all attributes are equal
|
|
- Only use important attributes to compute distance
|
|
- Weight attributes according to importance
|
|
- Does not detect noise
|
|
- Use k-NN, get k closest examples and take majority vote on solutions
|
|

|
|
|