2.8 KiB
2.8 KiB
- Instance Based
- Solution to new problem is solution to closest example
- Must be able to measure distance between pair of examples
- Normally euclidean distance
Normalisation of Numeric Attributes
- Attributes measured on different scales
- Larger scales have higher impacts
- Must normalise (transform to scale [0, 1])
a_i = \frac{v_i - minv_i}{maxv_i - minv_i}
Where:
a_i
is normalised value for attributei
v_i
is the current value for attributei
maxv_i
is largest value of attributei
minv_i
is smallest value of attributei
Example
maxv_{humidity} = 96
minv_{humidity} = 65
v_{humidity} = 80.5
a_i = \frac{80.5-65}{96-55} = \frac{15.5}{31} = 0.5
Example (Transport Dataset)
maxv_{doors} = 5
minv_{doors} = 2
v_{doors} = 3
a_i = \frac{3-2}{5-2} = \frac{1}{3}
Nearest Neighbor Applied (Transport Dataset)
- Last row is new vehicle to be classified
- N denotes normalised
- Right most column shows euclidean distances between each vehicle and new vehicle
- New vehicle is closest to the 1st example, a taxi, NN predicts taxi
vmin_{doors} = 2
vmax_{doors} = 5
vmin_{seats} = 7
vmax_{seats} = 65
Missing Values
Missing Nominal Values
- Assume missing feature is maximally different from any other value
- Distance is:
- 0 if identical and not missing
- 1 if otherwise
Missing Numeric Values
- 1 if both missing
- Assume maximum distance if one missing. Largest of:
- (normalised) size of known value or
- 1 - (normalised) size of known value
Example (Weather Data)
- Humidity of one example = 76
- Normalised = 0.36
- One missing
- Max distance = 1 - 0.36 = 0.64
Example (Transport Data)
- Number of seats of one example = 16
- Normalised = 9/58
- One missing
- 1 - 9/58 = 49/58
Normalised Transport Data with Missing Values
Definitions of Proximity
Euclidean Distance
\sqrt{(a_1-a_1')^2) + (a_2-a_2')^2 + ... + (a_n-a_n')^2}
Where a
and a'
are two examples with n
attributes and a'
is the value of attribute i
for a
Manhattan Distance
|a_1-a_1'|+|a_2-a_2'|+...+|a_n-a_n'|
Vertical bar means absolute value Negative becomes positive
Another distance measure could be cube root of sum of cubes. Higher the power, greater influence of large differences Euclidean distance is generally a good compromise