boris/G4G0-2

Fork 0

Files

boris 124e0b67ef vault backup: 2024-10-16 09:12:37

2024-10-16 09:12:37 +01:00

2.8 KiB

Raw Blame History

Instance Based
Solution to new problem is solution to closest example
Must be able to measure distance between pair of examples
Normally euclidean distance

Normalisation of Numeric Attributes

Attributes measured on different scales
- Larger scales have higher impacts
- Must normalise (transform to scale [0, 1])

`a_i = \frac{v_i - minv_i}{maxv_i - minv_i}`

Where:

a_i is normalised value for attribute i
v_i is the current value for attribute i
maxv_i is largest value of attribute i
minv_i is smallest value of attribute i

Example

`maxv_{humidity} = 96`

`minv_{humidity} = 65`

`v_{humidity} = 80.5`

`a_i = \frac{80.5-65}{96-55} = \frac{15.5}{31} = 0.5`

Example (Transport Dataset)

`maxv_{doors} = 5`

`minv_{doors} = 2`

`v_{doors} = 3`

`a_i = \frac{3-2}{5-2} = \frac{1}{3}`

Nearest Neighbor Applied (Transport Dataset)

Last row is new vehicle to be classified
N denotes normalised
Right most column shows euclidean distances between each vehicle and new vehicle
New vehicle is closest to the 1st example, a taxi, NN predicts taxi

`vmin_{doors} = 2`

`vmax_{doors} = 5`

`vmin_{seats} = 7`

`vmax_{seats} = 65`

Missing Values

Missing Nominal Values

Assume missing feature is maximally different from any other value
Distance is:
- 0 if identical and not missing
- 1 if otherwise

Missing Numeric Values

1 if both missing
Assume maximum distance if one missing. Largest of:
- (normalised) size of known value or
- 1 - (normalised) size of known value

Example (Weather Data)

Humidity of one example = 76
Normalised = 0.36
One missing
Max distance = 1 - 0.36 = 0.64

Example (Transport Data)

Number of seats of one example = 16
Normalised = 9/58
One missing
1 - 9/58 = 49/58

Normalised Transport Data with Missing Values

Last row to be classified
N denotes normalised
Right most column is euclidean values

Definitions of Proximity

Euclidean Distance

`\sqrt{(a_1-a_1')^2) + (a_2-a_2')^2 + ... + (a_n-a_n')^2}`

Where a and a' are two examples with n attributes and a' is the value of attribute i for a

Manhattan Distance

`|a_1-a_1'|+|a_2-a_2'|+...+|a_n-a_n'|`

Vertical bar means absolute value Negative becomes positive

Another distance measure could be cube root of sum of cubes. Higher the power, greater influence of large differences Euclidean distance is generally a good compromise

Problems with Nearest Neighbor

Slow since every example must be compared with new
Assumes all attributes are equal
- Only use important attributes to compute distance
- Weight attributes according to importance
Does not detect noise
- Use k-NN, get k closest examples and take majority vote on solutions

2.8 KiB Raw Blame History

Normalisation of Numeric Attributes

a_i = \frac{v_i - minv_i}{maxv_i - minv_i}

Example

maxv_{humidity} = 96

minv_{humidity} = 65

v_{humidity} = 80.5

a_i = \frac{80.5-65}{96-55} = \frac{15.5}{31} = 0.5