Files
G4G0-2/AI & Data Mining/Week 4/Lecture 7 - Nearest Neighbor.md
2025-03-16 18:59:42 +00:00

2.8 KiB
Executable File

  • Instance Based
  • Solution to new problem is solution to closest example
  • Must be able to measure distance between pair of examples
  • Normally euclidean distance

Normalisation of Numeric Attributes

  • Attributes measured on different scales
    • Larger scales have higher impacts
    • Must normalise (transform to scale [0, 1])

a_i = \frac{v_i - minv_i}{maxv_i - minv_i}

Where:

  • a_i is normalised value for attribute i
  • v_i is the current value for attribute i
  • maxv_i is largest value of attribute i
  • minv_i is smallest value of attribute i

Example

maxv_{humidity} = 96

minv_{humidity} = 65

v_{humidity} = 80.5

a_i = \frac{80.5-65}{96-55} = \frac{15.5}{31} = 0.5

Example (Transport Dataset)

maxv_{doors} = 5

minv_{doors} = 2

v_{doors} = 3

a_i = \frac{3-2}{5-2} = \frac{1}{3}

Nearest Neighbor Applied (Transport Dataset)

  • Last row is new vehicle to be classified
  • N denotes normalised
  • Right most column shows euclidean distances between each vehicle and new vehicle
  • New vehicle is closest to the 1st example, a taxi, NN predicts taxi

vmin_{doors} = 2

vmax_{doors} = 5

vmin_{seats} = 7

vmax_{seats} = 65

Missing Values

Missing Nominal Values

  • Assume missing feature is maximally different from any other value
  • Distance is:
    • 0 if identical and not missing
    • 1 if otherwise

Missing Numeric Values

  • 1 if both missing
  • Assume maximum distance if one missing. Largest of:
    • (normalised) size of known value or
    • 1 - (normalised) size of known value

Example (Weather Data)

  • Humidity of one example = 76
  • Normalised = 0.36
  • One missing
  • Max distance = 1 - 0.36 = 0.64

Example (Transport Data)

  • Number of seats of one example = 16
  • Normalised = 9/58
  • One missing
  • 1 - 9/58 = 49/58

Normalised Transport Data with Missing Values

  • Last row to be classified
  • N denotes normalised
  • Right most column is euclidean values

Definitions of Proximity

Euclidean Distance

\sqrt{(a_1-a_1')^2) + (a_2-a_2')^2 + … + (a_n-a_n')^2}

Where a and a' are two examples with n attributes and a' is the value of attribute i for a

Manhattan Distance

|a_1-a_1'|+|a_2-a_2'|+…+|a_n-a_n'|

Vertical bar means absolute value Negative becomes positive

Another distance measure could be cube root of sum of cubes. Higher the power, greater influence of large differences Euclidean distance is generally a good compromise

Problems with Nearest Neighbor

  • Slow since every example must be compared with new
  • Assumes all attributes are equal
    • Only use important attributes to compute distance
    • Weight attributes according to importance
  • Does not detect noise
    • Use k-NN, get k closest examples and take majority vote on solutions