vault backup: 2024-10-16 09:12:37

2024-10-16 09:12:37 +01:00
parent bad31f35c5
commit 124e0b67ef
190 changed files with 192115 additions and 0 deletions
--- a/Neighbor.md
+++ b/Neighbor.md
@@ -0,0 +1,112 @@
+- Instance Based
+- Solution to new problem is solution to closest example
+- Must be able to measure distance between pair of examples
+- Normally euclidean distance
+
+# Normalisation of Numeric Attributes
+
+- Attributes measured on different scales
+	- Larger scales have higher impacts
+	- Must normalise (transform to scale [0, 1])
+
+# $a_i = \frac{v_i - minv_i}{maxv_i - minv_i}$
+
+Where:
+- $a_i$ is normalised value for attribute $i$
+- $v_i$ is the current value for attribute $i$
+- $maxv_i$ is largest value of attribute $i$
+- $minv_i$ is smallest value of attribute $i$
+
+## Example
+
+ # $maxv_{humidity} = 96$
+# $minv_{humidity} = 65$
+# $v_{humidity} = 80.5$
+
+# $a_i = \frac{80.5-65}{96-55} = \frac{15.5}{31} = 0.5$
+
+## Example (Transport Dataset)
+
+# $maxv_{doors} = 5$
+# $minv_{doors} = 2$
+# $v_{doors} = 3$
+# $a_i = \frac{3-2}{5-2} = \frac{1}{3}$
+
+# Nearest Neighbor Applied (Transport Dataset)
+
+- Last row is new vehicle to be classified
+- N denotes normalised
+- Right most column shows euclidean distances between each vehicle and new vehicle
+- New vehicle is closest to the 1st example, a taxi, NN predicts taxi
+![](Pasted%20image%2020241010133818.png)
+# $vmin_{doors} = 2$
+# $vmax_{doors} = 5$
+# $vmin_{seats} = 7$
+# $vmax_{seats} = 65$
+
+# Missing Values
+
+## Missing Nominal Values 
+
+- Assume missing feature is maximally different from any other value
+- Distance is:
+	- 0 if identical and not missing
+	- 1 if otherwise
+
+## Missing Numeric Values
+
+- 1 if both missing
+- Assume maximum distance if one missing. Largest of:
+	- (normalised) size of known value or
+	- 1 - (normalised) size of known value
+
+## Example (Weather Data)
+
+- Humidity of one example = 76
+- Normalised = 0.36
+- One missing
+- Max distance = 1 - 0.36 = 0.64
+
+## Example (Transport Data)
+
+- Number of seats of one example = 16
+- Normalised = 9/58
+- One missing
+- 1 - 9/58  = 49/58
+
+## Normalised Transport Data with Missing Values
+
+- Last row to be classified
+- N denotes normalised
+- Right most column is euclidean values
+![](Pasted%20image%2020241010135130.png)
+
+# Definitions of Proximity
+
+## Euclidean Distance
+
+# $\sqrt{(a_1-a_1')^2) + (a_2-a_2')^2 + ... + (a_n-a_n')^2}$
+
+Where $a$ and $a'$ are two examples with $n$ attributes and $a'$ is the value of attribute $i$ for $a$
+
+## Manhattan Distance
+
+# $|a_1-a_1'|+|a_2-a_2'|+...+|a_n-a_n'|$
+
+Vertical bar means absolute value
+Negative becomes positive
+
+Another distance measure could be cube root of sum of cubes.
+Higher the power, greater influence of large differences
+Euclidean distance is generally a good compromise
+
+# Problems with Nearest Neighbor
+
+- Slow since every example must be compared with new
+- Assumes all attributes are equal
+	- Only use important attributes to compute distance
+	- Weight attributes according to importance
+- Does not detect noise
+	- Use k-NN, get k closest examples and take majority vote on solutions
+![](Pasted%20image%2020241011131542.png)
+