vault backup: 2024-10-16 09:12:37

2024-10-16 09:12:37 +01:00
parent bad31f35c5
commit 124e0b67ef
190 changed files with 192115 additions and 0 deletions
--- a/Mining/Week
+++ b/Mining/Week
@@ -0,0 +1,214 @@
+# Assessment
+
+## T1
+
+- Exam (50%)
+
+## T2
+
+- Coursework (50%)
+
+# Resources
+
+Data Mining: Practical Machine Learning Tools and Techniques (Witten, Frank, Hall & Pal) 4th Edition 2016
+
+Scientific Calculator
+
+# Data Vs Information
+
+- Too much data
+- Valuable resource
+- Raw data less important, need to develop techniques to extract information
+	- Data: recorded facts
+	- Information: patterns underlying data
+
+# Philosophy
+
+## Cow Culling
+
+- Cows described by 700 features about certain variables
+- Problem is the selection of cows of which to cull
+- Data is historical records, and farmer decisions
+- Machine Learning used to ascertain which factors taken into account by farmers, rather than automating the decision making process.
+
+# Definition of Data Mining
+
+- The extraction of:
+	- Implicit,
+	- Previously unknown,
+	- Potentially useful data
+- Programs that detect patterns and regularities are needed
+- Strong patterns => good predictions
+	- Issues:
+		- Most patterns not interesting
+		- Patterns may be inexact
+		- Data may be garbled or missing
+
+# Machine Learning Techniques
+
+- Algorithms for acquiring structural descriptions from examples
+- Structural descriptions represent patterns, explicitly.
+	- Predict outcome in new situation
+	- Understand and explain how prediction derived.
+- Methods originate from AI, statistics and research on databases.
+
+# Can Machines Learn?
+
+- By definition, sort of. The ability to obtain knowledge by study, experience or being taught, is very difficult to measure.
+- Does learning imply intention?
+
+# Terminology
+
+- Concept - Thing to be learned
+- Example / Instance - Individual, independent examples of a concept
+- Attributes / Features - Measuring aspects of an example / instance
+- Concept description (pattern, model, hypothesis) - Output for data mining algorithms.
+
+# Famous Small Datasets
+
+- Will be used in module
+- Unrealistically simple
+
+## Weather Dataset - Nominal
+
+Concept: conditions which are suitable for a game.
+Reference: Quinlan, J.R. (1986)
+Induction of decision trees. Machine
+Learning, 1(1), 81-106.
+
+### Attributes
+
+3\*3\*2\*2 = 36 possible combinations of values.
+Outlook
+
+- sunny, overcast, rainy
+Temperature
+- hot, mild, cool
+Humidity
+- high, normal
+Windy
+- yes, no
+Play
+- Class
+- yes, no
+
+### Dataset
+
+![](Pasted%20image%2020240919134249.png)
+![](Pasted%20image%2020240919134304.png)
+
+Rules ordered, higher = higher priority
+
+### Weather Dataset - Mixed
+
+![](Pasted%20image%2020240919134526.png)
+![](Pasted%20image%2020240919134535.png)
+
+## Contact Lenses Dataset
+
+Describes conditions under which an optician might want to prescribe soft, hard or no contact lenses.
+Grossly over-simplified.
+Reference: Cendrowska, J. (1987). Prism: an algorithm
+for inducing module rules. Journal of Man-Machine
+Studies, 27(4), 349–370.
+
+### Attributes
+
+3\*2\*2\*2 = 24 possibilities
+Dataset is exhaustive, which is unusual.
+
+Age
+
+- young, pre-presbyopic, presbyopic
+Spectacle Prescription
+- myope (short), hypermetrope (long)
+Astigmatism
+- yes, no
+Tear Production Rate
+- reduced, normal
+Recommended Lenses
+- class
+- hard, soft, none
+
+### Dataset
+
+![](Pasted%20image%2020240919134848.png)
+
+## Iris Dataset
+
+Used in many statistical experiments
+Contains numeric attributes of 3 different types of iris.
+Created in 1936 by Sir Ronald Fisher
+
+### Dataset
+
+![](Pasted%20image%2020240920130950.png)
+
+# Styles of Learning
+
+- Classification Learning: Predicting a **nominal** class
+- Numeric Prediction (Regression): Predicting a **numeric** quantity
+- Clustering: Grouping similar examples into clusters
+- Association Learning: Detecting associations between attributes
+
+## Classification Learning
+
+- Nominal
+- Supervised
+	- Provided with actual value of the class
+- Measure success on fresh data for which class labels are known (test data)
+
+## Numeric Prediction (Regression)
+
+- Numeric
+- Supervised
+- Test Data
+
+![](Pasted%20image%2020240920131244.png)
+
+Example uses a linear regression function to provide an estimated performance value based on attributes.
+
+## Clustering
+
+- Finding similar groups
+- Unsupervised
+	- Class of example is unknown
+- Success measured **subjectively**
+
+## Association Learning
+
+- Applied if no class specified, and any kind of structure is interesting
+- Difference to Classification Learning:
+	- Predicts any attribute's value, not just class.
+	- More than one attribute's value at a time
+	- Far more association rules than classification rules.
+
+## Classification Vs Association Rules
+
+Classification Rule:
+
+- Predicts value of a given attribute (class of example)
+- ``If outlook = sunny and humidity = high, then play = no``
+
+Association Rule:
+
+- Predicts value of arbitrary attribute / combination
+
+```If temperature = cool, humidity = normal
+If humidity = normal and windy = false, play = yes
+If outlook = sunny and play = no, humidity = high
+If windy = false and play = no, then outlook = sunny and humidity = high
+```
+
+# Data Mining and Ethics
+
+- Ethical Issues arise in practical applications
+- Data mining often used to discriminate
+- Ethical situation depends on application
+- Attributes may contain problematic information
+- Does ownership of data bestow right to use it in other ways than those purported when it was originally collected?
+- Who is permitted to access the data?
+- For what purpose was the data collected?
+- What conclusions can sensibly be drawn?
+- Caveats must be attached to results
+- Purely statistical arguments never sufficient
--- a/Mining/Week
+++ b/Mining/Week
@@ -0,0 +1,135 @@
+# Attributes
+
+- Each example described by fixed pre-defined set of features (attributes)
+- Number of attributes may vary
+	- ex. Transportation Vehicles
+		- no. wheels not applicable to ships
+		- no. masts not applicable to cars
+	- Possible solution: "irrelevant value" flag
+- Attributes may be dependent on other attributes
+
+# Taxonomy of Data Types
+
+![](Pasted%20image%2020240920132209.png)
+
+# Nominal Attributes
+
+- Distinct symbols
+	- Serve as labels or names
+- No relation implied among nominal values
+- Only equality tests can be performed
+	- ex. outlook = sunny
+
+# Sources of Missing Values
+
+- Malfunctioning / Misconfigured Equipment
+- Changes in design
+- Collation of different datasets
+- Data not collected for mining
+	- Errors and omissions dont affect purpose of data
+	- ex. Banks do not need to know age in banking datasets, DOB may contain missing values
+- Missing value may have significance
+	- ex. medical diagnoses can be made from tests a doctor decides, rather than the outcome.
+	- Most DM algos assume this is not the case, hence "missing" may need to be coded as an additional nominal value.
+
+# Inaccurate Values
+
+- Typographical errors in nominal attributes
+- Typographical and measurement errors in numeric attributes
+- Deliberate errors
+	- ex. Incorrect ZIP codes, unsanitised inputs
+- Duplicate examples
+
+# Weka and ARFF
+
+## Weather Dataset in ARFF
+
+![](Pasted%20image%2020240920132732.png)
+
+### Getting to Know the Data
+
+- First task, get to know data
+- Simple visualisations useful:
+	- Nominal: bar graph
+	- Numeric: histograms
+- 2D and 3D plots show dependencies
+- Need to consult experts
+- Too much data? Take sample.
+
+# Concept Descriptions
+
+- Output of DM algorithm
+- Many ways of representing:
+	- Decision Trees
+	- Rules
+	- Linear Regression Functions
+
+## Decision Trees
+
+- Divide-and-Conquer approach
+- Trees drawn upside down
+	- Node at top is root
+- Edges are branches
+- Rectangles represent leaves
+- Leaves assign classification
+- Nodes involve testing attribute
+
+### Decision Tree with Nominal Attributes
+
+![](Pasted%20image%2020240920133218.png)
+
+- Number of branches usually equal to number values
+- Attribute not tested more than once.
+
+### Decision Tree with Numeric Attributes
+
+![](Pasted%20image%2020240920133316.png)
+
+- Test whether value is greater or less than constant
+- Attribute may be tested multiple times
+
+### Decision Trees with Missing Values
+
+- Not clear which branch should be taken when node tests attribute with missing value
+- Does absence of a value have significance?
+	- Yes => Treat as separate value during training
+	- No => Treat in special way during testing
+		- Assign sample to most popular branch
+
+# Classification Rules
+
+- Popular alternative to decision tree
+- Antecedent (pre-condition) - series of tests
+- Tests usually logically ANDed together
+- Consequent (conclusion) - usually a class
+- Individual rules often logically ORed together
+
+## If-Then Rules for Contact Lenses
+
+![](Pasted%20image%2020240920133706.png)
+
+# Nuggets
+
+- Are rules independent
+- Problem: Ignores process of executing rules
+	- Ordered set (decision list)
+		- Order important for interpretation
+	- Unordered set
+		- Rules may overlap and lead to different conclusions for the same example
+		- Needs conflict resolution
+
+## Executing Rules
+
+- What if $\geq$ 2 rules conflict?
+	- Give no conclusion?
+	- Go with the rule that covers largest no. training samples?
+- What is no rule applies to test example?
+	- Give no conclusion?
+	- Go with class that is most frequent?
+
+## Special Case: Boolean Classes
+
+- Assumption: if example does not belong to class "yes", belongs to "no"
+- Solution: only learn rules for class "yes", use default rule for "no"
+![](Pasted%20image%2020240920134203.png)
+- Order is important, no conflicts.
--- a/Mining/Week
+++ b/Mining/Week
@@ -0,0 +1,85 @@
+# Statistical Modelling
+
+- Using statistical modelling for classification
+- Bayesian techniques adopted by machine learning community in the 90s
+- Opposite of 1R, uses all attributes
+- Assume:
+	- Attributes equally important
+	- Statistically independent
+- Independence assumption never correct
+- Works in practice
+
+# Weather Dataset
+
+![](Pasted%20image%2020241003132609.png)
+![](Pasted%20image%2020241003132636.png)
+
+# Bayes' Rule of Conditional Probability
+
+- Probability of event H given evidence E:
+
+# $Pr[H|E] = \frac{Pr[E|H]\times Pr[H]}{Pr[E]}$
+
+- H may be ex. Play = Yes
+- E may be particular weather for new day
+- A priori probability of H: $Pr[H]$
+	- Probability before evidence
+- A posteriori probability of H: $Pr[H|E]$
+	- Probability after evidence
+
+## Naive Bayes for Classification
+
+- Classification Learning: what is the probability of class given instance?
+	- Evidence $E$ = instance
+	- Event $H$ = class for given instance
+- Naive assumption: evidence splits into attributes that are independent
+
+# $Pr[H|E] = \frac{Pr[E_1|H] \times Pr[E_2|H]… Pr[E_n|H] \times Pr[H]}{Pr[E]}$
+
+- Denominator cancels out during conversion into probability by normalisation
+
+### Weather Data Example
+
+![](Pasted%20image%2020241003133919.png)
+
+# Laplace Estimator
+
+- Remedy to Zero-frequency problem: Add 1 to the count for every attribute value-class combination (laplace estimator)
+- Result: probabilities will never be 0 (also stabilises probability estimates)
+- Simple remedy is one which is often used in practice when zero frequency problem arises.
+
+## Example
+
+![](Pasted%20image%2020241003134100.png)
+
+# Modified Probability Estimates
+
+- Consider attribute *outlook* for class *yes*
+# $\frac{2+\frac{1}{3}\mu}{9+\mu}$
+Sunny
+
+# $\frac{4+\frac{1}{3}\mu}{9+\mu}$
+Overcast
+
+# $\frac{3+\frac{1}{3}\mu}{9+\mu}$
+Rainy
+
+- Each value treated the same way
+- Prior to seeing training set, assume each value is equally likely, ex. prior probability is $\frac{1}{3}$
+- When decided to add 1 to counts, we implicitly set $\mu$ to 3.
+- However, no particular reason to add 1 to the count, we could increment by 0.1 instead, setting $\mu$ to 0.3.
+- A large value of $\mu$ indicates prior probabilities are very important compared to evidence in training set.
+
+## Fully Bayesian Formulation
+
+# $\frac{2+\frac{1}{3}\mu p_1}{9+\mu}$
+Sunny
+
+# $\frac{4+\frac{1}{3}\mu p_2}{9+\mu}$
+Overcast
+
+# $\frac{3+\frac{1}{3}\mu p_3}{9+\mu}$
+Rainy
+
+- Where $p_1 + p_2 + p_3 = 1$ 
+- $p_1, p_2, p_3$ are prior probabilities of outlook being sunny, overcast or rainy before seeing the training set. However, in practice it is not clear how these prior probabilities should be assigned.
--- a/Mining/Week
+++ b/Mining/Week
@@ -0,0 +1,51 @@
+| Temperature | Skin   | Blood Pressure | Blocked Nose | Diagnosis |
+| ----------- | ------ | -------------- | ------------ | --------- |
+| Low         | Pale   | Normal         | True         | N         |
+| Moderate    | Pale   | Normal         | True         | B         |
+| High        | Normal | High           | False        | N         |
+| Moderate    | PaleFF | Normal         | False        | B         |
+| High        | Red    | High           | False        | N         |
+| High        | Red    | High           | True         | N         |
+| Moderate    | Red    | High           | False        | B         |
+| Low         | Normal | High           | False        | B         |
+| Low         | Pale   | Normal         | False        | B         |
+| Low         | Normal | Normal         | False        | B         |
+| High        | Normal | Normal         | True         | B         |
+| Moderate    | Normal | High           | True         | B         |
+| Moderate    | Red    | Normal         | False        | B         |
+| Low         | Normal | High           | True         | N         |
+
+|          | Temperature |     |        | Skin |     |        | Pressure |     |       | Blocked |     | Diag |      |
+| -------- | ----------- | --- | ------ | ---- | --- | ------ | -------- | --- | ----- | ------- | --- | ---- | ---- |
+|          | N           | B   |        | N    | B   |        | N        | B   |       | N       | B   | N    | B    |
+| Low      | 2           | 3   | Pale   | 1    | 3   | Normal | 1        | 6   | True  | 3       | 3   | 5    | 9    |
+| Moderate | 0           | 5   | Normal | 2    | 4   | High   | 4        | 3   | False | 2       | 6   |      |      |
+| High     | 3           | 1   | Red    | 2    | 2   |        |          |     |       |         |     |      |      |
+|          | Temperature |     |        | Skin |     |        | Pressure |     |       | Blocked |     |      | Diag |
+| Low      | 2/5         | 3/9 | Pale   | 1/5  | 3/9 | Normal | 1/5      | 6/9 | True  | 3/5     | 3/9 | 5/14 | 9/14 |
+| Moderate | 0/5         | 5/9 | Normal | 2/5  | 4/9 | High   | 4/5      | 3/9 | False | 2/5     | 6/9 |      |      |
+| High     | 2/5         | 1/9 | Red    | 3/5  | 2/9 |        |          |     |       |         |     |      |      |
+
+# Problem 1
+# $Pr[Diagnosis=N|E] = \frac{2}{5} \times \frac{2}{5} \times \frac{4}{5} \times \frac{3}{5} \times \frac{5}{14} = 0.027428571$
+# $Pr[Diagnosis = B|E] = \frac{3}{9} \times \frac{4}{9} \times \frac{3}{9} \times \frac{3}{9} \times \frac{9}{14} = 0.010582011$
+
+# $p(B) = \frac{0.0106}{0.0106+0.0274} = 0.2789$
+
+#  $p(N) = \frac{0.0274}{0.0106+0.0274} = 0.7211$
+
+Diagnosis N is much more likely than Diagnosis B
+
+# Problem 2
+
+# $Pr[Diagnosis = N|E] = \frac{2}{5} \times \frac{1}{5} \times \frac{3}{5} \times \frac{5}{14} = 0.0171$
+# $Pr[Diagnosis = B|E] = \frac{3}{9} \times \frac{6}{9} \times \frac{3}{9} \times \frac{9}{14} = 0.0476$
+# $p(N) = \frac{0.0171}{0.0171+0.0476} = 0.2643$
+# $p(B) = \frac{0.0474}{0.0476+0.0171} = 0.7357$
+
+Diagnosis B is much more likely than Diagnosis N
+
+# Problem 3
+
+# $Pr[Diagnosis = N|E] = \frac{0}{5} \times \frac{2}{5} \times \frac{4}{5} \times \frac{3}{5} \times \frac{5}{14} = 0$
+# $Pr[Diagnosis = B|E] = \frac{5}{9} \times \frac{4}{9} \times \frac{3}{9} \times \frac{3}{9} \times \frac{9}{14} = 0.018$
--- a/Mining/Week
+++ b/Mining/Week
@@ -0,0 +1,277 @@
+# Weather Dataset
+
+## Dataset
+
+```
+% This is a comment about the data set. 
+% This data describes examples of whether to play 
+% a game or not depending on weather conditions. 
+@relation letsPlay 
+@attribute outlook {sunny, overcast, rainy} 
+@attribute temperature real 
+@attribute humidity real 
+@attribute windy {TRUE, FALSE} 
+@attribute play {yes, no} 
+
+@data
+sunny,85,FALSE,no
+sunny,90,TRUE,no
+overcast,86,FALSE,yes
+rainy,96,FALSE,yes
+rainy,80,FALSE,yes
+rainy,70,TRUE,no
+overcast,65,TRUE,yes
+sunny,95,FALSE,no
+sunny,70,FALSE,yes
+rainy,80,FALSE,yes
+sunny,70,TRUE,yes
+overcast,90,TRUE,yes
+overcast,75,FALSE,yes
+rainy,91,TRUE,no
+```
+
+## Output
+
+```
+=== Run information ===
+
+Scheme:       weka.classifiers.bayes.NaiveBayes 
+Relation:     letsPlay
+Instances:    14
+Attributes:   5
+              outlook
+              temperature
+              humidity
+              windy
+              play
+Test mode:    evaluate on training data
+
+=== Classifier model (full training set) ===
+
+Naive Bayes Classifier
+
+                 Class
+Attribute          yes      no
+                (0.63)  (0.38)
+===============================
+outlook
+  sunny             3.0     4.0
+  overcast          5.0     1.0
+  rainy             4.0     3.0
+  [total]          12.0     8.0
+
+temperature
+  mean          72.9697 74.8364
+  std. dev.      5.2304   7.384
+  weight sum          9       5
+  precision      1.9091  1.9091
+
+humidity
+  mean          78.8395 86.1111
+  std. dev.      9.8023  9.2424
+  weight sum          9       5
+  precision      3.4444  3.4444
+
+windy
+  TRUE              4.0     4.0
+  FALSE             7.0     3.0
+  [total]          11.0     7.0
+
+Time taken to build model: 0 seconds
+
+=== Evaluation on training set ===
+
+Time taken to test model on training data: 0.01 seconds
+
+=== Summary ===
+
+Correctly Classified Instances          13               92.8571 %
+Incorrectly Classified Instances         1                7.1429 %
+Kappa statistic                          0.8372
+Mean absolute error                      0.2798
+Root mean squared error                  0.3315
+Relative absolute error                 60.2576 %
+Root relative squared error             69.1352 %
+Total Number of Instances               14
+```
+
+# Medical Dataset
+
+## Dataset
+
+```
+```@relation medical
+@attribute Temperature {Low,Moderate,High}
+@attribute Skin {Pale,Normal,Red}
+@attribute BloodPressure {Normal,High}
+@attribute BlockedNose {True,False}
+@attribute Diagnosis {N,B}
+
+@data
+Low, Pale, Normal, True, N
+Moderate, Pale, Normal, True, B
+High, Normal, High, False, N
+Moderate, Pale, Normal, False, B
+High, Red, High, False, N
+High, Red, High, True, N
+Moderate, Red, High, False, B
+Low, Normal, High, False, B
+Low, Pale, Normal, False, B
+Low, Normal, Normal, False, B
+High, Normal, Normal, True, B
+Moderate, Normal, High, True, B
+Moderate, Red, Normal, False, B
+Low, Normal, High, True, N```
+```
+
+## Output
+
+```
+=== Run information ===
+
+Scheme:       weka.classifiers.bayes.NaiveBayes 
+Relation:     diagnosis
+Instances:    14
+Attributes:   5
+              Temperature
+              Skin
+              BloodPressure
+              BlockedNose
+              Diagnosis
+Test mode:    evaluate on training data
+
+=== Classifier model (full training set) ===
+
+Naive Bayes Classifier
+
+                 Class
+Attribute            N      B
+                (0.38) (0.63)
+==============================
+Temperature
+  Low               3.0    4.0
+  Moderate          1.0    6.0
+  High              4.0    2.0
+  [total]           8.0   12.0
+
+Skin
+  Pale              2.0    4.0
+  Normal            3.0    5.0
+  Red               3.0    3.0
+  [total]           8.0   12.0
+
+BloodPressure
+  Normal            2.0    7.0
+  High              5.0    4.0
+  [total]           7.0   11.0
+
+BlockedNose
+  True              4.0    4.0
+  False             3.0    7.0
+  [total]           7.0   11.0
+
+Time taken to build model: 0 seconds
+
+=== Evaluation on training set ===
+
+Time taken to test model on training data: 0 seconds
+
+=== Summary ===
+
+Correctly Classified Instances          12               85.7143 %
+Incorrectly Classified Instances         2               14.2857 %
+Kappa statistic                          0.6889
+Mean absolute error                      0.2635
+Root mean squared error                  0.3272
+Relative absolute error                 56.7565 %
+Root relative squared error             68.2385 %
+Total Number of Instances               14
+```
+
+# Using Test Data
+
+## Test Data
+
+```
+@relation medical
+@attribute Temperature {Low,Moderate,High}
+@attribute Skin {Pale,Normal,Red}
+@attribute BloodPressure {Normal,High}
+@attribute BlockedNose {True,False}
+@attribute Diagnosis {N,B}
+@data
+Low,Normal,High,True,N
+Low,?,Normal,True,B
+Moderate,Normal,High,True,B
+```
+
+## Output
+
+```
+=== Run information ===
+
+Scheme:       weka.classifiers.bayes.NaiveBayes 
+Relation:     medical
+Instances:    14
+Attributes:   5
+              Temperature
+              Skin
+              BloodPressure
+              BlockedNose
+              Diagnosis
+Test mode:    user supplied test set:  size unknown (reading incrementally)
+
+=== Classifier model (full training set) ===
+
+Naive Bayes Classifier
+
+                 Class
+Attribute            N      B
+                (0.38) (0.63)
+==============================
+Temperature
+  Low               3.0    4.0
+  Moderate          1.0    6.0
+  High              4.0    2.0
+  [total]           8.0   12.0
+
+Skin
+  Pale              2.0    4.0
+  Normal            3.0    5.0
+  Red               3.0    3.0
+  [total]           8.0   12.0
+
+BloodPressure
+  Normal            2.0    7.0
+  High              5.0    4.0
+  [total]           7.0   11.0
+
+BlockedNose
+  True              4.0    4.0
+  False             3.0    7.0
+  [total]           7.0   11.0
+
+Time taken to build model: 0 seconds
+
+=== Predictions on test set ===
+
+    inst#     actual  predicted error prediction
+        1        1:N        1:N       0.652 
+        2        2:B        2:B       0.677 
+        3        2:B        2:B       0.706 
+
+=== Evaluation on test set ===
+
+Time taken to test model on supplied test set: 0 seconds
+
+=== Summary ===
+
+Correctly Classified Instances           3              100      %
+Incorrectly Classified Instances         0                0      %
+Kappa statistic                          1     
+Mean absolute error                      0.3215
+Root mean squared error                  0.3223
+Relative absolute error                 70.1487 %
+Root relative squared error             68.0965 %
+Total Number of Instances                3
+```
--- a/Neighbor.md
+++ b/Neighbor.md
@@ -0,0 +1,112 @@
+- Instance Based
+- Solution to new problem is solution to closest example
+- Must be able to measure distance between pair of examples
+- Normally euclidean distance
+
+# Normalisation of Numeric Attributes
+
+- Attributes measured on different scales
+	- Larger scales have higher impacts
+	- Must normalise (transform to scale [0, 1])
+
+# $a_i = \frac{v_i - minv_i}{maxv_i - minv_i}$
+
+Where:
+- $a_i$ is normalised value for attribute $i$
+- $v_i$ is the current value for attribute $i$
+- $maxv_i$ is largest value of attribute $i$
+- $minv_i$ is smallest value of attribute $i$
+
+## Example
+
+ # $maxv_{humidity} = 96$
+# $minv_{humidity} = 65$
+# $v_{humidity} = 80.5$
+
+# $a_i = \frac{80.5-65}{96-55} = \frac{15.5}{31} = 0.5$
+
+## Example (Transport Dataset)
+
+# $maxv_{doors} = 5$
+# $minv_{doors} = 2$
+# $v_{doors} = 3$
+# $a_i = \frac{3-2}{5-2} = \frac{1}{3}$
+
+# Nearest Neighbor Applied (Transport Dataset)
+
+- Last row is new vehicle to be classified
+- N denotes normalised
+- Right most column shows euclidean distances between each vehicle and new vehicle
+- New vehicle is closest to the 1st example, a taxi, NN predicts taxi
+![](Pasted%20image%2020241010133818.png)
+# $vmin_{doors} = 2$
+# $vmax_{doors} = 5$
+# $vmin_{seats} = 7$
+# $vmax_{seats} = 65$
+
+# Missing Values
+
+## Missing Nominal Values 
+
+- Assume missing feature is maximally different from any other value
+- Distance is:
+	- 0 if identical and not missing
+	- 1 if otherwise
+
+## Missing Numeric Values
+
+- 1 if both missing
+- Assume maximum distance if one missing. Largest of:
+	- (normalised) size of known value or
+	- 1 - (normalised) size of known value
+
+## Example (Weather Data)
+
+- Humidity of one example = 76
+- Normalised = 0.36
+- One missing
+- Max distance = 1 - 0.36 = 0.64
+
+## Example (Transport Data)
+
+- Number of seats of one example = 16
+- Normalised = 9/58
+- One missing
+- 1 - 9/58  = 49/58
+
+## Normalised Transport Data with Missing Values
+
+- Last row to be classified
+- N denotes normalised
+- Right most column is euclidean values
+![](Pasted%20image%2020241010135130.png)
+
+# Definitions of Proximity
+
+## Euclidean Distance
+
+# $\sqrt{(a_1-a_1')^2) + (a_2-a_2')^2 + ... + (a_n-a_n')^2}$
+
+Where $a$ and $a'$ are two examples with $n$ attributes and $a'$ is the value of attribute $i$ for $a$
+
+## Manhattan Distance
+
+# $|a_1-a_1'|+|a_2-a_2'|+...+|a_n-a_n'|$
+
+Vertical bar means absolute value
+Negative becomes positive
+
+Another distance measure could be cube root of sum of cubes.
+Higher the power, greater influence of large differences
+Euclidean distance is generally a good compromise
+
+# Problems with Nearest Neighbor
+
+- Slow since every example must be compared with new
+- Assumes all attributes are equal
+	- Only use important attributes to compute distance
+	- Weight attributes according to importance
+- Does not detect noise
+	- Use k-NN, get k closest examples and take majority vote on solutions
+![](Pasted%20image%2020241011131542.png)
+
--- a/Neighbor.md
+++ b/Neighbor.md
@@ -0,0 +1,36 @@
+
+![](Pasted%20image%2020241011131844.png)
+
+## Normalisation Equation
+# $a_i = \frac{v_i - minv_i}{maxv_i - minv_i}$
+## Euclidean Distance Equation
+# $\sqrt{(a_1-a_1')^2) + (a_2-a_2')^2 + ... + (a_n-a_n')^2}$
+
+
+# $vmax_{temp} = 85$
+# $vmin_{temp} = 64$
+
+# $a_{temp} = \frac{v_{temp} - 64}{21}$
+
+# $vmax_{humidity} = 96$
+# $vmin_{humidity} = 65$
+
+# $a_{humidity} = \frac{v_{humidity} - 65}{31}$
+
+| outlook  | temp | NT   | humidity | NH   | windy | play | Euclidean Distance to a' Calculation               | Euclidean Distance |
+| -------- | ---- | ---- | -------- | ---- | ----- | ---- | -------------------------------------------------- | ------------------ |
+| sunny    | 85   | 1    | 85       | 0.65 | F     | N    | $\sqrt{(85-72)^2 + (85-76)^2 + (2-2)^2 + (0-1)^2}$ | 15.84              |
+| sunny    | 80   | 0.76 | 90       | 0.81 | T     | N    | $\sqrt{(80-72)^2 + (90-76)^2+ (2-2)^2 + (1-1)^2}$  | 16.12              |
+| overcast | 83   | 0.90 | 68       | 0.68 | F     | Y    | $\sqrt{(83-72)^2 + (68-76)^2+ (1-2)^2 + (0-1)^2}$  | 13.67              |
+| rainy    | 70   | 0.29 | 96       | 1    | F     | Y    | $\sqrt{(70-72)^2 + (96-76)^2+ (0-2)^2 + (0-1)^2}$  | 20.22              |
+| rainy    | 68   | 0.19 | 80       | 0.48 | F     | Y    | $\sqrt{(68-72)^2 + (80-76)^2+ (0-2)^2 + (0-1)^2}$  | 25                 |
+| rainy    | 65   | 0.05 | 70       | 0.16 | T     | N    | $\sqrt{(65-72)^2 + (70-76)^2+ (0-2)^2 + (1-1)^2}$  |                    |
+| overcast | 64   | 0    | 65       | 0    | T     | Y    | $\sqrt{(64-72)^2 + (65-76)^2+ (1-2)^2 + (1-1)^2}$  |                    |
+| sunny    | 72   | 0.38 | 95       | 0.97 | F     | N    | $\sqrt{(72-72)^2 + (95-76)^2+ (2-2)^2 + (0-1)^2}$  |                    |
+| sunny    | 69   | 0.24 | 70       | 0.16 | F     | Y    | $\sqrt{(69-72)^2 + (70-76)^2+ (2-2)^2 + (0-1)^2}$  |                    |
+| rainy    | 75   | 0.52 | 80       | 0.48 | F     | Y    | $\sqrt{(75-72)^2 + (80-76)^2+ (0-2)^2 + (0-1)^2}$  |                    |
+| sunny    | 75   | 0.52 | 70       | 0.16 | T     | Y    | $\sqrt{(75-72)^2 + (70-76)^2+ (2-2)^2 + (1-1)^2}$  |                    |
+| overcast | 72   | 0.38 | 90       | 0.81 | T     | Y    | $\sqrt{(72-72)^2 + (90-76)^2+ (1-2)^2 + (1-1)^2}$  |                    |
+| overcast | 81   | 0.81 | 75       | 0.32 | F     | Y    | $\sqrt{(81-72)^2 + (75-76)^2+ (1-2)^2 + (0-1)^2}$  |                    |
+| rainy    | 71   | 0.33 | 91       | 0.84 | T     | N    | $\sqrt{(71-72)^2 + (91-76)^2+ (0-2)^2 + (1-1)^2}$  |                    |
+| sunny    | 72   | 0.38 | 76       | 0.35 | T     | ??   |                                                    |                    |
--- a/Neighbor.md
+++ b/Neighbor.md
@@ -0,0 +1,42 @@
+```
+=== Run information ===
+
+Scheme:       weka.classifiers.lazy.IBk -K 3 -W 0 -A "weka.core.neighboursearch.LinearNNSearch -A \"weka.core.EuclideanDistance -R first-last\""
+Relation:     letsPlay
+Instances:    14
+Attributes:   5
+              outlook
+              temperature
+              humidity
+              windy
+              play
+Test mode:    user supplied test set:  size unknown (reading incrementally)
+
+=== Classifier model (full training set) ===
+
+IB1 instance-based classifier
+using 3 nearest neighbour(s) for classification
+
+Time taken to build model: 0 seconds
+
+=== Predictions on test set ===
+
+    inst#     actual  predicted error prediction
+        1      1:yes      1:yes       0.659 
+        2      1:yes      1:yes       0.659 
+
+=== Evaluation on test set ===
+
+Time taken to test model on supplied test set: 0 seconds
+
+=== Summary ===
+
+Correctly Classified Instances           2              100      %
+Incorrectly Classified Instances         0                0      %
+Kappa statistic                          1     
+Mean absolute error                      0.3409
+Root mean squared error                  0.3409
+Relative absolute error                 90.9091 %
+Root relative squared error             90.9091 %
+Total Number of Instances                2
+```