vault backup: 2024-10-16 09:12:37

This commit is contained in:
boris
2024-10-16 09:12:37 +01:00
parent bad31f35c5
commit 124e0b67ef
190 changed files with 192115 additions and 0 deletions

View File

@@ -0,0 +1,85 @@
# Statistical Modelling
- Using statistical modelling for classification
- Bayesian techniques adopted by machine learning community in the 90s
- Opposite of 1R, uses all attributes
- Assume:
- Attributes equally important
- Statistically independent
- Independence assumption never correct
- Works in practice
# Weather Dataset
![](Pasted%20image%2020241003132609.png)
![](Pasted%20image%2020241003132636.png)
# Bayes' Rule of Conditional Probability
- Probability of event H given evidence E:
# $Pr[H|E] = \frac{Pr[E|H]\times Pr[H]}{Pr[E]}$
- H may be ex. Play = Yes
- E may be particular weather for new day
- A priori probability of H: $Pr[H]$
- Probability before evidence
- A posteriori probability of H: $Pr[H|E]$
- Probability after evidence
## Naive Bayes for Classification
- Classification Learning: what is the probability of class given instance?
- Evidence $E$ = instance
- Event $H$ = class for given instance
- Naive assumption: evidence splits into attributes that are independent
# $Pr[H|E] = \frac{Pr[E_1|H] \times Pr[E_2|H]… Pr[E_n|H] \times Pr[H]}{Pr[E]}$
- Denominator cancels out during conversion into probability by normalisation
### Weather Data Example
![](Pasted%20image%2020241003133919.png)
# Laplace Estimator
- Remedy to Zero-frequency problem: Add 1 to the count for every attribute value-class combination (laplace estimator)
- Result: probabilities will never be 0 (also stabilises probability estimates)
- Simple remedy is one which is often used in practice when zero frequency problem arises.
## Example
![](Pasted%20image%2020241003134100.png)
# Modified Probability Estimates
- Consider attribute *outlook* for class *yes*
# $\frac{2+\frac{1}{3}\mu}{9+\mu}$
Sunny
# $\frac{4+\frac{1}{3}\mu}{9+\mu}$
Overcast
# $\frac{3+\frac{1}{3}\mu}{9+\mu}$
Rainy
- Each value treated the same way
- Prior to seeing training set, assume each value is equally likely, ex. prior probability is $\frac{1}{3}$
- When decided to add 1 to counts, we implicitly set $\mu$ to 3.
- However, no particular reason to add 1 to the count, we could increment by 0.1 instead, setting $\mu$ to 0.3.
- A large value of $\mu$ indicates prior probabilities are very important compared to evidence in training set.
## Fully Bayesian Formulation
# $\frac{2+\frac{1}{3}\mu p_1}{9+\mu}$
Sunny
# $\frac{4+\frac{1}{3}\mu p_2}{9+\mu}$
Overcast
# $\frac{3+\frac{1}{3}\mu p_3}{9+\mu}$
Rainy
- Where $p_1 + p_2 + p_3 = 1$
- $p_1, p_2, p_3$ are prior probabilities of outlook being sunny, overcast or rainy before seeing the training set. However, in practice it is not clear how these prior probabilities should be assigned.

View File

@@ -0,0 +1,51 @@
| Temperature | Skin | Blood Pressure | Blocked Nose | Diagnosis |
| ----------- | ------ | -------------- | ------------ | --------- |
| Low | Pale | Normal | True | N |
| Moderate | Pale | Normal | True | B |
| High | Normal | High | False | N |
| Moderate | PaleFF | Normal | False | B |
| High | Red | High | False | N |
| High | Red | High | True | N |
| Moderate | Red | High | False | B |
| Low | Normal | High | False | B |
| Low | Pale | Normal | False | B |
| Low | Normal | Normal | False | B |
| High | Normal | Normal | True | B |
| Moderate | Normal | High | True | B |
| Moderate | Red | Normal | False | B |
| Low | Normal | High | True | N |
| | Temperature | | | Skin | | | Pressure | | | Blocked | | Diag | |
| -------- | ----------- | --- | ------ | ---- | --- | ------ | -------- | --- | ----- | ------- | --- | ---- | ---- |
| | N | B | | N | B | | N | B | | N | B | N | B |
| Low | 2 | 3 | Pale | 1 | 3 | Normal | 1 | 6 | True | 3 | 3 | 5 | 9 |
| Moderate | 0 | 5 | Normal | 2 | 4 | High | 4 | 3 | False | 2 | 6 | | |
| High | 3 | 1 | Red | 2 | 2 | | | | | | | | |
| | Temperature | | | Skin | | | Pressure | | | Blocked | | | Diag |
| Low | 2/5 | 3/9 | Pale | 1/5 | 3/9 | Normal | 1/5 | 6/9 | True | 3/5 | 3/9 | 5/14 | 9/14 |
| Moderate | 0/5 | 5/9 | Normal | 2/5 | 4/9 | High | 4/5 | 3/9 | False | 2/5 | 6/9 | | |
| High | 2/5 | 1/9 | Red | 3/5 | 2/9 | | | | | | | | |
# Problem 1
# $Pr[Diagnosis=N|E] = \frac{2}{5} \times \frac{2}{5} \times \frac{4}{5} \times \frac{3}{5} \times \frac{5}{14} = 0.027428571$
# $Pr[Diagnosis = B|E] = \frac{3}{9} \times \frac{4}{9} \times \frac{3}{9} \times \frac{3}{9} \times \frac{9}{14} = 0.010582011$
# $p(B) = \frac{0.0106}{0.0106+0.0274} = 0.2789$
# $p(N) = \frac{0.0274}{0.0106+0.0274} = 0.7211$
Diagnosis N is much more likely than Diagnosis B
# Problem 2
# $Pr[Diagnosis = N|E] = \frac{2}{5} \times \frac{1}{5} \times \frac{3}{5} \times \frac{5}{14} = 0.0171$
# $Pr[Diagnosis = B|E] = \frac{3}{9} \times \frac{6}{9} \times \frac{3}{9} \times \frac{9}{14} = 0.0476$
# $p(N) = \frac{0.0171}{0.0171+0.0476} = 0.2643$
# $p(B) = \frac{0.0474}{0.0476+0.0171} = 0.7357$
Diagnosis B is much more likely than Diagnosis N
# Problem 3
# $Pr[Diagnosis = N|E] = \frac{0}{5} \times \frac{2}{5} \times \frac{4}{5} \times \frac{3}{5} \times \frac{5}{14} = 0$
# $Pr[Diagnosis = B|E] = \frac{5}{9} \times \frac{4}{9} \times \frac{3}{9} \times \frac{3}{9} \times \frac{9}{14} = 0.018$

View File

@@ -0,0 +1,277 @@
# Weather Dataset
## Dataset
```
% This is a comment about the data set.
% This data describes examples of whether to play
% a game or not depending on weather conditions.
@relation letsPlay
@attribute outlook {sunny, overcast, rainy}
@attribute temperature real
@attribute humidity real
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,85,FALSE,no
sunny,90,TRUE,no
overcast,86,FALSE,yes
rainy,96,FALSE,yes
rainy,80,FALSE,yes
rainy,70,TRUE,no
overcast,65,TRUE,yes
sunny,95,FALSE,no
sunny,70,FALSE,yes
rainy,80,FALSE,yes
sunny,70,TRUE,yes
overcast,90,TRUE,yes
overcast,75,FALSE,yes
rainy,91,TRUE,no
```
## Output
```
=== Run information ===
Scheme: weka.classifiers.bayes.NaiveBayes
Relation: letsPlay
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Test mode: evaluate on training data
=== Classifier model (full training set) ===
Naive Bayes Classifier
Class
Attribute yes no
(0.63) (0.38)
===============================
outlook
sunny 3.0 4.0
overcast 5.0 1.0
rainy 4.0 3.0
[total] 12.0 8.0
temperature
mean 72.9697 74.8364
std. dev. 5.2304 7.384
weight sum 9 5
precision 1.9091 1.9091
humidity
mean 78.8395 86.1111
std. dev. 9.8023 9.2424
weight sum 9 5
precision 3.4444 3.4444
windy
TRUE 4.0 4.0
FALSE 7.0 3.0
[total] 11.0 7.0
Time taken to build model: 0 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0.01 seconds
=== Summary ===
Correctly Classified Instances 13 92.8571 %
Incorrectly Classified Instances 1 7.1429 %
Kappa statistic 0.8372
Mean absolute error 0.2798
Root mean squared error 0.3315
Relative absolute error 60.2576 %
Root relative squared error 69.1352 %
Total Number of Instances 14
```
# Medical Dataset
## Dataset
```
```@relation medical
@attribute Temperature {Low,Moderate,High}
@attribute Skin {Pale,Normal,Red}
@attribute BloodPressure {Normal,High}
@attribute BlockedNose {True,False}
@attribute Diagnosis {N,B}
@data
Low, Pale, Normal, True, N
Moderate, Pale, Normal, True, B
High, Normal, High, False, N
Moderate, Pale, Normal, False, B
High, Red, High, False, N
High, Red, High, True, N
Moderate, Red, High, False, B
Low, Normal, High, False, B
Low, Pale, Normal, False, B
Low, Normal, Normal, False, B
High, Normal, Normal, True, B
Moderate, Normal, High, True, B
Moderate, Red, Normal, False, B
Low, Normal, High, True, N```
```
## Output
```
=== Run information ===
Scheme: weka.classifiers.bayes.NaiveBayes
Relation: diagnosis
Instances: 14
Attributes: 5
Temperature
Skin
BloodPressure
BlockedNose
Diagnosis
Test mode: evaluate on training data
=== Classifier model (full training set) ===
Naive Bayes Classifier
Class
Attribute N B
(0.38) (0.63)
==============================
Temperature
Low 3.0 4.0
Moderate 1.0 6.0
High 4.0 2.0
[total] 8.0 12.0
Skin
Pale 2.0 4.0
Normal 3.0 5.0
Red 3.0 3.0
[total] 8.0 12.0
BloodPressure
Normal 2.0 7.0
High 5.0 4.0
[total] 7.0 11.0
BlockedNose
True 4.0 4.0
False 3.0 7.0
[total] 7.0 11.0
Time taken to build model: 0 seconds
=== Evaluation on training set ===
Time taken to test model on training data: 0 seconds
=== Summary ===
Correctly Classified Instances 12 85.7143 %
Incorrectly Classified Instances 2 14.2857 %
Kappa statistic 0.6889
Mean absolute error 0.2635
Root mean squared error 0.3272
Relative absolute error 56.7565 %
Root relative squared error 68.2385 %
Total Number of Instances 14
```
# Using Test Data
## Test Data
```
@relation medical
@attribute Temperature {Low,Moderate,High}
@attribute Skin {Pale,Normal,Red}
@attribute BloodPressure {Normal,High}
@attribute BlockedNose {True,False}
@attribute Diagnosis {N,B}
@data
Low,Normal,High,True,N
Low,?,Normal,True,B
Moderate,Normal,High,True,B
```
## Output
```
=== Run information ===
Scheme: weka.classifiers.bayes.NaiveBayes
Relation: medical
Instances: 14
Attributes: 5
Temperature
Skin
BloodPressure
BlockedNose
Diagnosis
Test mode: user supplied test set: size unknown (reading incrementally)
=== Classifier model (full training set) ===
Naive Bayes Classifier
Class
Attribute N B
(0.38) (0.63)
==============================
Temperature
Low 3.0 4.0
Moderate 1.0 6.0
High 4.0 2.0
[total] 8.0 12.0
Skin
Pale 2.0 4.0
Normal 3.0 5.0
Red 3.0 3.0
[total] 8.0 12.0
BloodPressure
Normal 2.0 7.0
High 5.0 4.0
[total] 7.0 11.0
BlockedNose
True 4.0 4.0
False 3.0 7.0
[total] 7.0 11.0
Time taken to build model: 0 seconds
=== Predictions on test set ===
inst# actual predicted error prediction
1 1:N 1:N 0.652
2 2:B 2:B 0.677
3 2:B 2:B 0.706
=== Evaluation on test set ===
Time taken to test model on supplied test set: 0 seconds
=== Summary ===
Correctly Classified Instances 3 100 %
Incorrectly Classified Instances 0 0 %
Kappa statistic 1
Mean absolute error 0.3215
Root mean squared error 0.3223
Relative absolute error 70.1487 %
Root relative squared error 68.0965 %
Total Number of Instances 3
```