vault backup: 2024-10-30 11:06:53

2024-10-30 11:06:53 +00:00
parent cf0301a749
commit 707f3abf98
22 changed files with 21307 additions and 78 deletions
--- a/Mining/Week
+++ b/Mining/Week
@@ -0,0 +1,81 @@
+# Logarithms
+
+$log_2X$ used for generating decision trees
+- Power to which we have to raise 2 to get X
+- When using, X will be probability between 0 and 1
+- log of probability is always negative
+
+# Decision Tree for Contact Lenses
+
+![](Pasted%20image%2020241024132411.png)
+
+- Upside down
+- Ellipse at top = root
+- Edges = branches
+- Rectangles = leaves
+- Leaves assign classification
+
+## Strategy
+
+- Grow trees from root
+- Top-Down
+- More specific as grown, described as general-to-specific
+- Divide and Conquer
+- Stop if all examples have same class
+- How is attribute or root node selected?
+	- Consider how to generate decision tree for weather dataset with nominal values only.
+
+## Weather Dataset
+
+![](Pasted%20image%2020241024132906.png)
+
+### Criterion for Attribute Selection
+
+- Which is best?
+	- Smallest tree
+	- Heuristic: choose attribute which produces purest nodes
+- Information gain popular criteria for measuring impurity
+	- Increases with average purity of subsets
+- Choose attribute that gives greatest information gain
+
+# Information
+
+- Expected amount of information needed to specify whether new example should be classified as yes or no, given it reached that node.
+
+## Computing Information
+
+- Measure information in bits
+	- Given probability distribution, info required to predict an event is the distributions entropy
+	- Entropy gives the information required in bits
+
+# $I(p_1,p_2,...,p_n)=-p_{1}\log_{2}p_1 -p_{2}\log_{2}p_2 ... -p_{n}\log_{2}p_n$
+ Where n = number of classes, and $p_1 + p_2 + ... p_{n} = 1$
+Minus signs included since output must be positive
+
+### Expected Information for Outlook
+
+- Outlook = Sunny
+# $info([2,3]) = I(\frac{2}{5},\frac{3}{5}) = -\frac{2}{5}\log_2(\frac{2}{5}) - \frac{3}{5}\log_2(\frac{3}{5}) = 0.971 bits$
+
+- Outlook = Overcast
+# $info([4,0]) = I(\frac{4}{4},\frac{0}{4}) = -1\log_2(1) -0\log_2(0) = 0 bits$
+
+- Outlook = Rainy
+# $info([3,2]) = I(\frac{3}{5},\frac{2}{5}) = -\frac{3}{5}\log_2(\frac{3}{5}) - \frac{2}{5}\log_2(\frac{2}{5}) = 0.693 bits$
+
+### Computing Information Gain
+
+Information before splitting - information after splitting
+
+gain(outlook) = info([9,5])-E(Outlook)
+				= 0.940 - 0.693
+				= 0.247
+
+#### Information Gain for Attributes of Weather Data
+
+gain(Outlook) = 0.247
+gain(Temperature) = 0.029
+gain(Humidity) = 0.152
+gain(Windy) = 0.048
+
+Outlook Selected for root because gains most information