vault backup: 2024-10-16 09:12:37

2024-10-16 09:12:37 +01:00
parent bad31f35c5
commit 124e0b67ef
190 changed files with 192115 additions and 0 deletions
--- a/Mining/Week
+++ b/Mining/Week
@@ -0,0 +1,214 @@
+# Assessment
+
+## T1
+
+- Exam (50%)
+
+## T2
+
+- Coursework (50%)
+
+# Resources
+
+Data Mining: Practical Machine Learning Tools and Techniques (Witten, Frank, Hall & Pal) 4th Edition 2016
+
+Scientific Calculator
+
+# Data Vs Information
+
+- Too much data
+- Valuable resource
+- Raw data less important, need to develop techniques to extract information
+	- Data: recorded facts
+	- Information: patterns underlying data
+
+# Philosophy
+
+## Cow Culling
+
+- Cows described by 700 features about certain variables
+- Problem is the selection of cows of which to cull
+- Data is historical records, and farmer decisions
+- Machine Learning used to ascertain which factors taken into account by farmers, rather than automating the decision making process.
+
+# Definition of Data Mining
+
+- The extraction of:
+	- Implicit,
+	- Previously unknown,
+	- Potentially useful data
+- Programs that detect patterns and regularities are needed
+- Strong patterns => good predictions
+	- Issues:
+		- Most patterns not interesting
+		- Patterns may be inexact
+		- Data may be garbled or missing
+
+# Machine Learning Techniques
+
+- Algorithms for acquiring structural descriptions from examples
+- Structural descriptions represent patterns, explicitly.
+	- Predict outcome in new situation
+	- Understand and explain how prediction derived.
+- Methods originate from AI, statistics and research on databases.
+
+# Can Machines Learn?
+
+- By definition, sort of. The ability to obtain knowledge by study, experience or being taught, is very difficult to measure.
+- Does learning imply intention?
+
+# Terminology
+
+- Concept - Thing to be learned
+- Example / Instance - Individual, independent examples of a concept
+- Attributes / Features - Measuring aspects of an example / instance
+- Concept description (pattern, model, hypothesis) - Output for data mining algorithms.
+
+# Famous Small Datasets
+
+- Will be used in module
+- Unrealistically simple
+
+## Weather Dataset - Nominal
+
+Concept: conditions which are suitable for a game.
+Reference: Quinlan, J.R. (1986)
+Induction of decision trees. Machine
+Learning, 1(1), 81-106.
+
+### Attributes
+
+3\*3\*2\*2 = 36 possible combinations of values.
+Outlook
+
+- sunny, overcast, rainy
+Temperature
+- hot, mild, cool
+Humidity
+- high, normal
+Windy
+- yes, no
+Play
+- Class
+- yes, no
+
+### Dataset
+
+![](Pasted%20image%2020240919134249.png)
+![](Pasted%20image%2020240919134304.png)
+
+Rules ordered, higher = higher priority
+
+### Weather Dataset - Mixed
+
+![](Pasted%20image%2020240919134526.png)
+![](Pasted%20image%2020240919134535.png)
+
+## Contact Lenses Dataset
+
+Describes conditions under which an optician might want to prescribe soft, hard or no contact lenses.
+Grossly over-simplified.
+Reference: Cendrowska, J. (1987). Prism: an algorithm
+for inducing module rules. Journal of Man-Machine
+Studies, 27(4), 349–370.
+
+### Attributes
+
+3\*2\*2\*2 = 24 possibilities
+Dataset is exhaustive, which is unusual.
+
+Age
+
+- young, pre-presbyopic, presbyopic
+Spectacle Prescription
+- myope (short), hypermetrope (long)
+Astigmatism
+- yes, no
+Tear Production Rate
+- reduced, normal
+Recommended Lenses
+- class
+- hard, soft, none
+
+### Dataset
+
+![](Pasted%20image%2020240919134848.png)
+
+## Iris Dataset
+
+Used in many statistical experiments
+Contains numeric attributes of 3 different types of iris.
+Created in 1936 by Sir Ronald Fisher
+
+### Dataset
+
+![](Pasted%20image%2020240920130950.png)
+
+# Styles of Learning
+
+- Classification Learning: Predicting a **nominal** class
+- Numeric Prediction (Regression): Predicting a **numeric** quantity
+- Clustering: Grouping similar examples into clusters
+- Association Learning: Detecting associations between attributes
+
+## Classification Learning
+
+- Nominal
+- Supervised
+	- Provided with actual value of the class
+- Measure success on fresh data for which class labels are known (test data)
+
+## Numeric Prediction (Regression)
+
+- Numeric
+- Supervised
+- Test Data
+
+![](Pasted%20image%2020240920131244.png)
+
+Example uses a linear regression function to provide an estimated performance value based on attributes.
+
+## Clustering
+
+- Finding similar groups
+- Unsupervised
+	- Class of example is unknown
+- Success measured **subjectively**
+
+## Association Learning
+
+- Applied if no class specified, and any kind of structure is interesting
+- Difference to Classification Learning:
+	- Predicts any attribute's value, not just class.
+	- More than one attribute's value at a time
+	- Far more association rules than classification rules.
+
+## Classification Vs Association Rules
+
+Classification Rule:
+
+- Predicts value of a given attribute (class of example)
+- ``If outlook = sunny and humidity = high, then play = no``
+
+Association Rule:
+
+- Predicts value of arbitrary attribute / combination
+
+```If temperature = cool, humidity = normal
+If humidity = normal and windy = false, play = yes
+If outlook = sunny and play = no, humidity = high
+If windy = false and play = no, then outlook = sunny and humidity = high
+```
+
+# Data Mining and Ethics
+
+- Ethical Issues arise in practical applications
+- Data mining often used to discriminate
+- Ethical situation depends on application
+- Attributes may contain problematic information
+- Does ownership of data bestow right to use it in other ways than those purported when it was originally collected?
+- Who is permitted to access the data?
+- For what purpose was the data collected?
+- What conclusions can sensibly be drawn?
+- Caveats must be attached to results
+- Purely statistical arguments never sufficient