vault backup: 2024-10-16 09:12:37

This commit is contained in:
boris
2024-10-16 09:12:37 +01:00
parent bad31f35c5
commit 124e0b67ef
190 changed files with 192115 additions and 0 deletions

View File

@@ -0,0 +1,214 @@
# Assessment
## T1
- Exam (50%)
## T2
- Coursework (50%)
# Resources
Data Mining: Practical Machine Learning Tools and Techniques (Witten, Frank, Hall & Pal) 4th Edition 2016
Scientific Calculator
# Data Vs Information
- Too much data
- Valuable resource
- Raw data less important, need to develop techniques to extract information
- Data: recorded facts
- Information: patterns underlying data
# Philosophy
## Cow Culling
- Cows described by 700 features about certain variables
- Problem is the selection of cows of which to cull
- Data is historical records, and farmer decisions
- Machine Learning used to ascertain which factors taken into account by farmers, rather than automating the decision making process.
# Definition of Data Mining
- The extraction of:
- Implicit,
- Previously unknown,
- Potentially useful data
- Programs that detect patterns and regularities are needed
- Strong patterns => good predictions
- Issues:
- Most patterns not interesting
- Patterns may be inexact
- Data may be garbled or missing
# Machine Learning Techniques
- Algorithms for acquiring structural descriptions from examples
- Structural descriptions represent patterns, explicitly.
- Predict outcome in new situation
- Understand and explain how prediction derived.
- Methods originate from AI, statistics and research on databases.
# Can Machines Learn?
- By definition, sort of. The ability to obtain knowledge by study, experience or being taught, is very difficult to measure.
- Does learning imply intention?
# Terminology
- Concept - Thing to be learned
- Example / Instance - Individual, independent examples of a concept
- Attributes / Features - Measuring aspects of an example / instance
- Concept description (pattern, model, hypothesis) - Output for data mining algorithms.
# Famous Small Datasets
- Will be used in module
- Unrealistically simple
## Weather Dataset - Nominal
Concept: conditions which are suitable for a game.
Reference: Quinlan, J.R. (1986)
Induction of decision trees. Machine
Learning, 1(1), 81-106.
### Attributes
3\*3\*2\*2 = 36 possible combinations of values.
Outlook
- sunny, overcast, rainy
Temperature
- hot, mild, cool
Humidity
- high, normal
Windy
- yes, no
Play
- Class
- yes, no
### Dataset
![](Pasted%20image%2020240919134249.png)
![](Pasted%20image%2020240919134304.png)
Rules ordered, higher = higher priority
### Weather Dataset - Mixed
![](Pasted%20image%2020240919134526.png)
![](Pasted%20image%2020240919134535.png)
## Contact Lenses Dataset
Describes conditions under which an optician might want to prescribe soft, hard or no contact lenses.
Grossly over-simplified.
Reference: Cendrowska, J. (1987). Prism: an algorithm
for inducing module rules. Journal of Man-Machine
Studies, 27(4), 349370.
### Attributes
3\*2\*2\*2 = 24 possibilities
Dataset is exhaustive, which is unusual.
Age
- young, pre-presbyopic, presbyopic
Spectacle Prescription
- myope (short), hypermetrope (long)
Astigmatism
- yes, no
Tear Production Rate
- reduced, normal
Recommended Lenses
- class
- hard, soft, none
### Dataset
![](Pasted%20image%2020240919134848.png)
## Iris Dataset
Used in many statistical experiments
Contains numeric attributes of 3 different types of iris.
Created in 1936 by Sir Ronald Fisher
### Dataset
![](Pasted%20image%2020240920130950.png)
# Styles of Learning
- Classification Learning: Predicting a **nominal** class
- Numeric Prediction (Regression): Predicting a **numeric** quantity
- Clustering: Grouping similar examples into clusters
- Association Learning: Detecting associations between attributes
## Classification Learning
- Nominal
- Supervised
- Provided with actual value of the class
- Measure success on fresh data for which class labels are known (test data)
## Numeric Prediction (Regression)
- Numeric
- Supervised
- Test Data
![](Pasted%20image%2020240920131244.png)
Example uses a linear regression function to provide an estimated performance value based on attributes.
## Clustering
- Finding similar groups
- Unsupervised
- Class of example is unknown
- Success measured **subjectively**
## Association Learning
- Applied if no class specified, and any kind of structure is interesting
- Difference to Classification Learning:
- Predicts any attribute's value, not just class.
- More than one attribute's value at a time
- Far more association rules than classification rules.
## Classification Vs Association Rules
Classification Rule:
- Predicts value of a given attribute (class of example)
- ``If outlook = sunny and humidity = high, then play = no``
Association Rule:
- Predicts value of arbitrary attribute / combination
```If temperature = cool, humidity = normal
If humidity = normal and windy = false, play = yes
If outlook = sunny and play = no, humidity = high
If windy = false and play = no, then outlook = sunny and humidity = high
```
# Data Mining and Ethics
- Ethical Issues arise in practical applications
- Data mining often used to discriminate
- Ethical situation depends on application
- Attributes may contain problematic information
- Does ownership of data bestow right to use it in other ways than those purported when it was originally collected?
- Who is permitted to access the data?
- For what purpose was the data collected?
- What conclusions can sensibly be drawn?
- Caveats must be attached to results
- Purely statistical arguments never sufficient