Files
G4G0-2/AI & Data Mining/Week 1/Lecture 1 - Introduction to Data Mining.md
2024-10-16 09:12:37 +01:00

215 lines
5.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Assessment
## T1
- Exam (50%)
## T2
- Coursework (50%)
# Resources
Data Mining: Practical Machine Learning Tools and Techniques (Witten, Frank, Hall & Pal) 4th Edition 2016
Scientific Calculator
# Data Vs Information
- Too much data
- Valuable resource
- Raw data less important, need to develop techniques to extract information
- Data: recorded facts
- Information: patterns underlying data
# Philosophy
## Cow Culling
- Cows described by 700 features about certain variables
- Problem is the selection of cows of which to cull
- Data is historical records, and farmer decisions
- Machine Learning used to ascertain which factors taken into account by farmers, rather than automating the decision making process.
# Definition of Data Mining
- The extraction of:
- Implicit,
- Previously unknown,
- Potentially useful data
- Programs that detect patterns and regularities are needed
- Strong patterns => good predictions
- Issues:
- Most patterns not interesting
- Patterns may be inexact
- Data may be garbled or missing
# Machine Learning Techniques
- Algorithms for acquiring structural descriptions from examples
- Structural descriptions represent patterns, explicitly.
- Predict outcome in new situation
- Understand and explain how prediction derived.
- Methods originate from AI, statistics and research on databases.
# Can Machines Learn?
- By definition, sort of. The ability to obtain knowledge by study, experience or being taught, is very difficult to measure.
- Does learning imply intention?
# Terminology
- Concept - Thing to be learned
- Example / Instance - Individual, independent examples of a concept
- Attributes / Features - Measuring aspects of an example / instance
- Concept description (pattern, model, hypothesis) - Output for data mining algorithms.
# Famous Small Datasets
- Will be used in module
- Unrealistically simple
## Weather Dataset - Nominal
Concept: conditions which are suitable for a game.
Reference: Quinlan, J.R. (1986)
Induction of decision trees. Machine
Learning, 1(1), 81-106.
### Attributes
3\*3\*2\*2 = 36 possible combinations of values.
Outlook
- sunny, overcast, rainy
Temperature
- hot, mild, cool
Humidity
- high, normal
Windy
- yes, no
Play
- Class
- yes, no
### Dataset
![](Pasted%20image%2020240919134249.png)
![](Pasted%20image%2020240919134304.png)
Rules ordered, higher = higher priority
### Weather Dataset - Mixed
![](Pasted%20image%2020240919134526.png)
![](Pasted%20image%2020240919134535.png)
## Contact Lenses Dataset
Describes conditions under which an optician might want to prescribe soft, hard or no contact lenses.
Grossly over-simplified.
Reference: Cendrowska, J. (1987). Prism: an algorithm
for inducing module rules. Journal of Man-Machine
Studies, 27(4), 349370.
### Attributes
3\*2\*2\*2 = 24 possibilities
Dataset is exhaustive, which is unusual.
Age
- young, pre-presbyopic, presbyopic
Spectacle Prescription
- myope (short), hypermetrope (long)
Astigmatism
- yes, no
Tear Production Rate
- reduced, normal
Recommended Lenses
- class
- hard, soft, none
### Dataset
![](Pasted%20image%2020240919134848.png)
## Iris Dataset
Used in many statistical experiments
Contains numeric attributes of 3 different types of iris.
Created in 1936 by Sir Ronald Fisher
### Dataset
![](Pasted%20image%2020240920130950.png)
# Styles of Learning
- Classification Learning: Predicting a **nominal** class
- Numeric Prediction (Regression): Predicting a **numeric** quantity
- Clustering: Grouping similar examples into clusters
- Association Learning: Detecting associations between attributes
## Classification Learning
- Nominal
- Supervised
- Provided with actual value of the class
- Measure success on fresh data for which class labels are known (test data)
## Numeric Prediction (Regression)
- Numeric
- Supervised
- Test Data
![](Pasted%20image%2020240920131244.png)
Example uses a linear regression function to provide an estimated performance value based on attributes.
## Clustering
- Finding similar groups
- Unsupervised
- Class of example is unknown
- Success measured **subjectively**
## Association Learning
- Applied if no class specified, and any kind of structure is interesting
- Difference to Classification Learning:
- Predicts any attribute's value, not just class.
- More than one attribute's value at a time
- Far more association rules than classification rules.
## Classification Vs Association Rules
Classification Rule:
- Predicts value of a given attribute (class of example)
- ``If outlook = sunny and humidity = high, then play = no``
Association Rule:
- Predicts value of arbitrary attribute / combination
```If temperature = cool, humidity = normal
If humidity = normal and windy = false, play = yes
If outlook = sunny and play = no, humidity = high
If windy = false and play = no, then outlook = sunny and humidity = high
```
# Data Mining and Ethics
- Ethical Issues arise in practical applications
- Data mining often used to discriminate
- Ethical situation depends on application
- Attributes may contain problematic information
- Does ownership of data bestow right to use it in other ways than those purported when it was originally collected?
- Who is permitted to access the data?
- For what purpose was the data collected?
- What conclusions can sensibly be drawn?
- Caveats must be attached to results
- Purely statistical arguments never sufficient