215 lines
5.3 KiB
Markdown
215 lines
5.3 KiB
Markdown
# Assessment
|
||
|
||
## T1
|
||
|
||
- Exam (50%)
|
||
|
||
## T2
|
||
|
||
- Coursework (50%)
|
||
|
||
# Resources
|
||
|
||
Data Mining: Practical Machine Learning Tools and Techniques (Witten, Frank, Hall & Pal) 4th Edition 2016
|
||
|
||
Scientific Calculator
|
||
|
||
# Data Vs Information
|
||
|
||
- Too much data
|
||
- Valuable resource
|
||
- Raw data less important, need to develop techniques to extract information
|
||
- Data: recorded facts
|
||
- Information: patterns underlying data
|
||
|
||
# Philosophy
|
||
|
||
## Cow Culling
|
||
|
||
- Cows described by 700 features about certain variables
|
||
- Problem is the selection of cows of which to cull
|
||
- Data is historical records, and farmer decisions
|
||
- Machine Learning used to ascertain which factors taken into account by farmers, rather than automating the decision making process.
|
||
|
||
# Definition of Data Mining
|
||
|
||
- The extraction of:
|
||
- Implicit,
|
||
- Previously unknown,
|
||
- Potentially useful data
|
||
- Programs that detect patterns and regularities are needed
|
||
- Strong patterns => good predictions
|
||
- Issues:
|
||
- Most patterns not interesting
|
||
- Patterns may be inexact
|
||
- Data may be garbled or missing
|
||
|
||
# Machine Learning Techniques
|
||
|
||
- Algorithms for acquiring structural descriptions from examples
|
||
- Structural descriptions represent patterns, explicitly.
|
||
- Predict outcome in new situation
|
||
- Understand and explain how prediction derived.
|
||
- Methods originate from AI, statistics and research on databases.
|
||
|
||
# Can Machines Learn?
|
||
|
||
- By definition, sort of. The ability to obtain knowledge by study, experience or being taught, is very difficult to measure.
|
||
- Does learning imply intention?
|
||
|
||
# Terminology
|
||
|
||
- Concept - Thing to be learned
|
||
- Example / Instance - Individual, independent examples of a concept
|
||
- Attributes / Features - Measuring aspects of an example / instance
|
||
- Concept description (pattern, model, hypothesis) - Output for data mining algorithms.
|
||
|
||
# Famous Small Datasets
|
||
|
||
- Will be used in module
|
||
- Unrealistically simple
|
||
|
||
## Weather Dataset - Nominal
|
||
|
||
Concept: conditions which are suitable for a game.
|
||
Reference: Quinlan, J.R. (1986)
|
||
Induction of decision trees. Machine
|
||
Learning, 1(1), 81-106.
|
||
|
||
### Attributes
|
||
|
||
3\*3\*2\*2 = 36 possible combinations of values.
|
||
Outlook
|
||
|
||
- sunny, overcast, rainy
|
||
Temperature
|
||
- hot, mild, cool
|
||
Humidity
|
||
- high, normal
|
||
Windy
|
||
- yes, no
|
||
Play
|
||
- Class
|
||
- yes, no
|
||
|
||
### Dataset
|
||
|
||

|
||

|
||
|
||
Rules ordered, higher = higher priority
|
||
|
||
### Weather Dataset - Mixed
|
||
|
||

|
||

|
||
|
||
## Contact Lenses Dataset
|
||
|
||
Describes conditions under which an optician might want to prescribe soft, hard or no contact lenses.
|
||
Grossly over-simplified.
|
||
Reference: Cendrowska, J. (1987). Prism: an algorithm
|
||
for inducing module rules. Journal of Man-Machine
|
||
Studies, 27(4), 349–370.
|
||
|
||
### Attributes
|
||
|
||
3\*2\*2\*2 = 24 possibilities
|
||
Dataset is exhaustive, which is unusual.
|
||
|
||
Age
|
||
|
||
- young, pre-presbyopic, presbyopic
|
||
Spectacle Prescription
|
||
- myope (short), hypermetrope (long)
|
||
Astigmatism
|
||
- yes, no
|
||
Tear Production Rate
|
||
- reduced, normal
|
||
Recommended Lenses
|
||
- class
|
||
- hard, soft, none
|
||
|
||
### Dataset
|
||
|
||

|
||
|
||
## Iris Dataset
|
||
|
||
Used in many statistical experiments
|
||
Contains numeric attributes of 3 different types of iris.
|
||
Created in 1936 by Sir Ronald Fisher
|
||
|
||
### Dataset
|
||
|
||

|
||
|
||
# Styles of Learning
|
||
|
||
- Classification Learning: Predicting a **nominal** class
|
||
- Numeric Prediction (Regression): Predicting a **numeric** quantity
|
||
- Clustering: Grouping similar examples into clusters
|
||
- Association Learning: Detecting associations between attributes
|
||
|
||
## Classification Learning
|
||
|
||
- Nominal
|
||
- Supervised
|
||
- Provided with actual value of the class
|
||
- Measure success on fresh data for which class labels are known (test data)
|
||
|
||
## Numeric Prediction (Regression)
|
||
|
||
- Numeric
|
||
- Supervised
|
||
- Test Data
|
||
|
||

|
||
|
||
Example uses a linear regression function to provide an estimated performance value based on attributes.
|
||
|
||
## Clustering
|
||
|
||
- Finding similar groups
|
||
- Unsupervised
|
||
- Class of example is unknown
|
||
- Success measured **subjectively**
|
||
|
||
## Association Learning
|
||
|
||
- Applied if no class specified, and any kind of structure is interesting
|
||
- Difference to Classification Learning:
|
||
- Predicts any attribute's value, not just class.
|
||
- More than one attribute's value at a time
|
||
- Far more association rules than classification rules.
|
||
|
||
## Classification Vs Association Rules
|
||
|
||
Classification Rule:
|
||
|
||
- Predicts value of a given attribute (class of example)
|
||
- ``If outlook = sunny and humidity = high, then play = no``
|
||
|
||
Association Rule:
|
||
|
||
- Predicts value of arbitrary attribute / combination
|
||
|
||
```If temperature = cool, humidity = normal
|
||
If humidity = normal and windy = false, play = yes
|
||
If outlook = sunny and play = no, humidity = high
|
||
If windy = false and play = no, then outlook = sunny and humidity = high
|
||
```
|
||
|
||
# Data Mining and Ethics
|
||
|
||
- Ethical Issues arise in practical applications
|
||
- Data mining often used to discriminate
|
||
- Ethical situation depends on application
|
||
- Attributes may contain problematic information
|
||
- Does ownership of data bestow right to use it in other ways than those purported when it was originally collected?
|
||
- Who is permitted to access the data?
|
||
- For what purpose was the data collected?
|
||
- What conclusions can sensibly be drawn?
|
||
- Caveats must be attached to results
|
||
- Purely statistical arguments never sufficient
|