5.3 KiB
Executable File
Assessment
T1
- Exam (50%)
T2
- Coursework (50%)
Resources
Data Mining: Practical Machine Learning Tools and Techniques (Witten, Frank, Hall & Pal) 4th Edition 2016
Scientific Calculator
Data Vs Information
- Too much data
- Valuable resource
- Raw data less important, need to develop techniques to extract information
- Data: recorded facts
- Information: patterns underlying data
Philosophy
Cow Culling
- Cows described by 700 features about certain variables
- Problem is the selection of cows of which to cull
- Data is historical records, and farmer decisions
- Machine Learning used to ascertain which factors taken into account by farmers, rather than automating the decision making process.
Definition of Data Mining
- The extraction of:
- Implicit,
- Previously unknown,
- Potentially useful data
- Programs that detect patterns and regularities are needed
- Strong patterns => good predictions
- Issues:
- Most patterns not interesting
- Patterns may be inexact
- Data may be garbled or missing
- Issues:
Machine Learning Techniques
- Algorithms for acquiring structural descriptions from examples
- Structural descriptions represent patterns, explicitly.
- Predict outcome in new situation
- Understand and explain how prediction derived.
- Methods originate from AI, statistics and research on databases.
Can Machines Learn?
- By definition, sort of. The ability to obtain knowledge by study, experience or being taught, is very difficult to measure.
- Does learning imply intention?
Terminology
- Concept - Thing to be learned
- Example / Instance - Individual, independent examples of a concept
- Attributes / Features - Measuring aspects of an example / instance
- Concept description (pattern, model, hypothesis) - Output for data mining algorithms.
Famous Small Datasets
- Will be used in module
- Unrealistically simple
Weather Dataset - Nominal
Concept: conditions which are suitable for a game. Reference: Quinlan, J.R. (1986) Induction of decision trees. Machine Learning, 1(1), 81-106.
Attributes
3*3*2*2 = 36 possible combinations of values. Outlook
- sunny, overcast, rainy Temperature
- hot, mild, cool Humidity
- high, normal Windy
- yes, no Play
- Class
- yes, no
Dataset
Rules ordered, higher = higher priority
Weather Dataset - Mixed
Contact Lenses Dataset
Describes conditions under which an optician might want to prescribe soft, hard or no contact lenses. Grossly over-simplified. Reference: Cendrowska, J. (1987). Prism: an algorithm for inducing module rules. Journal of Man-Machine Studies, 27(4), 349–370.
Attributes
3*2*2*2 = 24 possibilities Dataset is exhaustive, which is unusual.
Age
- young, pre-presbyopic, presbyopic Spectacle Prescription
- myope (short), hypermetrope (long) Astigmatism
- yes, no Tear Production Rate
- reduced, normal Recommended Lenses
- class
- hard, soft, none
Dataset
Iris Dataset
Used in many statistical experiments Contains numeric attributes of 3 different types of iris. Created in 1936 by Sir Ronald Fisher
Dataset
Styles of Learning
- Classification Learning: Predicting a nominal class
- Numeric Prediction (Regression): Predicting a numeric quantity
- Clustering: Grouping similar examples into clusters
- Association Learning: Detecting associations between attributes
Classification Learning
- Nominal
- Supervised
- Provided with actual value of the class
- Measure success on fresh data for which class labels are known (test data)
Numeric Prediction (Regression)
- Numeric
- Supervised
- Test Data
Example uses a linear regression function to provide an estimated performance value based on attributes.
Clustering
- Finding similar groups
- Unsupervised
- Class of example is unknown
- Success measured subjectively
Association Learning
- Applied if no class specified, and any kind of structure is interesting
- Difference to Classification Learning:
- Predicts any attribute's value, not just class.
- More than one attribute's value at a time
- Far more association rules than classification rules.
Classification Vs Association Rules
Classification Rule:
- Predicts value of a given attribute (class of example)
If outlook = sunny and humidity = high, then play = no
Association Rule:
- Predicts value of arbitrary attribute / combination
If humidity = normal and windy = false, play = yes
If outlook = sunny and play = no, humidity = high
If windy = false and play = no, then outlook = sunny and humidity = high
Data Mining and Ethics
- Ethical Issues arise in practical applications
- Data mining often used to discriminate
- Ethical situation depends on application
- Attributes may contain problematic information
- Does ownership of data bestow right to use it in other ways than those purported when it was originally collected?
- Who is permitted to access the data?
- For what purpose was the data collected?
- What conclusions can sensibly be drawn?
- Caveats must be attached to results
- Purely statistical arguments never sufficient