# Assessment ## T1 - Exam (50%) ## T2 - Coursework (50%) # Resources Data Mining: Practical Machine Learning Tools and Techniques (Witten, Frank, Hall & Pal) 4th Edition 2016 Scientific Calculator # Data Vs Information - Too much data - Valuable resource - Raw data less important, need to develop techniques to extract information - Data: recorded facts - Information: patterns underlying data # Philosophy ## Cow Culling - Cows described by 700 features about certain variables - Problem is the selection of cows of which to cull - Data is historical records, and farmer decisions - Machine Learning used to ascertain which factors taken into account by farmers, rather than automating the decision making process. # Definition of Data Mining - The extraction of: - Implicit, - Previously unknown, - Potentially useful data - Programs that detect patterns and regularities are needed - Strong patterns => good predictions - Issues: - Most patterns not interesting - Patterns may be inexact - Data may be garbled or missing # Machine Learning Techniques - Algorithms for acquiring structural descriptions from examples - Structural descriptions represent patterns, explicitly. - Predict outcome in new situation - Understand and explain how prediction derived. - Methods originate from AI, statistics and research on databases. # Can Machines Learn? - By definition, sort of. The ability to obtain knowledge by study, experience or being taught, is very difficult to measure. - Does learning imply intention? # Terminology - Concept - Thing to be learned - Example / Instance - Individual, independent examples of a concept - Attributes / Features - Measuring aspects of an example / instance - Concept description (pattern, model, hypothesis) - Output for data mining algorithms. # Famous Small Datasets - Will be used in module - Unrealistically simple ## Weather Dataset - Nominal Concept: conditions which are suitable for a game. Reference: Quinlan, J.R. (1986) Induction of decision trees. Machine Learning, 1(1), 81-106. ### Attributes 3\*3\*2\*2 = 36 possible combinations of values. Outlook - sunny, overcast, rainy Temperature - hot, mild, cool Humidity - high, normal Windy - yes, no Play - Class - yes, no ### Dataset ![](Pasted%20image%2020240919134249.png) ![](Pasted%20image%2020240919134304.png) Rules ordered, higher = higher priority ### Weather Dataset - Mixed ![](Pasted%20image%2020240919134526.png) ![](Pasted%20image%2020240919134535.png) ## Contact Lenses Dataset Describes conditions under which an optician might want to prescribe soft, hard or no contact lenses. Grossly over-simplified. Reference: Cendrowska, J. (1987). Prism: an algorithm for inducing module rules. Journal of Man-Machine Studies, 27(4), 349–370. ### Attributes 3\*2\*2\*2 = 24 possibilities Dataset is exhaustive, which is unusual. Age - young, pre-presbyopic, presbyopic Spectacle Prescription - myope (short), hypermetrope (long) Astigmatism - yes, no Tear Production Rate - reduced, normal Recommended Lenses - class - hard, soft, none ### Dataset ![](Pasted%20image%2020240919134848.png) ## Iris Dataset Used in many statistical experiments Contains numeric attributes of 3 different types of iris. Created in 1936 by Sir Ronald Fisher ### Dataset ![](Pasted%20image%2020240920130950.png) # Styles of Learning - Classification Learning: Predicting a **nominal** class - Numeric Prediction (Regression): Predicting a **numeric** quantity - Clustering: Grouping similar examples into clusters - Association Learning: Detecting associations between attributes ## Classification Learning - Nominal - Supervised - Provided with actual value of the class - Measure success on fresh data for which class labels are known (test data) ## Numeric Prediction (Regression) - Numeric - Supervised - Test Data ![](Pasted%20image%2020240920131244.png) Example uses a linear regression function to provide an estimated performance value based on attributes. ## Clustering - Finding similar groups - Unsupervised - Class of example is unknown - Success measured **subjectively** ## Association Learning - Applied if no class specified, and any kind of structure is interesting - Difference to Classification Learning: - Predicts any attribute's value, not just class. - More than one attribute's value at a time - Far more association rules than classification rules. ## Classification Vs Association Rules Classification Rule: - Predicts value of a given attribute (class of example) - ``If outlook = sunny and humidity = high, then play = no`` Association Rule: - Predicts value of arbitrary attribute / combination ```If temperature = cool, humidity = normal If humidity = normal and windy = false, play = yes If outlook = sunny and play = no, humidity = high If windy = false and play = no, then outlook = sunny and humidity = high ``` # Data Mining and Ethics - Ethical Issues arise in practical applications - Data mining often used to discriminate - Ethical situation depends on application - Attributes may contain problematic information - Does ownership of data bestow right to use it in other ways than those purported when it was originally collected? - Who is permitted to access the data? - For what purpose was the data collected? - What conclusions can sensibly be drawn? - Caveats must be attached to results - Purely statistical arguments never sufficient