Files
G4G0-2/AI & Data Mining/Week 1/Lecture 1 - Introduction to Data Mining.md
2025-01-30 09:27:31 +00:00

5.3 KiB
Executable File
Raw Blame History

Assessment

T1

  • Exam (50%)

T2

  • Coursework (50%)

Resources

Data Mining: Practical Machine Learning Tools and Techniques (Witten, Frank, Hall & Pal) 4th Edition 2016

Scientific Calculator

Data Vs Information

  • Too much data
  • Valuable resource
  • Raw data less important, need to develop techniques to extract information
    • Data: recorded facts
    • Information: patterns underlying data

Philosophy

Cow Culling

  • Cows described by 700 features about certain variables
  • Problem is the selection of cows of which to cull
  • Data is historical records, and farmer decisions
  • Machine Learning used to ascertain which factors taken into account by farmers, rather than automating the decision making process.

Definition of Data Mining

  • The extraction of:
    • Implicit,
    • Previously unknown,
    • Potentially useful data
  • Programs that detect patterns and regularities are needed
  • Strong patterns => good predictions
    • Issues:
      • Most patterns not interesting
      • Patterns may be inexact
      • Data may be garbled or missing

Machine Learning Techniques

  • Algorithms for acquiring structural descriptions from examples
  • Structural descriptions represent patterns, explicitly.
    • Predict outcome in new situation
    • Understand and explain how prediction derived.
  • Methods originate from AI, statistics and research on databases.

Can Machines Learn?

  • By definition, sort of. The ability to obtain knowledge by study, experience or being taught, is very difficult to measure.
  • Does learning imply intention?

Terminology

  • Concept - Thing to be learned
  • Example / Instance - Individual, independent examples of a concept
  • Attributes / Features - Measuring aspects of an example / instance
  • Concept description (pattern, model, hypothesis) - Output for data mining algorithms.

Famous Small Datasets

  • Will be used in module
  • Unrealistically simple

Weather Dataset - Nominal

Concept: conditions which are suitable for a game. Reference: Quinlan, J.R. (1986) Induction of decision trees. Machine Learning, 1(1), 81-106.

Attributes

3*3*2*2 = 36 possible combinations of values. Outlook

  • sunny, overcast, rainy Temperature
  • hot, mild, cool Humidity
  • high, normal Windy
  • yes, no Play
  • Class
  • yes, no

Dataset

Rules ordered, higher = higher priority

Weather Dataset - Mixed

Contact Lenses Dataset

Describes conditions under which an optician might want to prescribe soft, hard or no contact lenses. Grossly over-simplified. Reference: Cendrowska, J. (1987). Prism: an algorithm for inducing module rules. Journal of Man-Machine Studies, 27(4), 349370.

Attributes

3*2*2*2 = 24 possibilities Dataset is exhaustive, which is unusual.

Age

  • young, pre-presbyopic, presbyopic Spectacle Prescription
  • myope (short), hypermetrope (long) Astigmatism
  • yes, no Tear Production Rate
  • reduced, normal Recommended Lenses
  • class
  • hard, soft, none

Dataset

Iris Dataset

Used in many statistical experiments Contains numeric attributes of 3 different types of iris. Created in 1936 by Sir Ronald Fisher

Dataset

Styles of Learning

  • Classification Learning: Predicting a nominal class
  • Numeric Prediction (Regression): Predicting a numeric quantity
  • Clustering: Grouping similar examples into clusters
  • Association Learning: Detecting associations between attributes

Classification Learning

  • Nominal
  • Supervised
    • Provided with actual value of the class
  • Measure success on fresh data for which class labels are known (test data)

Numeric Prediction (Regression)

  • Numeric
  • Supervised
  • Test Data

Example uses a linear regression function to provide an estimated performance value based on attributes.

Clustering

  • Finding similar groups
  • Unsupervised
    • Class of example is unknown
  • Success measured subjectively

Association Learning

  • Applied if no class specified, and any kind of structure is interesting
  • Difference to Classification Learning:
    • Predicts any attribute's value, not just class.
    • More than one attribute's value at a time
    • Far more association rules than classification rules.

Classification Vs Association Rules

Classification Rule:

  • Predicts value of a given attribute (class of example)
  • If outlook = sunny and humidity = high, then play = no

Association Rule:

  • Predicts value of arbitrary attribute / combination
If humidity = normal and windy = false, play = yes
If outlook = sunny and play = no, humidity = high
If windy = false and play = no, then outlook = sunny and humidity = high

Data Mining and Ethics

  • Ethical Issues arise in practical applications
  • Data mining often used to discriminate
  • Ethical situation depends on application
  • Attributes may contain problematic information
  • Does ownership of data bestow right to use it in other ways than those purported when it was originally collected?
  • Who is permitted to access the data?
  • For what purpose was the data collected?
  • What conclusions can sensibly be drawn?
  • Caveats must be attached to results
  • Purely statistical arguments never sufficient