Files
G4G0-2/AI & Data Mining/Week 1/Lecture 2 - Input and Output.md
2024-10-16 09:12:37 +01:00

3.7 KiB

Attributes

  • Each example described by fixed pre-defined set of features (attributes)
  • Number of attributes may vary
    • ex. Transportation Vehicles
      • no. wheels not applicable to ships
      • no. masts not applicable to cars
    • Possible solution: "irrelevant value" flag
  • Attributes may be dependent on other attributes

Taxonomy of Data Types

Nominal Attributes

  • Distinct symbols
    • Serve as labels or names
  • No relation implied among nominal values
  • Only equality tests can be performed
    • ex. outlook = sunny

Sources of Missing Values

  • Malfunctioning / Misconfigured Equipment
  • Changes in design
  • Collation of different datasets
  • Data not collected for mining
    • Errors and omissions dont affect purpose of data
    • ex. Banks do not need to know age in banking datasets, DOB may contain missing values
  • Missing value may have significance
    • ex. medical diagnoses can be made from tests a doctor decides, rather than the outcome.
    • Most DM algos assume this is not the case, hence "missing" may need to be coded as an additional nominal value.

Inaccurate Values

  • Typographical errors in nominal attributes
  • Typographical and measurement errors in numeric attributes
  • Deliberate errors
    • ex. Incorrect ZIP codes, unsanitised inputs
  • Duplicate examples

Weka and ARFF

Weather Dataset in ARFF

Getting to Know the Data

  • First task, get to know data
  • Simple visualisations useful:
    • Nominal: bar graph
    • Numeric: histograms
  • 2D and 3D plots show dependencies
  • Need to consult experts
  • Too much data? Take sample.

Concept Descriptions

  • Output of DM algorithm
  • Many ways of representing:
    • Decision Trees
    • Rules
    • Linear Regression Functions

Decision Trees

  • Divide-and-Conquer approach
  • Trees drawn upside down
    • Node at top is root
  • Edges are branches
  • Rectangles represent leaves
  • Leaves assign classification
  • Nodes involve testing attribute

Decision Tree with Nominal Attributes

  • Number of branches usually equal to number values
  • Attribute not tested more than once.

Decision Tree with Numeric Attributes

  • Test whether value is greater or less than constant
  • Attribute may be tested multiple times

Decision Trees with Missing Values

  • Not clear which branch should be taken when node tests attribute with missing value
  • Does absence of a value have significance?
    • Yes => Treat as separate value during training
    • No => Treat in special way during testing
      • Assign sample to most popular branch

Classification Rules

  • Popular alternative to decision tree
  • Antecedent (pre-condition) - series of tests
  • Tests usually logically ANDed together
  • Consequent (conclusion) - usually a class
  • Individual rules often logically ORed together

If-Then Rules for Contact Lenses

Nuggets

  • Are rules independent
  • Problem: Ignores process of executing rules
    • Ordered set (decision list)
      • Order important for interpretation
    • Unordered set
      • Rules may overlap and lead to different conclusions for the same example
      • Needs conflict resolution

Executing Rules

  • What if \geq 2 rules conflict?
    • Give no conclusion?
    • Go with the rule that covers largest no. training samples?
  • What is no rule applies to test example?
    • Give no conclusion?
    • Go with class that is most frequent?

Special Case: Boolean Classes

  • Assumption: if example does not belong to class "yes", belongs to "no"
  • Solution: only learn rules for class "yes", use default rule for "no"
  • Order is important, no conflicts.