# Attributes - Each example described by fixed pre-defined set of features (attributes) - Number of attributes may vary - ex. Transportation Vehicles - no. wheels not applicable to ships - no. masts not applicable to cars - Possible solution: "irrelevant value" flag - Attributes may be dependent on other attributes # Taxonomy of Data Types ![](Pasted%20image%2020240920132209.png) # Nominal Attributes - Distinct symbols - Serve as labels or names - No relation implied among nominal values - Only equality tests can be performed - ex. outlook = sunny # Sources of Missing Values - Malfunctioning / Misconfigured Equipment - Changes in design - Collation of different datasets - Data not collected for mining - Errors and omissions dont affect purpose of data - ex. Banks do not need to know age in banking datasets, DOB may contain missing values - Missing value may have significance - ex. medical diagnoses can be made from tests a doctor decides, rather than the outcome. - Most DM algos assume this is not the case, hence "missing" may need to be coded as an additional nominal value. # Inaccurate Values - Typographical errors in nominal attributes - Typographical and measurement errors in numeric attributes - Deliberate errors - ex. Incorrect ZIP codes, unsanitised inputs - Duplicate examples # Weka and ARFF ## Weather Dataset in ARFF ![](Pasted%20image%2020240920132732.png) ### Getting to Know the Data - First task, get to know data - Simple visualisations useful: - Nominal: bar graph - Numeric: histograms - 2D and 3D plots show dependencies - Need to consult experts - Too much data? Take sample. # Concept Descriptions - Output of DM algorithm - Many ways of representing: - Decision Trees - Rules - Linear Regression Functions ## Decision Trees - Divide-and-Conquer approach - Trees drawn upside down - Node at top is root - Edges are branches - Rectangles represent leaves - Leaves assign classification - Nodes involve testing attribute ### Decision Tree with Nominal Attributes ![](Pasted%20image%2020240920133218.png) - Number of branches usually equal to number values - Attribute not tested more than once. ### Decision Tree with Numeric Attributes ![](Pasted%20image%2020240920133316.png) - Test whether value is greater or less than constant - Attribute may be tested multiple times ### Decision Trees with Missing Values - Not clear which branch should be taken when node tests attribute with missing value - Does absence of a value have significance? - Yes => Treat as separate value during training - No => Treat in special way during testing - Assign sample to most popular branch # Classification Rules - Popular alternative to decision tree - Antecedent (pre-condition) - series of tests - Tests usually logically ANDed together - Consequent (conclusion) - usually a class - Individual rules often logically ORed together ## If-Then Rules for Contact Lenses ![](Pasted%20image%2020240920133706.png) # Nuggets - Are rules independent - Problem: Ignores process of executing rules - Ordered set (decision list) - Order important for interpretation - Unordered set - Rules may overlap and lead to different conclusions for the same example - Needs conflict resolution ## Executing Rules - What if $\geq$ 2 rules conflict? - Give no conclusion? - Go with the rule that covers largest no. training samples? - What is no rule applies to test example? - Give no conclusion? - Go with class that is most frequent? ## Special Case: Boolean Classes - Assumption: if example does not belong to class "yes", belongs to "no" - Solution: only learn rules for class "yes", use default rule for "no" ![](Pasted%20image%2020240920134203.png) - Order is important, no conflicts.