3.7 KiB
3.7 KiB
Attributes
- Each example described by fixed pre-defined set of features (attributes)
- Number of attributes may vary
- ex. Transportation Vehicles
- no. wheels not applicable to ships
- no. masts not applicable to cars
- Possible solution: "irrelevant value" flag
- ex. Transportation Vehicles
- Attributes may be dependent on other attributes
Taxonomy of Data Types
Nominal Attributes
- Distinct symbols
- Serve as labels or names
- No relation implied among nominal values
- Only equality tests can be performed
- ex. outlook = sunny
Sources of Missing Values
- Malfunctioning / Misconfigured Equipment
- Changes in design
- Collation of different datasets
- Data not collected for mining
- Errors and omissions dont affect purpose of data
- ex. Banks do not need to know age in banking datasets, DOB may contain missing values
- Missing value may have significance
- ex. medical diagnoses can be made from tests a doctor decides, rather than the outcome.
- Most DM algos assume this is not the case, hence "missing" may need to be coded as an additional nominal value.
Inaccurate Values
- Typographical errors in nominal attributes
- Typographical and measurement errors in numeric attributes
- Deliberate errors
- ex. Incorrect ZIP codes, unsanitised inputs
- Duplicate examples
Weka and ARFF
Weather Dataset in ARFF
Getting to Know the Data
- First task, get to know data
- Simple visualisations useful:
- Nominal: bar graph
- Numeric: histograms
- 2D and 3D plots show dependencies
- Need to consult experts
- Too much data? Take sample.
Concept Descriptions
- Output of DM algorithm
- Many ways of representing:
- Decision Trees
- Rules
- Linear Regression Functions
Decision Trees
- Divide-and-Conquer approach
- Trees drawn upside down
- Node at top is root
- Edges are branches
- Rectangles represent leaves
- Leaves assign classification
- Nodes involve testing attribute
Decision Tree with Nominal Attributes
- Number of branches usually equal to number values
- Attribute not tested more than once.
Decision Tree with Numeric Attributes
- Test whether value is greater or less than constant
- Attribute may be tested multiple times
Decision Trees with Missing Values
- Not clear which branch should be taken when node tests attribute with missing value
- Does absence of a value have significance?
- Yes => Treat as separate value during training
- No => Treat in special way during testing
- Assign sample to most popular branch
Classification Rules
- Popular alternative to decision tree
- Antecedent (pre-condition) - series of tests
- Tests usually logically ANDed together
- Consequent (conclusion) - usually a class
- Individual rules often logically ORed together
If-Then Rules for Contact Lenses
Nuggets
- Are rules independent
- Problem: Ignores process of executing rules
- Ordered set (decision list)
- Order important for interpretation
- Unordered set
- Rules may overlap and lead to different conclusions for the same example
- Needs conflict resolution
- Ordered set (decision list)
Executing Rules
- What if
\geq
2 rules conflict?- Give no conclusion?
- Go with the rule that covers largest no. training samples?
- What is no rule applies to test example?
- Give no conclusion?
- Go with class that is most frequent?