136 lines
3.7 KiB
Markdown
Executable File
136 lines
3.7 KiB
Markdown
Executable File
# Attributes
|
|
|
|
- Each example described by fixed pre-defined set of features (attributes)
|
|
- Number of attributes may vary
|
|
- ex. Transportation Vehicles
|
|
- no. wheels not applicable to ships
|
|
- no. masts not applicable to cars
|
|
- Possible solution: "irrelevant value" flag
|
|
- Attributes may be dependent on other attributes
|
|
|
|
# Taxonomy of Data Types
|
|
|
|

|
|
|
|
# Nominal Attributes
|
|
|
|
- Distinct symbols
|
|
- Serve as labels or names
|
|
- No relation implied among nominal values
|
|
- Only equality tests can be performed
|
|
- ex. outlook = sunny
|
|
|
|
# Sources of Missing Values
|
|
|
|
- Malfunctioning / Misconfigured Equipment
|
|
- Changes in design
|
|
- Collation of different datasets
|
|
- Data not collected for mining
|
|
- Errors and omissions dont affect purpose of data
|
|
- ex. Banks do not need to know age in banking datasets, DOB may contain missing values
|
|
- Missing value may have significance
|
|
- ex. medical diagnoses can be made from tests a doctor decides, rather than the outcome.
|
|
- Most DM algos assume this is not the case, hence "missing" may need to be coded as an additional nominal value.
|
|
|
|
# Inaccurate Values
|
|
|
|
- Typographical errors in nominal attributes
|
|
- Typographical and measurement errors in numeric attributes
|
|
- Deliberate errors
|
|
- ex. Incorrect ZIP codes, unsanitised inputs
|
|
- Duplicate examples
|
|
|
|
# Weka and ARFF
|
|
|
|
## Weather Dataset in ARFF
|
|
|
|

|
|
|
|
### Getting to Know the Data
|
|
|
|
- First task, get to know data
|
|
- Simple visualisations useful:
|
|
- Nominal: bar graph
|
|
- Numeric: histograms
|
|
- 2D and 3D plots show dependencies
|
|
- Need to consult experts
|
|
- Too much data? Take sample.
|
|
|
|
# Concept Descriptions
|
|
|
|
- Output of DM algorithm
|
|
- Many ways of representing:
|
|
- Decision Trees
|
|
- Rules
|
|
- Linear Regression Functions
|
|
|
|
## Decision Trees
|
|
|
|
- Divide-and-Conquer approach
|
|
- Trees drawn upside down
|
|
- Node at top is root
|
|
- Edges are branches
|
|
- Rectangles represent leaves
|
|
- Leaves assign classification
|
|
- Nodes involve testing attribute
|
|
|
|
### Decision Tree with Nominal Attributes
|
|
|
|

|
|
|
|
- Number of branches usually equal to number values
|
|
- Attribute not tested more than once.
|
|
|
|
### Decision Tree with Numeric Attributes
|
|
|
|

|
|
|
|
- Test whether value is greater or less than constant
|
|
- Attribute may be tested multiple times
|
|
|
|
### Decision Trees with Missing Values
|
|
|
|
- Not clear which branch should be taken when node tests attribute with missing value
|
|
- Does absence of a value have significance?
|
|
- Yes => Treat as separate value during training
|
|
- No => Treat in special way during testing
|
|
- Assign sample to most popular branch
|
|
|
|
# Classification Rules
|
|
|
|
- Popular alternative to decision tree
|
|
- Antecedent (pre-condition) - series of tests
|
|
- Tests usually logically ANDed together
|
|
- Consequent (conclusion) - usually a class
|
|
- Individual rules often logically ORed together
|
|
|
|
## If-Then Rules for Contact Lenses
|
|
|
|

|
|
|
|
# Nuggets
|
|
|
|
- Are rules independent
|
|
- Problem: Ignores process of executing rules
|
|
- Ordered set (decision list)
|
|
- Order important for interpretation
|
|
- Unordered set
|
|
- Rules may overlap and lead to different conclusions for the same example
|
|
- Needs conflict resolution
|
|
|
|
## Executing Rules
|
|
|
|
- What if $\geq$ 2 rules conflict?
|
|
- Give no conclusion?
|
|
- Go with the rule that covers largest no. training samples?
|
|
- What is no rule applies to test example?
|
|
- Give no conclusion?
|
|
- Go with class that is most frequent?
|
|
|
|
## Special Case: Boolean Classes
|
|
|
|
- Assumption: if example does not belong to class "yes", belongs to "no"
|
|
- Solution: only learn rules for class "yes", use default rule for "no"
|
|

|
|
- Order is important, no conflicts.
|