Files
G4G0-2/AI & Data Mining/Week 1/Lecture 2 - Input and Output.md
2025-01-30 09:27:31 +00:00

136 lines
3.7 KiB
Markdown
Executable File

# Attributes
- Each example described by fixed pre-defined set of features (attributes)
- Number of attributes may vary
- ex. Transportation Vehicles
- no. wheels not applicable to ships
- no. masts not applicable to cars
- Possible solution: "irrelevant value" flag
- Attributes may be dependent on other attributes
# Taxonomy of Data Types
![](Pasted%20image%2020240920132209.png)
# Nominal Attributes
- Distinct symbols
- Serve as labels or names
- No relation implied among nominal values
- Only equality tests can be performed
- ex. outlook = sunny
# Sources of Missing Values
- Malfunctioning / Misconfigured Equipment
- Changes in design
- Collation of different datasets
- Data not collected for mining
- Errors and omissions dont affect purpose of data
- ex. Banks do not need to know age in banking datasets, DOB may contain missing values
- Missing value may have significance
- ex. medical diagnoses can be made from tests a doctor decides, rather than the outcome.
- Most DM algos assume this is not the case, hence "missing" may need to be coded as an additional nominal value.
# Inaccurate Values
- Typographical errors in nominal attributes
- Typographical and measurement errors in numeric attributes
- Deliberate errors
- ex. Incorrect ZIP codes, unsanitised inputs
- Duplicate examples
# Weka and ARFF
## Weather Dataset in ARFF
![](Pasted%20image%2020240920132732.png)
### Getting to Know the Data
- First task, get to know data
- Simple visualisations useful:
- Nominal: bar graph
- Numeric: histograms
- 2D and 3D plots show dependencies
- Need to consult experts
- Too much data? Take sample.
# Concept Descriptions
- Output of DM algorithm
- Many ways of representing:
- Decision Trees
- Rules
- Linear Regression Functions
## Decision Trees
- Divide-and-Conquer approach
- Trees drawn upside down
- Node at top is root
- Edges are branches
- Rectangles represent leaves
- Leaves assign classification
- Nodes involve testing attribute
### Decision Tree with Nominal Attributes
![](Pasted%20image%2020240920133218.png)
- Number of branches usually equal to number values
- Attribute not tested more than once.
### Decision Tree with Numeric Attributes
![](Pasted%20image%2020240920133316.png)
- Test whether value is greater or less than constant
- Attribute may be tested multiple times
### Decision Trees with Missing Values
- Not clear which branch should be taken when node tests attribute with missing value
- Does absence of a value have significance?
- Yes => Treat as separate value during training
- No => Treat in special way during testing
- Assign sample to most popular branch
# Classification Rules
- Popular alternative to decision tree
- Antecedent (pre-condition) - series of tests
- Tests usually logically ANDed together
- Consequent (conclusion) - usually a class
- Individual rules often logically ORed together
## If-Then Rules for Contact Lenses
![](Pasted%20image%2020240920133706.png)
# Nuggets
- Are rules independent
- Problem: Ignores process of executing rules
- Ordered set (decision list)
- Order important for interpretation
- Unordered set
- Rules may overlap and lead to different conclusions for the same example
- Needs conflict resolution
## Executing Rules
- What if $\geq$ 2 rules conflict?
- Give no conclusion?
- Go with the rule that covers largest no. training samples?
- What is no rule applies to test example?
- Give no conclusion?
- Go with class that is most frequent?
## Special Case: Boolean Classes
- Assumption: if example does not belong to class "yes", belongs to "no"
- Solution: only learn rules for class "yes", use default rule for "no"
![](Pasted%20image%2020240920134203.png)
- Order is important, no conflicts.