G4G0-2/AI & Data Mining/Week 1/Lecture 2 - Input and Output.md

# Attributes

- Each example described by fixed pre-defined set of features (attributes)
- Number of attributes may vary
	- ex. Transportation Vehicles
		- no. wheels not applicable to ships
		- no. masts not applicable to cars
	- Possible solution: "irrelevant value" flag
- Attributes may be dependent on other attributes

# Taxonomy of Data Types

![](Pasted%20image%2020240920132209.png)

# Nominal Attributes

- Distinct symbols
	- Serve as labels or names
- No relation implied among nominal values
- Only equality tests can be performed
	- ex. outlook = sunny

# Sources of Missing Values

- Malfunctioning / Misconfigured Equipment
- Changes in design
- Collation of different datasets
- Data not collected for mining
	- Errors and omissions dont affect purpose of data
	- ex. Banks do not need to know age in banking datasets, DOB may contain missing values
- Missing value may have significance
	- ex. medical diagnoses can be made from tests a doctor decides, rather than the outcome.
	- Most DM algos assume this is not the case, hence "missing" may need to be coded as an additional nominal value.

# Inaccurate Values

- Typographical errors in nominal attributes
- Typographical and measurement errors in numeric attributes
- Deliberate errors
	- ex. Incorrect ZIP codes, unsanitised inputs
- Duplicate examples

# Weka and ARFF

## Weather Dataset in ARFF

![](Pasted%20image%2020240920132732.png)

### Getting to Know the Data

- First task, get to know data
- Simple visualisations useful:
	- Nominal: bar graph
	- Numeric: histograms
- 2D and 3D plots show dependencies
- Need to consult experts
- Too much data? Take sample.

# Concept Descriptions

- Output of DM algorithm
- Many ways of representing:
	- Decision Trees
	- Rules
	- Linear Regression Functions

## Decision Trees

- Divide-and-Conquer approach
- Trees drawn upside down
	- Node at top is root
- Edges are branches
- Rectangles represent leaves
- Leaves assign classification
- Nodes involve testing attribute

### Decision Tree with Nominal Attributes

![](Pasted%20image%2020240920133218.png)

- Number of branches usually equal to number values
- Attribute not tested more than once.

### Decision Tree with Numeric Attributes

![](Pasted%20image%2020240920133316.png)

- Test whether value is greater or less than constant
- Attribute may be tested multiple times

### Decision Trees with Missing Values

- Not clear which branch should be taken when node tests attribute with missing value
- Does absence of a value have significance?
	- Yes => Treat as separate value during training
	- No => Treat in special way during testing
		- Assign sample to most popular branch

# Classification Rules

- Popular alternative to decision tree
- Antecedent (pre-condition) - series of tests
- Tests usually logically ANDed together
- Consequent (conclusion) - usually a class
- Individual rules often logically ORed together

## If-Then Rules for Contact Lenses

![](Pasted%20image%2020240920133706.png)

# Nuggets

- Are rules independent
- Problem: Ignores process of executing rules
	- Ordered set (decision list)
		- Order important for interpretation
	- Unordered set
		- Rules may overlap and lead to different conclusions for the same example
		- Needs conflict resolution

## Executing Rules

- What if $\geq$ 2 rules conflict?
	- Give no conclusion?
	- Go with the rule that covers largest no. training samples?
- What is no rule applies to test example?
	- Give no conclusion?
	- Go with class that is most frequent?

## Special Case: Boolean Classes

- Assumption: if example does not belong to class "yes", belongs to "no"
- Solution: only learn rules for class "yes", use default rule for "no"
![](Pasted%20image%2020240920134203.png)
- Order is important, no conflicts.