vault backup: 2024-10-16 09:12:37

This commit is contained in:
boris
2024-10-16 09:12:37 +01:00
parent bad31f35c5
commit 124e0b67ef
190 changed files with 192115 additions and 0 deletions

View File

@@ -0,0 +1,214 @@
# Assessment
## T1
- Exam (50%)
## T2
- Coursework (50%)
# Resources
Data Mining: Practical Machine Learning Tools and Techniques (Witten, Frank, Hall & Pal) 4th Edition 2016
Scientific Calculator
# Data Vs Information
- Too much data
- Valuable resource
- Raw data less important, need to develop techniques to extract information
- Data: recorded facts
- Information: patterns underlying data
# Philosophy
## Cow Culling
- Cows described by 700 features about certain variables
- Problem is the selection of cows of which to cull
- Data is historical records, and farmer decisions
- Machine Learning used to ascertain which factors taken into account by farmers, rather than automating the decision making process.
# Definition of Data Mining
- The extraction of:
- Implicit,
- Previously unknown,
- Potentially useful data
- Programs that detect patterns and regularities are needed
- Strong patterns => good predictions
- Issues:
- Most patterns not interesting
- Patterns may be inexact
- Data may be garbled or missing
# Machine Learning Techniques
- Algorithms for acquiring structural descriptions from examples
- Structural descriptions represent patterns, explicitly.
- Predict outcome in new situation
- Understand and explain how prediction derived.
- Methods originate from AI, statistics and research on databases.
# Can Machines Learn?
- By definition, sort of. The ability to obtain knowledge by study, experience or being taught, is very difficult to measure.
- Does learning imply intention?
# Terminology
- Concept - Thing to be learned
- Example / Instance - Individual, independent examples of a concept
- Attributes / Features - Measuring aspects of an example / instance
- Concept description (pattern, model, hypothesis) - Output for data mining algorithms.
# Famous Small Datasets
- Will be used in module
- Unrealistically simple
## Weather Dataset - Nominal
Concept: conditions which are suitable for a game.
Reference: Quinlan, J.R. (1986)
Induction of decision trees. Machine
Learning, 1(1), 81-106.
### Attributes
3\*3\*2\*2 = 36 possible combinations of values.
Outlook
- sunny, overcast, rainy
Temperature
- hot, mild, cool
Humidity
- high, normal
Windy
- yes, no
Play
- Class
- yes, no
### Dataset
![](Pasted%20image%2020240919134249.png)
![](Pasted%20image%2020240919134304.png)
Rules ordered, higher = higher priority
### Weather Dataset - Mixed
![](Pasted%20image%2020240919134526.png)
![](Pasted%20image%2020240919134535.png)
## Contact Lenses Dataset
Describes conditions under which an optician might want to prescribe soft, hard or no contact lenses.
Grossly over-simplified.
Reference: Cendrowska, J. (1987). Prism: an algorithm
for inducing module rules. Journal of Man-Machine
Studies, 27(4), 349370.
### Attributes
3\*2\*2\*2 = 24 possibilities
Dataset is exhaustive, which is unusual.
Age
- young, pre-presbyopic, presbyopic
Spectacle Prescription
- myope (short), hypermetrope (long)
Astigmatism
- yes, no
Tear Production Rate
- reduced, normal
Recommended Lenses
- class
- hard, soft, none
### Dataset
![](Pasted%20image%2020240919134848.png)
## Iris Dataset
Used in many statistical experiments
Contains numeric attributes of 3 different types of iris.
Created in 1936 by Sir Ronald Fisher
### Dataset
![](Pasted%20image%2020240920130950.png)
# Styles of Learning
- Classification Learning: Predicting a **nominal** class
- Numeric Prediction (Regression): Predicting a **numeric** quantity
- Clustering: Grouping similar examples into clusters
- Association Learning: Detecting associations between attributes
## Classification Learning
- Nominal
- Supervised
- Provided with actual value of the class
- Measure success on fresh data for which class labels are known (test data)
## Numeric Prediction (Regression)
- Numeric
- Supervised
- Test Data
![](Pasted%20image%2020240920131244.png)
Example uses a linear regression function to provide an estimated performance value based on attributes.
## Clustering
- Finding similar groups
- Unsupervised
- Class of example is unknown
- Success measured **subjectively**
## Association Learning
- Applied if no class specified, and any kind of structure is interesting
- Difference to Classification Learning:
- Predicts any attribute's value, not just class.
- More than one attribute's value at a time
- Far more association rules than classification rules.
## Classification Vs Association Rules
Classification Rule:
- Predicts value of a given attribute (class of example)
- ``If outlook = sunny and humidity = high, then play = no``
Association Rule:
- Predicts value of arbitrary attribute / combination
```If temperature = cool, humidity = normal
If humidity = normal and windy = false, play = yes
If outlook = sunny and play = no, humidity = high
If windy = false and play = no, then outlook = sunny and humidity = high
```
# Data Mining and Ethics
- Ethical Issues arise in practical applications
- Data mining often used to discriminate
- Ethical situation depends on application
- Attributes may contain problematic information
- Does ownership of data bestow right to use it in other ways than those purported when it was originally collected?
- Who is permitted to access the data?
- For what purpose was the data collected?
- What conclusions can sensibly be drawn?
- Caveats must be attached to results
- Purely statistical arguments never sufficient

View File

@@ -0,0 +1,135 @@
# Attributes
- Each example described by fixed pre-defined set of features (attributes)
- Number of attributes may vary
- ex. Transportation Vehicles
- no. wheels not applicable to ships
- no. masts not applicable to cars
- Possible solution: "irrelevant value" flag
- Attributes may be dependent on other attributes
# Taxonomy of Data Types
![](Pasted%20image%2020240920132209.png)
# Nominal Attributes
- Distinct symbols
- Serve as labels or names
- No relation implied among nominal values
- Only equality tests can be performed
- ex. outlook = sunny
# Sources of Missing Values
- Malfunctioning / Misconfigured Equipment
- Changes in design
- Collation of different datasets
- Data not collected for mining
- Errors and omissions dont affect purpose of data
- ex. Banks do not need to know age in banking datasets, DOB may contain missing values
- Missing value may have significance
- ex. medical diagnoses can be made from tests a doctor decides, rather than the outcome.
- Most DM algos assume this is not the case, hence "missing" may need to be coded as an additional nominal value.
# Inaccurate Values
- Typographical errors in nominal attributes
- Typographical and measurement errors in numeric attributes
- Deliberate errors
- ex. Incorrect ZIP codes, unsanitised inputs
- Duplicate examples
# Weka and ARFF
## Weather Dataset in ARFF
![](Pasted%20image%2020240920132732.png)
### Getting to Know the Data
- First task, get to know data
- Simple visualisations useful:
- Nominal: bar graph
- Numeric: histograms
- 2D and 3D plots show dependencies
- Need to consult experts
- Too much data? Take sample.
# Concept Descriptions
- Output of DM algorithm
- Many ways of representing:
- Decision Trees
- Rules
- Linear Regression Functions
## Decision Trees
- Divide-and-Conquer approach
- Trees drawn upside down
- Node at top is root
- Edges are branches
- Rectangles represent leaves
- Leaves assign classification
- Nodes involve testing attribute
### Decision Tree with Nominal Attributes
![](Pasted%20image%2020240920133218.png)
- Number of branches usually equal to number values
- Attribute not tested more than once.
### Decision Tree with Numeric Attributes
![](Pasted%20image%2020240920133316.png)
- Test whether value is greater or less than constant
- Attribute may be tested multiple times
### Decision Trees with Missing Values
- Not clear which branch should be taken when node tests attribute with missing value
- Does absence of a value have significance?
- Yes => Treat as separate value during training
- No => Treat in special way during testing
- Assign sample to most popular branch
# Classification Rules
- Popular alternative to decision tree
- Antecedent (pre-condition) - series of tests
- Tests usually logically ANDed together
- Consequent (conclusion) - usually a class
- Individual rules often logically ORed together
## If-Then Rules for Contact Lenses
![](Pasted%20image%2020240920133706.png)
# Nuggets
- Are rules independent
- Problem: Ignores process of executing rules
- Ordered set (decision list)
- Order important for interpretation
- Unordered set
- Rules may overlap and lead to different conclusions for the same example
- Needs conflict resolution
## Executing Rules
- What if $\geq$ 2 rules conflict?
- Give no conclusion?
- Go with the rule that covers largest no. training samples?
- What is no rule applies to test example?
- Give no conclusion?
- Go with class that is most frequent?
## Special Case: Boolean Classes
- Assumption: if example does not belong to class "yes", belongs to "no"
- Solution: only learn rules for class "yes", use default rule for "no"
![](Pasted%20image%2020240920134203.png)
- Order is important, no conflicts.