vault backup: 2024-10-16 09:12:37

2024-10-16 09:12:37 +01:00
parent bad31f35c5
commit 124e0b67ef
190 changed files with 192115 additions and 0 deletions
--- a/Mining/Week
+++ b/Mining/Week
@@ -0,0 +1,214 @@
+# Assessment
+
+## T1
+
+- Exam (50%)
+
+## T2
+
+- Coursework (50%)
+
+# Resources
+
+Data Mining: Practical Machine Learning Tools and Techniques (Witten, Frank, Hall & Pal) 4th Edition 2016
+
+Scientific Calculator
+
+# Data Vs Information
+
+- Too much data
+- Valuable resource
+- Raw data less important, need to develop techniques to extract information
+	- Data: recorded facts
+	- Information: patterns underlying data
+
+# Philosophy
+
+## Cow Culling
+
+- Cows described by 700 features about certain variables
+- Problem is the selection of cows of which to cull
+- Data is historical records, and farmer decisions
+- Machine Learning used to ascertain which factors taken into account by farmers, rather than automating the decision making process.
+
+# Definition of Data Mining
+
+- The extraction of:
+	- Implicit,
+	- Previously unknown,
+	- Potentially useful data
+- Programs that detect patterns and regularities are needed
+- Strong patterns => good predictions
+	- Issues:
+		- Most patterns not interesting
+		- Patterns may be inexact
+		- Data may be garbled or missing
+
+# Machine Learning Techniques
+
+- Algorithms for acquiring structural descriptions from examples
+- Structural descriptions represent patterns, explicitly.
+	- Predict outcome in new situation
+	- Understand and explain how prediction derived.
+- Methods originate from AI, statistics and research on databases.
+
+# Can Machines Learn?
+
+- By definition, sort of. The ability to obtain knowledge by study, experience or being taught, is very difficult to measure.
+- Does learning imply intention?
+
+# Terminology
+
+- Concept - Thing to be learned
+- Example / Instance - Individual, independent examples of a concept
+- Attributes / Features - Measuring aspects of an example / instance
+- Concept description (pattern, model, hypothesis) - Output for data mining algorithms.
+
+# Famous Small Datasets
+
+- Will be used in module
+- Unrealistically simple
+
+## Weather Dataset - Nominal
+
+Concept: conditions which are suitable for a game.
+Reference: Quinlan, J.R. (1986)
+Induction of decision trees. Machine
+Learning, 1(1), 81-106.
+
+### Attributes
+
+3\*3\*2\*2 = 36 possible combinations of values.
+Outlook
+
+- sunny, overcast, rainy
+Temperature
+- hot, mild, cool
+Humidity
+- high, normal
+Windy
+- yes, no
+Play
+- Class
+- yes, no
+
+### Dataset
+
+![](Pasted%20image%2020240919134249.png)
+![](Pasted%20image%2020240919134304.png)
+
+Rules ordered, higher = higher priority
+
+### Weather Dataset - Mixed
+
+![](Pasted%20image%2020240919134526.png)
+![](Pasted%20image%2020240919134535.png)
+
+## Contact Lenses Dataset
+
+Describes conditions under which an optician might want to prescribe soft, hard or no contact lenses.
+Grossly over-simplified.
+Reference: Cendrowska, J. (1987). Prism: an algorithm
+for inducing module rules. Journal of Man-Machine
+Studies, 27(4), 349–370.
+
+### Attributes
+
+3\*2\*2\*2 = 24 possibilities
+Dataset is exhaustive, which is unusual.
+
+Age
+
+- young, pre-presbyopic, presbyopic
+Spectacle Prescription
+- myope (short), hypermetrope (long)
+Astigmatism
+- yes, no
+Tear Production Rate
+- reduced, normal
+Recommended Lenses
+- class
+- hard, soft, none
+
+### Dataset
+
+![](Pasted%20image%2020240919134848.png)
+
+## Iris Dataset
+
+Used in many statistical experiments
+Contains numeric attributes of 3 different types of iris.
+Created in 1936 by Sir Ronald Fisher
+
+### Dataset
+
+![](Pasted%20image%2020240920130950.png)
+
+# Styles of Learning
+
+- Classification Learning: Predicting a **nominal** class
+- Numeric Prediction (Regression): Predicting a **numeric** quantity
+- Clustering: Grouping similar examples into clusters
+- Association Learning: Detecting associations between attributes
+
+## Classification Learning
+
+- Nominal
+- Supervised
+	- Provided with actual value of the class
+- Measure success on fresh data for which class labels are known (test data)
+
+## Numeric Prediction (Regression)
+
+- Numeric
+- Supervised
+- Test Data
+
+![](Pasted%20image%2020240920131244.png)
+
+Example uses a linear regression function to provide an estimated performance value based on attributes.
+
+## Clustering
+
+- Finding similar groups
+- Unsupervised
+	- Class of example is unknown
+- Success measured **subjectively**
+
+## Association Learning
+
+- Applied if no class specified, and any kind of structure is interesting
+- Difference to Classification Learning:
+	- Predicts any attribute's value, not just class.
+	- More than one attribute's value at a time
+	- Far more association rules than classification rules.
+
+## Classification Vs Association Rules
+
+Classification Rule:
+
+- Predicts value of a given attribute (class of example)
+- ``If outlook = sunny and humidity = high, then play = no``
+
+Association Rule:
+
+- Predicts value of arbitrary attribute / combination
+
+```If temperature = cool, humidity = normal
+If humidity = normal and windy = false, play = yes
+If outlook = sunny and play = no, humidity = high
+If windy = false and play = no, then outlook = sunny and humidity = high
+```
+
+# Data Mining and Ethics
+
+- Ethical Issues arise in practical applications
+- Data mining often used to discriminate
+- Ethical situation depends on application
+- Attributes may contain problematic information
+- Does ownership of data bestow right to use it in other ways than those purported when it was originally collected?
+- Who is permitted to access the data?
+- For what purpose was the data collected?
+- What conclusions can sensibly be drawn?
+- Caveats must be attached to results
+- Purely statistical arguments never sufficient
--- a/Mining/Week
+++ b/Mining/Week
@@ -0,0 +1,135 @@
+# Attributes
+
+- Each example described by fixed pre-defined set of features (attributes)
+- Number of attributes may vary
+	- ex. Transportation Vehicles
+		- no. wheels not applicable to ships
+		- no. masts not applicable to cars
+	- Possible solution: "irrelevant value" flag
+- Attributes may be dependent on other attributes
+
+# Taxonomy of Data Types
+
+![](Pasted%20image%2020240920132209.png)
+
+# Nominal Attributes
+
+- Distinct symbols
+	- Serve as labels or names
+- No relation implied among nominal values
+- Only equality tests can be performed
+	- ex. outlook = sunny
+
+# Sources of Missing Values
+
+- Malfunctioning / Misconfigured Equipment
+- Changes in design
+- Collation of different datasets
+- Data not collected for mining
+	- Errors and omissions dont affect purpose of data
+	- ex. Banks do not need to know age in banking datasets, DOB may contain missing values
+- Missing value may have significance
+	- ex. medical diagnoses can be made from tests a doctor decides, rather than the outcome.
+	- Most DM algos assume this is not the case, hence "missing" may need to be coded as an additional nominal value.
+
+# Inaccurate Values
+
+- Typographical errors in nominal attributes
+- Typographical and measurement errors in numeric attributes
+- Deliberate errors
+	- ex. Incorrect ZIP codes, unsanitised inputs
+- Duplicate examples
+
+# Weka and ARFF
+
+## Weather Dataset in ARFF
+
+![](Pasted%20image%2020240920132732.png)
+
+### Getting to Know the Data
+
+- First task, get to know data
+- Simple visualisations useful:
+	- Nominal: bar graph
+	- Numeric: histograms
+- 2D and 3D plots show dependencies
+- Need to consult experts
+- Too much data? Take sample.
+
+# Concept Descriptions
+
+- Output of DM algorithm
+- Many ways of representing:
+	- Decision Trees
+	- Rules
+	- Linear Regression Functions
+
+## Decision Trees
+
+- Divide-and-Conquer approach
+- Trees drawn upside down
+	- Node at top is root
+- Edges are branches
+- Rectangles represent leaves
+- Leaves assign classification
+- Nodes involve testing attribute
+
+### Decision Tree with Nominal Attributes
+
+![](Pasted%20image%2020240920133218.png)
+
+- Number of branches usually equal to number values
+- Attribute not tested more than once.
+
+### Decision Tree with Numeric Attributes
+
+![](Pasted%20image%2020240920133316.png)
+
+- Test whether value is greater or less than constant
+- Attribute may be tested multiple times
+
+### Decision Trees with Missing Values
+
+- Not clear which branch should be taken when node tests attribute with missing value
+- Does absence of a value have significance?
+	- Yes => Treat as separate value during training
+	- No => Treat in special way during testing
+		- Assign sample to most popular branch
+
+# Classification Rules
+
+- Popular alternative to decision tree
+- Antecedent (pre-condition) - series of tests
+- Tests usually logically ANDed together
+- Consequent (conclusion) - usually a class
+- Individual rules often logically ORed together
+
+## If-Then Rules for Contact Lenses
+
+![](Pasted%20image%2020240920133706.png)
+
+# Nuggets
+
+- Are rules independent
+- Problem: Ignores process of executing rules
+	- Ordered set (decision list)
+		- Order important for interpretation
+	- Unordered set
+		- Rules may overlap and lead to different conclusions for the same example
+		- Needs conflict resolution
+
+## Executing Rules
+
+- What if $\geq$ 2 rules conflict?
+	- Give no conclusion?
+	- Go with the rule that covers largest no. training samples?
+- What is no rule applies to test example?
+	- Give no conclusion?
+	- Go with class that is most frequent?
+
+## Special Case: Boolean Classes
+
+- Assumption: if example does not belong to class "yes", belongs to "no"
+- Solution: only learn rules for class "yes", use default rule for "no"
+![](Pasted%20image%2020240920134203.png)
+- Order is important, no conflicts.