|Decission tree: A decision tree is a machine learning model that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements._____________Annotation: The above characterizations of concepts are neither definitions nor exhausting presentations of problems related to them. Instead, they are intended to give a short introduction to the contributions below. – Lexicon of Arguments. |
Peter Norvig on Decision Tree - Dictionary of Arguments
Norvig I 698
Def Decision tree/Norvig/Russell: A decision tree represents a function DECISION TREE that takes as input a vector of attribute values and returns a “decision”—a single output value. The input and output values can be discrete or continuous. A decision tree reaches its decision by performing a sequence of tests. Each internal node in the tree corresponds to a test of the value of one of the input attributes, Ai, and the branches from the node are labeled with the possible values of the attribute, Ai =vik. Each leaf node in the tree specifies a value to be returned by the function.
A Boolean decision tree is logically equivalent to the assertion that the goal attribute is true
if and only if the input attributes satisfy one of the paths leading to a leaf with value true.
Writing this out in propositional logic, we have
Goal ⇔ (Path1 ∨ Path2 ∨ · · ·) ,
where each Path is a conjunction of attribute-value tests required to follow that path. Thus, the whole expression is equivalent to disjunctive normal form. ((s) >Normal form/logic.)
Unfortunately, no matter how we measure size, it is an intractable problem to find the smallest
consistent tree; there is no way to efficiently search through the 22n trees. With some simple heuristics, however, we can find a good approximate solution: a small (but not smallest) consistent tree. The decision tree learning algorithm adopts a greedy divide-and-conquer strategy: always test the most important attribute first. This test divides the problem up into smaller subproblems that can then be solved recursively.
“Most important attribute”: the one that makes the most difference to the classification of an example.
Decision tree learning algorithm: see Norvig I 702.
Norvig I 705
Problems: the decision tree learning algorithm will generate a large tree when there is actually no pattern to be found.
Overfitting: algorithm will seize on any pattern it can find in the input. If it turns out that there are 2 rolls of a 7-gram blue die with fingers crossed and they both come out 6, then the algorithm may construct a path that predicts 6 in that case.
Solution: decision tree pruning combats overfitting. Pruning works by eliminating nodes that are not clearly relevant.
Norvig I 706
Missing data: In many domains, not all the attribute values will be known for every example.
Norvig I 707
Multivalued attributes: When an attribute has many possible values, the information gain measure gives an inappropriate indication of the attribute’s usefulness. In the extreme case, an attribute such as exact time has a different value for every example, which means each subset of examples is a singleton with a unique classification, and the information gain measure would have its highest value for this attribute.
Continuous and integer-valued input attributes: Continuous or integer-valued attributes such as height and weight, have an infinite set of possible values. Rather than generate infinitely many branches, decision-tree learning algorithms typically find the split point that gives the highest information gain.
Continuous-valued output attributes: If we are trying to predict a numerical output value, such as the price of an apartment, then we need a regression tree rather than a classification tree. A regression tree has at each leaf a linear function of some subset of numerical attributes, rather than a single value.
Norvig I 758
History: The first notable use of decision trees was in EPAM, the “Elementary Perceiver And Memorizer” (Feigenbaum, 1961)(1), which was a simulation of human concept learning. ID3 (Quinlan, 1979)(2) added the crucial idea of choosing the attribute with maximum entropy; it is the basis for the decision tree algorithm in this chapter. Information theory was developed by Claude Shannon to aid in the study of communication (Shannon and Weaver, 1949)(3). (Shannon also contributed one of the earliest examples of machine learning, a mechanical mouse named Theseus that learned to navigate through a maze by trial and error.) The χ2 method of tree pruning was described by Quinlan (1986)(4). C4.5, an industrial-strength decision tree package, can be found in Quinlan (1993)(5). An independent tradition of decision tree learning exists in the statistical literature. Classification and Regression Trees (Breiman et al., 1984)(6), known as the “CART book,” is the principal reference.
1. Feigenbaum, E. A. (1961). The simulation of verbal learning behavior. Proc. Western Joint Computer
Conference, 19, 121-131.
2. Quinlan, J. R. (1979). Discovering rules from large collections of examples: A case study. In Michie,
D. (Ed.), Expert Systems in the Microelectronic Age. Edinburgh University Press.
3. Shannon, C. E. and Weaver, W. (1949). The Mathematical Theory of Communication. University of
4. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81-106.
5. Quinlan, J. R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann.
6. Breiman, L., Friedman, J., Olshen, R. A., and Stone, C. J. (1984). Classification and Regression Trees.
Wadsworth International Group._____________Explanation of symbols: Roman numerals indicate the source, arabic numerals indicate the page number. The corresponding books are indicated on the right hand side. ((s)…): Comment by the sender of the contribution. Translations: Dictionary of Arguments The note [Concept/Author], [Author1]Vs[Author2] or [Author]Vs[term] resp. "problem:"/"solution:", "old:"/"new:" and "thesis:" is an addition from the Dictionary of Arguments. If a German edition is specified, the page numbers refer to this edition.
Stuart J. Russell
Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010