Peter Norvig on Artificial Neural Networks - Dictionary of Arguments

Author	Concept	Summary/Quotes	Sources
Philosophy Dictionary of Arguments Home


Artificial neural networks: Artificial neural networks are computational models inspired by the structure of biological neural networks. They process information through interconnected nodes, or "neurons," to perform tasks like pattern recognition and machine learning. See also Machine Learning, Neural networks, Pattern recognition. _____________ Annotation: The above characterizations of concepts are neither definitions nor exhausting presentations of problems related to them. Instead, they are intended to give a short introduction to the contributions below. – Lexicon of Arguments.

> Norvig, Peter	> Artificial Neural Networks	Peter Norvig on Artificial Neural Networks - Dictionary of Arguments Norvig I 728 Artificial Neural Networks/Norvig/Russell: Neural networks are composed of nodes or units (…) connected by directed links. A link from unit i to unit j serves to propagate the activation ai from i to j. Each link also has a numeric weight wi,j associated with it, which determines the strength and sign of the connection. Just as in linear regression models, each unit has a dummy input a0 =1 with an associated weight w0,j . Norvig I 729 Perceptrons: The activation function g is typically either a hard threshold (…), in which case the unit is called a perceptron, or a logistic function (…), in which case the term sigmoid perceptron is sometimes used. Both of these nonlinear activation function ensure the important property that the entire network of units can represent a nonlinear function (…). Forms of a network: a) A feed-forward network has connections only in one direction—that is, it forms a directed acyclic graph. Every node receives input from “upstream” nodes and delivers output to “downstream” nodes; there are no loops. A feed-forward network represents a function of its current input; thus, it has no internal state other than the weights themselves. b) A recurrent network, on the other hand, feeds its outputs back into its own inputs. This means that the activation levels of the network form a dynamical system that may reach a stable state or exhibit oscillations or even chaotic behavior. Layers: a) Feed-forward networks are usually arranged in layers, such that each unit receives input only from units in the immediately preceding layer. b) Multilayer networks, which have one or more layers of hidden units that are not connected to the outputs of the network. Training/Learning: For example, if we want to train a network to add two input bits, each a 0 or a 1, we will need one output for the sum bit and one for the carry bit. Also, when the learning problem involves classification into more than two classes—for example, when learning to categorize images of handwritten digits—it is common to use one output unit for each class. Norvig I 731 Any desired functionality can be obtained by connecting large numbers of units into (possibly recurrent) networks of arbitrary depth. The problem was that nobody knew how to train such networks. This turns out to be an easy problem if we think of a network the right way: as a function hw(x) parameterized by the weights w. Norvig I 732 (…) we have the output expressed as a function of the inputs and the weights. (…) because the function represented by a network can be highly nonlinear—composed, as it is, of nested nonlinear soft threshold functions—we can see neural networks as a tool for doing nonlinear regression. Norvig I 736 Learning in neural networks: just as with >Bayesian networks, we also need to understand how to find the best network structure. If we choose a network that is too big, it will be able to memorize all the examples by forming a large lookup table, but will not necessarily generalize well to inputs that have not been seen before. Norvig I 737 Optimal brain damage: The optimal brain damage algorithm begins with a fully connected network and removes connections from it. After the network is trained for the first time, an information-theoretic approach identifies an optimal selection of connections that can be dropped. The network is then retrained, and if its performance has not decreased then the process is repeated. In addition to removing connections, it is also possible to remove units that are not contributing much to the result. Parametric models: A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) is called a parametric model. No matter how much data you throw at a parametric model, it won’t change its mind about how many parameters it needs. Nonparametric models: A nonparametric model is one that cannot be characterized by a bounded set of parameters. For example, suppose that each hypothesis we generate simply retains within itself all of the training examples and uses all of them to predict the next example. Such a hypothesis family would be nonparametric because the effective number of parameters is unbounded- it grows with the number of examples. This approach is called instance-based learning or memory-based learning. The simplest instance-based learning method is table lookup: take all the training examples, put them in a lookup table, and then when asked for h(x), see if x is in the table; (…). Norvig I 738 We can improve on table lookup with a slight variation: given a query xq, find the k examples that are nearest to xq. This is called k-nearest neighbors lookup. ((s) Cf. >Local/global/Philosophical theories.) Norvig I 744 Support vector machines/SVM: The support vector machine or SVM framework is currently the most popular approach for “off-the-shelf” supervised learning: if you don’t have any specialized prior knowledge about a domain, then the SVM is an excellent method to try first. Properties of SVMs: 1. SVMs construct a maximum margin separator - a decision boundary with the largest possible distance to example points. This helps them generalize well. 2. SVMs create a linear separating hyperplane, but they have the ability to embed the data into a higher-dimensional space, using the so-called kernel trick. 3. SVMs are a nonparametric method - they retain training examples and potentially need to store them all. On the other hand, in practice they often end up retaining only a small fraction of the number of examples - sometimes as few as a small constant times the number of dimensions. Norvig I 745 Instead of minimizing expected empirical loss on the training data, SVMs attempt to minimize expected generalization loss. We don’t know where the as-yet-unseen points may fall, but under the probabilistic assumption that they are drawn from the same distribution as the previously seen examples, there are some arguments from computational learning theory (…) suggesting that we minimize generalization loss by choosing the separator that is farthest away from the examples we have seen so far. Norvig I 748 Ensemble Learning: >Learning/AI Research. Norvig I 757 Linear regression is a widely used model. The optimal parameters of a linear regression model can be found by gradient descent search, or computed exactly. A linear classifier with a hard threshold—also known as a perceptron—can be trained by a simple weight update rule to fit data that are linearly separable. In other cases, the rule fails to converge. Norvig I 758 Logistic regression replaces the perceptron’s hard threshold with a soft threshold defined by a logistic function. Gradient descent works well even for noisy data that are not linearly separable. Norvig I 760 History: The term logistic function comes from Pierre-Francois Verhulst (1804–1849), a statistician who used the curve to model population growth with limited resources, a more realistic model than the unconstrained geometric growth proposed by Thomas Malthus. Verhulst called it the courbe logistique, because of its relation to the logarithmic curve. The term regression is due to Francis Galton, nineteenth century statistician, cousin of Charles Darwin, and initiator of the fields of meteorology, fingerprint analysis, and statistical correlation, who used it in the sense of regression to the mean. The term curse of dimensionality comes from Richard Bellman (1961)(1). Logistic regression can be solved with gradient descent, or with the Newton-Raphson method (Newton, 1671(2); Raphson, 1690(3)). A variant of the Newton method called L-BFGS is sometimes used for large-dimensional problems; the L stands for “limited memory,” meaning that it avoids creating the full matrices all at once, and instead creates parts of them on the fly. BFGS are authors’ initials (Byrd et al., 1995)(4). The ideas behind kernel machines come from Aizerman et al. (1964)(5) (who also introduced the kernel trick), but the full development of the theory is due to Vapnik and his colleagues (Boser et al., 1992)(6). SVMs were made practical with the introduction of the soft-margin classifier for handling noisy data in a paper that won the 2008 ACM Theory and Practice Award (Cortes and Vapnik, 1995)(7), and of the Sequential Minimal Optimization (SMO) algorithm for efficiently solving SVM problems using quadratic programming (Platt, 1999)(8). SVMs have proven to be very popular and effective for tasks such as text categorization (Joachims, 2001)(9), computational genomics (Cristianini and Hahn, 2007)(10), and natural language processing, such as the handwritten digit recognition of DeCoste and Schölkopf (2002)(11). As part of this process, many new kernels have been designed that work with strings, trees, and other non-numerical data types. A related technique that also uses the kernel trick to implicitly represent an exponential feature space is the voted perceptron (Freund and Schapire, 1999(12); Collins and Duffy, 2002(13)). Textbooks on SVMs include Cristianini and Shawe-Taylor (2000)(14) and Schölkopf and Smola (2002)(15). A friendlier exposition appears in the AI Magazine article by Cristianini and Schölkopf (2002)(16). Bengio and LeCun (2007)(17) show some of the limitations of SVMs and other local, nonparametric methods for learning functions that have a global structure but do not have local smoothness. Ensemble learning is an increasingly popular technique for improving the performance of learning algorithms. Bagging (Breiman, 1996)(18), the first effective method, combines hypotheses learned from multiple bootstrap data sets, each generated by subsampling the original data set. The boosting method described in this chapter originated with theoretical work by Schapire (1990)(19). The ADABOOST algorithm was developed by Freund and Schapire Norvig I 761 (1996) (20)and analyzed theoretically by Schapire (2003)(21). Friedman et al. (2000)(22) explain boosting from a statistician’s viewpoint. Online learning is covered in a survey by Blum (1996)(23) and a book by Cesa-Bianchi and Lugosi (2006)(24). Dredze et al. (2008)(25) introduce the idea of confidence-weighted online learning for classification: in addition to keeping a weight for each parameter, they also maintain a measure of confidence, so that a new example can have a large effect on features that were rarely seen before (and thus had low confidence) and a small effect on common features that have already been well-estimated. 1. Bellman, R. E. (1961). Adaptive Control Processes: A Guided Tour. Princeton University Press. 2. Newton, I. (1664-1671). Methodus fluxionum et serierum infinitarum. Unpublished notes 3. Raphson, J. (1690). Analysis aequationum universalis. Apud Abelem Swalle, London. 4. Byrd, R. H., Lu, P., Nocedal, J., and Zhu, C. (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific and Statistical Computing, 16(5), 1190-1208. 5. Aizerman, M., Braverman, E., and Rozonoer, L. (1964). Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25, 821-837. 6. Boser, B., Guyon, I., and Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In COLT-92. 7. Cortes, C. and Vapnik, V. N. (1995). Support vector networks. Machine Learning, 20, 273-297. 8. Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods: Support Vector Learning, pp. 185-208. MIT Press. 9. Joachims, T. (2001). A statistical learning model of text classification with support vector machines. In SIGIR-01, pp. 128-136. 10. Cristianini, N. and Hahn, M. (2007). Introduction to Computational Genomics: A Case Studies Approach. Cambridge University Press. 11. DeCoste, D. and Schölkopf, B. (2002). Training invariant support vector machines. Machine Learning, 46(1), 161–190. 12. Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting algorithm. In ICML-96. 13. Collins, M. and Duffy, K. (2002). New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In ACL-02. 14. Cristianini, N. and Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press. 15. Schölkopf, B. and Smola, A. J. (2002). Learning with Kernels. MIT Press. 16. Cristianini, N. and Schölkopf, B. (2002). Support vector machines and kernel methods: The new generation of learning machines. AIMag, 23(3), 31–41. 17. Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms towards AI. In Bottou, L., Chapelle, O., DeCoste, D., and Weston, J. (Eds.), Large-Scale Kernel Machines. MIT Press. 18. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. 19. Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5(2), 197–227. 20. Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting algorithm. In ICML-96. 21. Schapire, R. E. (2003). The boosting approach to machine learning: An overview. In Denison, D. D., Hansen, M. H., Holmes, C., Mallick, B., and Yu, B. (Eds.), Nonlinear Estimation and Classification. Springer. 22. Friedman, J., Hastie, T., and Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28(2), 337–374. 23. Blum, A. L. (1996). On-line algorithms in machine learning. In Proc.Workshop on On-Line Algorithms, Dagstuhl, pp. 306–325. 24. Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and Games. Cambridge University Press. 25. Dredze, M., Crammer, K., and Pereira, F. (2008). Confidence-weighted linear classification. In ICML- 08, pp. 264–271. _____________ Explanation of symbols: Roman numerals indicate the source, arabic numerals indicate the page number. The corresponding books are indicated on the right hand side. ((s)…): Comment by the sender of the contribution. Translations: Dictionary of Arguments The note [Concept/Author], [Author1]Vs[Author2] or [Author]Vs[term] resp. "problem:"/"solution:", "old:"/"new:" and "thesis:" is an addition from the Dictionary of Arguments. If a German edition is specified, the page numbers refer to this edition.	Norvig I Peter Norvig Stuart J. Russell Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010

Send Link

> Counter arguments against Norvig

> Counter arguments in relation to Artificial Neural Networks

Authors A B C D E F G H I J K L M N O P Q R S T U V W Y Z

Concepts A B C D E F G H I J K L M N O P Q R S T U V W Z

Ed. Martin Schulz, access date 2024-04-19

Legal Notice Contact Data protection declaration