| Disputed term/author/ism | Author |
Entry |
Reference |
|---|---|---|---|
| Artificial Neural Networks | Norvig | Norvig I 728 Artificial Neural Networks/Norvig/Russell: Neural networks are composed of nodes or units (…) connected by directed links. A link from unit i to unit j serves to propagate the activation ai from i to j. Each link also has a numeric weight wi,j associated with it, which determines the strength and sign of the connection. Just as in linear regression models, each unit has a dummy input a0 =1 with an associated weight w0,j . Norvig I 729 Perceptrons: The activation function g is typically either a hard threshold (…), in which case the unit is called a perceptron, or a logistic function (…), in which case the term sigmoid perceptron is sometimes used. Both of these nonlinear activation function ensure the important property that the entire network of units can represent a nonlinear function (…). Forms of a network: a) A feed-forward network has connections only in one direction—that is, it forms a directed acyclic graph. Every node receives input from “upstream” nodes and delivers output to “downstream” nodes; there are no loops. A feed-forward network represents a function of its current input; thus, it has no internal state other than the weights themselves. b) A recurrent network, on the other hand, feeds its outputs back into its own inputs. This means that the activation levels of the network form a dynamical system that may reach a stable state or exhibit oscillations or even chaotic behavior. Layers: a) Feed-forward networks are usually arranged in layers, such that each unit receives input only from units in the immediately preceding layer. b) Multilayer networks, which have one or more layers of hidden units that are not connected to the outputs of the network. Training/Learning: For example, if we want to train a network to add two input bits, each a 0 or a 1, we will need one output for the sum bit and one for the carry bit. Also, when the learning problem involves classification into more than two classes—for example, when learning to categorize images of handwritten digits—it is common to use one output unit for each class. Norvig I 731 Any desired functionality can be obtained by connecting large numbers of units into (possibly recurrent) networks of arbitrary depth. The problem was that nobody knew how to train such networks. This turns out to be an easy problem if we think of a network the right way: as a function hw(x) parameterized by the weights w. Norvig I 732 (…) we have the output expressed as a function of the inputs and the weights. (…) because the function represented by a network can be highly nonlinear—composed, as it is, of nested nonlinear soft threshold functions—we can see neural networks as a tool for doing nonlinear regression. Norvig I 736 Learning in neural networks: just as with >Bayesian networks, we also need to understand how to find the best network structure. If we choose a network that is too big, it will be able to memorize all the examples by forming a large lookup table, but will not necessarily generalize well to inputs that have not been seen before. Norvig I 737 Optimal brain damage: The optimal brain damage algorithm begins with a fully connected network and removes connections from it. After the network is trained for the first time, an information-theoretic approach identifies an optimal selection of connections that can be dropped. The network is then retrained, and if its performance has not decreased then the process is repeated. In addition to removing connections, it is also possible to remove units that are not contributing much to the result. Parametric models: A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) is called a parametric model. No matter how much data you throw at a parametric model, it won’t change its mind about how many parameters it needs. Nonparametric models: A nonparametric model is one that cannot be characterized by a bounded set of parameters. For example, suppose that each hypothesis we generate simply retains within itself all of the training examples and uses all of them to predict the next example. Such a hypothesis family would be nonparametric because the effective number of parameters is unbounded- it grows with the number of examples. This approach is called instance-based learning or memory-based learning. The simplest instance-based learning method is table lookup: take all the training examples, put them in a lookup table, and then when asked for h(x), see if x is in the table; (…). Norvig I 738 We can improve on table lookup with a slight variation: given a query xq, find the k examples that are nearest to xq. This is called k-nearest neighbors lookup. ((s) Cf. >Local/global/Philosophical theories.) Norvig I 744 Support vector machines/SVM: The support vector machine or SVM framework is currently the most popular approach for “off-the-shelf” supervised learning: if you don’t have any specialized prior knowledge about a domain, then the SVM is an excellent method to try first. Properties of SVMs: 1. SVMs construct a maximum margin separator - a decision boundary with the largest possible distance to example points. This helps them generalize well. 2. SVMs create a linear separating hyperplane, but they have the ability to embed the data into a higher-dimensional space, using the so-called kernel trick. 3. SVMs are a nonparametric method - they retain training examples and potentially need to store them all. On the other hand, in practice they often end up retaining only a small fraction of the number of examples - sometimes as few as a small constant times the number of dimensions. Norvig I 745 Instead of minimizing expected empirical loss on the training data, SVMs attempt to minimize expected generalization loss. We don’t know where the as-yet-unseen points may fall, but under the probabilistic assumption that they are drawn from the same distribution as the previously seen examples, there are some arguments from computational learning theory (…) suggesting that we minimize generalization loss by choosing the separator that is farthest away from the examples we have seen so far. Norvig I 748 Ensemble Learning: > href="https://philosophy-science-humanities-controversies.com/listview-details.php?id=2497863&a=$a&first_name=&author=AI%20Research&concept=Learning">Learning/AI Research. Norvig I 757 Linear regression is a widely used model. The optimal parameters of a linear regression model can be found by gradient descent search, or computed exactly. A linear classifier with a hard threshold—also known as a perceptron—can be trained by a simple weight update rule to fit data that are linearly separable. In other cases, the rule fails to converge. Norvig I 758 Logistic regression replaces the perceptron’s hard threshold with a soft threshold defined by a logistic function. Gradient descent works well even for noisy data that are not linearly separable. Norvig I 760 History: The term logistic function comes from Pierre-Francois Verhulst (1804–1849), a statistician who used the curve to model population growth with limited resources, a more realistic model than the unconstrained geometric growth proposed by Thomas Malthus. Verhulst called it the courbe logistique, because of its relation to the logarithmic curve. The term regression is due to Francis Galton, nineteenth century statistician, cousin of Charles Darwin, and initiator of the fields of meteorology, fingerprint analysis, and statistical correlation, who used it in the sense of regression to the mean. The term curse of dimensionality comes from Richard Bellman (1961)(1). Logistic regression can be solved with gradient descent, or with the Newton-Raphson method (Newton, 1671(2); Raphson, 1690(3)). A variant of the Newton method called L-BFGS is sometimes used for large-dimensional problems; the L stands for “limited memory,” meaning that it avoids creating the full matrices all at once, and instead creates parts of them on the fly. BFGS are authors’ initials (Byrd et al., 1995)(4). The ideas behind kernel machines come from Aizerman et al. (1964)(5) (who also introduced the kernel trick), but the full development of the theory is due to Vapnik and his colleagues (Boser et al., 1992)(6). SVMs were made practical with the introduction of the soft-margin classifier for handling noisy data in a paper that won the 2008 ACM Theory and Practice Award (Cortes and Vapnik, 1995)(7), and of the Sequential Minimal Optimization (SMO) algorithm for efficiently solving SVM problems using quadratic programming (Platt, 1999)(8). SVMs have proven to be very popular and effective for tasks such as text categorization (Joachims, 2001)(9), computational genomics (Cristianini and Hahn, 2007)(10), and natural language processing, such as the handwritten digit recognition of DeCoste and Schölkopf (2002)(11). As part of this process, many new kernels have been designed that work with strings, trees, and other non-numerical data types. A related technique that also uses the kernel trick to implicitly represent an exponential feature space is the voted perceptron (Freund and Schapire, 1999(12); Collins and Duffy, 2002(13)). Textbooks on SVMs include Cristianini and Shawe-Taylor (2000)(14) and Schölkopf and Smola (2002)(15). A friendlier exposition appears in the AI Magazine article by Cristianini and Schölkopf (2002)(16). Bengio and LeCun (2007)(17) show some of the limitations of SVMs and other local, nonparametric methods for learning functions that have a global structure but do not have local smoothness. Ensemble learning is an increasingly popular technique for improving the performance of learning algorithms. Bagging (Breiman, 1996)(18), the first effective method, combines hypotheses learned from multiple bootstrap data sets, each generated by subsampling the original data set. The boosting method described in this chapter originated with theoretical work by Schapire (1990)(19). The ADABOOST algorithm was developed by Freund and Schapire Norvig I 761 (1996) (20)and analyzed theoretically by Schapire (2003)(21). Friedman et al. (2000)(22) explain boosting from a statistician’s viewpoint. Online learning is covered in a survey by Blum (1996)(23) and a book by Cesa-Bianchi and Lugosi (2006)(24). Dredze et al. (2008)(25) introduce the idea of confidence-weighted online learning for classification: in addition to keeping a weight for each parameter, they also maintain a measure of confidence, so that a new example can have a large effect on features that were rarely seen before (and thus had low confidence) and a small effect on common features that have already been well-estimated. 1. Bellman, R. E. (1961). Adaptive Control Processes: A Guided Tour. Princeton University Press. 2. Newton, I. (1664-1671). Methodus fluxionum et serierum infinitarum. Unpublished notes 3. Raphson, J. (1690). Analysis aequationum universalis. Apud Abelem Swalle, London. 4. Byrd, R. H., Lu, P., Nocedal, J., and Zhu, C. (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific and Statistical Computing, 16(5), 1190-1208. 5. Aizerman, M., Braverman, E., and Rozonoer, L. (1964). Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25, 821-837. 6. Boser, B., Guyon, I., and Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In COLT-92. 7. Cortes, C. and Vapnik, V. N. (1995). Support vector networks. Machine Learning, 20, 273-297. 8. Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods: Support Vector Learning, pp. 185-208. MIT Press. 9. Joachims, T. (2001). A statistical learning model of text classification with support vector machines. In SIGIR-01, pp. 128-136. 10. Cristianini, N. and Hahn, M. (2007). Introduction to Computational Genomics: A Case Studies Approach. Cambridge University Press. 11. DeCoste, D. and Schölkopf, B. (2002). Training invariant support vector machines. Machine Learning, 46(1), 161–190. 12. Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting algorithm. In ICML-96. 13. Collins, M. and Duffy, K. (2002). New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In ACL-02. 14. Cristianini, N. and Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press. 15. Schölkopf, B. and Smola, A. J. (2002). Learning with Kernels. MIT Press. 16. Cristianini, N. and Schölkopf, B. (2002). Support vector machines and kernel methods: The new generation of learning machines. AIMag, 23(3), 31–41. 17. Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms towards AI. In Bottou, L., Chapelle, O., DeCoste, D., and Weston, J. (Eds.), Large-Scale Kernel Machines. MIT Press. 18. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. 19. Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5(2), 197–227. 20. Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting algorithm. In ICML-96. 21. Schapire, R. E. (2003). The boosting approach to machine learning: An overview. In Denison, D. D., Hansen, M. H., Holmes, C., Mallick, B., and Yu, B. (Eds.), Nonlinear Estimation and Classification. Springer. 22. Friedman, J., Hastie, T., and Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28(2), 337–374. 23. Blum, A. L. (1996). On-line algorithms in machine learning. In Proc.Workshop on On-Line Algorithms, Dagstuhl, pp. 306–325. 24. Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and Games. Cambridge University Press. 25. Dredze, M., Crammer, K., and Pereira, F. (2008). Confidence-weighted linear classification. In ICML- 08, pp. 264–271. |
Norvig I Peter Norvig Stuart J. Russell Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010 |
| Bayesianism | Norvig | Norvig I 503 Bayesianism/Norvig/Russell: Bayes’ rule allows unknown probabilities to be computed from known conditional probabilities, usually in the causal direction. Applying Bayes’ rule with many pieces of evidence runs into the same scaling problems as does the full joint distribution. Conditional independence brought about by direct causal relationships in the domain might allow the full joint distribution to be factored into smaller, conditional distributions. The naive Bayes model assumes the conditional independence of all effect variables, given a single cause variable, and grows linearly with the number of effects. Norvig I 505 Bayesian probabilistic reasoning has been used in AI since the 1960s, especially in medical diagnosis. It was used not only to make a diagnosis from available evidence, but also to select further questions and tests by using the theory of information value (…) when available evidence was inconclusive (Gorry, 1968(1); Gorry et al., 1973(2)). One system outperformed human experts in the diagnosis of acute abdominal illnesses (de Dombal et al., 1974)(3). Lucas et al. (2004)(4) gives an overview. These early Bayesian systems suffered from a number of problems, however. Because they lacked any theoretical model of the conditions they were diagnosing, they were vulnerable to unrepresentative data occurring in situations for which only a small sample was available (de Dombal et al., 1981)(5). Even more fundamentally, because they lacked a concise formalism (…) for representing and using conditional independence information, they depended on the acquisition, storage, and processing of enormous tables of probabilistic data. Because of these difficulties, probabilistic methods for coping with uncertainty fell out of favor in AI from the 1970s to the mid-1980s. The naive Bayes model for joint distributions has been studied extensively in the pattern recognition literature since the 1950s (Duda and Hart, 1973)(6). It has also been used, often unwittingly, in information retrieval, beginning with the work of Maron (1961)(7). The probabilistic foundations of this technique, (…) were elucidated by Robertson and Sparck Jones (1976)(8). Independence: Domingos and Pazzani (1997)(9) provide an explanation Norvig I 506 for the surprising success of naive Bayesian reasoning even in domains where the independence assumptions are clearly violated. >Bayesian Networks/Norvig. 1. Gorry, G. A. (1968). Strategies for computer-aided diagnosis. Mathematical Biosciences, 2(3-4), 293- 318. 2. Gorry, G. A., Kassirer, J. P., Essig, A., and Schwartz, W. B. (1973). Decision analysis as the basis for computer-aided management of acute renal failure. American Journal of Medicine, 55, 473-484. 3. de Dombal, F. T., Leaper, D. J., Horrocks, J. C., and Staniland, J. R. (1974). Human and omputeraided diagnosis of abdominal pain: Further report with emphasis on performance of clinicians. British Medical Journal, 1, 376–380. 4. Lucas, P., van der Gaag, L., and Abu-Hanna, A. (2004). Bayesian networks in biomedicine and health-care. Artificial Intelligence in Medicine 5. de Dombal, F. T., Staniland, J. R., and Clamp, S. E. (1981). Geographical variation in disease presentation. Medical Decision Making, 1, 59–69. 6. Duda, R. O. and Hart, P. E. (1973). Pattern classification and scene analysis. Wiley. 7. Maron, M. E. (1961). Automatic indexing: An experimental inquiry. JACM, 8(3), 404-417. 8. Robertson, S. E. and Sparck Jones, K. (1976). Relevance weighting of search terms. J. American Society for Information Science, 27, 129-146. 9. Domingos, P. and Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero–one loss. Machine Learning, 29, 103–30. |
Norvig I Peter Norvig Stuart J. Russell Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010 |
| Bayesianism | Russell | Norvig I 503 Bayesianism/Norvig/Russell: Bayes’ rule allows unknown probabilities to be computed from known conditional probabilities, usually in the causal direction. Applying Bayes’ rule with many pieces of evidence runs into the same scaling problems as does the full joint distribution. Conditional independence brought about by direct causal relationships in the domain might allow the full joint distribution to be factored into smaller, conditional distributions. The naive Bayes model assumes the conditional independence of all effect variables, given a single cause variable, and grows linearly with the number of effects. Norvig I 505 Bayesian probabilistic reasoning has been used in AI since the 1960s, especially in medical diagnosis. It was used not only to make a diagnosis from available evidence, but also to select further questions and tests by using the theory of information value (…) when available evidence was inconclusive (Gorry, 1968(1); Gorry et al., 1973(2)). One system outperformed human experts in the diagnosis of acute abdominal illnesses (de Dombal et al., 1974)(3). Lucas et al. (2004)(4) gives an overview. These early Bayesian systems suffered from a number of problems, however. Because they lacked any theoretical model of the conditions they were diagnosing, they were vulnerable to unrepresentative data occurring in situations for which only a small sample was available (de Dombal et al., 1981)(5). Even more fundamentally, because they lacked a concise formalism (…) for representing and using conditional independence information, they depended on the acquisition, storage, and processing of enormous tables of probabilistic data. Because of these difficulties, probabilistic methods for coping with uncertainty fell out of favor in AI from the 1970s to the mid-1980s. The naive Bayes model for joint distributions has been studied extensively in the pattern recognition literature since the 1950s (Duda and Hart, 1973)(6). It has also been used, often unwittingly, in information retrieval, beginning with the work of Maron (1961)(7). The probabilistic foundations of this technique, (…) were elucidated by Robertson and Sparck Jones (1976)(8). Independence: Domingos and Pazzani (1997)(9) provide an explanation Norvig I 506 for the surprising success of naive Bayesian reasoning even in domains where the independence assumptions are clearly violated. >Bayesian Networks/Norvig. 1. Gorry, G. A. (1968). Strategies for computer-aided diagnosis. Mathematical Biosciences, 2(3-4), 293- 318. 2. Gorry, G. A., Kassirer, J. P., Essig, A., and Schwartz, W. B. (1973). Decision analysis as the basis for computer-aided management of acute renal failure. American Journal of Medicine, 55, 473-484. 3. de Dombal, F. T., Leaper, D. J., Horrocks, J. C., and Staniland, J. R. (1974). Human and omputeraided diagnosis of abdominal pain: Further report with emphasis on performance of clinicians. British Medical Journal, 1, 376–380. 4. Lucas, P., van der Gaag, L., and Abu-Hanna, A. (2004). Bayesian networks in biomedicine and health-care. Artificial Intelligence in Medicine 5. de Dombal, F. T., Staniland, J. R., and Clamp, S. E. (1981). Geographical variation in disease presentation. Medical Decision Making, 1, 59–69. 6. Duda, R. O. and Hart, P. E. (1973). Pattern classification and scene analysis. Wiley. 7. Maron, M. E. (1961). Automatic indexing: An experimental inquiry. JACM, 8(3), 404-417. 8. Robertson, S. E. and Sparck Jones, K. (1976). Relevance weighting of search terms. J. American Society for Information Science, 27, 129-146. 9. Domingos, P. and Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero–one loss. Machine Learning, 29, 103–30. |
Russell I B. Russell/A.N. Whitehead Principia Mathematica Frankfurt 1986 Russell II B. Russell The ABC of Relativity, London 1958, 1969 German Edition: Das ABC der Relativitätstheorie Frankfurt 1989 Russell IV B. Russell The Problems of Philosophy, Oxford 1912 German Edition: Probleme der Philosophie Frankfurt 1967 Russell VI B. Russell "The Philosophy of Logical Atomism", in: B. Russell, Logic and KNowledge, ed. R. Ch. Marsh, London 1956, pp. 200-202 German Edition: Die Philosophie des logischen Atomismus In Eigennamen, U. Wolf (Hg) Frankfurt 1993 Russell VII B. Russell On the Nature of Truth and Falsehood, in: B. Russell, The Problems of Philosophy, Oxford 1912 - Dt. "Wahrheit und Falschheit" In Wahrheitstheorien, G. Skirbekk (Hg) Frankfurt 1996 Norvig I Peter Norvig Stuart J. Russell Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010 |
| Connectionism | Pauen | Pauen I 148 Neural networks/Fodor: work quite unlike computer (and computation) - namely associative. >Computational model, >Computation, >Neural networks, >Association. I 152 Learning: here neural networks are superior to computers where program and data are separated. >Learning, >Machine Learning, >Artificial Intelligence. I 155 VsNeural networks: they cannot explain the systematic nature and productivity of thinking. >Thinking, >Consciousness, >Knowledge. I 153 Artificial neural networks/Pauen: Back Programming: retroactive effect of information. Punch line: weight of compounds can be differentiated - learning: here the intervention of the experimenter is needed - large fault tolerance - strength: pattern recognition. Cf. >Backtracking/Norvig. |
Pauen I M. Pauen Grundprobleme der Philosophie des Geistes Frankfurt 2001 |
| Intelligence | Newell, A./Simon, H. | Münch III 57ff Intelligence/Newell/Simon: there is as little a "principle of intelligence" as there is a "principle of life", which explains the essence of life from its very nature. But that is not that there are no structural requirements for intelligence. Cf. >Principles. Münch III 69 General Problem Solver/Newell/Simon: (GPS) general mechanisms, schemes, for performing different tasks. Distinction nets, pattern recognition mechanisms, syntax analysis. >General Problem Solver, >Distinctions, >Networks, >Artificial Neural Networks, >Syntax, >Analysis, >Pattern Recognition, >Machine Learning, >Artificial Intelligence. Münch III 76 Definition Intelligence/Newell/Simon: a system with limited processing capacity is to make wise decisions in the face of what is next to be done. Prerequisite: the solution distribution must not be completely random! Pure insertion and testing is not intelligent. >Inserting. The origin of intelligence is nothing mystic: it comes from search trees. Allen Newell/Herbert Simon, “Computer Science as Empirical Inquiry: Symbols and Search“ Communications of the Association for Computing Machinery 19 (1976), 113-126 |
Mü III D. Münch (Hrsg.) Kognitionswissenschaft Frankfurt 1992 |
| Learning | Gärdenfors | I 26 Learning/convexity/Gärdenfors: that properties are convex (i.e. that points lying inbetween in the quality space have the same property as outer points with this property) facilitates the learning of categories. --- I 42 Learning/Gärdenfors: when learning concepts, we must start from a few specimen, which we then generalize. (See Reed, 1972(1), Nosofsky, 1986(2), 1988(3), Langley, 1996(4)). When we accept prototypes, we can say that typical instances are obtained from the examples by finding something like a middle position in the domain. (Langley(4), 1996, p.99). This middle position can then again be used for a Voronoi-tessellation of the region. ((s) Division of the domain into adjacent sections with a respective center.)> Prototype/Gärdenfors. 1. Reed, S. K. (1972). Pattern recognition and categorization. Cognitive Psychology, 3, 382-407 2. Nosofsky, R. M. (1986). Attention, similarity, and the identification-categorization relationship. Journal of Experimental Psychology: General, 115, 39-57 3. Nosofsky, R. M. (1988). Similarity, frequency, and category representations. Journal of Experimental Psychology: Learning, Memory and Cognition, 14, 54-65 4. Langley, P. (1996). Elements of machine learning. San Francisco, CA: Morgan Kaufmann |
Gä I P. Gärdenfors The Geometry of Meaning Cambridge 2014 |
| Neural Networks | Norvig | Norvig I 761 Neural Networks/Norvig/Russell: Literature on neural networks: Cowan and Sharp (1988b(1), 1988a(2)) survey the early history, beginning with the work of McCulloch and Pitts (1943)(3). John McCarthy has pointed to the work of Nicolas Rashevsky (1936(4), 1938(5)) as the earliest mathematical model of neural learning.) Norbert Wiener, a pioneer of cybernetics and control theory (Wiener, 1948)(6), worked with McCulloch and Pitts and influenced a number of young researchers including Marvin Minsky, who may have been the first to develop a working neural network in hardware in 1951 (see Minsky and Papert, 1988(7), pp. ix–x). Turing (1948)(8) wrote a research report titled Intelligent Machinery that begins with the sentence “I propose to investigate the question as to whether it is possible for machinery to show intelligent behaviour” and goes on to describe a recurrent neural network architecture he called “B-type unorganized machines” and an approach to training them. Unfortunately, the report went unpublished until 1969, and was all but ignored until recently. Frank Rosenblatt (1957)(9) invented the modern “perceptron” and proved the perceptron convergence theorem (1960), although it had been foreshadowed by purely mathematical work outside the context of neural networks (Agmon, 1954(10); Motzkin and Schoenberg, 1954(11)). Some early work was also done on multilayer networks, including Gamba perceptrons (Gamba et al., 1961)(12) and madalines (Widrow, 1962)(13). Learning Machines (Nilsson, 1965)(14) covers much of this early work and more. The subsequent demise of early perceptron research efforts was hastened (or, the authors later claimed, merely explained) by the book Perceptrons (Minsky and Papert, 1969)(15), which lamented the field’s lack of mathematical rigor. The book pointed out that single-layer perceptrons could represent only linearly separable concepts and noted the lack of effective learning algorithms for multilayer networks. The papers in (Hinton and Anderson, 1981)(16), based on a conference in San Diego in 1979, can be regarded as marking a renaissance of connectionism. The two-volume “PDP” (Parallel Distributed Processing) anthology (Rumelhart et al., 1986a)(17) and a short article in Nature (Rumelhart et al., 1986b)(18) attracted a great deal of attention—indeed, the number of papers on “neural networks” multiplied by a factor of 200 between 1980–84 and 1990–94. The analysis of neural networks using the physical theory of magnetic spin glasses (Amit et al., 1985)(19) tightened the links between statistical mechanics and neural network theory - providing not only useful mathematical insights but also respectability. The back-propagation technique had been invented quite early (Bryson and Ho, 1969)(20) but it was rediscovered several times (Werbos, 1974(21); Parker, 1985(22)). The probabilistic interpretation of neural networks has several sources, including Baum and Wilczek (1988)(23) and Bridle (1990)(24). The role of the sigmoid function is discussed by Jordan (1995)(25). Bayesian parameter learning for neural networks was proposed by MacKay Norvig I 762 (1992)(26) and is explored further by Neal (1996)(27). The capacity of neural networks to represent functions was investigated by Cybenko (1988(28), 1989(29)), who showed that two hidden layers are enough to represent any function and a single layer is enough to represent any continuous function. The “optimal brain damage” method (>Artificial neural networks/Norvig) for removing useless connections is by LeCun et al. (1989)(30), and Sietsma and Dow (1988)(31) show how to remove useless units. >Complexity/Norvig. Norvig I 763 For neural nets, Bishop (1995)(32), Ripley (1996)(33), and Haykin (2008)(34) are the leading texts. The field of computational neuroscience is covered by Dayan and Abbott (2001)(35). 1. Cowan, J. D. and Sharp, D. H. (1988b). Neural nets and artificial intelligence. Daedalus, 117, 85–121. 2. Cowan, J. D. and Sharp, D. H. (1988a). Neural nets. Quarterly Reviews of Biophysics, 21, 365–427. 3. McCulloch, W. S. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5, 115–137. 4. Rashevsky, N. (1936). Physico-mathematical aspects of excitation and conduction in nerves. In Cold Springs Harbor Symposia on Quantitative Biology. IV: Excitation Phenomena, pp. 90–97. 5. Rashevsky, N. (1938). Mathematical Biophysics: Physico-Mathematical Foundations of Biology. University of Chicago Press. 6. Wiener, N. (1948). Cybernetics. Wiley. 7. Minsky, M. L. and Papert, S. (1988). Perceptrons: An Introduction to Computational Geometry (Expanded edition). MIT Press. 8. Turing, A. (1948). Intelligent machinery. Tech. rep. National Physical Laboratory. reprinted in (Ince, 1992). 9. Rosenblatt, F. (1957). The perceptron: A perceiving and recognizing automaton. Report 85-460-1, Project PARA, Cornell Aeronautical Laboratory. 10. Agmon, S. (1954). The relaxation method for linear inequalities. Canadian Journal of Mathematics, 6(3), 382–392. 11. Motzkin, T. S. and Schoenberg, I. J. (1954). The elaxation method for linear inequalities. Canadian Journal of Mathematics, 6(3), 393–404. 12. Gamba, A., Gamberini, L., Palmieri, G., and Sanna, R. (1961). Further experiments with PAPA. Nuovo Cimento Supplemento, 20(2), 221–231. 13. Widrow, B. (1962). Generalization and information storage in networks of adaline “neurons”. In Self-Organizing Systems 1962, pp. 435–461. 14. Nilsson, N. J. (1965). Learning Machines: Foundations of Trainable Pattern-Classifying Systems. McGraw-Hill. Republished in 1990. 15. Minsky, M. L. and Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry (first edition). MIT Press. 16. Hinton, G. E. and Anderson, J. A. (1981). Parallel Models of Associative Memory. Lawrence Erlbaum Associates. 17. Rumelhart, D. E., Hinton, G. E., andWilliams, R. J. (1986a). Learning internal representations by error propagation. In Rumelhart, D. E. and McClelland, J. L. (Eds.), Parallel Distributed Processing, Vol. 1, chap. 8, pp. 318–362. MIT Press. 18. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986b). Learning representations by back propagating errors. Nature, 323, 533–536. 19. Amit, D., Gutfreund, H., and Sompolinsky, H. (1985). Spin-glass models of neural networks. Physical Review, A 32, 1007–1018. 20. Bryson, A. E. and Ho, Y.-C. (1969). Applied Optimal Control. Blaisdell. 21. Werbos, P. (1974). Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. Ph.D. thesis, Harvard University. 22. Parker, D. B. (1985). Learning logic. Technical report TR-47, Center for Computational Research in Economics and Management Science, Massachusetts Institute of Technology. 23. Baum, E. and Wilczek, F. (1988). Supervised learning of probability distributions by neural networks. In Anderson, D. Z. (Ed.), Neural Information Processing Systems, pp. 52–61. American Institute of Physics. 24. Bridle, J. S. (1990). Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Fogelman Souli´e, F. and H´erault, J. (Eds.), Neuro computing: Algorithms, Architectures and Applications. Springer-Verlag. 25. Jordan, M. I. (1995). Why the logistic function? a tutorial discussion on probabilities and neural networks. Computational cognitive science technical report 9503, Massachusetts Institute of Technology. 26. MacKay, D. J. C. (1992). A practical Bayesian framework for back-propagation networks. Neural Computation, 4(3), 448–472. 27. Neal, R. (1996). Bayesian Learning for Neural Networks. Springer-Verlag. 28. Cybenko, G. (1988). Continuous valued neural networks with two hidden layers are sufficient. Technical report, Department of Computer Science, Tufts University. 29. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Controls, Signals, and Systems, 2, 303–314. 30. LeCun, Y., Jackel, L., Boser, B., and Denker, J. (1989). Handwritten digit recognition: Applications of neural network chips and automatic learning. IEEE Communications Magazine, 27(11), 41– 46. 31. Sietsma, J. and Dow, R. J. F. (1988). Neural net pruning - Why and how. In IEEE International Conference on Neural Networks, pp. 325–333. 32. Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University Press. 33. Ripley, B. D. (1996). Pattern Recognition and Neural Networks. Cambridge University Press. 34. Haykin, S. (2008). Neural Networks: A Comprehensive Foundation. Prentice Hall. 35. Dayan, P. and Abbott, L. F. (2001). Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems. MIT Press. |
Norvig I Peter Norvig Stuart J. Russell Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010 |
| Optimization | Norvig | Norvig I 133 Optimization/Norvig/Russell: Linear programming is probably the most widely studied and broadly useful class of optimization problems. It is a special case of the more general problem of convex optimization, which allows the constraint region to be any convex region and the objective to be any function that is convex within the constraint region. Linear programming problems: here, constraints must be linear inequalities forming a convex set and the objective function is also linear. The time complexity of linear programming is polynomial in the number of variables. Def Convex: A set of points S is convex if the line joining any two points in S is also contained in S. A convex function is one for which the space “above” it forms a convex set; by definition, convex functions have no local (as opposed to global) minima. Under certain conditions, convex optimization problems are also polynomially solvable and may be feasible in practice with thousands of variables. Several important problems in machine learning and control theory can be formulated as convex optimization problems. >Search algorithms. Norvig I 155 Finding optimal solutions in continuous spaces is the subject matter of several fields, including optimization theory, optimal control theory, and the calculus of variations. The basic techniques are explained well by Bishop (1995)(1); Press et al. (2007)(2) cover a wide range of algorithms and provide working software. As Andrew Moore points out, researchers have taken inspiration for search and optimization algorithms from a wide variety of fields of study: metallurgy (simulated annealing), biology (genetic algorithms), economics (market-based algorithms), entomology (ant colony optimization), neurology (neural networks), animal behavior (reinforcement learning), mountaineering (hill climbing), and others. In the 1950s, several statisticians, including Box (1957)(3) and Friedman (1959)(4), used evolutionary techniques for optimization problems, but it wasn’t until Rechenberg (1965)(5) introduced evolution strategies to solve optimization problems for airfoils that the approach gained popularity. 1. Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University Press 2. Press,W. H., Teukolsky, S. A., Vetterling,W. T., and Flannery, B. P. (2007). Numerical Recipes: The Art of Scientific Computing (third edition). Cambridge University Press 3. Box, G. E. P. (1957). Evolutionary operation: A method of increasing industrial productivity. Applied Statistics, 6, 81–101. 4. Friedman, G. J. (1959). Digital simulation of an evolutionary process. General Systems Yearbook, 4, 171–184. 5. Rechenberg, I. (1965). Cybernetic solution path of an experimental problem. Library translation 1122, Royal Aircraft Establishment |
Norvig I Peter Norvig Stuart J. Russell Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010 |
| Optimization | Russell | Norvig I 133 Optimization/Norvig/Russell: Linear programming is probably the most widely studied and broadly useful class of optimization problems. It is a special case of the more general problem of convex optimization, which allows the constraint region to be any convex region and the objective to be any function that is convex within the constraint region. Linear programming problems: here, constraints must be linear inequalities forming a convex set and the objective function is also linear. The time complexity of linear programming is polynomial in the number of variables. Def Convex: A set of points S is convex if the line joining any two points in S is also contained in S. A convex function is one for which the space “above” it forms a convex set; by definition, convex functions have no local (as opposed to global) minima. Under certain conditions, convex optimization problems are also polynomially solvable and may be feasible in practice with thousands of variables. Several important problems in machine learning and control theory can be formulated as convex optimization problems. >Search algorithms. Norvig I 155 Finding optimal solutions in continuous spaces is the subject matter of several fields, including optimization theory, optimal control theory, and the calculus of variations. The basic techniques are explained well by Bishop (1995)(1); Press et al. (2007)(2) cover a wide range of algorithms and provide working software. As Andrew Moore points out, researchers have taken inspiration for search and optimization algorithms from a wide variety of fields of study: metallurgy (simulated annealing), biology (genetic algorithms), economics (market-based algorithms), entomology (ant colony optimization), neurology (neural networks), animal behavior (reinforcement learning), mountaineering (hill climbing), and others. In the 1950s, several statisticians, including Box (1957)(3) and Friedman (1959)(4), used evolutionary techniques for optimization problems, but it wasn’t until Rechenberg (1965)(5) introduced evolution strategies to solve optimization problems for airfoils that the approach gained popularity. 1. Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University Press 2. Press,W. H., Teukolsky, S. A., Vetterling,W. T., and Flannery, B. P. (2007). Numerical Recipes: The Art of Scientific Computing (third edition). Cambridge University Press 3. Box, G. E. P. (1957). Evolutionary operation: A method of increasing industrial productivity. Applied Statistics, 6, 81–101. 4. Friedman, G. J. (1959). Digital simulation of an evolutionary process. General Systems Yearbook, 4, 171–184. 5. Rechenberg, I. (1965). Cybernetic solution path of an experimental problem. Library translation 1122, Royal Aircraft Establishment |
Russell I B. Russell/A.N. Whitehead Principia Mathematica Frankfurt 1986 Russell II B. Russell The ABC of Relativity, London 1958, 1969 German Edition: Das ABC der Relativitätstheorie Frankfurt 1989 Russell IV B. Russell The Problems of Philosophy, Oxford 1912 German Edition: Probleme der Philosophie Frankfurt 1967 Russell VI B. Russell "The Philosophy of Logical Atomism", in: B. Russell, Logic and KNowledge, ed. R. Ch. Marsh, London 1956, pp. 200-202 German Edition: Die Philosophie des logischen Atomismus In Eigennamen, U. Wolf (Hg) Frankfurt 1993 Russell VII B. Russell On the Nature of Truth and Falsehood, in: B. Russell, The Problems of Philosophy, Oxford 1912 - Dt. "Wahrheit und Falschheit" In Wahrheitstheorien, G. Skirbekk (Hg) Frankfurt 1996 Norvig I Peter Norvig Stuart J. Russell Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010 |
| Statistical Learning | Norvig | Norvig I 825 Statistical learning/Norvig/Russell: Statistical learning methods range from simple calculation of averages to the construction of complex models such as Bayesian networks. They have applications throughout computer science, engineering, computational biology, neuroscience, psychology, and physics. ((s) Cf. >Prior knowledge/Norvig). Bayesian learning methods: formulate learning as a form of probabilistic inference, using the observations to update a prior distribution over hypotheses. This approach provides a good way to implement Ockham’s razor, but quickly becomes intractable for complex hypothesis spaces. Maximum a posteriori (MAP) learning: selects a single most likely hypothesis given the data. The hypothesis prior is still used and the method is often more tractable than full Bayesian learning. Maximum-likelihood learning: simply selects the hypothesis that maximizes the likelihood of the data; it is equivalent to MAP learning with a uniform prior. In simple cases such as linear regression and fully observable Bayesian networks, maximum-likelihood solutions can be found easily in closed form. Naive Bayes learning is a particularly effective technique that scales well. Hidden variables/latent variables: When some variables are hidden, local maximum likelihood solutions can be found using the EM algorithm. Applications include clustering using mixtures of Gaussians, learning Bayesian networks, and learning hidden Markov models. Norvig I 823 EM Algorithm: Each involves computing expected values of hidden variables for each example and then recomputing the parameters, using the expected values as if they were observed values. Norvig I 825 Learning the structure of Bayesian networks is an example of model selection. This usually involves a discrete search in the space of structures. Some method is required for trading off model complexity against degree of fit. Nonparametric models: represent a distribution using the collection of data points. Thus, the number of parameters grows with the training set. Nearest-neighbors methods look at the examples nearest to the point in question, whereas kernel methods form a distance-weighted combination of all the examples. History: The application of statistical learning techniques in AI was an active area of research in the early years (see Duda and Hart, 1973)(1) but became separated from mainstream AI as the latter field concentrated on symbolic methods. A resurgence of interest occurred shortly after the introduction of Bayesian network models in the late 1980s; at roughly the same time, Norvig I 826 statistical view of neural network learning began to emerge. In the late 1990s, there was a noticeable convergence of interests in machine learning, statistics, and neural networks, centered on methods for creating large probabilistic models from data. Naïve Bayes model: is one of the oldest and simplest forms of Bayesian network, dating back to the 1950s. Its surprising success is partially explained by Domingos and Pazzani (1997)(2). A boosted form of naive Bayes learning won the first KDD Cup data mining competition (Elkan, 1997)(3). Heckerman (1998)(4) gives an excellent introduction to the general problem of Bayes net learning. Bayesian parameter learning with Dirichlet priors for Bayesian networks was discussed by Spiegelhalter et al. (1993)(5). The BUGS software package (Gilks et al., 1994)(6) incorporates many of these ideas and provides a very powerful tool for formulating and learning complex probability models. The first algorithms for learning Bayes net structures used conditional independence tests (Pearl, 1988(7); Pearl and Verma, 1991(8)). Spirtes et al. (1993)(9) developed a comprehensive approach embodied in the TETRAD package for Bayes net learning. Algorithmic improvements since then led to a clear victory in the 2001 KDD Cup data mining competition for a Bayes net learning method (Cheng et al., 2002)(10). (The specific task here was a bioinformatics problem with 139,351 features!) A structure-learning approach based on maximizing likelihood was developed by Cooper and Herskovits (1992)(11) and improved by Heckerman et al. (1994)(12). Several algorithmic advances since that time have led to quite respectable performance in the complete-data case (Moore and Wong, 2003(13); Teyssier and Koller, 2005(14)). One important component is an efficient data structure, the AD-tree, for caching counts over all possible combinations of variables and values (Moore and Lee, 1997)(15). Friedman and Goldszmidt (1996)(16) pointed out the influence of the representation of local conditional distributions on the learned structure. Hidden variables/missing data: The general problem of learning probability models with hidden variables and missing data was addressed by Hartley (1958)(17), who described the general idea of what was later called EM and gave several examples. Further impetus came from the Baum–Welch algorithm for HMM learning (Baum and Petrie, 1966)(18), which is a special case of EM. The paper by Dempster, Laird, and Rubin (1977)(19), which presented the EM algorithm in general form and analyzed its convergence, is one of the most cited papers in both computer science and statistics. (Dempster himself views EM as a schema rather than an algorithm, since a good deal of mathematical work may be required before it can be applied to a new family of distributions.) McLachlan and Krishnan (1997)(20) devote an entire book to the algorithm and its properties. The specific problem of learning mixture models, including mixtures of Gaussians, is covered by Titterington et al. (1985)(21). Within AI, the first successful system that used EM for mixture modeling was AUTOCLASS (Cheeseman et al., 1988(22); Cheeseman and Stutz, 1996(23)). AUTOCLASS has been applied to a number of real-world scientific classification tasks, including the discovery of new types of stars from spectral data (Goebel et al., 1989)(24) and new classes of proteins and introns in DNA/protein sequence databases (Hunter and States, 1992)(25). Maximum-likelihood parameter learning: For maximum-likelihood parameter learning in Bayes nets with hidden variables, EM and gradient-based methods were introduced around the same time by Lauritzen (1995)(26), Russell et al. (1995)(27), and Binder et al. (1997a)(28). The structural EM algorithm was developed by Friedman (1998)(29) and applied to maximum-likelihood learning of Bayes net structures with Norvig I 827 latent variables. Friedman and Koller (2003)(30). describe Bayesian structure learning. Causality/causal network: The ability to learn the structure of Bayesian networks is closely connected to the issue of recovering causal information from data. That is, is it possible to learn Bayes nets in such a way that the recovered network structure indicates real causal influences? For many years, statisticians avoided this question, believing that observational data (as opposed to data generated from experimental trials) could yield only correlational information—after all, any two variables that appear related might in fact be influenced by a third, unknown causal factor rather than influencing each other directly. Pearl (2000)(31) has presented convincing arguments to the contrary, showing that there are in fact many cases where causality can be ascertained and developing the causal network formalism to express causes and the effects of intervention as well as ordinary conditional probabilities. Literature on statistical learning and pattern recognition: Good texts on Bayesian statistics include those by DeGroot (1970)(32), Berger (1985)(33), and Gelman et al. (1995)(34). Bishop (2007)(35) and Hastie et al. (2009)(36) provide an excellent introduction to statistical machine learning. For pattern classification, the classic text for many years has been Duda and Hart (1973)(1), now updated (Duda et al., 2001)(37). The annual NIPS (Neural Information Processing Conference) conference, whose proceedings are published as the series Advances in Neural Information Processing Systems, is now dominated by Bayesian papers. Papers on learning Bayesian networks also appear in the Uncertainty in AI and Machine Learning conferences and in several statistics conferences. Journals specific to neural networks include Neural Computation, Neural Networks, and the IEEE Transactions on Neural Networks. 1. Duda, R. O. and Hart, P. E. (1973). Pattern classification and scene analysis. Wiley. 2. Domingos, P. and Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29, 103–30. 3. Elkan, C. (1997). Boosting and naive Bayesian learning. Tech. rep., Department of Computer Science and Engineering, University of California, San Diego. 4. Heckerman, D. (1998). A tutorial on learning with Bayesian networks. In Jordan, M. I. (Ed.), Learning in graphical models. Kluwer. 5. Spiegelhalter, D. J., Dawid, A. P., Lauritzen, S., and Cowell, R. (1993). Bayesian analysis in expert systems. Statistical Science, 8, 219–282. 6. Gilks, W. R., Thomas, A., and Spiegelhalter, D. J. (1994). A language and program for complex Bayesian modelling. The Statistician, 43, 169–178. 7. Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann. 8. Pearl, J. and Verma, T. (1991). A theory of inferred causation. In KR-91, pp. 441–452. 9. Spirtes, P., Glymour, C., and Scheines, R. (1993). Causation, prediction, and search. Springer-Verlag. 10. Cheng, J., Greiner, R., Kelly, J., Bell, D. A., and Liu, W. (2002). Learning Bayesian networks from data: An information-theory based approach. AIJ, 137, 43–90. 11. Cooper, G. and Herskovits, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309–347. 12. Heckerman, D., Geiger, D., and Chickering, D. M. (1994). Learning Bayesian networks: The combination of knowledge and statistical data. Technical report MSR-TR-94-09, Microsoft Research. 13. Moore, A. and Wong, W.-K. (2003). Optimal reinsertion: A new search operator for accelerated and more accurate Bayesian network structure learning. In ICML-03. 14. Teyssier, M. and Koller, D. (2005). Ordering-based search: A simple and effective algorithm for learning Bayesian networks. In UAI-05, pp. 584–590. 15. Moore, A. W. and Lee, M. S. (1997). Cached sufficient statistics for efficient machine learning with large datasets. JAIR, 8, 67–91. 16. Friedman, N. and Goldszmidt, M. (1996). Learning Bayesian networks with local structure. In UAI-96, pp. 252–262. 17. Hartley, H. (1958). Maximum likelihood estimation from incomplete data. Biometrics, 14, 174–194. 18. Baum, L. E. and Petrie, T. (1966). Statistical inference for probabilistic functions of finite state Markov chains. Annals of Mathematical Statistics, 41. 19. Dempster, A. P., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society, 39 (Series B), 1–38. 20. McLachlan, G. J. and Krishnan, T. (1997). The EM Algorithm and Extensions. Wiley. 21. Titterington, D. M., Smith, A. F. M., and Makov, U. E. (1985). Statistical analysis of finite mixture distributions. Wiley. 22. Cheeseman, P., Self, M., Kelly, J., and Stutz, J. (1988). Bayesian classification. In AAAI-88, Vol. 2, pp. 607–611. 23. Cheeseman, P. and Stutz, J. (1996). Bayesian classification (AutoClass): Theory and results. In Fayyad, U., Piatesky-Shapiro, G., Smyth, P., and Uthurusamy, R. (Eds.), Advances in Knowledge Discovery and Data Mining. AAAI Press/MIT Press. 24. Goebel, J., Volk, K., Walker, H., and Gerbault, F. (1989). Automatic classification of spectra from the infrared astronomical satellite (IRAS). Astronomy and Astrophysics, 222, L5–L8. 25. Hunter, L. and States, D. J. (1992). Bayesian classification of protein structure. IEEE Expert, 7(4), 67–75. 26. Lauritzen, S. (1995). The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 19, 191–201. 27. Russell, S. J., Binder, J., Koller, D., and Kanazawa, K. (1995). Local learning in probabilistic networks with hidden variables. In IJCAI-95, pp. 1146–52. 28. Binder, J., Koller, D., Russell, S. J., and Kanazawa, K. (1997a). Adaptive probabilistic networks with hidden variables. Machine Learning, 29, 213–244. 29. Friedman, N. (1998). The Bayesian structural EM algorithm. In UAI-98. 30. Friedman, N. and Koller, D. (2003). Being Bayesian about Bayesian network structure: A Bayesian approach to structure discovery in Bayesian networks. Machine Learning, 50, 95–125. 31. Pearl, J. (2000). Causality: Models, Reasoning, and Inference. Cambridge University Press. 32. DeGroot, M. H. (1970). Optimal Statistical Decisions. McGraw-Hill. 33. Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis. Springer Verlag. 34. Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. (1995). Bayesian Data Analysis. Chapman & Hall. 35. Bishop, C. M. (2007). Pattern Recognition and Machine Learning. Springer-Verlag. 36. Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference and Prediction (2nd edition). Springer- Verlag. 37. Duda, R. O., Hart, P. E., and Stork, D. G. (2001). Pattern Classification (2nd edition). Wiley. |
Norvig I Peter Norvig Stuart J. Russell Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010 |
| Weather Forecasting | Edwards | Edwards I 362 Weather Forecasting/Edwards: From the dawn of synoptic forecasting, weather forecasting comprised three principal steps: (1) collect the available data, (2) interpret the data to create a picture of the weather situation, and (3) predict how that picture will change during the forecast period. The second step, originally known as “diagnosis,” transformed raw data from a relatively few points into a coherent, self-consistent picture of atmospheric structure and motion.(1) As in a medical diagnosis, forecasters combined theory and experiential knowledge to reach a shared understanding of reality from incomplete and potentially ambiguous indications (symptoms). Analysis: For early NWP (Numerical Weather Prediction) , diagnosis or “analysis” proved the most difficult aspect of forecasting. Ultimately, it was also the most rewarding. In the long run, analysis would also connect forecasting with climatology in new, unexpected, and important ways. >Reanalysis/Climatology. Edwards I 364 Interpretation: Before numerical weather prediction, analysis was an interpretive process that involved a shifting combination of mathematics, graphical techniques, and pattern recognition. Human interpretation played a crucial role in data collection; (…).(2) Edwards I 369 Objective analysis: The JNWPU’s (Northwestern Polytechnical University) first, experimental analysis program defined a 1000×1000 km square around each gridpoint. Next, it searched for all available observed data within that square. If it found no data, the program skipped that gridpoint and moved to the next one. If it did find data, the program fitted a quadratic surface to all the data points within the search square. It then interpolated a value on that surface for the gridpoint. (…)This technique worked well for areas densely covered by observations, but performed poorly in large data-void regions.(3) >Models/Climatology, >Climate data/Edwards. Edwards I 391 Models/weather forecasting: Traditionally, scientists and philosophers alike understood mathematical models as expressions of theory - as constructs that relate dependent and independent variables to one another according to physical laws. On this view, you make a model to test a theory (or one expression of a theory). You take some measurements, fill them in as values for initial conditions in the model, then solve the equations, iterating into the future. from the point of view of operational forecasting, the main goal of analysis is not to explain weather but to reproduce it. You are generating a global data image, simulating and observing at the same time, checking and adjusting your simulation and your observations against each other. As the philosopher Eric Winsberg has argued, simulation modeling of this sort doesn’t test theory; it applies theory. This mode - application, not justification, of theory - is “unfamiliar to most philosophy of science.”(4) 1. V. Bjerknes, Dynamic Meteorology and Hydrography, Part II. Kinematics (Gibson Bros., Carnegie Institute, 1911); R. Daley, Atmospheric Data Analysis (Cambridge University Press, 1991). 2. See 14. P. Bergthorsson and B. R. Döös, “Numerical Weather Map Analysis,” Tellus 7, no. 3 (1955), 329. 3. As one of the method’s designers observed, “straightforward interpolation between observations hundreds or thousands of miles apart is not going to give a usable value.” G. P. Cressman, “Dynamic Weather Prediction,” in Meteorological Challenges: A History, ed. D. P. McIntyre (Information Canada, 1972), 188. 4. E. Winsberg, “Sanctioning Models: The Epistemology of Simulation,” Science in Context 12, no. 2 (1999), 275. |
Edwards I Paul N. Edwards A Vast Machine: Computer Models, Climate Data, and the Politics of Global Warming Cambridge 2013 |
| Disputed term/author/ism | Author Vs Author |
Entry |
Reference |
|---|---|---|---|
| Dennett, D. | Pauen Vs Dennett, D. | Pauen I 143 Swapped Spectra/VsDennett: Of course, a neural network can realize two very different forms of activity, e.g. pattern recognition and behavior control. There is also no reason not to combine an activity with changing activities of the other type. A trained network can also be caused even to give various responses to a pattern. |
Pauen I M. Pauen Grundprobleme der Philosophie des Geistes Frankfurt 2001 |