Disputed term/author/ism | Author |
Entry |
Reference |
---|---|---|---|
Artworks | Flusser | Rötzer I 66/67 Artwork/Flusser: The second principle of thermodynamics has to be interpreted in such a way that the interesting things become increasingly rare. For works of art are not based on any theory, especially not on the information theory, they remain relatively uninformative and likely. >Second law of thermodynamics, >Information, >Probability, >Events, >Order. Flusser: it is not the size of works of art that is to be denied, but the production of them is to be freed from their mystical aura in order to be able to better estimate their size. >Aura. I 68 What does an author do? He collects information that he finds in already produced works according to criteria of his time, to which he adds information from a concrete life. Among the self-acquired information may also be noises, i. e. previously unavailable information. Rötzer I 70 Art making needs to be mechanized and theorized. (Ethics, behaviour and aesthetics, experience are never separated). >Ethics, >Behavior, >Aesthetics, >Experiences. Common sense/Flusser: Common sense proves to be a reactionary element here. Characteristic of the present. >Conservatism, >Present. |
Fl I V. Flusser Kommunikologie Mannheim 1996 Rötz I F. Rötzer Kunst machen? München 1991 |
Change | AI Research | Norvig I 566 Change/probability/time/inference/AI research/Norvig/Russell: Agents in partially observable environments must be able to keep track of the current state, to the extent that their sensors allow. (…) an agent maintains a belief state that represents which states of the world are currently possible. >Belief states/Norvig. From the belief state and a transition model, the agent can predict how the world might evolve in the next time step. From the percepts observed and a sensor model, the agent can update the belief state. [There are two ways of representing belief states] (…) a) by explicitly enumerated sets of states, b) by logical formulas. Those approaches defined belief states in terms of which world states were possible, but could say nothing about which states were likely or unlikely. Problem: a changing world is modeled using a variable for each aspect of the world state at each point in time. The transition and sensor models may be uncertain: the transition model describes the probability distribution of the variables at time t, given the state of the world at past times, while the sensor model describes the probability of each percept at time t, given the current state of the world. Solution: three specific kinds of models: hidden Markov models, Kalman filters, and dynamic Bayesian networks (which include hidden Markov models and Kalman filters as special cases). Norvig I 567 To assess the current state from the history of evidence and to predict the outcomes of treatment actions, we must model these changes. We view the world as a series of snapshots, or time slices, each of which contains a set of random variables, some observable and some not. ((s) Cf. >Four dimensionalism/Philosophical theories). Norvig I 568 (…) the next step is to specify how the world evolves (the transition model) and how the evidence variables get their values (the sensor model). Norvig I 570 Order: increasing the order can always be reformulated as an increase in the set of state variables, keeping the order fixed. Notice that adding state variables might improve the system’s predictive power but also increases the prediction requirements (…). Norvig I 603 Problem: data association: When trying to keep track of many objects, uncertainty arises as to which observations belong to which objects—the data association problem. The number of association hypotheses is typically intractably large, but MCMC and particle filtering algorithms for data association work well in practice. Norvig I 602 MCMC: An MCMC algorithm explores the space of assignment histories. Norvig I 603 Change: The changing state of the world is handled by using a set of random variables to represent the state at each point in time. Representations: can be designed to satisfy the Markov property, so that the future is independent of the past given the present. Combined with the assumption that the process is stationary—that is, the dynamics do not change over time—this greatly simplifies the representation. Probability: A temporal probability model can be thought of as containing a transition model describing the state evolution and a sensor model describing the observation process. >Inference/AI research. Historical development: Many of the basic ideas for estimating the state of dynamical systems came from the mathematician C. F. Gauss (1809)(1), who formulated a deterministic least-squares algorithm for the problem of estimating orbits from astronomical observations. A. A. Markov (1913)(2) developed what was later called the Markov assumption in his analysis of stochastic processes; Norvig I 604 (…). The general theory of Markov chains and their mixing times is covered by Levin et al. (2008)(3). Significant classified work on filtering was done during World War II by Wiener (1942)(4) for continuous-time processes and by Kolmogorov (1941)(5) for discrete-time processes. Although this work led to important technological developments over the next 20 years, its use of a frequency-domain representation made many calculations quite cumbersome. Direct state-space modeling of the stochastic process turned out to be simpler, as shown by Peter Swerling (1959)(6) and Rudolf Kalman (1960)(7). The hidden Markov model and associated algorithms for inference and learning, including the forward–backward algorithm, were developed by Baum and Petrie (1966)(8). The Viterbi algorithm first appeared in (Viterbi, 1967)(9). Similar ideas also appeared independently in the Kalman filtering community (Rauch et al., 1965)(10). The forward–backward algorithm was one of the main precursors of the general formulation of the EM algorithm (Dempster et al., 1977)(11) (…). Dynamic Bayesian networks (DBNs) can be viewed as a sparse encoding of a Markov process and were first used in AI by Dean and Kanazawa (1989b)(12), Nicholson and Brady (1992)(13), and Kjaerulff (1992)(14). The last work extends the HUGIN Bayes net system to accommodate dynamic Bayesian networks. The book by Dean and Wellman (1991)(15) helped popularize DBNs and the probabilistic approach to planning and control within AI. Murphy (2002)(16) provides a thorough analysis of DBNs. Dynamic Bayesian networks have become popular for modeling a variety of complex motion processes in computer vision (Huang et al., 1994(17); Intille and Bobick, 1999)(18). Like HMMs, they have found applications in speech recognition (Zweig and Russell, 1998(19)); Richardson et al., 2000(20); Stephenson et al., 2000(21); Nefian et al., 2002(22); Livescu et al., 2003(23)), Norvig I 605 genomics (Murphy and Mian, 1999(24); Perrin et al., 2003(25); Husmeier, 2003(26)) and robot localization (Theocharous et al., 2004)(27). The link between HMMs and DBNs, and between the forward–backward algorithm and Bayesian network propagation, was made explicitly by Smyth et al. (1997)(28). A further unification with Kalman filters (and other statistical models) appears in Roweis and Ghahramani (1999)(29). Procedures exist for learning the parameters (Binder et al., 1997a(30); Ghahramani, 1998)(31) and structures (Friedman et al., 1998)(32) of DBNs. Norvig I 606 Data association: Data association for multi target tracking was first described in a probabilistic setting by Sittler (1964)(33). The first practical algorithm for large-scale problems was the “multiple hypothesis tracker” or MHT algorithm (Reid, 1979)(34). Many important papers are collected by Bar-Shalom and Fortmann (1988)(35) and Bar-Shalom (1992)(36). The development of an MCMC algorithm for data association is due to Pasula et al. (1999)(37), who applied it to traffic surveillance problems. Oh et al. (2009)(38) provide a formal analysis and extensive experimental comparisons to other methods. Schulz et al. (2003)(39) describe a data association method based on particle filtering. Ingemar Cox analyzed the complexity of data association (Cox, 1993(40); Cox and Hingorani, 1994(41)) and brought the topic to the attention of the vision community. He also noted the applicability of the polynomial-time Hungarian algorithm to the problem of finding most-likely assignments, which had long been considered an intractable problem in the tracking community. The algorithm itself was published by Kuhn (1955)(42), based on translations of papers published in 1931 by two Hungarian mathematicians, Dénes König and Jenö Egerváry. The basic theorem had been derived previously, however, in an unpublished Latin manuscript by the famous Prussian mathematician Carl Gustav Jacobi (1804–1851). 1. Gauss, C. F. (1829). Beiträge zur Theorie der algebraischen Gleichungen. Collected in Werke, Vol. 3, pages 71–102. K. Gesellschaft Wissenschaft, Göttingen, Germany, 1876. 2. Markov, A. A. (1913). An example of statistical investigation in the text of “Eugene Onegin” illustrating coupling of “tests” in chains. Proc. Academy of Sciences of St. Petersburg, 7. 3. Levin, D. A., Peres, Y., and Wilmer, E. L. (2008). Markov Chains and Mixing Times. American Mathematical Society. 4. Wiener, N. (1942). The extrapolation, interpolation, and smoothing of stationary time series. Osrd 370, Report to the Services 19, Research Project DIC-6037, MIT. 5. Kolmogorov, A. N. (1941). Interpolation und Extrapolation von stationären zufälligen Folgen. Bulletin of the Academy of Sciences of the USSR, Ser. Math. 5, 3–14. 6. Swerling, P. (1959). First order error propagation in a stagewise smoothing procedure for satellite observations. J. Astronautical Sciences, 6, 46–52. 7. Kalman, R. (1960). A new approach to linear filtering and prediction problems. J. Basic Engineering, 82, 35–46. 8. Baum, L. E. and Petrie, T. (1966). Statistical inference for probabilistic functions of finite state Markov chains. Annals of Mathematical Statistics, 41. 9. Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260–269. 10. Rauch, H. E., Tung, F., and Striebel, C. T. (1965). Maximum likelihood estimates of linear dynamic systems. AIAA Journal, 3(8), 1445–1450. 11. Dempster, A. P., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society, 39 (Series B), 1–38. 12. Dean, T. and Kanazawa, K. (1989b). A model for reasoning about persistence and causation. Computational Intelligence, 5(3), 142–150. 13. Nicholson, A. and Brady, J. M. (1992). The data association problem when monitoring robot vehicles using dynamic belief networks. In ECAI-92, pp. 689–693. 14. Kjaerulff, U. (1992). A computational scheme for reasoning in dynamic probabilistic networks. In UAI-92, pp. 121–129. 15. Dean, T. and Wellman, M. P. (1991). Planning and Control. Morgan Kaufmann. 16. Murphy, K. (2002). Dynamic Bayesian Networks: Representation, Inference and Learning. Ph.D. thesis, UC Berkeley 17. Huang, T., Koller, D., Malik, J., Ogasawara, G., Rao, B., Russell, S. J., and Weber, J. (1994). Automatic symbolic traffic scene analysis using belief networks. In AAAI-94, pp. 966–972 18. Intille, S. and Bobick, A. (1999). A framework for recognizing multi-agent action from visual evidence. In AAAI-99, pp. 518-525. 19. Zweig, G. and Russell, S. J. (1998). Speech recognition with dynamic Bayesian networks. In AAAI-98, pp. 173–180. 20. Richardson, M., Bilmes, J., and Diorio, C. (2000). Hidden-articulator Markov models: Performance improvements and robustness to noise. In ICASSP-00. 21. Stephenson, T., Bourlard, H., Bengio, S., and Morris, A. (2000). Automatic speech recognition using dynamic bayesian networks with both acoustic and articulatory features. In ICSLP-00, pp. 951-954. 22. Nefian, A., Liang, L., Pi, X., Liu, X., and Murphy, K. (2002). Dynamic bayesian networks for audiovisual speech recognition. EURASIP, Journal of Applied Signal Processing, 11, 1–15. 23. Livescu, K., Glass, J., and Bilmes, J. (2003). Hidden feature modeling for speech recognition using dynamic Bayesian networks. In EUROSPEECH-2003, pp. 2529-2532 24. Murphy, K. and Mian, I. S. (1999). Modelling gene expression data using Bayesian networks. people.cs.ubc.ca/˜murphyk/Papers/ismb99.pdf. 25. Perrin, B. E., Ralaivola, L., and Mazurie, A. (2003). Gene networks inference using dynamic Bayesian networks. Bioinformatics, 19, II 138-II 148. 26. Husmeier, D. (2003). Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic bayesian networks. Bioinformatics, 19(17), 2271-2282. 27. Theocharous, G., Murphy, K., and Kaelbling, L. P. (2004). Representing hierarchical POMDPs as DBNs for multi-scale robot localization. In ICRA-04. 28. Smyth, P., Heckerman, D., and Jordan, M. I. (1997). Probabilistic independence networks for hidden Markov probability models. Neural Computation, 9(2), 227-269. 29. Roweis, S. T. and Ghahramani, Z. (1999). A unifying review of Linear GaussianModels. Neural Computation, 11(2), 305-345. 30. Binder, J., Koller, D., Russell, S. J., and Kanazawa, K. (1997a). Adaptive probabilistic networks with hidden variables. Machine Learning, 29, 213-244. 31. Ghahramani, Z. (1998). Learning dynamic bayesian networks. In Adaptive Processing of Sequences and Data Structures, pp. 168–197. 32. Friedman, N., Murphy, K., and Russell, S. J. (1998). Learning the structure of dynamic probabilistic networks. In UAI-98. 33. Sittler, R. W. (1964). An optimal data association problem in surveillance theory. IEEE Transactions on Military Electronics, 8(2), 125-139. 34. Reid, D. B. (1979). An algorithm for tracking multiple targets. IEEE Trans. Automatic Control, 24(6), 843–854. 35. Bar-Shalom, Y. and Fortmann, T. E. (1988). Tracking and Data Association. Academic Press. 36. Bar-Shalom, Y. (Ed.). (1992). Multi target multi sensor tracking: Advanced applications. Artech House. 37. Pasula, H., Russell, S. J., Ostland, M., and Ritov, Y. (1999). Tracking many objects with many sensors. In IJCAI-99. 38. Oh, S., Russell, S. J., and Sastry, S. (2009). Markov chain Monte Carlo data association for multi-target tracking. IEEE Transactions on Automatic Control, 54(3), 481-497. 39. Schulz, D., Burgard, W., Fox, D., and Cremers, A. B. (2003). People tracking with mobile robots using sample-based joint probabilistic data association filters. Int. J. Robotics Research, 22(2), 99-116 40. Cox, I. (1993). A review of statistical data association techniques for motion correspondence. IJCV, 10, 53–66. 41 Cox, I. and Hingorani, S. L. (1994). An efficient implementation and evaluation of Reid’s multiple hypothesis tracking algorithm for visual tracking. In ICPR-94, Vol. 1, pp. 437-442. 42. Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2, 83-97. |
Norvig I Peter Norvig Stuart J. Russell Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010 |
Chess | Norvig | Norvig I 192 Chess/artificial intelligence/Norvig/Russell: Chess was one of the first tasks undertaken in AI, with early efforts by many of the pioneers of computing, including Konrad Zuse in 1945, Norbert Wiener in his book Cybernetics (1948), and Alan Turing in 1950 (see Turing et al., 1953)(2). But it was Claude Shannon’s article Programming a Computer for Playing Chess (1950)(3) that had the most complete set of ideas, describing a representation for board positions, an evaluation function, quiescence search, and some ideas for selective (nonexhaustive) game-tree search. Slater (1950)(4) and the commentators on his article also explored the possibilities for computer chess play. D. G. Prinz (1952)(5) completed a program that solved chess endgame problems but did not play a full game. Stan Ulam and a group at the Los Alamos National Lab produced a program that played chess on a 6×6 board with no bishops (Kister et al., 1957)(6). It could search 4 plies deep in about 12 minutes. Alex Bernstein wrote the first documented program to play a full game of standard chess (Bernstein and Roberts, 1958)(7). The first computer chess match featured the Kotok–McCarthy program from MIT (Kotok, 1962)(8) and the ITEP program written in the mid-1960s at Moscow’s Institute of Theoretical and Experimental Physics (Adelson-Velsky et al., 1970)(9). The first chess program to compete successfully with humans was MIT’s MACHACK-6 (Greenblatt et al., 1967)(10). The $10,000 prize for the first program to achieve a USCF (United States Chess Federation) rating of 2500 (near the grandmaster level) was awarded to DEEP THOUGHT (Hsu et al., 1990) (11) in 1989. The grand prize, $100,000, went to DEEP BLUE (Campbell et al., 2002(12); Hsu, 2004(13)) for its landmark victory over world champion Garry Kasparov in a 1997 exhibition match. Norvig I 193 HYDRA (Donninger and Lorenz, 2004(14)) is rated somewhere between 2850 and 3000, based mostly on its trouncing of Michael Adams. The RYBKA program is rated between 2900 and 3100, but this is based on a small number of games and is not considered reliable. Ross (2004)(15) shows how human players have learned to exploit some of the weaknesses of the computer programs. 1. Wiener, N. (1948). Cybernetics. Wiley. 2. Turing, A., Strachey, C., Bates,M. A., and Bowden, B. V. (1953). Digital computers applied to games. In Bowden, B. V. (Ed.), Faster than Thought, pp. 286 - 310. Pitman. 3. Shannon, C. E. (1950). Programming a computer for playing chess. Philosophical Magazine, 41(4), 256–275 4. Slater, E. (1950). Statistics for the chess computer and the factor of mobility. In Symposium on Information Theory, pp. 150-152. Ministry of Supply 5. Prinz, D. G. (1952). Robot chess. Research, 5, 261- 266. 6. Kister, J., Stein, P., Ulam, S., Walden, W., and Wells, M. (1957). Experiments in chess. JACM, 4, 174–177. 7. Bernstein, A. and Roberts, M. (1958). Computer vs. chess player. Scientific American, 198(6), 96- 105. 8. Kotok, A. (1962). A chess playing program for the IBM 7090. AI project memo 41, MIT Computation Center. 9. Adelson-Velsky, G. M., Arlazarov, V. L., Bitman, A. R., Zhivotovsky, A. A., and Uskov, A. V. (1970). Programming a computer to play chess. Russian Mathematical Surveys, 25, 221-262. 10. Greenblatt, R. D., Eastlake, D. E., and Crocker, S. D. (1967). The Greenblatt chess program. In Proc. Fall Joint Computer Conference, pp. 801-810. 11. Hsu, F.-H., Anantharaman, T. S., Campbell, M. S., and Nowatzyk, A. (1990). A grandmaster chess machine. Scientific American, 263(4), 44–50. 12. Campbell, M. S., Hoane, A. J., and Hsu, F.-H. (2002). Deep Blue. AIJ, 134(1–2), 57–83. 13. Hsu, F.-H. (2004). Behind Deep Blue: Building the Computer that Defeated the World Chess Champion. Princeton University Press. 14. Donninger, C. and Lorenz, U. (2004). The chess monster hydra. In Proc. 14th International Conference on Field-Programmable Logic and applications, pp. 927-932. 15. Ross, P. E. (2004). Psyching out computer chess players. IEEE Spectrum, 41(2), 14-15. |
Norvig I Peter Norvig Stuart J. Russell Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010 |
Chess | Russell | Norvig I 192 Chess/artificial intelligence/Norvig/Russell: Chess was one of the first tasks undertaken in AI, with early efforts by many of the pioneers of computing, including Konrad Zuse in 1945, Norbert Wiener in his book Cybernetics (1948), and Alan Turing in 1950 (see Turing et al., 1953)(2). But it was Claude Shannon’s article Programming a Computer for Playing Chess (1950)(3) that had the most complete set of ideas, describing a representation for board positions, an evaluation function, quiescence search, and some ideas for selective (nonexhaustive) game-tree search. Slater (1950)(4) and the commentators on his article also explored the possibilities for computer chess play. D. G. Prinz (1952)(5) completed a program that solved chess endgame problems but did not play a full game. Stan Ulam and a group at the Los Alamos National Lab produced a program that played chess on a 6×6 board with no bishops (Kister et al., 1957)(6). It could search 4 plies deep in about 12 minutes. Alex Bernstein wrote the first documented program to play a full game of standard chess (Bernstein and Roberts, 1958)(7). The first computer chess match featured the Kotok–McCarthy program from MIT (Kotok, 1962)(8) and the ITEP program written in the mid-1960s at Moscow’s Institute of Theoretical and Experimental Physics (Adelson-Velsky et al., 1970)(9). The first chess program to compete successfully with humans was MIT’s MACHACK-6 (Greenblatt et al., 1967)(10). The $10,000 prize for the first program to achieve a USCF (United States Chess Federation) rating of 2500 (near the grandmaster level) was awarded to DEEP THOUGHT (Hsu et al., 1990) (11) in 1989. The grand prize, $100,000, went to DEEP BLUE (Campbell et al., 2002(12); Hsu, 2004(13)) for its landmark victory over world champion Garry Kasparov in a 1997 exhibition match. Norvig I 193 HYDRA (Donninger and Lorenz, 2004(14)) is rated somewhere between 2850 and 3000, based mostly on its trouncing of Michael Adams. The RYBKA program is rated between 2900 and 3100, but this is based on a small number of games and is not considered reliable. Ross (2004)(15) shows how human players have learned to exploit some of the weaknesses of the computer programs. >Artificial Intelligence, >Machine Learning. 1. Wiener, N. (1948). Cybernetics. Wiley. 2. Turing, A., Strachey, C., Bates,M. A., and Bowden, B. V. (1953). Digital computers applied to games. In Bowden, B. V. (Ed.), Faster than Thought, pp. 286 - 310. Pitman. 3. Shannon, C. E. (1950). Programming a computer for playing chess. Philosophical Magazine, 41(4), 256–275 4. Slater, E. (1950). Statistics for the chess computer and the factor of mobility. In Symposium on Information Theory, pp. 150-152. Ministry of Supply 5. Prinz, D. G. (1952). Robot chess. Research, 5, 261- 266. 6. Kister, J., Stein, P., Ulam, S., Walden, W., and Wells, M. (1957). Experiments in chess. JACM, 4, 174–177. 7. Bernstein, A. and Roberts, M. (1958). Computer vs. chess player. Scientific American, 198(6), 96- 105. 8. Kotok, A. (1962). A chess playing program for the IBM 7090. AI project memo 41, MIT Computation Center. 9. Adelson-Velsky, G. M., Arlazarov, V. L., Bitman, A. R., Zhivotovsky, A. A., and Uskov, A. V. (1970). Programming a computer to play chess. Russian Mathematical Surveys, 25, 221-262. 10. Greenblatt, R. D., Eastlake, D. E., and Crocker, S. D. (1967). The Greenblatt chess program. In Proc. Fall Joint Computer Conference, pp. 801-810. 11. Hsu, F.-H., Anantharaman, T. S., Campbell, M. S., and Nowatzyk, A. (1990). A grandmaster chess machine. Scientific American, 263(4), 44–50. 12. Campbell, M. S., Hoane, A. J., and Hsu, F.-H. (2002). Deep Blue. AIJ, 134(1–2), 57–83. 13. Hsu, F.-H. (2004). Behind Deep Blue: Building the Computer that Defeated the World Chess Champion. Princeton University Press. 14. Donninger, C. and Lorenz, U. (2004). The chess monster hydra. In Proc. 14th International Conference on Field-Programmable Logic and applications, pp. 927-932. 15. Ross, P. E. (2004). Psyching out computer chess players. IEEE Spectrum, 41(2), 14-15. |
Russell I B. Russell/A.N. Whitehead Principia Mathematica Frankfurt 1986 Russell II B. Russell The ABC of Relativity, London 1958, 1969 German Edition: Das ABC der Relativitätstheorie Frankfurt 1989 Russell IV B. Russell The Problems of Philosophy, Oxford 1912 German Edition: Probleme der Philosophie Frankfurt 1967 Russell VI B. Russell "The Philosophy of Logical Atomism", in: B. Russell, Logic and KNowledge, ed. R. Ch. Marsh, London 1956, pp. 200-202 German Edition: Die Philosophie des logischen Atomismus In Eigennamen, U. Wolf (Hg) Frankfurt 1993 Russell VII B. Russell On the Nature of Truth and Falsehood, in: B. Russell, The Problems of Philosophy, Oxford 1912 - Dt. "Wahrheit und Falschheit" In Wahrheitstheorien, G. Skirbekk (Hg) Frankfurt 1996 Norvig I Peter Norvig Stuart J. Russell Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010 |
Communication | Bateson | I 91 Communication/time use/acting/action/Bateson: communication through action only plays in the present. Time use can only be used in the language. An animal cannot express by its behaviour: I will not bite you. >Time, >Past, >Present, >Future. I 376 Communication/Logic/Paradoxes/Bateson: there is almost no formal theory dealing with analogue communication and especially no equivalent of information theory or logical type theory. >Paradoxes, >Type theory. |
Bt I G. Bateson Steps to an Ecology of Mind, Collected Essays in Anthropology, Psychiatry, Evolution, and Epistemology, San Francisco 1972 German Edition: Ökologie des Geistes. Anthropologische, psychologische, biologische und epistemologische Perspektiven Frankfurt 1985 |
Communication | Flusser | I 12 Communication/communication theory/Flusser: thesis: communication theory is understood by Flusser interpretative; it is not understood in the sense of information theory. >Communication theory, >Interpretation, cf. >Information theory. |
Fl I V. Flusser Kommunikologie Mannheim 1996 |
Complexes/Complexity | Norvig | Norvig I 712 Complexity/AI Research/Norvig/Russell: [one way of reducing complexity is] model selection with cross-validation on model size. An alternative approach is to search for a hypothesis that directly minimizes the weighted sum of Norvig I 713 empirical loss and the complexity of the hypothesis, which we will call the total cost: Cost (h) = EmpLoss(h) + λ Complexity (h) ˆh ∗ = argmin Cost (h)/h∈H. Here λ is a parameter, a positive number that serves as a conversion rate between loss and hypothesis complexity (which after all are not measured on the same scale). This approach combines loss and complexity into one metric, allowing us to find the best hypothesis all at once. Regularization: This process of explicitly penalizing complex hypotheses is called regularization (because it looks for a function that is more regular, or less complex). Note that the cost function requires us to make two choices: the loss function and the complexity measure, which is called a regularization function. The choice of regularization function depends on the hypothesis space. Another way to simplify models is to reduce the dimensions that the models work with. A process of feature selection can be performed to discard attributes that appear to be irrelevant. Χ2 pruning is a kind of feature selection. MDL: The minimum description length or MDL hypothesis minimizes the total number of bits required. VsMDL: This works well in the limit, but for smaller problems there is a difficulty in that the choice of encoding for the program - for example, how best to encode a decision tree as a bit string - affects the outcome. >Learning theory/Norvig, >Learning/AI Research. Norvig I 759 History: Whereas the identification-in-the-limit approach concentrates on eventual convergence, the study of Kolmogorov complexity or algorithmic complexity, developed independently by Solomonoff (1964(1), 2009(2)) and Kolmogorov (1965)(3), attempts to provide a formal definition for the notion of simplicity used in Ockham’s razor. To escape the problem that simplicity depends on the way in which information is represented, it is proposed that simplicity be measured by the length of the shortest program for a universal Turing machine that correctly reproduces the observed data. Although there are many possible universal Turing machines, and hence many possible “shortest” programs, these programs differ in length by at most a constant that is independent of the amount of data. This beautiful insight, which essentially shows that any initial representation bias will eventually be overcome by the data itself, is marred only by the undecidability of computing the length of the shortest program. Approximate measures such as the minimum description length, or MDL (Rissanen, 1984(4), 2007(5)) can be used instead and have produced excellent results in practice. The text by Li and Vitanyi (1993)(6) is the best source for Kolmogorov complexity. Norvig I 762 The complexity of neural network learning has been investigated by researchers in computational learning theory. Early computational results were obtained by Judd (1990)(7), who showed that the general problem of finding a set of weights consistent with a set of examples is NP-complete, even under very restrictive assumptions. Some of the first sample complexity results were obtained by Baum and Haussler (1989)(8), who showed that the number of examples required for effective learning grows as roughly W logW, where W is the number of weights. Since then, a much more sophisticated theory has been developed (Anthony and Bartlett, 1999)(9), including the important result that the representational capacity of a network depends on the size of the weights as well as on their number, a result that should not be surprising in the light of our discussion of regularization. 1. Solomonoff, R. J. (1964). A formal theory of inductive inference. Information and Control, 7, 1–22, 224-254. 2. Solomonoff, R. J. (2009). Algorithmic probability-theory and applications. In Emmert-Streib, F. and Dehmer, M. (Eds.), Information Theory and Statistical Learning. Springer. 3. Kolmogorov, A. N. (1965). Three approaches to the quantitative definition of information. Problems in Information Transmission, 1(1), 1–7. 4. Rissanen, J. (1984). Universal coding, information, prediction, and estimation. IEEE Transactions on Information Theory, IT-30(4), 629-636. 5. Rissanen, J. (2007). Information and Complexity in Statistical Modeling. Springer. 6. Li, M. and Vitanyi, P. M. B. (1993). An Introduction to Kolmogorov Complexity and Its Applications. Springer-Verlag. 7. Judd, J. S. (1990). Neural Network Design and the Complexity of Learning. MIT Press. 8. Baum, E. and Haussler, D. (1989). What size net gives valid generalization? Neural Computation, 1(1), 151160. 9. Anthony, M. and Bartlett, P. (1999). Neural Network Learning: Theoretical Foundations. Cambridge University Press. |
Norvig I Peter Norvig Stuart J. Russell Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010 |
Complexes/Complexity | Russell | Norvig I 712 Complexity/AI Research/Norvig/Russell: [one way of reducing complexity is] model selection with cross-validation on model size. An alternative approach is to search for a hypothesis that directly minimizes the weighted sum of Norvig I 713 empirical loss and the complexity of the hypothesis, which we will call the total cost: Cost (h) = EmpLoss(h) + λ Complexity (h) ˆh ∗ = argmin Cost (h)/h∈H. Here λ is a parameter, a positive number that serves as a conversion rate between loss and hypothesis complexity (which after all are not measured on the same scale). This approach combines loss and complexity into one metric, allowing us to find the best hypothesis all at once. Regularization: This process of explicitly penalizing complex hypotheses is called regularization (because it looks for a function that is more regular, or less complex). Note that the cost function requires us to make two choices: the loss function and the complexity measure, which is called a regularization function. The choice of regularization function depends on the hypothesis space. Another way to simplify models is to reduce the dimensions that the models work with. A process of feature selection can be performed to discard attributes that appear to be irrelevant. Χ2 pruning is a kind of feature selection. MDL: The minimum description length or MDL hypothesis minimizes the total number of bits required. VsMDL: This works well in the limit, but for smaller problems there is a difficulty in that the choice of encoding for the program - for example, how best to encode a decision tree as a bit string - affects the outcome. >Learning theory/Norvig, >Learning/AI Research. Norvig I 759 History: Whereas the identification-in-the-limit approach concentrates on eventual convergence, the study of Kolmogorov complexity or algorithmic complexity, developed independently by Solomonoff (1964(1), 2009(2)) and Kolmogorov (1965)(3), attempts to provide a formal definition for the notion of simplicity used in Ockham’s razor. To escape the problem that simplicity depends on the way in which information is represented, it is proposed that simplicity be measured by the length of the shortest program for a universal Turing machine that correctly reproduces the observed data. Although there are many possible universal Turing machines, and hence many possible “shortest” programs, these programs differ in length by at most a constant that is independent of the amount of data. This beautiful insight, which essentially shows that any initial representation bias will eventually be overcome by the data itself, is marred only by the undecidability of computing the length of the shortest program. Approximate measures such as the minimum description length, or MDL (Rissanen, 1984(4), 2007(5)) can be used instead and have produced excellent results in practice. The text by Li and Vitanyi (1993)(6) is the best source for Kolmogorov complexity. Norvig I 762 The complexity of neural network learning has been investigated by researchers in computational learning theory. Early computational results were obtained by Judd (1990)(7), who showed that the general problem of finding a set of weights consistent with a set of examples is NP-complete, even under very restrictive assumptions. Some of the first sample complexity results were obtained by Baum and Haussler (1989)(8), who showed that the number of examples required for effective learning grows as roughly W logW, where W is the number of weights. Since then, a much more sophisticated theory has been developed (Anthony and Bartlett, 1999)(9), including the important result that the representational capacity of a network depends on the size of the weights as well as on their number, a result that should not be surprising in the light of our discussion of regularization. 1. Solomonoff, R. J. (1964). A formal theory of inductive inference. Information and Control, 7, 1–22, 224-254. 2. Solomonoff, R. J. (2009). Algorithmic probability-theory and applications. In Emmert-Streib, F. and Dehmer, M. (Eds.), Information Theory and Statistical Learning. Springer. 3. Kolmogorov, A. N. (1965). Three approaches to the quantitative definition of information. Problems in Information Transmission, 1(1), 1–7. 4. Rissanen, J. (1984). Universal coding, information, prediction, and estimation. IEEE Transactions on Information Theory, IT-30(4), 629-636. 5. Rissanen, J. (2007). Information and Complexity in Statistical Modeling. Springer. 6. Li, M. and Vitanyi, P. M. B. (1993). An Introduction to Kolmogorov Complexity and Its Applications. Springer-Verlag. 7. Judd, J. S. (1990). Neural Network Design and the Complexity of Learning. MIT Press. 8. Baum, E. and Haussler, D. (1989). What size net gives valid generalization? Neural Computation, 1(1), 151160. 9. Anthony, M. and Bartlett, P. (1999). Neural Network Learning: Theoretical Foundations. Cambridge University Press. |
Russell I B. Russell/A.N. Whitehead Principia Mathematica Frankfurt 1986 Russell II B. Russell The ABC of Relativity, London 1958, 1969 German Edition: Das ABC der Relativitätstheorie Frankfurt 1989 Russell IV B. Russell The Problems of Philosophy, Oxford 1912 German Edition: Probleme der Philosophie Frankfurt 1967 Russell VI B. Russell "The Philosophy of Logical Atomism", in: B. Russell, Logic and KNowledge, ed. R. Ch. Marsh, London 1956, pp. 200-202 German Edition: Die Philosophie des logischen Atomismus In Eigennamen, U. Wolf (Hg) Frankfurt 1993 Russell VII B. Russell On the Nature of Truth and Falsehood, in: B. Russell, The Problems of Philosophy, Oxford 1912 - Dt. "Wahrheit und Falschheit" In Wahrheitstheorien, G. Skirbekk (Hg) Frankfurt 1996 Norvig I Peter Norvig Stuart J. Russell Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010 |
Decision Tree | Norvig | Norvig I 698 Def Decision tree/Norvig/Russell: A decision tree represents a function DECISION TREE that takes as input a vector of attribute values and returns a “decision”—a single output value. The input and output values can be discrete or continuous. A decision tree reaches its decision by performing a sequence of tests. Each internal node in the tree corresponds to a test of the value of one of the input attributes, Ai, and the branches from the node are labeled with the possible values of the attribute, Ai =vik. Each leaf node in the tree specifies a value to be returned by the function. A Boolean decision tree is logically equivalent to the assertion that the goal attribute is true if and only if the input attributes satisfy one of the paths leading to a leaf with value true. Writing this out in propositional logic, we have Goal ⇔ (Path1 ∨ Path2 ∨ · · ·) , where each Path is a conjunction of attribute-value tests required to follow that path. Thus, the whole expression is equivalent to disjunctive normal form. ((s) >Normal form/logic.) Unfortunately, no matter how we measure size, it is an intractable problem to find the smallest consistent tree; there is no way to efficiently search through the 22n trees. With some simple heuristics, however, we can find a good approximate solution: a small (but not smallest) consistent tree. The decision tree learning algorithm adopts a greedy divide-and-conquer strategy: always test the most important attribute first. This test divides the problem up into smaller subproblems that can then be solved recursively. “Most important attribute”: the one that makes the most difference to the classification of an example. Decision tree learning algorithm: see Norvig I 702. Norvig I 705 Problems: the decision tree learning algorithm will generate a large tree when there is actually no pattern to be found. Overfitting: algorithm will seize on any pattern it can find in the input. If it turns out that there are 2 rolls of a 7-gram blue die with fingers crossed and they both come out 6, then the algorithm may construct a path that predicts 6 in that case. Solution: decision tree pruning combats overfitting. Pruning works by eliminating nodes that are not clearly relevant. Norvig I 706 Missing data: In many domains, not all the attribute values will be known for every example. Norvig I 707 Multivalued attributes: When an attribute has many possible values, the information gain measure gives an inappropriate indication of the attribute’s usefulness. In the extreme case, an attribute such as exact time has a different value for every example, which means each subset of examples is a singleton with a unique classification, and the information gain measure would have its highest value for this attribute. Continuous and integer-valued input attributes: Continuous or integer-valued attributes such as height and weight, have an infinite set of possible values. Rather than generate infinitely many branches, decision-tree learning algorithms typically find the split point that gives the highest information gain. Continuous-valued output attributes: If we are trying to predict a numerical output value, such as the price of an apartment, then we need a regression tree rather than a classification tree. A regression tree has at each leaf a linear function of some subset of numerical attributes, rather than a single value. >Learning/AI Research. Norvig I 758 History: The first notable use of decision trees was in EPAM, the “Elementary Perceiver And Memorizer” (Feigenbaum, 1961)(1), which was a simulation of human concept learning. ID3 (Quinlan, 1979)(2) added the crucial idea of choosing the attribute with maximum entropy; it is the basis for the decision tree algorithm in this chapter. Information theory was developed by Claude Shannon to aid in the study of communication (Shannon and Weaver, 1949)(3). (Shannon also contributed one of the earliest examples of machine learning, a mechanical mouse named Theseus that learned to navigate through a maze by trial and error.) The χ2 method of tree pruning was described by Quinlan (1986)(4). C4.5, an industrial-strength decision tree package, can be found in Quinlan (1993)(5). An independent tradition of decision tree learning exists in the statistical literature. Classification and Regression Trees (Breiman et al., 1984)(6), known as the “CART book,” is the principal reference. 1. Feigenbaum, E. A. (1961). The simulation of verbal learning behavior. Proc. Western Joint Computer Conference, 19, 121-131. 2. Quinlan, J. R. (1979). Discovering rules from large collections of examples: A case study. In Michie, D. (Ed.), Expert Systems in the Microelectronic Age. Edinburgh University Press. 3. Shannon, C. E. and Weaver, W. (1949). The Mathematical Theory of Communication. University of Illinois Press. 4. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81-106. 5. Quinlan, J. R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann. 6. Breiman, L., Friedman, J., Olshen, R. A., and Stone, C. J. (1984). Classification and Regression Trees. Wadsworth International Group. |
Norvig I Peter Norvig Stuart J. Russell Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010 |
Everyday Language | Lyons | I 92 Everyday language/information/slang/Lyons: in colloquial language there is a tendency to replace frequently used words with longer "more colourful" synonyms because the information content has been worn out by frequent use. Slang changes frequently. >Language, >Language use, >Metaphors, >Metonymies, >Synonymy, >Information. I 100 Information/Information Theory/Linguistics/Lyons: Dilemma: 1. statistical considerations are important for understanding the development and operation of the language. 2. It is practically impossible to calculate the information here exactly. I 101 Solution: Linguistics today is more concerned with the structure of sentences than with utterances in concrete situations. >Syntax, >Semantics, >Grammar. |
Ly II John Lyons Semantics Cambridge, MA 1977 Lyons I John Lyons Introduction to Theoretical Lingustics, Cambridge/MA 1968 German Edition: Einführung in die moderne Linguistik München 1995 |
Information | Information, information theory: A character or a character combination contains information when it is clear to the recipient that this character or the character combination appears instead of another possible character or a possible character combination. The supply of possible characters determines to a part the probability of the occurrence of a character from this supply. In addition, the expected probability of the appearance of a character can be increased by already experienced experiences of regularities. The amount of information transmitted by a character depends on the improbability of the occurrence of the character. |
||
Information | Kelly | I 958 Information/Kelly: is ambiguous: a) a certain number of bits b) a significant signal. Signals/Kelly: as entropy increases, the bits increase, but the amount of signals decreases. I use information here in the second sense: information is a signal that makes a difference. Cf. >Communication, >Message, >Information theory. |
Kelly I Kevin Kelly What Technology Wants New York 2011 |
Information | Lyons | I 100 Information/Information Theory/Linguistics/Lyons: Dilemma: 1. statistical considerations are important for understanding the development and operation of the language. 2. It is practically impossible to calculate the information here exactly. >Language evolution, >Language acquisition, >Language use, >Everyday language. I 101 Solution: Linguistics today is more concerned with the structure of sentences than with expressions in concrete situations. >Expressions, >Situations, >Linguistics, >Grammar. |
Ly II John Lyons Semantics Cambridge, MA 1977 Lyons I John Lyons Introduction to Theoretical Lingustics, Cambridge/MA 1968 German Edition: Einführung in die moderne Linguistik München 1995 |
Observation | Frith | I 156 Definition Ideal Bayesian Observer/Information Theory/Frith: the ideal Bayesian observer maximizes the reciprocal information between the world and itself. >Ideal observers, >Information, >Information theory, >Understanding, >Learning. |
Frith I Chris Frith Making up the Mind: How the Brain Creates Our Mental World, Hoboken/NJ 2007 German Edition: Wie unser Gehirn die Welt erschafft Heidelberg 2013 |
Signals | Kelly | I 958 Signals/Kelly: as entropy increases, the bits increase, but the amount of signals decreases. >Information, >Information theory, >Entropy, >Communication, >Code. |
Kelly I Kevin Kelly What Technology Wants New York 2011 |
Disputed term/author/ism | Author Vs Author |
Entry |
Reference |
---|---|---|---|
Information Theory | Skepticism Vs Information Theory | Brendel I 213 Informationstheorie/Internalismus/Information/Brendel: zwischen beiden besteht immer noch eine Kluft, auf die sowohl Gettier als auch der Skeptizismus hinweisen. SkepticismVsInformation theory: man kann nie sicher sein, dass ein Signal r auch tatsächlich die Information dass s F ist, trägt. ((s) >Interpretation). Man müsste alle Alternativen ausschließen, und das geht für endliche Subjekte nicht. |
Bre I E. Brendel Wahrheit und Wissen Paderborn 1999 |
Luhmann, N. | Flusser Vs Luhmann, N. | I 11 Flusser: Commucation theory is interpretive, not in the sense of information theory. I 13 Interpretation / FlusserVsMoles, FlusserVsLuhmann: instead explanation: then human communication is not recognized as improbable process but as intention. |
Fl I V. Flusser Kommunikologie Mannheim 1996 |