Disputed term/author/ism | Author |
Entry |
Reference |
---|---|---|---|
Adaption | Deacon | I 330 Adaptation/brain/language/Deacon: when it comes to whether the brain adapts to certain requirements of language processing and language acquisition, the following is crucial: apart from the fact that a noun remains a noun and a change of time a change of time, regardless of the words involved, there must be more constant things between individuals, certain functions that are always processed in the same way in different ways under all conditions. There must be certain invariant sensory motoric or mnemonic features that could be adapted. >Language acquisition, >Invariance, >Covariance, >Language processing. I 331 Such characteristics are present in the case of alarm calls from animals, e.g. when a distinction is made between enemies on the ground and enemies of the air. I 331/332 Symbols/symbolic learning/adaptation/Deacon: it is precisely the complex structure that forms symbols among each other that make it impossible for them to be genetically assimilated. >Symbols/Deacon. Most grammatical operations have no direct connection to things in the world. Therefore, there is hardly any innate reference in the human language. The grammatical peculiarities also change from language to language, so that there is little consistency for possible adaptation. Deep structure: maybe this is what is open to adaptation? In order to adapt a function, it is not necessary for a certain place in the brain to remain constant for this function. Newer theories speak more of "language programs" and "data structures". >Deep structure, >Grammar, >Symbolic learning. I 333 The way these structures are distributed in the brain should remain invariant when assimilated in an evolutionary process. >Language origins. Speech processing: interestingly, it is not the logical operations but the analysis of the physical signals that are assigned to specific brain regions. This has major consequences. The grammatical structures are the ones that have had the least chance of establishing a fixed place in the brain for their processing. >Brain/Deacon. |
Dea I T. W. Deacon The Symbolic Species: The Co-evolution of language and the Brain New York 1998 Dea II Terrence W. Deacon Incomplete Nature: How Mind Emerged from Matter New York 2013 |
Artificial Neural Networks | Norvig | Norvig I 728 Artificial Neural Networks/Norvig/Russell: Neural networks are composed of nodes or units (…) connected by directed links. A link from unit i to unit j serves to propagate the activation ai from i to j. Each link also has a numeric weight wi,j associated with it, which determines the strength and sign of the connection. Just as in linear regression models, each unit has a dummy input a0 =1 with an associated weight w0,j . Norvig I 729 Perceptrons: The activation function g is typically either a hard threshold (…), in which case the unit is called a perceptron, or a logistic function (…), in which case the term sigmoid perceptron is sometimes used. Both of these nonlinear activation function ensure the important property that the entire network of units can represent a nonlinear function (…). Forms of a network: a) A feed-forward network has connections only in one direction—that is, it forms a directed acyclic graph. Every node receives input from “upstream” nodes and delivers output to “downstream” nodes; there are no loops. A feed-forward network represents a function of its current input; thus, it has no internal state other than the weights themselves. b) A recurrent network, on the other hand, feeds its outputs back into its own inputs. This means that the activation levels of the network form a dynamical system that may reach a stable state or exhibit oscillations or even chaotic behavior. Layers: a) Feed-forward networks are usually arranged in layers, such that each unit receives input only from units in the immediately preceding layer. b) Multilayer networks, which have one or more layers of hidden units that are not connected to the outputs of the network. Training/Learning: For example, if we want to train a network to add two input bits, each a 0 or a 1, we will need one output for the sum bit and one for the carry bit. Also, when the learning problem involves classification into more than two classes—for example, when learning to categorize images of handwritten digits—it is common to use one output unit for each class. Norvig I 731 Any desired functionality can be obtained by connecting large numbers of units into (possibly recurrent) networks of arbitrary depth. The problem was that nobody knew how to train such networks. This turns out to be an easy problem if we think of a network the right way: as a function hw(x) parameterized by the weights w. Norvig I 732 (…) we have the output expressed as a function of the inputs and the weights. (…) because the function represented by a network can be highly nonlinear—composed, as it is, of nested nonlinear soft threshold functions—we can see neural networks as a tool for doing nonlinear regression. Norvig I 736 Learning in neural networks: just as with >Bayesian networks, we also need to understand how to find the best network structure. If we choose a network that is too big, it will be able to memorize all the examples by forming a large lookup table, but will not necessarily generalize well to inputs that have not been seen before. Norvig I 737 Optimal brain damage: The optimal brain damage algorithm begins with a fully connected network and removes connections from it. After the network is trained for the first time, an information-theoretic approach identifies an optimal selection of connections that can be dropped. The network is then retrained, and if its performance has not decreased then the process is repeated. In addition to removing connections, it is also possible to remove units that are not contributing much to the result. Parametric models: A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) is called a parametric model. No matter how much data you throw at a parametric model, it won’t change its mind about how many parameters it needs. Nonparametric models: A nonparametric model is one that cannot be characterized by a bounded set of parameters. For example, suppose that each hypothesis we generate simply retains within itself all of the training examples and uses all of them to predict the next example. Such a hypothesis family would be nonparametric because the effective number of parameters is unbounded- it grows with the number of examples. This approach is called instance-based learning or memory-based learning. The simplest instance-based learning method is table lookup: take all the training examples, put them in a lookup table, and then when asked for h(x), see if x is in the table; (…). Norvig I 738 We can improve on table lookup with a slight variation: given a query xq, find the k examples that are nearest to xq. This is called k-nearest neighbors lookup. ((s) Cf. >Local/global/Philosophical theories.) Norvig I 744 Support vector machines/SVM: The support vector machine or SVM framework is currently the most popular approach for “off-the-shelf” supervised learning: if you don’t have any specialized prior knowledge about a domain, then the SVM is an excellent method to try first. Properties of SVMs: 1. SVMs construct a maximum margin separator - a decision boundary with the largest possible distance to example points. This helps them generalize well. 2. SVMs create a linear separating hyperplane, but they have the ability to embed the data into a higher-dimensional space, using the so-called kernel trick. 3. SVMs are a nonparametric method - they retain training examples and potentially need to store them all. On the other hand, in practice they often end up retaining only a small fraction of the number of examples - sometimes as few as a small constant times the number of dimensions. Norvig I 745 Instead of minimizing expected empirical loss on the training data, SVMs attempt to minimize expected generalization loss. We don’t know where the as-yet-unseen points may fall, but under the probabilistic assumption that they are drawn from the same distribution as the previously seen examples, there are some arguments from computational learning theory (…) suggesting that we minimize generalization loss by choosing the separator that is farthest away from the examples we have seen so far. Norvig I 748 Ensemble Learning: > href="https://philosophy-science-humanities-controversies.com/listview-details.php?id=2497863&a=$a&first_name=&author=AI%20Research&concept=Learning">Learning/AI Research. Norvig I 757 Linear regression is a widely used model. The optimal parameters of a linear regression model can be found by gradient descent search, or computed exactly. A linear classifier with a hard threshold—also known as a perceptron—can be trained by a simple weight update rule to fit data that are linearly separable. In other cases, the rule fails to converge. Norvig I 758 Logistic regression replaces the perceptron’s hard threshold with a soft threshold defined by a logistic function. Gradient descent works well even for noisy data that are not linearly separable. Norvig I 760 History: The term logistic function comes from Pierre-Francois Verhulst (1804–1849), a statistician who used the curve to model population growth with limited resources, a more realistic model than the unconstrained geometric growth proposed by Thomas Malthus. Verhulst called it the courbe logistique, because of its relation to the logarithmic curve. The term regression is due to Francis Galton, nineteenth century statistician, cousin of Charles Darwin, and initiator of the fields of meteorology, fingerprint analysis, and statistical correlation, who used it in the sense of regression to the mean. The term curse of dimensionality comes from Richard Bellman (1961)(1). Logistic regression can be solved with gradient descent, or with the Newton-Raphson method (Newton, 1671(2); Raphson, 1690(3)). A variant of the Newton method called L-BFGS is sometimes used for large-dimensional problems; the L stands for “limited memory,” meaning that it avoids creating the full matrices all at once, and instead creates parts of them on the fly. BFGS are authors’ initials (Byrd et al., 1995)(4). The ideas behind kernel machines come from Aizerman et al. (1964)(5) (who also introduced the kernel trick), but the full development of the theory is due to Vapnik and his colleagues (Boser et al., 1992)(6). SVMs were made practical with the introduction of the soft-margin classifier for handling noisy data in a paper that won the 2008 ACM Theory and Practice Award (Cortes and Vapnik, 1995)(7), and of the Sequential Minimal Optimization (SMO) algorithm for efficiently solving SVM problems using quadratic programming (Platt, 1999)(8). SVMs have proven to be very popular and effective for tasks such as text categorization (Joachims, 2001)(9), computational genomics (Cristianini and Hahn, 2007)(10), and natural language processing, such as the handwritten digit recognition of DeCoste and Schölkopf (2002)(11). As part of this process, many new kernels have been designed that work with strings, trees, and other non-numerical data types. A related technique that also uses the kernel trick to implicitly represent an exponential feature space is the voted perceptron (Freund and Schapire, 1999(12); Collins and Duffy, 2002(13)). Textbooks on SVMs include Cristianini and Shawe-Taylor (2000)(14) and Schölkopf and Smola (2002)(15). A friendlier exposition appears in the AI Magazine article by Cristianini and Schölkopf (2002)(16). Bengio and LeCun (2007)(17) show some of the limitations of SVMs and other local, nonparametric methods for learning functions that have a global structure but do not have local smoothness. Ensemble learning is an increasingly popular technique for improving the performance of learning algorithms. Bagging (Breiman, 1996)(18), the first effective method, combines hypotheses learned from multiple bootstrap data sets, each generated by subsampling the original data set. The boosting method described in this chapter originated with theoretical work by Schapire (1990)(19). The ADABOOST algorithm was developed by Freund and Schapire Norvig I 761 (1996) (20)and analyzed theoretically by Schapire (2003)(21). Friedman et al. (2000)(22) explain boosting from a statistician’s viewpoint. Online learning is covered in a survey by Blum (1996)(23) and a book by Cesa-Bianchi and Lugosi (2006)(24). Dredze et al. (2008)(25) introduce the idea of confidence-weighted online learning for classification: in addition to keeping a weight for each parameter, they also maintain a measure of confidence, so that a new example can have a large effect on features that were rarely seen before (and thus had low confidence) and a small effect on common features that have already been well-estimated. 1. Bellman, R. E. (1961). Adaptive Control Processes: A Guided Tour. Princeton University Press. 2. Newton, I. (1664-1671). Methodus fluxionum et serierum infinitarum. Unpublished notes 3. Raphson, J. (1690). Analysis aequationum universalis. Apud Abelem Swalle, London. 4. Byrd, R. H., Lu, P., Nocedal, J., and Zhu, C. (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific and Statistical Computing, 16(5), 1190-1208. 5. Aizerman, M., Braverman, E., and Rozonoer, L. (1964). Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25, 821-837. 6. Boser, B., Guyon, I., and Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In COLT-92. 7. Cortes, C. and Vapnik, V. N. (1995). Support vector networks. Machine Learning, 20, 273-297. 8. Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods: Support Vector Learning, pp. 185-208. MIT Press. 9. Joachims, T. (2001). A statistical learning model of text classification with support vector machines. In SIGIR-01, pp. 128-136. 10. Cristianini, N. and Hahn, M. (2007). Introduction to Computational Genomics: A Case Studies Approach. Cambridge University Press. 11. DeCoste, D. and Schölkopf, B. (2002). Training invariant support vector machines. Machine Learning, 46(1), 161–190. 12. Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting algorithm. In ICML-96. 13. Collins, M. and Duffy, K. (2002). New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In ACL-02. 14. Cristianini, N. and Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press. 15. Schölkopf, B. and Smola, A. J. (2002). Learning with Kernels. MIT Press. 16. Cristianini, N. and Schölkopf, B. (2002). Support vector machines and kernel methods: The new generation of learning machines. AIMag, 23(3), 31–41. 17. Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms towards AI. In Bottou, L., Chapelle, O., DeCoste, D., and Weston, J. (Eds.), Large-Scale Kernel Machines. MIT Press. 18. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. 19. Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5(2), 197–227. 20. Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting algorithm. In ICML-96. 21. Schapire, R. E. (2003). The boosting approach to machine learning: An overview. In Denison, D. D., Hansen, M. H., Holmes, C., Mallick, B., and Yu, B. (Eds.), Nonlinear Estimation and Classification. Springer. 22. Friedman, J., Hastie, T., and Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28(2), 337–374. 23. Blum, A. L. (1996). On-line algorithms in machine learning. In Proc.Workshop on On-Line Algorithms, Dagstuhl, pp. 306–325. 24. Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and Games. Cambridge University Press. 25. Dredze, M., Crammer, K., and Pereira, F. (2008). Confidence-weighted linear classification. In ICML- 08, pp. 264–271. |
Norvig I Peter Norvig Stuart J. Russell Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010 |
Beliefs | Schiffer | I 273 Def subdoxastic/Stich: (1978): a subdoxastic state is not a religious state, but an information-bearing state. You are unconscious and inferentially insulated from beliefs. >Unconsious, >Belief state, >Beliefs, >Inference. E.g. if there is a transformational grammar, then the states they would represent would be subdoxastic. Schiffer thesis: language processing is done through a series of internal subdoxastic states. 1. Stephen P. Stich (1978). Beliefs and subdoxastic states. In: Philosophy of Science 45 (December):499-518 --- I 26 Belief/Schiffer: problem: such a psychological theory does not create the meaning of beliefs. - Solution: functionalist reduction. >Psycho functionalism. Ultimately: "Bel = def 1st element of an ordered pair of functions that satisfies T (f,g) "... ((s) from which the theory says that it is belief) ...) - ((s) "Loar-style"). >Meaning theory/Loar. I 28 Schiffer: It is already presupposed that one forms beliefs and desires as functions of propositions on (sets of) internal Z-types. >Functional role/Schiffer. The criterion that a Z-token is n a belief, that p is, that n is a token of a Z-type which has the functional role, that correlates the definition of bel T with p. I 150 Belief property/SchifferVs: if belief properties existed, they would not be irreducible (absurd). - ((s) It is already proven for Schiffer that there is a neural proposition for E.g. stepping back from a car.) This is the cause - then we have a mental proposition in addition. This is then not supported by any counterfactual conditional. Counterfactual conditional/(s): indicates whether something is superfluous - or whether it is then sufficient as an explanation. >Counterfactual conditionals. I 155 Belief properties/Schiffer: presumed they existed (language-independent), then they should be simple (non-assembled), i.e. no function of other things. Vs: E.g. the proposition, to love Thatcher is composed of love and Thatcher - but belief is no such relation (see above). Problem: if belief properties are semantically simple, then there is an infinite number of them. - Then language learning is impossible. >Language acquisition, >Learning. I 163 Belief predicates: less problematic than belief properties: irreducibility out of conceptual role. >Conceptual role. E.g. Ava would not have stepped back if she did not have the belief property that a car is coming. Conceptually and ontologically independent of the singular term "The EC of the belief that a car comes" This is a benign predicate-dualism (in terms of conceptual roles). It has no causal power. Pleonastic: Ava stepped back because she had the belief property... I 164 Belief/(s): Where, Ava believes that a car is coming, she believes this in every possible world that is physically indistinguishable from the actual world. Problem: that cannot be proven - but is probably true. Then ultimately, she stepped back, because she was in the neural state... SchifferVsEliminativism/SchifferVsChurchland: the eliminativism should then have the result that nobody believes anything. >Eliminativism, >Reductionism. |
Schi I St. Schiffer Remnants of Meaning Cambridge 1987 |
Brain Development | Developmental Psychology | Upton I 70 Brain development(Developmental psychology/Upton: between the ages of two and five years the changes that occur in the brain enable children to plan their actions, pay greater attention to tasks and increase their language skills. The brain does not grow as rapidly during this time period as it did in infancy, but there are still some dramatic anatomical changes that take place (Thompson et al., 2000)(1). During early childhood, children’s brains show rapid growth in the prefrontal cortex in particular. The prefrontal cortex is an area of the frontal lobes that is known to be involved in two very important activities: planning and organising new actions, and maintaining attention to tasks (Blumenthal et al., 1999)(2). Other important changes include an increase in myelination of the cells in the brain. This myelination speeds up the rate at which information travels through the nervous system (Meier et al., 2004)(3). E.g.myelination of the area of the brain that controls hand–eye coordination is not completed until around four years of age. Brain-imaging studies have shown that children with lower rates of myelination in this area of the brain at four years of age show poorer hand–eye coordination than their peers (Pujol et al., 2004)(4). Upton I 71 Language/right hemisphere: Handedness has traditionally been thought to have a strong link to brain organisation. Paul Pierre Broca first described language regions in the left hemisphere of right-handers in the nineteenth century and, from then on, it was accepted that the reverse, that is, right-hemisphere language dominance, should be true of left-handers (Knecht et al., 2000)(5). However, in reality the left-hand side of the brain dominates in language processing for most people: around 95 per cent of right-handers process speech predominantly in the left hemisphere (Springer and Deutsch, 1985)(6), as do more than 50 per cent of left-handers (Knecht et al., 2000)(5). According to Knecht et al., left-handedness is neither a precondition nor a necessary consequence of right-hemisphere language dominance (Knecht et al., 2000(5), p. 2517). >Learning, >Learning theory, >Language acquisition, >Brain, >Lateralization of the brain, >Language. 1. Thompson, P.M., Giedd, J. N., Woods, R. P., MacDonald D. Evans, A. C. & Toga, A. W. 2000. Growth patterns in the developing brain detected by using continuum mechanical tensor maps. Nature, 404, 190-3. 2. Blumenthal, J. A., Babyak M. A., Moore, K.A., Craighead, W.E:, Herman, S. Khatri, P., Waugh, R, Napolitano, M.A., Forman, L.M., Appelbaum, M., Doraiswamy, P.M. & Krishnan, K.R. 1999. Effects of exercise training on older patients with major depression, Archives of internal Medicine, 159: 2349-56. 3. Meier, B.P. and Hinsz, V.B. (2004) A comparison of human aggression committed by groups and individuals: an interindividual-intergroup discontinuity. Journal of Experimental Social Psychology, 40: 551–59. 4. Pujol,J, López-Sala, A., Sebastiá-Gallés, N, Deus, J, Cardoner, N., Soriano-Mas, C, Moreno, A. and Sans, A. (2004) Delayed myelination in children with developmental delay detected by volumetric MRI. NeuroImage, 22 (2): 897–903. 5. Knecht, S., Dräger, B., Deppe, M., Bobe, L. and Lohmann, H. (2000) Handedness and hemispheric language dominance in healthy humans. Brain, 123(12): 2512–18. 6. Springer, S.P. and Deutsch, G. (1985) Left Brain, Right Brain. New York: W.H. Freeman. |
Upton I Penney Upton Developmental Psychology 2011 |
Information Extraction | AI Research | Norvig I 873 Information extraction/AI Research/Norvig/Russell: Information extraction is the process of acquiring knowledge by skimming a text and looking for occurrences of a particular class of object and for relationships among objects. A typical task is to extract instances of addresses from Web pages, with database fields for street, city, state, and zip code; (…).In a limited domain, this can be done with high accuracy. As the domain gets more general, more complex linguistic models and more complex learning techniques are necessary. Norvig I 874 A. Finite-state template-based information extraction: Attribute-based extraction system: (…) assumes that the entire text refers to a single object and the task is to extract attributes of that object. E.g., Manufacturer; product; price. Relational extraction systems: deal with multiple objects and the relations among them. Norvig I 875 A relational extraction system can be built as a series of cascaded finite-state transducers. E.g., FASTUS consists of five stages: 1. Tokenization, 2. Complex-word handling, 3. Basic-group handling, 4. Complex-phrase handling, 5. Structure merging. Norvig I 876 B. Probabilistic models for information extraction: When information extraction must be attempted from noisy or varied input, (…) it is better to use a probabilistic model rather than a rule-based model. The simplest probabilistic model for sequences with hidden state is the hidden Markov model, or HMM. >Bayesian networks, >Statistical learning. (…) an HMM models a progression through a sequence of hidden states, xt, with an observation et at each step. To apply HMMs to information extraction, [one] can either build one big HMM for all the attributes or build a separate HMM for each attribute. HMMs are probabilistic, and thus tolerant to noise. (…) with HMMs there is graceful degradation with missing characters/words, and [one] get[s] a probability indicating the degree of match, not just a Boolean match/fail. Norvig I 877 (…) HMMs can be trained from data; they don’t require laborious engineering of templates, and thus they can more easily be kept up to date as text changes over time. Norvig I 878 VsHMMs: Problem: One issue with HMMs for the information extraction task is that they model a lot of probabilities that we don’t really need. An HMM is a generative model; it models the full joint probability of observations and hidden states, and thus can be used to generate samples. That is, we can use the HMM model not only to parse a text and recover the speaker and date, but also to generate a random instance of a text containing a speaker and a date. Solution: All we need in order to understand a text is a discriminative model, one that models the conditional probability of the hidden attributes given the observations (the text). Conditional random field: We don’t need the independence assumptions of the Markov model - we can have an xt that is dependent on x1. A framework for this type of model is the conditional random field, or CRF, which models a conditional probability distribution of a set of target variables given a set of observed variables. Like Bayesian networks, CRFs can represent many different structures of dependencies among the variables. Norvig I 879 Ontology extraction: [different from] information extraction as finding a specific set of relations (e.g., speaker, time, location) in a specific text (e.g., a talk announcement) (…) [ontology extraction] is building a large knowledge base or ontology of facts from a corpus. This is different in three ways: First it is open-ended - we want to acquire facts about all types of domains, not just one specific domain. Second, with a large corpus, this task is dominated by precision, not recall - just as with >question answering on the Web (…). Third, the results can be statistical aggregates gathered from multiple sources, rather than being extracted from one specific text. E.g., Hearst (1992)(1) looked at the problem of learning an ontology of concept categories and subcategories from a large corpus. The work concentrated on templates that are very general (not tied to a specific domain) and have high precision (are Norvig I 880 almost always correct when they match) but low recall (do not always match). Here is one of the most productive templates: NP such as NP (, NP)* (,)? ((and | or) NP)? . Here [“such as”, “and”, “or”] and commas must appear literally in the text, but the parentheses are for grouping, the asterisk means repetition of zero or more, and the question mark means optional. Problems: The biggest weakness in this approach is the sensitivity to noise. If one of the first few templates is incorrect, errors can propagate quickly. One way to limit this problem is to not accept a new example unless it is verified by multiple templates, and not accept a new template unless it discovers multiple examples that are also found by other templates. Machine reading: (…) a system that could read on its own and build up its own database. Such a system would be relation-independent; would work for any relation. In practice, these systems work on all relations in parallel, because of the I/O demands of large corpora. They behave less like a traditional information extraction system that is targeted at a few relations and more like a human reader who learns from the text itself; because of this the field has been called machine reading. A representative machine-reading system is TEXTRUNNER (Banko and Etzioni, 2008)(2). TEXTRUNNER uses cotraining to boost its performance, but it needs something to bootstrap from. In the case of Hearst (1992)(1), specific patterns (e.g., such as) provided the bootstrap, and for Brin (1998)(3), it was a set of five author-title pairs. Norvig I 884 Early information extraction programs include GUS (Bobrow et al., 1977)(4) and FRUMP (DeJong, 1982)(5). Recent information extraction has been pushed forward by the annual Message Understand Conferences (MUC), sponsored by the U.S. government. The FASTUS finite-state system was done by Hobbs et al. (1997)(6). It was based in part on the idea from Pereira and Wright (1991)(7) of using FSAs as approximations to phrase-structure grammars. Surveys of template-based systems are given by Roche and Schabes (1997)(8), Appelt (1999)(9), Norvig I 885 Freitag and McCallum (2000)(10) discuss HMMs for Information Extraction. CRFs were introduced by Lafferty et al. (2001)(11); an example of their use for information extraction is described in (McCallum, 2003)(12) and a tutorial with practical guidance is given by (Sutton and McCallum, 2007)(13). Sarawagi (2007)(14) gives a comprehensive survey. 1. Hearst, M. A. (1992). Automatic acquisition of hyponyms from large text corpora. In COLING-92. 2. Banko, M. and Etzioni, O. (2008). The tradeoffs between open and traditional relation extraction. In ACL-08, pp. 28–36. 3. Brin, D. (1998). The Transparent Society. Perseus 4. Bobrow, D. G.,Kaplan, R.,Kay,M.,Norman, D. A., Thompson, H., and Winograd, T. (1977). GUS, a frame driven dialog system. AIJ, 8, 155–173. 5. DeJong, G. (1982). An overview of the FRUMP system. In Lehnert,W. and Ringle,M. (Eds.), Strategies for Natural Language Processing, pp. 149–176. Lawrence Erlbaum. 6. Hobbs, J. R., Appelt, D., Bear, J., Israel, D., Kameyama, M., Stickel, M. E., and Tyson, M. (1997). FASTUS: A cascaded finite-state transducer for extracting information from natural-language text. In Roche, E. and Schabes, Y. (Eds.), Finite- State Devices for Natural Language Processing, pp. 383–406. MIT Press. 7. Pereira, F. and Wright, R. N. (1991). Finite-state approximation of phrase structure grammars. In ACL-91, pp. 246–255. 8. Roche, E. and Schabes, Y. (1997). Finite-State Language Processing (Language, Speech and Communication). Bradford Books. 9. Appelt, D. (1999). Introduction to information extraction. CACM, 12(3), 161–172. 10. Freitag, D. and McCallum, A. (2000). Information extraction with hmm structures learned by stochastic optimization. In AAAI-00. 11. Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML-01. 12. McCallum, A. (2003). Efficiently inducing features of conditional random fields. In UAI-03. 13. Sutton, C. and McCallum, A. (2007). An introduction to conditional random fields for relational learning. In Getoor, L. and Taskar, B. (Eds.), Introduction to Statistical Relational Learning. MIT Press. 14. Sarawagi, S. (2007). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. |
Norvig I Peter Norvig Stuart J. Russell Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010 |
Innateness | Field | II 388 Rules/in the head/brain/Field: there may be rules that are "written in the head". >Rules, >Language processing, >Information processing, >Language of thought, >Brain, >Brain/Brainstates, >Thinking, >Cognition. Problem: when a rule is "written in the head", then a part of the brain must read them, and this in turn is controlled by rules. - These rules might be changeable. >Regress. |
Field I H. Field Realism, Mathematics and Modality Oxford New York 1989 Field II H. Field Truth and the Absence of Fact Oxford New York 2001 Field III H. Field Science without numbers Princeton New Jersey 1980 Field IV Hartry Field "Realism and Relativism", The Journal of Philosophy, 76 (1982), pp. 553-67 In Theories of Truth, Paul Horwich Aldershot 1994 |
Lateralization of the Brain | Developmental Psychology | Upton I 71 Lateralization/Language/right hemisphere/developmental psychology/Upton: Handedness has traditionally been thought to have a strong link to brain organisation. Paul Pierre Broca first described language regions in the left hemisphere of right-handers in the nineteenth century and, from then on, it was accepted that the reverse, that is, right-hemisphere language dominance, should be true of left-handers (Knecht et al., 2000)(1). However, in reality the left-hand side of the brain dominates in language processing for most people: around 95 per cent of right-handers process speech predominantly in the left hemisphere (Springer and Deutsch, 1985)(2), as do more than 50 per cent of left-handers (Knecht et al., 2000)(1). According to Knecht et al., left-handedness is neither a precondition nor a necessary consequence of right-hemisphere language dominance (Knecht et al., 2000(1), p. 2517). Left-handedness is more frequently seen in creative and artistic individuals, such as musicians and artists, than would be expected by chance (Schachter and Ransil, 1996)(3). This might be explained by the finding that left-handers tend to have exceptional visual-spatial skills (Holtzen, 2000)(4), meaning that they are better able to recognise and represent shape and form (Ghayas and Adil, 2007)(5). Studies have shown a tendency for left-handers to score highly on intelligence tests (e.g. Bower, 1985(6); Ghayas and Adil, 2007(5)); however, it has also been noted that left-handers are more likely to have reading problems than right-handers (Natsopuolos et al., 1998), which may be related to the way they process language. >Language Development/Developmental psychology, >Brain, >Brain development. 1. Knecht, S, Dräger, B, Deppe, M, Bobe, L and Lohmann, H. (2000) Handedness and hemispheric language dominance in healthy humans. Brain, 123(12): 2512–18. 2. Springer, S.P. and Deutsch, G. (1985) Left Brain, Right Brain. New York: WH Freeman. 3. Schachter, S.C. and Ransil, B.J. (1996) Handedness distributions in nine professional groups. Perceptual and Motor Skills, 82: 51–63. 4. Holtzen, DW (2000) Handedness and professional tennis. International Journal of Neuroscience. 105: 109–19. 5. Ghayas, S. and Adil, A. (2007) Effect of handedness on intelligence level of students. Journal of the Indian Academy of Applied Psychology, 33(1): 85–91. 6. Bower, B. (1985) The left hand of math and verbal talent. Science News, 127(17): 263. |
Upton I Penney Upton Developmental Psychology 2011 |
Meaning Theory | Schiffer | I 12 Meaning theory/Schiffer: assuming compositionality, you can identify language with the system of conventions in P. - Then one has (with Davidson) the form of meaning theory. - No one has ever done this. >Compositionality, >Meaning theory/Davidson. I 182 Truth Theory/Schiffer: a truth theory cannot be a meaning theory because its knowledge would not be sufficient for understanding the language. >Truth theory, >Understanding. I 220 Meaning theory/Schiffer: not every language needs a correct meaning theory - because it has to do without the relation theory for belief. >Relation theory. I 222 The relation theory for belief is wrong when languages have no compositional truth-theoretical semantics - otherwise it would be true. I 261 Meaning/Meaning Theory/language/Schiffer: Thesis: all theories of language and thought are based on false prerequisites. Error: to think that language comprehension would be a process of inferences. Then every sentence must have a feature, and this could not merely consist in that the sentence has that and that meaning. Because that would be semantic. We need a non-semantic description. Problem: E.g. "she gave it to him" has not even semantic properties. E.g. "snow is white" has its semantic properties only contingently. >Semantic properties. I 264 SchifferVsGrice: we cannot formulate our semantic knowledge in non-semantic terms. >Intentions/Grice. I 265 Meaning theory/meaning/SchifferVsMeaning theory: all theories have failed. Thesis: there is no meaning theory. - (This is the no-Theory-Theory of mental representation). Schiffer:Meaning is not an entity - therefore there is also no theory of this object. I 269 Schiffer: Meaning is also determinable without meaning theory. I 269 No-Theory-Theory of mental representation: there is no theory for intentionality, because having a concept does not mean that the quantifiable real would be entities. The scheme "x believes y iff __" cannot be supplemented. The questions on our language processing are empirically, not philosophical. >Language use, >Language behavior. |
Schi I St. Schiffer Remnants of Meaning Cambridge 1987 |
Phonetics | Psychological Theories | Slater I 192 Phonetics/psychological theories: Liberman, Harris, Hoffman, and Griffith (1957)(1) summarized a decade of research at Haskins Laboratories that revealed a special property of the human adult auditory system. In contrast to every other type of auditory stimulus, whose perception conformed to invariant principles such as Weber’s Law. Def Weber’s Law: differences in intensity and frequency are discriminated in proportional steps, not absolute steps. LibermanVsWeber’s Law: Liberman et al. provided compelling evidence that certain classes of speech sounds (notably stop consonants) are not perceived in this monotonic manner. Rather, speech is perceived in a non-monotonic manner, with discontinuities in discrimination that fall approximately at the edges of perceptual categories. Subsequent work from Haskins (Liberman, Harris, Kinney, & Lane, 1961(2); Liberman, Cooper, Shankweiler, & Studdert Kennedy, 1967)(3) provided even more definitive evidence for what became known as categorical perception (CP). Categorical Perception (CP): This special mode of perception was characterized by two crucial properties: (a) tokens presented from a physical continuum were identified (labeled) as a member of one category or the other, with a sharp transition in identification (ID) at the category boundary, and (b) failure of within-category discrimination and a peak in between-category discrimination for tokens that straddled the category boundary. >Language development/psychological theories. Language development: Because no speech production was required to document the presence of CP, one could avoid the circular logic of claiming that competence was limited by production deficiencies. Thus, if one could develop a method to test infants on a speech perception task, and if their performance conformed to the CP pattern of discrimination and identification observed in adults, then the presence of a functioning speech mode (i.e., an innate and linguistically relevant perceptual system) would be demonstrated. Slater I 197 Development: There is no question that infants are better at some phonetic discrimination than adults. For example, infants from a Japanese speaking environment can discriminate the /ri-/li contrast (Tsushima et al., 1994)(4), even though it is not used phonemically by adult speakers of Japanese, and these adult speakers have great difficulty improving their /r/-/l/ discrimination even after extensive training (Lively, Pisoni, Yamada, Tohkura & Yamada, 1994)(5). This suggests that listening experience must play a substantial role in at least some phonetic category discrimination. Werker and Tees (1984) were the first to show the time-course of such a tuning by the listening environment. Infants from an English speaking environment were able at six months of age to discriminate two non-native phonetic contrasts (from Hindi and from Salish, a Native American language), thereby surpassing their adult English speaking parents. But by 12 months of age the discriminative abilities of infants from an English speaking environment for these two non-native contrasts had fallen to near chance. Slater I 198 Consonant discrimination: (…) experience with the native language can exert a substantial role in consonant discrimination over the second six months of postnatal life. (…) Kuhl, Williams, Lacerda, Stevens, and Lindblom (1992)(6) showed that the effect of native language experience operates even earlier over vowel contrasts, with language-specific tuning by six months of age. Recent evidence from Kuhl, Tsao, and Liu (2003)(7) suggests that social interaction, rather than mere passive listening, plays a key role in this process of attuning the phonetic categories, and further work from Tsao, Liu, and Kuhl(2004)(8) suggests that early attunement is predictive of later levels of vocabulary size. >Phonemes, >Phonology, >Categorical perception, >P.D. Eimas. 1. Liberman, A. M., Harris, K. S., Hoffman, H. S., & Griffith, B.C. (1957). The discrimination of speech sounds within and across phoneme boundaries. Journal of Experimental Psychology, 54, 358—368. 2. Liberman, A. M., Harris, K. S., Kinney, J., & Lane, H. (1961). The discrimination of relative onset-time of the components of certain speech and non-speech patterns. Journal of Experimental Psychology, 61,379—388. 3. Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967). Perception of the speech code. Psychological Review, 74, 431—461. 4. Tsushima, T. Takizawa, O., Sasaki, M., Siraki, S., Nishi, K., Kohno, M., Menyuk, P., & Best, C. (1994, October). Discrimination of English/r-l/ and/w-y/ by Japanese infants at 6—12 months: Language specific developmental changes in speech perception abilities. Paper presented at the International Conference on Spoken Language Processing, Yokohama, Japan. 5. Lively, S. E., Pisoni, D. B., Yamada, R. A., Tohkura, Y., & Yamada, T. (1994). Training Japanese listeners to identify English/r/ and /1/. III. Long-term retention of new phonetic categones. Journal of the Acoustical Society of America, 96, 2076—2087. 6. Kuhl, P. K., Williams, K. A., Lacerda, F., Stevens, K. N., & Lindbiom, B. (1992). Linguistic experience alters phonetic perception in infants by 6 months of age. Science, 255, 606—608. 7. Kuhi, P. K., Tsao. F.-M., & Liu, H.-M. (2003). Foreign-language experience in infancy Effects of short-term exposure and social interaction on phonetic learning. Proceedings of the National Academy of Sciences, 100, 9096—9101. 8. Tsao, F.-M., Liu, H.-M., & Kuhl, P. K. (2004). Speech perception in infancy predicts language development in the second year of life: A longitudinal study. Child Development, 75, 1067—1084. Richard N. Aslin, “Language Development. Revisiting Eimas et al.‘s /ba/ and /pa/ Study”, in: Alan M. Slater and Paul C. Quinn (eds.) 2012. Developmental Psychology. Revisiting the Classic Studies. London: Sage Publications |
Slater I Alan M. Slater Paul C. Quinn Developmental Psychology. Revisiting the Classic Studies London 2012 |
Prior Knowledge | Norvig | Norvig I 777 Prior knowledge/AI Research/Norvig/Russell: To understand the role of prior knowledge, we need to talk about the logical relationships among hypotheses, example descriptions, and classifications. Let Descriptions denote the conjunction of all the example descriptions in the training set, and let Classifications denote the conjunction of all the example classifications. Then a Hypothesis that “explains the observations” must satisfy the following property (recall that |= means “logically entails”): Hypothesis ∧ Descriptions |= Classifications. Entailment constraint: We call this kind of relationship an entailment constraint, in which Hypothesis is the “un-known.” Pure inductive learning means solving this constraint, where Hypothesis is drawn from some predefined hypothesis space. >Hypotheses/AI Research. Software agents/knowledge/learning/Norvig: The modern approach is to design agents that already know something and are trying to learn some more. An autonomous learning agent that uses background knowledge must somehow obtain the background knowledge in the first place (…). This method must itself be a learning process. The agent’s life history will therefore be characterized by cumulative, or incremental, development. Norvig I 778 Learning with background knowledge: allows much faster learning than one might expect from a pure induction program. Explanation based learning/EBL: the entailment constraints satisfied by EBL are the following: Hypothesis ∧ Descriptions |= Classifications Background |= Hypothesis. Norvig I 779 (…) it was initially thought to be a way to learn from examples. But because it requires that the background knowledge be sufficient to explain the hypothesis, which in turn explains the observations, the agent does not actually learn anything factually new from the example. The agent could have derived the example from what it already knew, although that might have required an unreasonable amount of computation. EBL is now viewed as a method for converting first-principles theories into useful, special purpose knowledge. Relevance/observations/RBL: the prior knowledge background concerns the relevance of a set of features to the goal predicate. This knowledge, together with the observations, allows the agent to infer a new, general rule that explains the observations: Hypothesis ∧ Descriptions |= Classifications , Background ∧ Descriptions ∧ Classifications |= Hypothesis. We call this kind of generalization relevance-based learning, or RBL. (…) whereas RBL does make use of the content of the observations, it does not produce hypotheses that go beyond the logical content of the background knowledge and the observations. It is a deductive form of learning and cannot by itself account for the creation of new knowledge starting from scratch. Entailment constraint: Background ∧ Hypothesis ∧ Descriptions |= Classifications. That is, the background knowledge and the new hypothesis combine to explain the examples. Knowledge-based inductive learning/KBIL algorithms: Algorithms that satisfy [the entailment] constraint are called knowledge-based inductive learning, or KBIL, algorithms. KBIL algorithms, (…) have been studied mainly in the field of inductive logic programming, or ILP. Norvig I 780 Explanation-based learning: The basic idea of memo functions is to accumulate a database of input–output pairs; when the function is called, it first checks the database to see whether it can avoid solving the problem from scratch. Explanation-based learning takes this a good deal further, by creating general rules that cover an entire class of cases. Norvig I 781 General rules: The basic idea behind EBL is first to construct an explanation of the observation using prior knowledge, and then to establish a definition of the class of cases for which the same explanation structure can be used. This definition provides the basis for a rule covering all of the cases in the class. Explanation: The “explanation” can be a logical proof, but more generally it can be any reasoning or problem-solving process whose steps are well defined. The key is to be able to identify the necessary conditions for those same steps to apply to another case. Norvig I 782 EBL: 1. Given an example, construct a proof that the goal predicate applies to the example using the available background knowledge. Norvig I 783 2. In parallel, construct a generalized proof tree for the variabilized goal using the same inference steps as in the original proof. 3. Construct a new rule whose left-hand side consists of the leaves of the proof tree and whose right-hand side is the variabilized goal (after applying the necessary bindings from the generalized proof). 4. Drop any conditions from the left-hand side that are true regardless of the values of the variables in the goal. Norvig I 794 Inverse resolution: Inverse resolution is based on the observation that if the example Classifications follow from Background ∧ Hypothesis ∧ Descriptions, then one must be able to prove this fact by resolution (because resolution is complete). If we can “run the proof backward,” then we can find a Hypothesis such that the proof goes through. Norvig I 795 Inverse entailment: The idea is to change the entailment constraint Background ∧ Hypothesis ∧ Descriptions |= Classifications to the logically equivalent form Background ∧ Descriptions ∧ ¬Classifications |= ¬Hypothesis. Norvig I 796 An inverse resolution procedure that inverts a complete resolution strategy is, in principle, a complete algorithm for learning first-order theories. That is, if some unknown Hypothesis generates a set of examples, then an inverse resolution procedure can generate Hypothesis from the examples. This observation suggests an interesting possibility: Suppose that the available examples include a variety of trajectories of falling bodies. Would an inverse resolution program be theoretically capable of inferring the law of gravity? The answer is clearly yes, because the law of gravity allows one to explain the examples, given suitable background mathematics. Norvig I 798 Literature: The current-best-hypothesis approach is an old idea in philosophy (Mill, 1843)(1). Early work in cognitive psychology also suggested that it is a natural form of concept learning in humans (Bruner et al., 1957)(2). In AI, the approach is most closely associated with the work of Patrick Winston, whose Ph.D. thesis (Winston, 1970)(3) addressed the problem of learning descriptions of complex objects. Version space: The version space method (Mitchell, 1977(4), 1982(5)) takes a different approach, maintaining the set of all consistent hypotheses and eliminating thosefound to be inconsistent with new examples. The approach was used in the Meta-DENDRAL Norvig I 799 expert system for chemistry (Buchanan and Mitchell, 1978)(6), and later in Mitchell’s (1983)(7) LEX system, which learns to solve calculus problems. A third influential thread was formed by the work of Michalski and colleagues on the AQ series of algorithms, which learned sets of logical rules (Michalski, 1969(8); Michalski et al., 1986(9)). EBL: EBL had its roots in the techniques used by the STRIPS planner (Fikes et al., 1972)(10). When a plan was constructed, a generalized version of it was saved in a plan library and used in later planning as a macro-operator. Similar ideas appeared in Anderson’s ACT* architecture, under the heading of knowledge compilation (Anderson, 1983)(11), and in the SOAR architecture, as chunking (Laird et al., 1986)(12). Schema acquisition (DeJong, 1981)(13), analytical generalization (Mitchell, 1982)(5), and constraint-based generalization (Minton, 1984)(14) were immediate precursors of the rapid growth of interest in EBL stimulated by the papers of Mitchell et al. (1986)(15) and DeJong and Mooney (1986)(16). Hirsh (1987) introduced the EBL algorithm described in the text, showing how it could be incorporated directly into a logic programming system. Van Harmelen and Bundy (1988)(18) explain EBL as a variant of the partial evaluation method used in program analysis systems (Jones et al., 1993)(19). VsEBL: Initial enthusiasm for EBL was tempered by Minton’s finding (1988)(20) that, without extensive extra work, EBL could easily slow down a program significantly. Formal probabilistic analysis of the expected payoff of EBL can be found in Greiner (1989)(21) and Subramanian and Feldman (1990)(22). An excellent survey of early work on EBL appears in Dietterich (1990)(23). Relevance: Relevance information in the form of functional dependencies was first developed in the database community, where it is used to structure large sets of attributes into manageable subsets. Functional dependencies were used for analogical reasoning by Carbonell and Collins (1973)(24) and rediscovered and given a full logical analysis by Davies and Russell (Davies, 1985(25); Davies and Russell, 1987(26)). Prior knowledge: Their role as prior knowledge in inductive learning was explored by Russell and Grosof (1987)(27). The equivalence of determinations to a restricted-vocabulary hypothesis space was proved in Russell (1988)(28). Learning: Learning algorithms for determinations and the improved performance obtained by RBDTL were first shown in the FOCUS algorithm, due to Almuallim and Dietterich (1991)(29). Tadepalli (1993)(30) describes a very ingenious algorithm for learning with determinations that shows large improvements in earning speed. Inverse deduction: The idea that inductive learning can be performed by inverse deduction can be traced to W. S. Jevons (1874)(31) (…). Computational investigations began with the remarkable Ph.D. thesis by Norvig I 800 Gordon Plotkin (1971)(32) at Edinburgh. Although Plotkin developed many of the theorems and methods that are in current use in ILP, he was discouraged by some undecidability results for certain subproblems in induction. MIS (Shapiro, 1981)(33) reintroduced the problem of learning logic programs, but was seen mainly as a contribution to the theory of automated debugging. Induction/rules: Work on rule induction, such as the ID3 (Quinlan, 1986)(34) and CN2 (Clark and Niblett, 1989)(35) systems, led to FOIL (Quinlan, 1990)(36), which for the first time allowed practical induction of relational rules. Relational Learning: The field of relational learning was reinvigorated by Muggleton and Buntine (1988)(37), whose CIGOL program incorporated a slightly incomplete version of inverse resolution and was capable of generating new predicates. The inverse resolution method also appears in (Russell, 1986)(38), with a simple algorithm given in a footnote. The next major system was GOLEM (Muggleton and Feng, 1990)(39), which uses a covering algorithm based on Plotkin’s concept of relative least general generalization. ITOU (Rouveirol and Puget, 1989)(40) and CLINT (De Raedt, 1992)(41) were other systems of that era. Natural language: More recently, PROGOL (Muggleton, 1995)(42) has taken a hybrid (top-down and bottom-up) approach to inverse entailment and has been applied to a number of practical problems, particularly in biology and natural language processing. Uncertainty: Muggleton (2000)(43) describes an extension of PROGOL to handle uncertainty in the form of stochastic logic programs. Inductive logic programming /ILP: A formal analysis of ILP methods appears in Muggleton (1991)(44), a large collection of papers in Muggleton (1992)(45), and a collection of techniques and applications in the book by Lavrauc and Duzeroski (1994)(46). Page and Srinivasan (2002)(47) give a more recent overview of the field’s history and challenges for the future. Early complexity results by Haussler (1989) suggested that learning first-order sentences was intractible. However, with better understanding of the importance of syntactic restrictions on clauses, positive results have been obtained even for clauses with recursion (Duzeroski et al., 1992)(48). Learnability results for ILP are surveyed by Kietz and Duzeroski (1994)(49) and Cohen and Page (1995)(50). Discovery systems/VsILP: Although ILP now seems to be the dominant approach to constructive induction, it has not been the only approach taken. So-called discovery systems aim to model the process of scientific discovery of new concepts, usually by a direct search in the space of concept definitions. Doug Lenat’s Automated Mathematician, or AM (Davis and Lenat, 1982)(51), used discovery heuristics expressed as expert system rules to guide its search for concepts and conjectures in elementary number theory. Unlike most systems designed for mathematical reasoning, AM lacked a concept of proof and could only make conjectures. It rediscovered Goldbach’s conjecture and the Unique Prime Factorization theorem. AM’s architecture was generalized in the EURISKO system (Lenat, 1983)(52) by adding a mechanism capable of rewriting the system’s own discovery heuristics. EURISKO was applied in a number of areas other than mathematical discovery, although with less success than AM. The methodology of AM and EURISKO has been controversial (Ritchie and Hanna, 1984; Lenat and Brown, 1984). 1. Mill, J. S. (1843). A System of Logic, Ratiocinative and Inductive: Being a Connected View of the Principles of Evidence, and Methods of Scientific Investigation. J.W. Parker, London. 2. Bruner, J. S., Goodnow, J. J., and Austin, G. A. (1957). A Study of Thinking. Wiley. 3. Winston, P. H. (1970). Learning structural descriptions from examples. Technical report MAC-TR-76, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology. 4. Mitchell, T.M. (1977). Version spaces: A candidate elimination approach to rule learning. In IJCAI-77, pp. 305–310. 5. Mitchell, T. M. (1982). Generalization as search. AIJ, 18(2), 203–226. 6. Buchanan, B. G.,Mitchell, T.M., Smith, R. G., and Johnson, C. R. (1978). Models of learning systems. In Encyclopedia of Computer Science and Technology, Vol. 11. Dekker. 7. Mitchell, T. M., Utgoff, P. E., and Banerji, R. (1983). Learning by experimentation: Acquiring and refining problem-solving heuristics. In Michalski, R. S., Carbonell, J. G., and Mitchell, T. M. (Eds.), Machine Learning: An Artificial Intelligence Approach, pp. 163–190. Morgan Kaufmann. 8. Michalski, R. S. (1969). On the quasi-minimal solution of the general covering problem. In Proc. First International Symposium on Information Processing, pp. 125–128. 9. Michalski, R. S.,Mozetic, I., Hong, J., and Lavrauc, N. (1986). The multi-purpose incremental learning system AQ15 and its testing application to three medical domains. In AAAI-86, pp. 1041–1045. 10. Fikes, R. E., Hart, P. E., and Nilsson, N. J. (1972). Learning and executing generalized robot plans. AIJ, 3(4), 251–288. 11. Anderson, J. R. (1983). The Architecture of Cognition. Harvard University Press. 12. Laird, J., Rosenbloom, P. S., and Newell, A. (1986). Chunking in Soar: The anatomy of a general learning mechanism. Machine Learning, 1, 11–46. 13. DeJong, G. (1981). Generalizations based on explanations. In IJCAI-81, pp. 67–69. 14. Minton, S. (1984). Constraint-based generalization: Learning game-playing plans from single examples. In AAAI-84, pp. 251–254. 15. Mitchell, T. M., Keller, R., and Kedar-Cabelli, S. (1986). Explanation-based generalization: A unifying view. Machine Learning, 1, 47–80. 16. DeJong, G. and Mooney, R. (1986). Explanation-based learning: An alternative view. Machine Learning, 1, 145–176. 17. Hirsh, H. (1987). Explanation-based generalization in a logic programming environment. In IJCAI-87. 18. van Harmelen, F. and Bundy, A. (1988). Explanation-based generalisation = partial evaluation. AIJ, 36(3), 401–412. 19. Jones, N. D., Gomard, C. K., and Sestoft, P. (1993). Partial Evaluation and Automatic Program Generation. Prentice-Hall. 20. Minton, S. (1988). Quantitative results concerning the utility of explanation-based learning. In AAAI-88, pp. 564–569. 21. Greiner, R. (1989). Towards a formal analysis of EBL. In ICML-89, pp. 450–453. 22. Subramanian, D. and Feldman, R. (1990). The utility of EBL in recursive domain theories. In AAAI-90, Vol. 2, pp. 942–949. 23. Dietterich, T. (1990). Machine learning. Annual Review of Computer Science, 4, 255–306. 24. Carbonell, J. R. and Collins, A. M. (1973). Natural semantics in artificial intelligence. In IJCAI-73, pp. 344–351. 25. Davies, T. R. (1985). Analogy. Informal note INCSLI- 85-4, Center for the Study of Language and Information (CSLI). 26. Davies, T. R. and Russell, S. J. (1987). A logical approach to reasoning by analogy. In IJCAI-87, Vol. 1, pp. 264–270. 27. Russell, S. J. and Grosof, B. (1987). A declarative approach to bias in concept learning. In AAAI-87. 28. Russell, S. J. (1988). Tree-structured bias. In AAAI-88, Vol. 2, pp. 641–645. 29. Almuallim, H. and Dietterich, T. (1991). Learning with many irrelevant features. In AAAI-91, Vol. 2, pp. 547–552. 30. Tadepalli, P. (1993). Learning from queries and examples with tree-structured bias. In ICML-93, pp. 322–329. 31. Jevons, W. S. (1874). The Principles of Science. Routledge/Thoemmes Press, London. 32. Plotkin, G. (1971). Automatic Methods of Inductive Inference. Ph.D. thesis, Edinburgh University. 33. Shapiro, E. (1981). An algorithm that infers theories from facts. In IJCAI-81, p. 1064. 34. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81–106. 35. Clark, P. and Niblett, T. (1989). The CN2 induction algorithm. Machine Learning, 3, 261–283. 36. Quinlan, J. R. (1990). Learning logical definitions from relations. Machine Learning, 5(3), 239–266. 37. Muggleton, S. H. and Buntine, W. (1988). Machine invention of first-order predicates by inverting resolution. In ICML-88, pp. 339–352. 38. Russell, S. J. (1986). A quantitative analysis of analogy by similarity. In AAAI-86, pp. 284–288. 39. Muggleton, S. H. and Feng, C. (1990). Efficient induction of logic programs. In Proc. Workshop on Algorithmic Learning Theory, pp. 368–381. 40. Rouveirol, C. and Puget, J.-F. (1989). A simple and general solution for inverting resolution. In Proc. European Working Session on Learning, pp. 201–210. 41. De Raedt, L. (1992). Interactive Theory Revision: An Inductive Logic Programming Approach. Academic Press. 42. Muggleton, S. H. (1995). Inverse entailment and Progol. New Generation Computing, 13(3-4), 245- 286. 43. Muggleton, S. H. (2000). Learning stochastic logic programs. Proc. AAAI 2000 Workshop on Learning Statistical Models from Relational Data. 44. Muggleton, S. H. (1991). Inductive logic programming. New Generation Computing, 8, 295–318. 45. Muggleton, S. H. (1992). Inductive Logic Programming. Academic Press. 46. Lavrauc, N. and Duzeroski, S. (1994). Inductive Logic Programming: Techniques and Applications. Ellis Horwood 47. Page, C. D. and Srinivasan, A. (2002). ILP: A short look back and a longer look forward. Submitted to Journal of Machine Learning Research. 48. Duzeroski, S., Muggleton, S. H., and Russell, S. J. (1992). PAC-learnability of determinate logic programs. In COLT-92, pp. 128–135. 49. Kietz, J.-U. and Duzeroski, S. (1994). Inductive logic programming and learnability. SIGART Bulletin, 5(1), 22–32. 50. Cohen, W. W. and Page, C. D. (1995). Learnability in inductive logic programming: Methods and results. New Generation Computing, 13(3–4), 369-409. 51. Davis, R. and Lenat, D. B. (1982). Knowledge-Based Systems in Artificial Intelligence. McGraw- Hill. 52. Lenat, D. B. (1983). EURISKO: A program that learns new heuristics and domain concepts: The nature of heuristics, III: Program design and results. AIJ, 21(1–2), 61–98. 53. Ritchie, G. D. and Hanna, F. K. (1984). AM: A case study in AI methodology. AIJ, 23(3), 249–268. 54. Lenat, D. B. and Brown, J. S. (1984). Why AM and EURISKO appear to work. AIJ, 23(3), 269–294. |
Norvig I Peter Norvig Stuart J. Russell Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010 |
Question Answering | AI Research | Norvig I 872 Question Answering/AI Research/Norvig/Russell: Information retrieval is the task of finding documents that are relevant to a query, where the query may be a question, or just a topic area or concept. Question answering is a somewhat different task, in which the query really is a question, and the answer is not a ranked list of documents but rather a short response—a sentence, or even just a phrase. CF. >Information retrieval. There have been question-answering NLP (natural language processing) systems since the 1960s, but only since 2001 have such systems used Web information retrieval to radically increase their breadth of coverage. The ASKMSR system (Banko et al., 2002)(1) is a typical Web-based question-answering system. It is based on the intuition that most questions will be answered many times on the Web, so question answering should be thought of as a problem in precision, not recall. We don’t have to deal with all the different ways that an answer might be phrased—we only have to find one of them. E.g., [Who killed Abraham Lincoln?] – Web entry: “John Wilkes Booth altered history with a bullet. He will forever be known as the man who ended Abraham Lincoln’s life.” Problem: To use this passage to answer the question, the system would have to know that ending a life can be a killing, that “He” refers to Booth, and several other linguistic and semantic facts. Norvig I 873 ASKMSR does not attempt this kind of sophistication—it knows nothing about pronoun reference, or about killing, or any other verb. It does know 15 different kinds of questions, and how they can be rewritten as queries to a search engine. It knows that [Who killed Abraham Lincoln] can be rewritten as the query [* killed Abraham Lincoln] and as [Abraham Lincoln was killed by *]. It issues these rewritten queries and examines the results that come back - not the full Web pages, just the short summaries of text that appear near the query terms. The results are broken into 1-, 2-, and 3-grams (>Language models/Norvig) and tallied for frequency in the result sets and for weight: an n-gram that came back from a very specific query rewrite (such as the exact phrase match query [“Abraham Lincoln was killed by *”]) would get more weight than one from a general query rewrite, such as [Abraham OR Lincoln OR killed]. ASKMSR relies upon the breadth of the content on the Web rather than on its own depth of understanding. >Information Extraction, >Information retrieval. Norvig I 885 History: Banko et al. (2002)(1) present the ASKMSR question-answering system; a similar system is due to Kwok et al. (2001)(2). Pasca and Harabagiu (2001)(3) discuss a contest-winning question-answering system. Two early influential approaches to automated knowledge engineering were by Riloff (1993)(4), who showed that an automatically constructed dictionary performed almost as well as a carefully handcrafted domain-specific dictionary, and by Yarowsky (1995)(5), who showed that the task of word sense classification (…) could be accomplished through unsupervised training on a corpus of unlabeled text with accuracy as good as supervised methods. The idea of simultaneously extracting templates and examples from a handful of labeled examples was developed independently and simultaneously by Blum and Mitchell (1998)(6), who called it cotraining and by Brin (1998)(7), who called it DIPRE (Dual Iterative Pattern Relation Extraction). You can see why the term cotraining has stuck. Similar early work, under the name of bootstrapping, was done by Jones et al. (1999)(8). The method was advanced by the QXTRACT (Agichtein and Gravano, 2003)(9) and KNOWITALL (Etzioni et al., 2005)(10) systems. Machine reading was introduced by Mitchell (2005)(11) and Etzioni et al. (2006)(12) and is the focus of the TEXTRUNNER project (Banko et al., 2007(13); Banko and Etzioni, 2008(14)). (Cf. >Information extraction). (…) it is also possible to do information extraction based on the physical structure or layout of text rather than on the linguistic structure. HTML lists and tables in both HTML and relational databases are home to data that can be extracted and consolidated (Hurst, 2000(15); Pinto et al., 2003(16); Cafarella et al., 2008(17)). The Association for Computational Linguistics (ACL) holds regular conferences and publishes the journal Computational Linguistics. There is also an International Conference on Computational Linguistics (COLING). The textbook by Manning and Schütze (1999)(18) covers statistical language processing, while Jurafsky and Martin (2008)(19) give a comprehensive introduction to speech and natural language processing. 1. Banko, M., Brill, E., Dumais, S. T., and Lin, J. (2002). Askmsr: Question answering using the worldwide web. In Proc. AAAI Spring Symposium on Mining Answers from Texts and Knowledge Bases, pp. 7–9. 2. Kwok, C., Etzioni, O., andWeld, D. S. (2001). Scaling question answering to the web. In Proc. 10th International Conference on the World Wide Web. 3. Pasca,M. and Harabagiu, S.M. (2001). High performance question/answering. In SIGIR-01, pp. 366– 374. 4. Riloff, E. (1993). Automatically constructing a dictionary for information extraction tasks. In AAAI-93, pp. 811–816. 5. Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In ACL- 95, pp. 189–196. 6. Blum, A. L. and Mitchell, T. M. (1998). Combining labeled and unlabeled data with co-training. In COLT-98, pp. 92–100. 7. Brin, D. (1998). The Transparent Society. Perseus. 8. Jones, R., McCallum, A., Nigam, K., and Riloff, E. (1999). Bootstrapping for text learning tasks. In Proc. IJCAI-99 Workshop on Text Mining: Foundations, Techniques, and Applications, pp. 52–63. 9. Agichtein, E. and Gravano, L. (2003). Querying text databases for efficient information extraction. In Proc. IEEE Conference on Data Engineering. 10. Etzioni, O., Cafarella, M. J., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D. S., and Yates, A. (2005). Unsupervised named-entity extraction from the web: An experimental study. AIJ, 165(1), 91–134. 11. Mitchell, T. M. (2005). Reading the web: A breakthrough goal for AI. AIMag, 26(3), 12–16. 12. Etzioni, O., Banko, M., and Cafarella, M. J. (2006). Machine reading. In AAAI-06. 13. Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., and Etzioni, O. (2007). Open information extraction from the web. In IJCAI-07. 14. Banko, M. and Etzioni, O. (2008). The tradeoffs between open and traditional relation extraction. In ACL-08, pp. 28–36. 15. Hurst, M. (2000). The Interpretation of Text in Tables. Ph.D. thesis, Edinburgh. 16. Pinto, D.,McCallum, A.,Wei, X., and Croft,W. B. (2003). Table extraction using conditional random fields. In SIGIR-03. 17. Cafarella,M. J.,Halevy, A., Zhang, Y.,Wang, D. Z., and Wu, E. (2008). Webtables: Exploring the power of tables on the web. In VLDB-2008. 18. Manning, C. and Sch¨utze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press. 19. Jurafsky, D. and Martin, J. H. (2008). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2nd edition). Prentice- Hall. |
Norvig I Peter Norvig Stuart J. Russell Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010 |
Reading Acquisition | Neuroimaging | Upton I 101 Reading acquisition/Neuroimaging/Upton: (…) evidence from neuro-imaging studies and studies of patients with cerebellar lesions also play an important role in a range of high-level cognitive functions, such as language, previously believed to be under the sole control of the cortex (Booth et al., 2007)(1). According to the cerebellar deficit hypothesis (Nicolson et al., 1995)(2), both literacy and automaticity problems can be explained by abnormal cerebellar function. Indeed, there is evidence from both behavioural and neuro-imaging tests that dyslexia is associated with cerebellar impairment in about 80 per cent of cases (Nicholson et al., 2001)(3). It therefore seems that not only does motor development create the opportunity for cognitive functions to develop,(…) but that the interrelatedness of cognitive and motor development might also be based on shared neural systems (Ojeman, 1984(4); Diamond, 2000(5)). >Language, >Language acquisition, >Speaking, >Writing. 1. Booth JR1, Wood L, Lu D, Houk JC, Bitan T., „The role of the basal ganglia and cerebellum in language processing.“ In: Brain Res. 2007 Feb 16;1133(1):136-44. Epub 2006 Dec 26. 2. Nicolson RI1, Fawcett AJ, Dean P. „Time estimation deficits in developmental dyslexia: evidence of cerebellar involvement.“, In: Proc Biol Sci. 1995 Jan 23;259(1354):43-7. 3. Nicolson, RI, Fawcett, AJ and Dean, P (2001) Developmental dyslexia: the cerebellar deficit hypothesis. Trends in Neurosciences, 24(9): 508–11. 4. Ojeman, GA (1984) Common cortical and thalamic mechanisms for language and motor functions. American Journal of Physiology, 246: 901–3. 5. Diamond, LM (2000) Sexual identity, attractions, and behavior among young sexual-minority women over a two-year period. Developmental Psychology, 36: 241–50. |
Upton I Penney Upton Developmental Psychology 2011 |
Text Classification | AI Research | Norvig I 865 Text Classification/text categorization/AI Research/Norvig/Russell: (…) given a text of some kind, decide which of a predefined set of classes it belongs to. Language identification and genre classification are examples of text classification, as is sentiment analysis (classifying a movie or product review as positive or negative) and spam detection (classifying an email message as spam or not-spam). >Spam/AI Research. Norvig I 884 Manning and Schütze (1999)(1) and Sebastiani (2002)(2) survey text-classification techniques. Joachims (2001)(3) uses statistical learning theory and support vector machines to give a theoretical analysis of when classification will be successful. Apté et al. (1994)(4) report an accuracy of 96% in classifying Reuters news articles into the “Earnings” category. Koller and Sahami (1997)(5) report accuracy up to 95% with a naive Bayes classifier, and up to 98.6% with a Bayes classifier that accounts for some dependencies among features. Lewis (1998)(6) surveys forty years of application of naive Bayes techniques to text classification and retrieval. Schapire and Singer (2000)(7) show that simple linear classifiers can often achieve accuracy almost as good as more complex models and are more efficient to evaluate. Nigam et al. (2000)(8) show how to use the EM algorithm to label unlabeled documents, thus learning a better classification model. Witten et al. (1999)(9) describe compression algorithms for classification, and show the deep connection between the LZW compression algorithm and maximum-entropy language models. 1. Manning, C. and Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press. 2. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47. 3. Joachims, T. (2001). A statistical learning model of text classification with support vector machines. In SIGIR-01, pp. 128–136. 4. Apté, C., Damerau, F., and Weiss, S. (1994). Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12, 233–251. 5. Koller, D. and Sahami, M. (1997). Hierarchically classifying documents using very few words. In ICML-97, pp. 170–178. 6. Lewis, D. D. (1998). Naive Bayes at forty: The independence assumption in information retrieval. In ECML-98, pp. 4–15. 7. Schapire, R. E. and Singer, Y. (2000). Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2/3), 135–168. 8. Nigam, K., McCallum, A., Thrun, S., and Mitchell, T. M. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2–3), 103–134. 9. Witten, I. H., Moffat, A., and Bell, T. C. (1999). Managing Gigabytes: Compressing and Indexing Documents and Images (second edition). Morgan Kaufmann. |
Norvig I Peter Norvig Stuart J. Russell Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010 |
Disputed term/author/ism | Author |
Entry |
Reference |
---|---|---|---|
Sprache | Schiffer, St. | I 273 Schiffer: Language processing is done by a series of subdoxastic internal states. |
|