AI Research on Text Classification - Dictionary of Arguments
Norvig I 865
Text Classification/text categorization/AI Research/Norvig/Russell: (…) given a text of some kind, decide which of a predefined set of classes it belongs to. Language identification and genre classification are examples of text classification, as is sentiment analysis (classifying a movie or product review as positive or negative) and spam detection (classifying an email message as spam or not-spam). >Spam/AI Research.
Norvig I 884
Manning and Schütze (1999)(1) and Sebastiani (2002)(2) survey text-classification techniques. Joachims (2001)(3) uses statistical learning theory and support vector machines to give a theoretical analysis of when classification will be successful. Apté et al. (1994)(4) report an accuracy of 96% in classifying Reuters news articles into the “Earnings” category. Koller and Sahami (1997)(5) report accuracy up to 95% with a naive Bayes classifier, and up to 98.6% with a Bayes classifier that accounts for some dependencies among features. Lewis (1998)(6) surveys forty years of application of naive Bayes techniques to text classification and retrieval. Schapire and Singer (2000)(7) show that simple linear classifiers can often achieve accuracy almost as good as more complex models and are more efficient to evaluate. Nigam et al. (2000)(8) show how to use the EM algorithm to label unlabeled documents, thus learning a better classification model. Witten et al. (1999)(9) describe compression algorithms for classification, and show the deep connection between the LZW compression algorithm and maximum-entropy language models.
1. Manning, C. and Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT
2. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys,
3. Joachims, T. (2001). A statistical learning model of text classification with support vector machines. In SIGIR-01, pp. 128–136.
4. Apté, C., Damerau, F., and Weiss, S. (1994). Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12, 233–251.
5. Koller, D. and Sahami, M. (1997). Hierarchically classifying documents using very few words. In
ICML-97, pp. 170–178.
6. Lewis, D. D. (1998). Naive Bayes at forty: The independence assumption in information retrieval. In
ECML-98, pp. 4–15.
7. Schapire, R. E. and Singer, Y. (2000). Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2/3), 135–168.
8. Nigam, K., McCallum, A., Thrun, S., and Mitchell, T. M. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2–3), 103–134.
9. Witten, I. H., Moffat, A., and Bell, T. C. (1999). Managing Gigabytes: Compressing and Indexing
Documents and Images (second edition). Morgan Kaufmann._____________Explanation of symbols: Roman numerals indicate the source, arabic numerals indicate the page number. The corresponding books are indicated on the right hand side. ((s)…): Comment by the sender of the contribution. Translations: Dictionary of Arguments The note [Concept/Author], [Author1]Vs[Author2] or [Author]Vs[term] resp. "problem:"/"solution:", "old:"/"new:" and "thesis:" is an addition from the Dictionary of Arguments. If a German edition is specified, the page numbers refer to this edition.
Stuart J. Russell
Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010