DATA 690 Special Topics: Introduction to Natural Language Processing

Course Description: This course aims to teach the use of natural language processing (NLP) as a set of methods for exploring and reasoning about text as data. The focus will be on the applied side of NLP. Students will use existing NLP methods and libraries in Python to textual problems. Topics include language modeling, text classification, sentiment analysis, summarization and machine translation.

Prerequisites: DATA 602.

References:

  • Daniel Jurafsky and James H, Martin, “Speech and Language Processing: An introduction to speech recognition, computational linguistics and natural language processing,” Prentice Hall, 2008 (2nd edition)
  • Christopher D. Manning and Hinrich Schütze, “Foundations of Statistical Natural Language Processing,” MIT Press, 2000
  • Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, “Introduction to Information Retrieval,” Cambridge University Press. 2008
  • Nitin Indurkhya and Fred J. Damerau editors, “Handbook of Natural Language Processing,” CRC Press, 2010 (2nd edition)

Learning Outcomes: After this course, students should be able to

  • Understand the key concepts of NLP for describing and analyzing language
  • Describe the typical problems and processing layers in NLP
  • Analyze NLP problems to decompose them into independent components
  • Choose appropriate solutions for solving typical NLP problems (tokenizing, tagging, parsing)
  • Assess / Evaluate NLP based systems

Tentative Schedule

  • Introduction to NLP
  • Basic text processing: Preprocessing: tokenization and segmentation; normalization of words: stemming, lemmatization, morphological analyzers; regular expressions; edit distance
  • N-grams, perplexity, and methods of smoothing
  • Language models: input prediction, error correction, speech recognition, text generation.
  • Tagging: POS tagging and named entity recognition
  • Hidden Markov models and the Viterbi algorithm
  • Midterm Exam/Project
  • Text classification, Sentiment analysis, and Naive Bayes classifier
  • Performance measures: Accuracy, precision, recall, and F-measure
  • Parsing: Trees, context-free grammar, probabilistic approach to parsing, lex-icalized PCFGs, and CKY algorithm.
  • Machine Learning: Direct, transfer-based, interlingual, and statistical ML
  • Computational Semantics: Word senses and meanings; WordNet; semantic similarity measures: thesaurus-based and distributional methods.
  • Text Summarization: Extractive and abstractive summarization, multiple-document summarization, and query-based summarization
  • Unsupervised Text Summarization and Evaluation of Summarization Systems
  • Final Exam/Project