CS 398/498: Natural Language Processing

Course Information

Syllabus PDF

Description: Overview of modern natural language processing techniques: text normalization, language model, part-of-speech tagging, hidden Markov model, syntatic and dependency parsing, semantics, word sense, reference resolution, dialog agent, machine translation. Two class projects to design, implement and evaluate classic NLP algorithms. Credit will not be given for both CSE 398 and CSE 498.

Lectures: Tuesday/Thursday 1:10-2:25, Packard Lab 258

Office Hours: Thursday 4:30pm - 6:30 pm, Packard Lab 329

Prerequisites: For CSE 398: (MATH 231 or ECO 045) and (CSE017), for CSE 498 instructor permission is required.
We will mainly use Java for projects. Relevant programming and math concepts will be discussed briefly only when necessary.

Formats: 1 closed-book mid-term, 2 coding projects, 4 homework assigments, 1 final presentation.

Grading: Mid-term (15%), 4 coding projects (15% each), final (25%). There is no homework or quiz. Late submissions will be penalized 20% of the total grades per late day (24 hours or part thereof) and no assignment will be accepted more than four days after its due date. The projects will graded partly based on your programs' performance in terms of metrics defined in individual projects.

Textbooks

SLP2= Speech and Language Processing, 2nd Edition by Daniel Jurafsky, James H. Martin.

SLP3= Speech and Language Processing, 3nd Edition by Daniel Jurafsky, James H. Martin. Most chapters freely available at Link.

FSNLP= Foundations of statistical natural language processing, by Manning, Christopher D., Schütze, Hinrich. Cambridge, Mass.: MIT Press, 2000. Paper book available at Linderman Reserve and Ebook available to Lehigh users.

Online Resources

Coursesite: for posting grades only Link.

Piazza: you may post your questions that can be answered by the instructor and other students Link.

This website: for general information and resources (codes, data, projects).

Schedule

The following topics will be coverved (tentatively): words (language models); grammar (parts-of-speech tagging, inference and training algorithms for HMM, grammar and syntactic parsing, dependency parsing); semantics (word sense, semantics role labeling); discourse (coreference resolution and summarization); application (machine translation, conversational agents).

Projects

There are four programming projects, with increasing difficulty, corresponding to 4 levels of NLP: words, syntax, semantics and pragmatics. Each student needs to implement the projects using JAVA to get full credit (these are NOT team projects!). Project descriptions and code sketches will be released about 3 weeks before the due dates of the corresponding projects. Sharing and copying solutions are considered as a violation of honor code. This includes but not limited to copying solutions from the web, the textbook solution manuals and previous years' submissions.

Project 1 asks you to implement a corpus-based spell-checker using N-grams. Several variations of N-grams estimators will be tested on some popular corpora. Though this project, you will learn how to process text data using Java, how to extract and utilize useful statistics from texts, how to evaluate a spell checker. [Project Description]
Project 2 asks you to build a POS tagger using HMM. Though this project, key algorithms for HMM, such as the Viterbi and Baum–Welch algorithms will implemented and tested. [Project Description]
Project 3 will guide you in building a simple CFG syntactic parser. [Project Description]
Project 4 asks you to implement the Lesk algorithm for word sense disambiguation. [Project Description]

Datasets:

CMU datasets for information retrieval.
Datasets for training POS tagger.
Treebank annotations for training syntactic parser.
Amazon QA data.
Amazon product review data.
Yelp restaurant review data: Yelp challenge.
Sentiment analysis challenges: SemeVal2016.
Google and Microsoft n-gram query: A stackoverflow answer.
Annotated reviews for sentiment analysis (along with papers) Prof. Bing Liu's project page.
Google Word Sense Disambiguation Corpora Link.