Course Information

Syllabus PDF

Description: This course introduces the students to the data, algorithms, models and tools in modern text analytics, through lectures, course projects and presentations. Topics covered includes text representation, classification, clustering, core natural language processing, sentiment and opinion analysis, neural network based approaches, trustworthiness issues, data integration, crowdsourcing and collective intelligence. Only minimal knowledge of probability, statistics and programming is necessary to attend this course.

Lectures: Tuesday/Thursday 1:10-2:25 at Packard Lab 258

Office Hours: Thursday 4:30pm - 6:30 pm, Packard Lab 329

Prerequisites: For CSE 398: (MATH 231 or ECO 045) and (CSE017), for CSE 498 instructor permission is required.
We will mainly use Python for demonstration purpose, although any programming languages can be used for projects. Python is quite readable and intuitive, and you shall be able to learn it quickly if you've done Java or C++. Three lectures will be devoted to the core programming and math tools at the beginning.

Formats: 1 closed-book mid-term, 2 projects (see more details at the end of this page), open-book in-class quizzes, presentations.

Grading: Mid-term 25%, project 1 15%, project 2 50% (proposal 10%, presentation 10%, deliverables 30%), in-class quizzes 10%. Late submissions will be penalized 20% of the total grades per late day (24 hours or part thereof) after due date. No assignment will be accepted more than four days after its due date.

Textbooks

This course will use contents from various books that are freely available online or through Lehigh library.

Required

The students are encouraged to read the required materials listed in the schedule section before attending class. Problems in exams and quizzes will be based on the required readings.

IIR = Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze. Cambridge University Press, 2008. Download.

SAOM= Sentiment Analysis and Opinion Mining, by Bing Liu. Morgan & Claypool Publishers, May 2012. Download.

FSNLP= Foundations of statistical natural language processing, by Manning, Christopher D., Schütze, Hinrich. Cambridge, Mass.: MIT Press, 2000. Paper book available at Linderman Reserve and Ebook available to Lehigh users..

Supplementary

These are excellent materials for you to get alternative viewpoints and more details of the materials covered in the lectures. You are NOT required to read them and they will NOT be in your exams or quizzes.

NLPP= Natural Language Processing with Python, by Bird, Steven, Edward Loper and Ewan Klein. O’Reilly Media Inc, 2009. Link (NLP algorithms implemented off-the-shelf).

SLP3= Speech and Language Processing, by Daniel Jurafsky, James H. Martin. Copyright c 2015. All rights reserved. Draft of June 26, 2015. Link.

SA = Sentiment Analysis: mining opinions, sentiments, and emotions, by Bing Liu. Cambridge University Press, 2015. (The most current book on the topic, though you need to purchase it. We will use SAOM as the major source).

TDA = Twitter Data Analytics, by Kumar, Shamanth, Morstatter, Fred, and Huan Liu. Twitter Data Analytics. Springer, 2013. Download (Mining texts and social network data on Twitter, with codes).

PRML= Pattern Recognition and Machine Learning , by Bishop, Christopher M., Springer, 2006. Available to Lehigh users (Good source for basic machine learning algorithms like classification, clustering, probabilistic models, with a Bayesian flavor. Recommended only for advanced readers).

ESL = The Elements of Statistical Learning: Data Mining, Inference, and Prediction, by Trevor Hastie, Robert Tibshirani and Jerome Friedman. Second Edition. Springer, 2013. Download (not recommended for beginners).

Online Resources

Coursesite: here for posting grades and sending out notifications.

Piazza: For posting questions and discussions about assignments and lectures.

This website: for posting materials (slides, codes, data, projects) and is the most update-to-date .

Schedule


Projects:

The projects can be done in any programming languages you feel comfortable with. Sharing and copying solutions are considered as a violation of honor code. This includes but not limited to copying solutions from the web, the textbook solution manuals and previous years' submissions.

Project 1

Project 2


Datasets: