CS 398/498: Text Mining

Course Information

Syllabus PDF

Description: This course introduces the students to the data, algorithms, models and tools in modern text analytics, through lectures, course projects and presentations. Topics covered includes text representation, classification, clustering, core natural language processing, sentiment and opinion analysis, neural network based approaches, trustworthiness issues, data integration, crowdsourcing and collective intelligence. Only minimal knowledge of probability, statistics and programming is necessary to attend this course.

Lectures: Tuesday/Thursday 1:10-2:25 at Packard Lab 258

Office Hours: Thursday 4:30pm - 6:30 pm, Packard Lab 329

Prerequisites: For CSE 398: (MATH 231 or ECO 045) and (CSE017), for CSE 498 instructor permission is required.
We will mainly use Python for demonstration purpose, although any programming languages can be used for projects. Python is quite readable and intuitive, and you shall be able to learn it quickly if you've done Java or C++. Three lectures will be devoted to the core programming and math tools at the beginning.

Formats: 1 closed-book mid-term, 2 projects (see more details at the end of this page), open-book in-class quizzes, presentations.

Grading: Mid-term 25%, project 1 15%, project 2 50% (proposal 10%, presentation 10%, deliverables 30%), in-class quizzes 10%. Late submissions will be penalized 20% of the total grades per late day (24 hours or part thereof) after due date. No assignment will be accepted more than four days after its due date.

Textbooks

This course will use contents from various books that are freely available online or through Lehigh library.

Required

The students are encouraged to read the required materials listed in the schedule section before attending class. Problems in exams and quizzes will be based on the required readings.

IIR = Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze. Cambridge University Press, 2008. Download.

SAOM= Sentiment Analysis and Opinion Mining, by Bing Liu. Morgan & Claypool Publishers, May 2012. Download.

FSNLP= Foundations of statistical natural language processing, by Manning, Christopher D., Schütze, Hinrich. Cambridge, Mass.: MIT Press, 2000. Paper book available at Linderman Reserve and Ebook available to Lehigh users..

Supplementary

These are excellent materials for you to get alternative viewpoints and more details of the materials covered in the lectures. You are NOT required to read them and they will NOT be in your exams or quizzes.

NLPP= Natural Language Processing with Python, by Bird, Steven, Edward Loper and Ewan Klein. O’Reilly Media Inc, 2009. Link (NLP algorithms implemented off-the-shelf).

SA = Sentiment Analysis: mining opinions, sentiments, and emotions, by Bing Liu. Cambridge University Press, 2015. (The most current book on the topic, though you need to purchase it. We will use SAOM as the major source).

TDA = Twitter Data Analytics, by Kumar, Shamanth, Morstatter, Fred, and Huan Liu. Twitter Data Analytics. Springer, 2013. Download (Mining texts and social network data on Twitter, with codes).

PRML= Pattern Recognition and Machine Learning , by Bishop, Christopher M., Springer, 2006. Available to Lehigh users (Good source for basic machine learning algorithms like classification, clustering, probabilistic models, with a Bayesian flavor. Recommended only for advanced readers).

ESL = The Elements of Statistical Learning: Data Mining, Inference, and Prediction, by Trevor Hastie, Robert Tibshirani and Jerome Friedman. Second Edition. Springer, 2013. Download (not recommended for beginners).

Online Resources

Coursesite: here for posting grades and sending out notifications.

Piazza: For posting questions and discussions about assignments and lectures.

This website: for posting materials (slides, codes, data, projects) and is the most update-to-date .

Schedule

Projects:

The projects can be done in any programming languages you feel comfortable with. Sharing and copying solutions are considered as a violation of honor code. This includes but not limited to copying solutions from the web, the textbook solution manuals and previous years' submissions.

Project 1

Individual project, each student will be the only person completing this project. This project consists of implementation of some existing text mining algorithms.
For non-CS students: if you feel that you don't have sufficient programming skills to accomplish this project, please discuss with the instructor for alternatives that are more suitable for you.
This year, we will implement a classifier and an information extractor for sentiment analysis using Amazon reviews.
The description of the project is here. The dataset is here. The link to Stanford tagger is here. You may find a Python notebook for file input, output and example of POS tagger here (download the file to your local disk and open it in Jupyter).

Project 2

This is a team project that can be accomplished with a maximum of 3 students. Undergraduate students are encouraged to team up with graduate students. I would recommend forming a team with complementary skills (programming, domain knowledge, written and oral presentation, etc.)
The project will generate deliverables including a 15-20 mins in-class presentations and final product (runnable demos or quantiative results, along with a technical report).
Individual team members will specify their and other team members' contributions to the project in their technical report. In the extreme case where a team member contributes superficially, the grades of the member for the project will be adjusted accordingly. The final presentations will also be evaluated by other students in the classroom.
Details can be found here. Deadlines of various deliverables can be found in the course schedule.

Datasets:

Amazon QA data: http://jmcauley.ucsd.edu/data/amazon/qa/
Amazon product review data: http://www.cse.lehigh.edu/~sxie/teaching/data/reviews_Musical_Instruments_5.json.gz
Yelp restaurant review data: Yelp challenge.
Sentiment analysis challenges: SemeVal2016.
Google and Microsoft n-gram query: A stackoverflow answer.
Annotated reviews for sentiment analysis (along with papers) Prof. Bing Liu's project page.