Suggested Team Projects
Module 1: optimization

Project 1
[AdaGrad or Adam]
We have learned gradient descent and its stochastic version in class, and now you will explore more advanced optimization algorithms
for faster and more stable model training.
You can choose one of the following:
John Duchi, Elad Hazan and Yoram Singer, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, JMLR, 2011.
Diederik P. Kingma, Jimmy Ba, Adam: A Method for Stochastic Optimization. ICLR, 2015.

Project 2
[Asynchronous distributed optimization algorithms]
In our class, we learned how to solve a model by updating all parameters at the same time,
based on the most updated values of all parameters.
The gradient is available only after all parameters are updated in the last iteration.
Can we update parameters using outofdate values to beat this bottleneck?
In this project, you will explore the power of asynchorous distributed optimization for largescale optimization.
Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu. Asynchronous Methods for Deep Reinforcement Learning , ICML, 2016.
B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. NIPS, 2011.
A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. NIPS, 2011.
H. R. Feyzmahdavian, A. Aytekin, and M. Johansson. An asynchronous minibatch algorithm for regularized stochastic optimization. IEEE Transactions on Automatic Control, 2016.
T. Paine, H. Jin, J. Yang, Z. Lin, and T. Huang. GPU asynchronous stochastic gradient descent to speed up neural network training. 2013.

Project 3
[Dual decomposition for graphical model inference] We learned how to use message passing and sampling for graphical model inference.
The third alternative is to adopt an optimizationbased approach.
We ask you to develop a distributed algorithm called dual decomposition that solves optimization problems that is a sum of objective functions sharing optimization variables.
You need to be familiar with the basic duality theory in optimization (but not very deep, if you can deal with SVM, you can deal with this paper).
David Sontag, Amir Globerson and Tommi Jaakkola. Introduction to Dual Decomposition for Inference.

Project X
[Convex optimization and learning theory]
This paper connects optimization to bias and variance balancing.
Hongseok Namkoong, John C. Duchi.
Variancebased Regularization with Convex Objectives
.

Project X
[No local minima for matrix completion]
Matrix completion is widely used in collaborative filtering for recommendation system (Netflix).
It is not clear whether the formulated optimization problem is convex or not, and thus whether local minima are also global ones.
This paper tries to answer this question.
Rong Ge, Jason D. Lee, Tengyu Ma.
Matrix Completion has No Spurious Local Minimum
.
Module 2: graphical models

Project 4
[Collective inference for network classification]
You probably have heard of fake news on Facebook, but
you may have not realized that there are large number of fake reviews on Amazon and Yelp.
The following papers applied Markov Random Fields and message passing to find suspicious accounts contents on social networks and review websites.
J. Jia, B. Wang, L. Zhang, and N. Z. Gong, AttriInfer: Inferring user attributes in online social networks using markov random fields, in WWW, 2017.
Binghui Wang, Neil Zhenqiang Gong, Hao Fu, GANG: Detecting Fraudulent Users in Online Social Networks via GuiltbyAssociation on Directed Graphs, in ICDM, 2017.
Shebuti Rayana and Leman Akoglu, Collective Opinion Spam Detection: Bridging Review Networks and Metadata, KDD, 2015

Project 5
[Belief propagation on GPU]
The vanilla belief propagation algorithm is sequential in nature: you loop through the edges of a graph and compute the messages onebyone until convergence.
One may not want to wait for a long time for the convergence, especially during humanmachine interactions.
To exploit multicore on GPU to speed up inference, nontrivial reformulation of belief propagation is needed.
This paper presents an elegant reformulation using sparse matrix multipilcation to compute belief propagation.
Reid M. Bixler, Sparse Matrix Belief Propagation., Master Thesis, Virginia Tech. 2018

Project 6
[Graphical models for multiinstance learning]
You are asked to apply directed graphical models to a natural language processing problem called "relation extraction".
You will be given bags of text of snippets, with each bag representing a pair of words where a relation needs to be predicted for the two words contained in the text snippet.
Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, Christopher D. Manning. Multiinstance Multilabel Learning for Relation Extraction, EMNLP, 2012.

Project 7
[Monte Carlo sampling for topic models]
We have learned graphical model and several inference algorithms.
Here we ask you to apply a Gibbs sampler to learn topics of text documents in an unsupervised way.
Implementation is not that hard, but understanding the model require sophisticated manipulation of exponential families.
Thomas L. Griffiths and Mark Steyvers. Finding scientific topics. PNAS April 6, 2004 101 (suppl 1) 52285235.
Module 3: Deep learning

Project 8
[Explainable RNN as information extraction]
Predictions from texts, such as sentiment expressed by the texts, are usually numeric, while the rationales behind the predictions can be hard to figure out.
This project asks you to implement RNN,
a deep learning model called "Rurrent Neural Network" (not covered in depth in our lecture),
that can explain to the user what text snippets lead to the predictions.
GPUs are available in Sandbox at your disposal and you can use Tensorflow/PyTorch for your experiments.
Tao Lei, Regina Barzilay and Tommi Jaakkola. Rationalizing Neural Predictions, EMNLP, 2016.

Project 9
[Text generation using RNN]
You can use machine learning model to generate realistic natural language.
Here we ask you to apply RNN
to generate reviews for products on Amazon.
Dataset will be provided and you can use existing source codes online, but you're required to deliver substantial experiments.
GPUs are available in Sandbox at your disposal and you can use Tensorflow/PyTorch for your experiments.
Ilya Sutskever, James Martens, and Geoffrey Hinton. Generating Text with Recurrent Neural Networks, ICML, 2011.
Li Dong, Shaohan Huang, Furu Wei, Mirella Lapata, Ming Zhou and Ke Xu.
Learning to Generate Product Reviews from Attributes, EACL, 2017.

Project 10
[Deep QLearning for continuous action space] You have implemented a simple Qlearning algorithm using MLP, with discrete action spaces.
In robotics, the action space can be continuous, and we asked you to go deeper and implement an actorcritic algorithm
with continuous action space.
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver and Daan Wierstra. Continuous control with deep reinforcement learning, ICLR, 2016.

Project 11
[Deep QLearning for video game playing] This video explains pretty much of the implementation, but you have to do it by yourselves.
See this github repo for more information.

Project 12
[Variational autoencoder] Autoencoder can be formulated using a Bayesian framework and optimized using variational inference.
This work will allow you to extend your knowledge in neural network, graphical model and optimization by seeing how they are combined to tackle a deep learning model.
Diederik P. Kingma, Max Welling. AutoEncoding Variational Bayes, ICLR, 2014.
Module 4: Dimension reduction

Project 13
[EM algorithm for probabilistic PCA] An EM algorithm is proposed to find parameters of the probabilistic PCA model.
It allows you to estimate parameters using streaming data rather than computing eigenvectors of a large covariance matrix,
and is thus more computational efficient than eigendecomposition.
Sam Roweis, EM algorithms for PCA and SPCA, in NIPS 1998

Project 14
[MLE for probabilistic PCA] An MLE algorithm is proposed to find parameters of the probabilistic PCA model.
Same as traditional PCA, this algorithm needs to compute eigenvectors of a covariance matrix, which can be large.
The project is for you to learn another formulation that also leads to eigendecomposition.
Tipping, M. E. and C. M. Bishop (1999b). Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B 21(3), 611–622.

Project 15
[Kernel PCA] Traditional PCA, including probabilistic PCA, are linear.
Kernels can be introduced into PCA to handle nonlinearity in the data, pretty much like SVM, although we are dealing with unsupervised learning.
Scholkopf, B., A. Smola, and K.R. Muller (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10(5), 1299–1319.

Project 16
[Kernel CCA] Canonical correlation analysis tries to reduce dimensionality of data that have two modalities.
For example, to classify webpages that can contain both texts (first modality) and images (the second modality),
you can use CCA to merge the two modalities into a shared latent space, where classification can tap on information from both modalities.
Hardoon, David R., Szedmak, Sandor R., and Shawe Taylor, John R. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16:2639–2664, 2004.

Project 17
[Interpretable matrix factorization] The basis and representation vectors produced by PCA are not interpretable: what does the entries in these vectors mean?
The interpretability of the outcome is important for humans who use the results.
CUR matrix factorization uses vectors in the data matrix as "bases" to represent the whole dataset, so that the vectors has the same
meaning as the input vectors.
Michael W. Mahoney and Petros Drineas CUR matrix decompositions for improved data analysis. PNAS January 20, 2009 106 (3) 697702.
Module 5: Clustering

Project 18
[Information theoretical clustering] A clustering problem can be formulated using information theory.
The idea is to treat the latent clustering variables as a bridge connecting the observed data vectors and the identifiers of the data.
EM algorithm can be used to solve the resulting optimization problem.
Tishby, N., Pereira, F. and Bialek, W. (1999), The information bottleneck method, in The 37’th Allerton conference on communication, control, and computing.

Project 19
[Spectral clustering] kmeans assumes that the input data are in a form of vectors.
However, many datasets have relation between data items (represented by a graph), for example, a social network of persons.
Even the vectorial data can be transformed to a graph.
Spectral clustering is a clustering algorithm that partitions a graph into regions as clusters.
Von Luxburg, U. (2007), A tutorial on spectral clustering, Statistics and Computing, 17(4), 395–416.
Module 6: Classification and Regression

Project 20
[Multiclass SVM] The SVM we studied is for binary classification. What if your data have more than two classes (called "Multiclass classification")?
SVM can be reformulated to deal with multiclass problems.
Crammer, K. and Singer, Y. (2001), On the algorithmic implementation of multiclass kernelbased vector machines, Journal of Machine Learning Research 2, 265–292.

Project X
Optimizing SVM through subgradient.
Shai ShalevShwartz, Yoram Singer, and Nathan Srebro.
Pegasos: Primal estimated subgradient solver for SVM

Project 21
[Multitask Classification] The problem studies how to use the same set of features to produce multiple output for multiple prediction tasks.
For example, you can use the same Facebook user profile to predict the sex, age and music preference at the same time.
Argyriou, A., Evgeniou, T., and Pontil, M. Convex multitask feature learning. Machine Learning, 73
(3):243–272, 2008.

Project 22
[Ranking] Wondering how Facebook or Twitter recommend friends to you? This can be solved by ranking all Facebook/Twitter users based on your interest into them.
Regression models can be adapted to ranking, but there are more specific models, algorithms and metrics for ranking problems.
The following paper uses an SVMstyle algorithm to optimize a ranking metric called NDCG.
Chapelle, O., Le, Q. and Smola, A. (2007), Large margin optimization of ranking measures, in NIPS workshop: Machine learning for Web search (Machine Learning).
Module 7: Learning Theory

Project 23
[Why using Entropy in Decision Tree Learning] There is a proof that entropy can encourage a boostinglike advantage when learning a tree using C4.5 or ID3.
Michael Kearns and Y. Mansour.
On the Boosting Ability of TopDown Decision Tree Learning Algorithms.
On the algorithmic implementation of multiclass kernelbased vector machines, Journal of Computer and Systems Sciences, 58(1), 1999, pages 109128. Earlier version in Proceedings of the 28th ACM Symposium on the Theory of Computing, pp.459468, 1996, ACM Press.
Module 8: Combining models

Project 24
Bayesian optimization (BayesOpt) is now widely used in tuning deep network hyperparameters,
and this paper finds BayesOpt useful for combining multiple classification models too.
JulienCharles Levesque, Christian Gagne and Robert Sabourin.
Bayesian Hyperparameter Optimization for Ensemble Learning, UAI'16 Proceedings of the ThirtySecond Conference on Uncertainty in Artificial Intelligence.

Project 25
Use graphical model with variational optimization to find best way to combine decisions made by multiple classification models.
Qiang Liu, Jian Peng, Alexander T. Ihler.
Variational Inference for Crowdsourcing,
Advances in Neural Information Processing Systems 25 (NIPS 2012).
Module 9: FAT (Fair, Accountable, and Transparent) Machine Learning
More papers can be found
here.

Project 26
How to explain why a classification model fails in securitycritical application?
This paper explore the answers with deep models.
Wenbo Guo, Dongliang Mu, Jun Xu, Purui Su, Gang Wang, and Xinyu Xing.
LEMNA: Explaining Deep Learning based Security Applications
,
CCS ’18

Project 27
What do you mean when a probabilistic classifier is fair? Is it possile to guarantee fairness?
This theoretical paper formally define the conditions of fairness and show that it is impossible to satisfy all of them simultaneously.
Inherent TradeOffs in the Fair Determination of Risk Scores

Project 28
Machine learning is designed to work with humans.
Decisions made by a model will be used downstream by human decision makers, who may have inherent bias.
This paper designed a learning model that can interact with humans while taking into account the the bias.
David Madras, Toni Pitassi, and Richard Zemel.
Predict Responsibly: Improving Fairness and Accuracy by Learning to Defer

Project X
Explaining linear models through influence matrix.
Pang Wei Koh, Percy Liang
Understanding Blackbox Predictions via Influence Functions

Project X
Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, Percy Liang
Fairness Without Demographics in Repeated Loss Minimization

Project X
Lydia Liu, Sarah Dean, Esther Rolf, Max Simchowitz, Moritz Hardt
Delayed Impact of Fair Machine Learning

Project X
Anish Athalye, Nicholas Carlini, David Wagner
Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples