CS 326/426: Machine learning projects

Suggested Team Projects

Module 1: optimization

Project 1 [AdaGrad or Adam] We have learned gradient descent and its stochastic version in class, and now you will explore more advanced optimization algorithms for faster and more stable model training. You can choose one of the following:
John Duchi, Elad Hazan and Yoram Singer, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, JMLR, 2011. Diederik P. Kingma, Jimmy Ba, Adam: A Method for Stochastic Optimization. ICLR, 2015.
Project 2 [Asynchronous distributed optimization algorithms] In our class, we learned how to solve a model by updating all parameters at the same time, based on the most updated values of all parameters. The gradient is available only after all parameters are updated in the last iteration. Can we update parameters using out-of-date values to beat this bottleneck? In this project, you will explore the power of asynchorous distributed optimization for large-scale optimization.
Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu. Asynchronous Methods for Deep Reinforcement Learning , ICML, 2016.
B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. NIPS, 2011.
A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. NIPS, 2011.
H. R. Feyzmahdavian, A. Aytekin, and M. Johansson. An asynchronous mini-batch algorithm for regularized stochastic optimization. IEEE Transactions on Automatic Control, 2016.
T. Paine, H. Jin, J. Yang, Z. Lin, and T. Huang. GPU asynchronous stochastic gradient descent to speed up neural network training. 2013.
Project 3 [Dual decomposition for graphical model inference] We learned how to use message passing and sampling for graphical model inference. The third alternative is to adopt an optimization-based approach. We ask you to develop a distributed algorithm called dual decomposition that solves optimization problems that is a sum of objective functions sharing optimization variables. You need to be familiar with the basic duality theory in optimization (but not very deep, if you can deal with SVM, you can deal with this paper).
David Sontag, Amir Globerson and Tommi Jaakkola. Introduction to Dual Decomposition for Inference.
Project X [Convex optimization and learning theory] This paper connects optimization to bias and variance balancing.
Hongseok Namkoong, John C. Duchi. Variance-based Regularization with Convex Objectives .
Project X [No local minima for matrix completion] Matrix completion is widely used in collaborative filtering for recommendation system (Netflix). It is not clear whether the formulated optimization problem is convex or not, and thus whether local minima are also global ones. This paper tries to answer this question.
Rong Ge, Jason D. Lee, Tengyu Ma. Matrix Completion has No Spurious Local Minimum .

Module 2: graphical models

Project 4 [Collective inference for network classification] You probably have heard of fake news on Facebook, but you may have not realized that there are large number of fake reviews on Amazon and Yelp. The following papers applied Markov Random Fields and message passing to find suspicious accounts contents on social networks and review websites.
J. Jia, B. Wang, L. Zhang, and N. Z. Gong, AttriInfer: Inferring user attributes in online social networks using markov random fields, in WWW, 2017.
Binghui Wang, Neil Zhenqiang Gong, Hao Fu, GANG: Detecting Fraudulent Users in Online Social Networks via Guilt-by-Association on Directed Graphs, in ICDM, 2017.
Shebuti Rayana and Leman Akoglu, Collective Opinion Spam Detection: Bridging Review Networks and Metadata, KDD, 2015
Project 5 [Belief propagation on GPU] The vanilla belief propagation algorithm is sequential in nature: you loop through the edges of a graph and compute the messages one-by-one until convergence. One may not want to wait for a long time for the convergence, especially during human-machine interactions. To exploit multi-core on GPU to speed up inference, non-trivial re-formulation of belief propagation is needed. This paper presents an elegant re-formulation using sparse matrix multipilcation to compute belief propagation.
Reid M. Bixler, Sparse Matrix Belief Propagation., Master Thesis, Virginia Tech. 2018
Project 6 [Graphical models for multi-instance learning] You are asked to apply directed graphical models to a natural language processing problem called "relation extraction". You will be given bags of text of snippets, with each bag representing a pair of words where a relation needs to be predicted for the two words contained in the text snippet.
Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, Christopher D. Manning. Multi-instance Multi-label Learning for Relation Extraction, EMNLP, 2012.
Project 7 [Monte Carlo sampling for topic models] We have learned graphical model and several inference algorithms. Here we ask you to apply a Gibbs sampler to learn topics of text documents in an unsupervised way. Implementation is not that hard, but understanding the model require sophisticated manipulation of exponential families.
Thomas L. Griffiths and Mark Steyvers. Finding scientific topics. PNAS April 6, 2004 101 (suppl 1) 5228-5235.

Module 3: Deep learning

Project 8 [Explainable RNN as information extraction] Predictions from texts, such as sentiment expressed by the texts, are usually numeric, while the rationales behind the predictions can be hard to figure out. This project asks you to implement RNN, a deep learning model called "Rurrent Neural Network" (not covered in depth in our lecture), that can explain to the user what text snippets lead to the predictions. GPUs are available in Sandbox at your disposal and you can use Tensorflow/PyTorch for your experiments.
Tao Lei, Regina Barzilay and Tommi Jaakkola. Rationalizing Neural Predictions, EMNLP, 2016.
Project 9 [Text generation using RNN] You can use machine learning model to generate realistic natural language. Here we ask you to apply RNN to generate reviews for products on Amazon. Dataset will be provided and you can use existing source codes online, but you're required to deliver substantial experiments. GPUs are available in Sandbox at your disposal and you can use Tensorflow/PyTorch for your experiments.
Ilya Sutskever, James Martens, and Geoffrey Hinton. Generating Text with Recurrent Neural Networks, ICML, 2011.
Li Dong, Shaohan Huang, Furu Wei, Mirella Lapata, Ming Zhou and Ke Xu. Learning to Generate Product Reviews from Attributes, EACL, 2017.
Project 10 [Deep Q-Learning for continuous action space] You have implemented a simple Q-learning algorithm using MLP, with discrete action spaces. In robotics, the action space can be continuous, and we asked you to go deeper and implement an actor-critic algorithm with continuous action space.
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver and Daan Wierstra. Continuous control with deep reinforcement learning, ICLR, 2016.
Project 11 [Deep Q-Learning for video game playing] This video explains pretty much of the implementation, but you have to do it by yourselves.
See this github repo for more information.
Project 12 [Variational autoencoder] Autoencoder can be formulated using a Bayesian framework and optimized using variational inference. This work will allow you to extend your knowledge in neural network, graphical model and optimization by seeing how they are combined to tackle a deep learning model.
Diederik P. Kingma, Max Welling. Auto-Encoding Variational Bayes, ICLR, 2014.

Module 4: Dimension reduction

Project 13 [EM algorithm for probabilistic PCA] An EM algorithm is proposed to find parameters of the probabilistic PCA model. It allows you to estimate parameters using streaming data rather than computing eigenvectors of a large covariance matrix, and is thus more computational efficient than eigen-decomposition.
Sam Roweis, EM algorithms for PCA and SPCA, in NIPS 1998
Project 14 [MLE for probabilistic PCA] An MLE algorithm is proposed to find parameters of the probabilistic PCA model. Same as traditional PCA, this algorithm needs to compute eigenvectors of a covariance matrix, which can be large. The project is for you to learn another formulation that also leads to eigen-decomposition.
Tipping, M. E. and C. M. Bishop (1999b). Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B 21(3), 611–622.
Project 15 [Kernel PCA] Traditional PCA, including probabilistic PCA, are linear. Kernels can be introduced into PCA to handle non-linearity in the data, pretty much like SVM, although we are dealing with unsupervised learning.
Scholkopf, B., A. Smola, and K.-R. Muller (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10(5), 1299–1319.
Project 16 [Kernel CCA] Canonical correlation analysis tries to reduce dimensionality of data that have two modalities. For example, to classify webpages that can contain both texts (first modality) and images (the second modality), you can use CCA to merge the two modalities into a shared latent space, where classification can tap on information from both modalities.
Hardoon, David R., Szedmak, Sandor R., and Shawe Taylor, John R. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16:2639–2664, 2004.
Project 17 [Interpretable matrix factorization] The basis and representation vectors produced by PCA are not interpretable: what does the entries in these vectors mean? The interpretability of the outcome is important for humans who use the results. CUR matrix factorization uses vectors in the data matrix as "bases" to represent the whole dataset, so that the vectors has the same meaning as the input vectors.
Michael W. Mahoney and Petros Drineas CUR matrix decompositions for improved data analysis. PNAS January 20, 2009 106 (3) 697-702.

Module 5: Clustering

Project 18 [Information theoretical clustering] A clustering problem can be formulated using information theory. The idea is to treat the latent clustering variables as a bridge connecting the observed data vectors and the identifiers of the data. EM algorithm can be used to solve the resulting optimization problem.
Tishby, N., Pereira, F. and Bialek, W. (1999), The information bottleneck method, in The 37’th Allerton conference on communication, control, and computing.
Project 19 [Spectral clustering] k-means assumes that the input data are in a form of vectors. However, many datasets have relation between data items (represented by a graph), for example, a social network of persons. Even the vectorial data can be transformed to a graph. Spectral clustering is a clustering algorithm that partitions a graph into regions as clusters.
Von Luxburg, U. (2007), A tutorial on spectral clustering, Statistics and Computing, 17(4), 395–416.

Module 6: Classification and Regression

Project 20 [Multi-class SVM] The SVM we studied is for binary classification. What if your data have more than two classes (called "Multi-class classification")? SVM can be re-formulated to deal with multi-class problems.
Crammer, K. and Singer, Y. (2001), On the algorithmic implementation of multi-class kernel-based vector machines, Journal of Machine Learning Research 2, 265–292.
Project X Optimizing SVM through sub-gradient.
Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver for SVM
Project 21 [Multi-task Classification] The problem studies how to use the same set of features to produce multiple output for multiple prediction tasks. For example, you can use the same Facebook user profile to predict the sex, age and music preference at the same time.
Argyriou, A., Evgeniou, T., and Pontil, M. Convex multi-task feature learning. Machine Learning, 73 (3):243–272, 2008.
Project 22 [Ranking] Wondering how Facebook or Twitter recommend friends to you? This can be solved by ranking all Facebook/Twitter users based on your interest into them. Regression models can be adapted to ranking, but there are more specific models, algorithms and metrics for ranking problems. The following paper uses an SVM-style algorithm to optimize a ranking metric called NDCG.
Chapelle, O., Le, Q. and Smola, A. (2007), Large margin optimization of ranking measures, in NIPS workshop: Machine learning for Web search (Machine Learning).

Module 7: Learning Theory

Project 23 [Why using Entropy in Decision Tree Learning] There is a proof that entropy can encourage a boosting-like advantage when learning a tree using C4.5 or ID3.
Michael Kearns and Y. Mansour. On the Boosting Ability of Top-Down Decision Tree Learning Algorithms. On the algorithmic implementation of multi-class kernel-based vector machines, Journal of Computer and Systems Sciences, 58(1), 1999, pages 109-128. Earlier version in Proceedings of the 28th ACM Symposium on the Theory of Computing, pp.459-468, 1996, ACM Press.

Module 8: Combining models

Project 24 Bayesian optimization (BayesOpt) is now widely used in tuning deep network hyperparameters, and this paper finds BayesOpt useful for combining multiple classification models too.
Julien-Charles Levesque, Christian Gagne and Robert Sabourin. Bayesian Hyperparameter Optimization for Ensemble Learning, UAI'16 Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence.
Project 25 Use graphical model with variational optimization to find best way to combine decisions made by multiple classification models.
Qiang Liu, Jian Peng, Alexander T. Ihler. Variational Inference for Crowdsourcing, Advances in Neural Information Processing Systems 25 (NIPS 2012).

Module 9: FAT (Fair, Accountable, and Transparent) Machine Learning

More papers can be found here.

Project 26 How to explain why a classification model fails in security-critical application? This paper explore the answers with deep models.
Wenbo Guo, Dongliang Mu, Jun Xu, Purui Su, Gang Wang, and Xinyu Xing. LEMNA: Explaining Deep Learning based Security Applications , CCS ’18
Project 27 What do you mean when a probabilistic classifier is fair? Is it possile to guarantee fairness? This theoretical paper formally define the conditions of fairness and show that it is impossible to satisfy all of them simultaneously.
Inherent Trade-Offs in the Fair Determination of Risk Scores
Project 28 Machine learning is designed to work with humans. Decisions made by a model will be used downstream by human decision makers, who may have inherent bias. This paper designed a learning model that can interact with humans while taking into account the the bias.
David Madras, Toni Pitassi, and Richard Zemel. Predict Responsibly: Improving Fairness and Accuracy by Learning to Defer
Project X Explaining linear models through influence matrix.
Pang Wei Koh, Percy Liang Understanding Black-box Predictions via Influence Functions
Project X
Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, Percy Liang Fairness Without Demographics in Repeated Loss Minimization
Project X
Lydia Liu, Sarah Dean, Esther Rolf, Max Simchowitz, Moritz Hardt Delayed Impact of Fair Machine Learning
Project X
Anish Athalye, Nicholas Carlini, David Wagner Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples