Research Areas

All publications can be found here and Google Scholar.
Explainable and fair ML on graphs
Graphs are ubiquitous in many applications, such as molecular biochemistry, neural science, Internet, computer vision, NLP, and crowdsourcing [ICDM2021c]). Machine learning on graphs, especially with neural networks, has demonstrate accurate predictive power. PI Xie is investigating explainability and fairness beyond accuracy. (i) On large graphs, power-law degree distributions are common and can lead to fairness issues in the graphical models and affect end-users. We propose a linear system to certificate if multiple desired fairness criteria can be fulfilled simultaneously, and if not, a multi-objective optimization algorithm to find Pareto fronts for efficient trade-offs among the criteria [CIKM2021]). To reduce optimization cost, the team proposes continuous Pareto front exploration by exploiting the smoothness of the set of Pareto optima. (ii) Graphical models can be hard to understand by human users due to multiplexed information propagations over many edges. The team published a series of works addressing challenges in making graphical models more interpretable, such as large discrete search space [ICDM2019]), axiomatic attribution [CIKM2020]), multi-objective explanations [ICDM2021a]), and robustness of explanations via constrained optimization [ICDM2021b]).
Trustworthy fraud detection
Online medias, such as Yelp, TripAdvisor and Amazon, are full of opinionated information that can significantly influence a large number of customers' decisions. Due to the ``word-of-mouth'' effect, dishonest businesses have adopted unethical or even illegal marketing strategies by paying spammers to post fake reviews (opinion spams) to promote or demote the targets businesses and products, leading to trustworthiness issues of the online contents. To address the issue, trustworthy (defined by AIR="Accurate, Interpretable, and Robust") fraud detections is required (sketched in [CIC2018]). We've adopted propagations over networks [ICDM2011], temporal patterns [KDD2012], text features [DSAA2015] and multi-source data [BigData2016a, BigData2015]. Spam detectors are also constantly under attack of adversarial spammers in a changing environments and robust detectors are critical [BigData2018, KDD 2020].
Ensemble and model fusion
An ensemble of multiple models, if fused properly, can provide more predictive power than any constituent model. Traditional ensemble methods, with a long history, studied the fusion of a small number of predictive models for binary and multi-class prediction. We move this field forward by targeting at fusing many unidentifiable predictive models, such as crowdsourcing workers, with sparse and structured output such as sets, rankings, and trees. The challenges are to gauge the individual model's performance and to take into account the extra knowledge of the output space. Please check out these three papers [ICDM2013, DSAA2015, CIKM2016a] along with others [SDM2012, KDD2014, SDM2015b]. Along with Dr. Qi Li from Iowa State, we extended the framework to address fusion problem on sequential data found in natural language processing [ICDM2021_c].
Extreme Multi-labeled Learning
Multi-labeled learning is a technique to assign more than one semantic concepts to a data item and has found wide applications in areas such as bioinformatics, healthcare, e-commerce, and social media. Big data have changed the landscape of multi-labeled learning by increasing the number of labels to an unprecedented scale. For example, there are tens of thousands of tags (as labels) for texts on Stackoverflow and Yahoo Answer, and millions of tags for images on Flickr. Extreme multi-labeled learning tries to scale up the traditional multi-labeled learning to handle the large number of labels with varying importance, sparsity, and informativeness. My current research tries to address the scalability challenges in many aspects through the help of NLP, knowledgebases, and crowdsourcing. See [BIGDATA2016a CIKM2016b SDM2016a] for the on-going research on the topic.

Funding

We are thankful to following funding agencies for their support to our research.
  • CAREER: Bilevel Optimization for Accountable Machine Learning on Graphs (NSF IIS-2145922)
  • Program in the Foundations and Applications of Mathematical Optimization and Data Science (Lehigh Research Future Grant)
  • Efficient, explainable and robust data scientific methods for smart engineering systems (Lehigh Accelerator Grant)
  • Algorithms, systems, and theories for exploiting data dependencies in crowdsourcing (NSF IIS-2008155)
  • Learning Dynamic and Robust Defenses Against Co-Adaptive Spammers (NSF CNS-1931042)
NSF