Research Areas

All publications can be found here and Google Scholar.
Trustworthy fraud detection
Online media, such as Yelp, TripAdvisor and Amazon, are full of opinionated information that can easily and significantly influence a large number of customers' decisions. Due to the ``word-of-mouth'' effect, dishonest businesses have adopted unethical or even illegal marketing strategies by paying spammers to post fake reviews (opinion spams) to promote or demote the targets businesses and products, leading to trustworthiness issues of the online contents. To address the issue, trustworthy (defined by AIR="Accurate, Reliable and Interpretable") fraud detections is required (sketched in [CIC2018]. We've adopted propagations over networks [ICDM2011], temporal patterns [KDD2012], text features [DSAA2015] and multi-source data [BigData2016a BigData2015]. Spam detectors are also constantly under attack of adversarial spammers and thus maintaining proactive detectors is critical [BigData2018]. Since humans (model developers and users) are in the detection loop, detection reliability and interpretability is desired. We propose model debugging and interpretation to deliver these desiderata [Pre-print].
Ensemble and model fusion
An ensemble of multiple models, if fused properly, can provide more predictive power than any constituent model. Traditional ensemble methods, with a long history, studied the fusion of a small number of predictive models for binary and multi-class prediction. We move this field forward by targeting at fusing many unidentifiable predictive models, such as crowdsourcing workers, with sparse and structured output such as sets, rankings, and trees. The challenges are to gauge the individual model's performance and to take into account the extra knowledge of the output space. Please check out these three papers [ICDM2013 DSAA2015 CIKM2016a] along with others [SDM2012 KDD2014 SDM2015b].
Extreme Multi-labeled Learning
Multi-labeled learning is a technique to assign more than one semantic concepts to a data item and has found wide applications in areas such as bioinformatics, healthcare, e-commerce, and social media. Big data have changed the landscape of multi-labeled learning by increasing the number of labels to an unprecedented scale. For example, there are tens of thousands of tags (as labels) for texts on Stackoverflow and Yahoo Answer, and millions of tags for images on Flickr. Extreme multi-labeled learning tries to scale up the traditional multi-labeled learning to handle the large number of labels with varying importance, sparsity, and informativeness. My current research tries to address the scalability challenges in many aspects through the help of NLP, knowledgebases, and crowdsourcing. See [BIGDATA2016a CIKM2016b SDM2016a] for the on-going research on the topic.