III: Small: Domain-Agnostic Dataset Search

Participants:

Investigators: Brian D. Davison (PI: Lehigh University, Dept. of Computer Science and Engineering), Jeff Heflin (Co-PI: Lehigh University, Dept. of Computer Science and Engineering), and Haiyan Jia (Co-PI: Lehigh University, Dept. of Journalism and Communication).
Student participants: Helen Borchart '21, Zhiyu Chen, PhD '22, Jessica Hicks '19, Yujie Ji '20 MS, Alexandra Johnson '19, Drake Johnson (from Calfornia University of Pennsylvania), Alissa Landberg '22, dePaul Miller '20, Larrisa Miller '20, Mericel Mirabal '22, Ethan Moscot '22, Kishan Patel '23, Lixuan Qiu '20, Keith Register (from Princeton), Emma Stein '20, Mathangi Sundar (from BITS, India), Mohamed Trabelsi, PhD '22, Ngan Tran '21, Xuewei Brooks Wang '20, Hui Ye MS'20, Yang Yi '18, and Yifan Zhang, MS '23.

Description:

This research will provide the technology and develop the prototype of a tool that can ultimately assist many kinds of scientists to locate data that they can use to perform exploratory analysis and test hypotheses. Thus, this work will enable public dataset discovery and reuse, regardless of who produced the data or where it is stored. A dataset search engine using these methods benefits society by helping researchers to accelerate their work and reduce duplicate efforts. It will also benefit others, such as data journalists, as data promises a new source of evidence and for story discovery, a new way for story-telling and fact-checking, to make reporting that is both meaningful and trustworthy. This work will help any data analyst locate relevant datasets.

This project will impact the training of graduate students and undergraduates (both within and separate from the requested REU supplement). This involvement will make it possible to broaden participation by underrepresented groups and the development of educational materials. The researchers will incorporate results of this work in courses, including Data Science, Web Search Engines, Data Journalism, and Semantic Web Topics.

Existing dataset search services are cumbersome, focusing on searching descriptions, not data, and cater to searchers looking within their own discipline. The project's goal is to develop a prototype dataset search engine incorporating new techniques for full-content indexing to enable searchers to find data across the web, regardless of domain. The investigators will combine principles and novel methods from information retrieval, databases, and data mining. The design and development of the prototype will also take a user-centric approach, involving professionals and practitioners in observational, interview and experimental studies to inform and guide this process.

The outcomes of this work include: 1. The development of new principles, methods, and technologies for the construction of search indexes from hundreds of thousands of real-world public datasets: the researchers will create novel methods for a) full-content indexing and analysis, b) inferring additional metadata such as attribute names when the existing descriptors are lacking and, c) inferring additional descriptors that can be used to resolve schema and data heterogeneity. 2. The understanding of searchers' cognitive processes as they search for and consider use of datasets. A social cognitive model will be built to describe human-system interactions in dataset searches, and to predict the effectiveness of the system in various scenarios. 3. The development of novel interfaces to support the search, exploration, and presentation of datasets to such users. Through this process, the researchers will develop a set of instruments for evaluating the dataset search technology and interface from the user's perspective. Research results will be disseminated broadly by presenting and publishing at conferences and journals, sharing on the web, giving talks, and making developed software open source.

News:

Our partnership with data.world was mentioned in their blog post on September 8, 2020.
A search engine for datasets, Lehigh Research Review, June 2020.
Lehigh research team to investigate a 'Google for research data', EurekAlert!, 20-Aug-2018.

Publications:

H. Jia, L. Miller, J. Hicks, E. Moscot, A. Landberg, J. Heflin, and B.D. Davison. (2022) Truth in a Sea of Data: Adoption and Use of Data Search Tools among Researchers and Journalists. In Information, Communication and Society, 26(16): 3239-3258. Taylor & Francis, November. DOI: 10.1080/1369118X.2022.2147398
Z. Chen. (2022) Dataset Search and Augmentation. Doctoral dissertation, Department of Computer Science and Engineering, Lehigh University, August.
M. Trabelsi. (2022) Leveraging Dataset Content in Neural Models for Search and Curation. Doctoral dissertation, Department of Computer Science and Engineering, Lehigh University, August.
M. Trabelsi, Z. Chen, S. Zhang, B.D. Davison, and J. Heflin. (2022) StruBERT: Structure-aware BERT for Table Search and Matching. In Proceedings of the 31st edition of the Web Conference, pp. 442-451, online, April. DOI: 10.1145/3485447.3511972
Z. Chen, M. Trabelsi, J. Heflin, D. Yin and B. D. Davison. (2021) MGNETS: Multi-Graph Neural Networks for Table Search. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM), pp. 2945-2949, online, November. DOI: 10.1145/3459637.3482140
M. Trabelsi, Z. Chen, B. D. Davison, and J. Heflin. (2021) Neural Ranking Models for Document Retrieval. Information Retrieval, 24:400-444, October. DOI: 10.1007/s10791-021-09398-0
J. Heflin, B. D. Davison, and H. Jia. (2021) Exploring Datasets via Cell-Centric Indexing. In Proceedings of DESIRES 2021: Second International Conference on Design of Experimental Search and Information REtrieval Systems, CEUR Workshop Proceedings, Volume 2950, pp. 53-60, Padua, Italy, September.
Z. Chen, S. Zhang, and B. D. Davison. (2021) WTR: A Test Collection for Web Table Retrieval. In Proceedings of 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2514-2520, July. DOI: 10.1145/3404835.3463260.
H. Borchart. (2021) Query Refinement in Dataset Search. Senior Project Report, Cognitive Science Program, Lehigh University, May.
M. Trabelsi, Z. Chen, B. D. Davison, and J. Heflin. (2020) A Hybrid Deep Model for Learning to Rank Data Tables. In Proceedings of the 2020 IEEE International Conference on Big Data (IEEE BigData 2020), December.
M. Trabelsi, Z. Chen, B. D. Davison, and J. Heflin. (2020) Relational Graph Embeddings for Table Retrieval. In Seventh International Workshop on High Performance Big Graph Data Management, Analysis, and Mining (BigGraphs 2020), held with IEEE BigData 2020, December.
D. Johnson, K. Register, B. D. Davison, and J. Heflin. (2020) An Exploratory Interface for Dataset Repositories Using Cell-Centric Indexing. Poster paper in Proceedings of the 2020 IEEE International Conference on Big Data (IEEE BigData 2020), pp. 5716-5718, December.
Z. Chen, M. Trabelsi, B. D. Davison, and J. Heflin. (2020) Towards Knowledge Acquisition of Metadata on AI Progress. In Proceedings of the ISWC 2020 Demos and Industry Tracks: From Novel Ideas to Industrial Practice, co-located with the 19th International Semantic Web Conference (ISWC 2020), CEUR Workshop Proceedings, Vol. 2721, pages 232-237, November.
L. Qiu, H. Jia, B. D. Davison, and J. Heflin. (2020) An Architecture for Cell-Centric Indexing of Datasets. In Proceedings of PROFILES'20: 7th International Workshop on Dataset PROFILing and Search, pages 82-96, held with ISWC 2020, November.
Z. Chen, M. Trabelsi, J. Heflin, Y. Xu, and B. D. Davison. (2020) Table Search Using a Deep Contextualized Language Model. In Proceedings of 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 589-598, July.
L. Miller. (2020) Facilitating Dataset Search of Non-Expert Users through Heuristic and Systematic Information Processing. Honors Thesis, Cognitive Science Program, Lehigh University, May.
E. Stein. (2020) How Communication and Customization Influence Perceived Credibility, Usability and Adoption of Dataset Search Tools. Senior Project Report, Cognitive Science Program, Lehigh University, May.
Z. Chen, H. Jia, J. Heflin, and B. D. Davison. (2020) Leveraging Schema Labels to Enhance Dataset Search. In Proceedings of the 42nd European Conference on Information Retrieval (ECIR 2020), pages 267-280, April.
M. Trabelsi, B. D. Davison, and J. Heflin. (2019) Improved Table Retrieval Using Multiple Context Embeddings for Attributes. In Proceedings of the 2019 IEEE International Conference on Big Data (BigData), pages 1238-1244, Los Angeles, CA, December.
Z. Chen. (2018) Challenges and Progress in Dataset Search. Presentation at the Eighth BCS-IRSG Symposium on Future Directions in Information Access (FDIA 2018), co-located with the 8th International Conference on the Theory of Information Retrieval, Tianjin, China, September 2018.
Y. Yi, Z. Chen, J. Heflin and B. D. Davison. (2018) Recognizing Quantity Names for Tabular Data. In Joint Proceedings of the First International Workshop on Professional Search (ProfS2018); the Second Workshop on Knowledge Graphs and Semantics for Text Retrieval, Analysis, and Understanding (KG4IR); and the International Workshop on Data Search (DATA:SEARCH'18), pages 68-73. Presented at the International Workshop on Data Search (DATA:SEARCH'18). Co-located with SIGIR 2018, Ann Arbor, Michigan, USA, July.
Z. Chen, H. Jia, J. Heflin and B. D. Davison. (2018) Generating Schema Labels through Dataset Content Analysis. In Companion Proceedings of the The Web Conference (WWW '18), pages 1515-1522. Presented at the International Workshop on Profiling and Searching Data on the Web (Profiles & Data:Search'18, co-located with The Web Conference), Lyon, France, April. Best paper award.

This material is based upon work supported by Lehigh University under an internal seed grant and the National Science Foundation under Grant No. 1816325 (III: Small: Domain-Agnostic Dataset Search). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Last modified: 8 December 2022