This research will provide the technology and develop the prototype of a tool that can ultimately assist many kinds of scientists to locate data that they can use to perform exploratory analysis and test hypotheses. Thus, this work will enable public dataset discovery and reuse, regardless of who produced the data or where it is stored. A dataset search engine using these methods benefits society by helping researchers to accelerate their work and reduce duplicate efforts. It will also benefit others, such as data journalists, as data promises a new source of evidence and for story discovery, a new way for story-telling and fact-checking, to make reporting that is both meaningful and trustworthy. This work will help any data analyst locate relevant datasets.
This project will impact the training of graduate students and undergraduates (both within and separate from the requested REU supplement). This involvement will make it possible to broaden participation by underrepresented groups and the development of educational materials. The researchers will incorporate results of this work in courses, including Data Science, Web Search Engines, Data Journalism, and Semantic Web Topics.
Existing dataset search services are cumbersome, focusing on searching descriptions, not data, and cater to searchers looking within their own discipline. The project's goal is to develop a prototype dataset search engine incorporating new techniques for full-content indexing to enable searchers to find data across the web, regardless of domain. The investigators will combine principles and novel methods from information retrieval, databases, and data mining. The design and development of the prototype will also take a user-centric approach, involving professionals and practitioners in observational, interview and experimental studies to inform and guide this process.
The outcomes of this work include: 1. The development of new principles, methods, and technologies for the construction of search indexes from hundreds of thousands of real-world public datasets: the researchers will create novel methods for a) full-content indexing and analysis, b) inferring additional metadata such as attribute names when the existing descriptors are lacking and, c) inferring additional descriptors that can be used to resolve schema and data heterogeneity. 2. The understanding of searchers' cognitive processes as they search for and consider use of datasets. A social cognitive model will be built to describe human-system interactions in dataset searches, and to predict the effectiveness of the system in various scenarios. 3. The development of novel interfaces to support the search, exploration, and presentation of datasets to such users. Through this process, the researchers will develop a set of instruments for evaluating the dataset search technology and interface from the user's perspective. Research results will be disseminated broadly by presenting and publishing at conferences and journals, sharing on the web, giving talks, and making developed software open source.