An Architecture for Cell-Centric Indexing of Datasets

Lixuan Qiu, Haiyan Jia, Brian D. Davison and Jeff Heflin.

Workshop Paper (15 pages)
Official published version: http://ceur-ws.org/Vol-2722/profiles2020-paper-2.pdf
Author's version: PDF (523KB)

Abstract

Increasingly, large collections of datasets are made available to the public via the Web, ranging from government-curated datasets like those of data.gov to communally-sourced datasets such as Wikipedia tables. It has become clear that traditional search techniques are insufficient for such sources, especially when the user is unfamiliar with the terminology used by the creators of the relevant datasets. We propose to address this problem by elevating the datum to a first-class object that is indexed, thereby making it less dependent on how a dataset is structured. In a data table, a cell contains a value for a particular row as described by a particular column. In our cell-centric indexing approach, we index the metadata of each cell, so that information about its dataset and column simply become metadata rather than constraining concepts. In this paper we define cell-centric indexing and present a system architecture that supports its use in exploring datasets. We describe how cell-centric indexing can be implemented using traditional information retrieval technology and evaluate the scalability of the architecture.

In Joint Proceedings of Workshops AI4LEGAL2020, NLIWOD, PROFILES 2020, QuWeDa 2020 and SEMIFORM2020, Colocated with the 19th International Semantic Web Conference (ISWC 2020), CEUR Workshop Proceedings, Volume 2722, pages 82-96. Presented at PROFILES'20: 7th International Workshop on Dataset PROFILing and Search, November 2020.

Back to Brian Davison's publications

Last modified: 11 January 2020
Brian D. Davison