Generating Schema Labels through Dataset Content Analysis

Zhiyu Chen, Haiyan Jia, Jeff Heflin, and Brian D. Davison
Workshop Best Paper Award Winner

Paper (8 pages)
ACM published version: https://doi.org/10.1145/3184558.3191601
Author copy: PDF
(639KB)

Abstract

Impoverished descriptions and convoluted schema labels are common challenges in data-centric tasks such as schema matching and data linking, especially when datasets can span domains. To address these issues, we consider the task of schema label generation. Typically, schema labels are created by dataset providers and are useful for users to understand a dataset. The motivation behind the task is that a lot of data linking systems require overlapping information between two datasets and rely on unique identifiers of schema labels. Moreover, it is common for schema labels in different datasets to have different identifiers even when they refer to the same concept. With no naming standard for schema labels, unintelligible labels are widely found in real-world datasets. For example, many schema labels contain abbreviations and compound nouns that hinder automated matching of attributes in corresponding datasets. Through schema label generation, more common (and thus understandable) schema labels can be provided to allow for broader schema matches in contexts such as dataset search and data linking. We develop a variety of features based on analysis of dataset content to enable machine learning methods to recommend useful labels. We test our approach on two real-world data collections and demonstrate that our method is able to outperform the alternative approach.

In Companion Proceedings of the The Web Conference (WWW '18), pages 1515-1522. Presented at the International Workshop on Profiling and Searching Data on the Web (Profiles & Data:Search'18, co-located with The Web Conference), Lyon, France, April 2018.

Back to Brian Davison's publications

Last modified: 1 May 2018
Brian D. Davison