Paper (6 pages)
Author copy: PDF
In this era of Big Data, there are many public web repositories for people to access, retrieve, and store data. It is natural for dataset search queries to include quantities along with their units. Describing data in terms of units is an important characteristic when that data is to be used, even though such units are often not present in the data schema. However, quantity names are often provided with or without corresponding units in the column name or in an abbreviated format. Quantity names (e.g., length, weight and time) can be matched to a set of relevant units. Therefore, there is a significant need to automatically determine the quantity names for column values. We investigate the potential to recognize quantity names to which units belong. We assign each column a class label corresponding to the quantity name and thus configure the problem as a multi-class classification task, and then establish a variety of features based on the column name and column content. Using a random forest, we show that these features are useful for predicting quantity names for columns in tables.
In Joint Proceedings of the First International Workshop on Professional Search (ProfS2018); the Second Workshop on Knowledge Graphs and Semantics for Text Retrieval, Analysis, and Understanding (KG4IR); and the International Workshop on Data Search (DATA:SEARCH'18), pages 68-73. Co-located with SIGIR 2018, Ann Arbor, Michigan, USA, July 2018.
Back to Brian Davison's publications