Normalizing Microtext

Zhenzhen Xue, Dawei Yin and Brian D. Davison

Full Paper (6 pages)
Author's version: PDF (94KB)

Abstract

The use of computer mediated communication has resulted in a new form of written text---Microtext---which is very different from well-written text. Tweets and SMS messages, which have limited length and may contain misspellings, slang, or abbreviations, are two typical examples of microtext. Microtext poses new challenges to standard natural language processing tools which are usually designed for well-written text. The objective of this work is to normalize microtext, in order to produce text that could be suitable for further treatment. We propose a normalization approach based on the source channel model, which incorporates four factors, namely an orthographic factor, a phonetic factor, a contextual factor and acronym expansion. Experiments show that our approach can normalize Twitter messages reasonably well, and it outperforms existing algorithms on a public SMS data set.

In Proceedings of The AAAI-11 Workshop on Analyzing Microtext, pages 74-79, San Francisco, 8 August 2011.

Back to Brian Davison's publications

Last modified: 10 August 2011
Brian D. Davison