Short Text Classification Improved by Learning Multi-Granularity Topics
Mengen Chen, Xiaoming Jin, Dou Shen
Short text is becoming popular nowadays. Examples include tweets, search snippets, product reviews and so on. Understanding short text is urgent and beneficial for all players including Web users and Web service providers. However, short text is quite different from traditional documents in its shortness and sparsity, which hinders the application of conventional machine learning and text mining algorithms. Two major approaches have been exploited in the literature to enrich the representation of short text. One is to fetch contextual information of the short text to directly add more text; the other is to derive latent topics out of some existing large corpus, through which the short text can be bridged. The latter approach is more elegant and efficient in most cases. The major trend along this direction is to derive latent topics at a certain level through well-known topic models such as LDA. However, it is not sufficient to cover and differentiate the feature spaces of the short text under consideration. In this paper, we move forward along this direction by proposing an advanced method which leverages topics at multiple granularity. Using multi-granularity topics, we can model the short text more precisely. Taking short text classification as an example, we compared our proposed method with the state-of-the-art baseline over one open data set. Our method reduced the classification error by $20.25$~\% and $16.68$~\% respectively on two classifiers.