Synopsis

The Language Function Analysis 2011 Corpus (LFA-11) is a German text corpus of promotional text, reviews and blog posts on music and smartphones. The texts were manually classified with respect to their topic relevance, language function, and sentiment polarity.

The purpose of the corpus is to provide textual data for the development and evaluation of approaches to language function analysis and sentiment analysis. Therefore, each text is classified by language function (personal, commercial, or informational) as well as by sentiment (positive, negative, neutral).

Download

To download the corpus use the following links:

If you use the dataset in your research, please send us a copy of your publication. We kindly ask you to refer to the corpus via [bib].

Research

The corpus consists of two separated collections, which contain the texts about music and smartphones respectively. The music collection consists of 2,713 promotional texts and reviews from both users and professionals. The smartphone collection contains 2,093 blog posts on smartphones from the Spinn3r corpus.

Annotations are provided in form of an UTF-8 encoded XMI file, which is preformatted for Apache UIMA. The two collections are split into training (50% of texts), validation (25%) and test sets (25%). The distribution of the different annotations over the sets are shown in the tables below.

Music
Relevance true false
Training set 1327 (97.9%) 28 (2.1%)
Validation set 673 (99.1%) 6 (0.9%)
Test set 662 (97.5%) 17 (2.5%)
Distribution of relevance annotations
Smartphone
Relevance true false
Training set 561 (53.6%) 486 (46.4%)
Validation set 307 (58.7%) 216 (41.3%)
Test set 287 (54.9%) 236 (45.1%)

Three different functions of language are analyzed:

  • commercial: the text is of obvious commercial interest. It seems to predominantly aim at persuading the reader to buy or like the product.
  • informational: the text predominantly appears to be informative in a journalistic manner.
  • personal: the text probably represents the personal view on the product of a private individual.

The distribution of the single functions in the corpus is shown in the table below.

Music
Function Commercial Informational Personal
Training set 127 ( 9.4%) 707 (52.2%) 521 (38.5%)
Validation set 72 (10.6%) 188 (27.7%) 419 (61.7%)
Test set 68 (10.0%) 269 (39.6%) 342 (50.4%)
Distribution of language function annotations
Smartphone
Function Commercial Informational Personal
Training set 90 ( 8.6%) 411 (39.3%) 546 (52.1%)
Validation set 36 ( 6.9%) 208 (39.8%) 279 (53.4%)
Test set 28 ( 5.4%) 193 (36.9%) 302 (57.7%)

Music
Sentiment Positive Negative Neutral
Training set 1003 (74.9%) 93 ( 6.9%) 259 (19.1%)
Validation set 558 (82.2%) 39 ( 5.7%) 82 (12.1%)
Test set 514 (75.7%) 50 ( 7.4%) 115 (16.9%)
Distribution of sentiment annotations
Smartphone
Sentiment Positive Negative Neutral
Training set 205 (19.6%) 104 ( 9.9%) 738 (70.5%)
Validation set 110 (21.0%) 70 (13.4%) 343 (65.6%)
Test set 84 (16.1%) 80 (15.3%) 359 (68.6%)

For more information please download the LFA-11 Corpus documentation lfa-11-documentation.pdf (last accessed on 04/23/2013).

People

Publications