The Language Function Analysis 2011 Corpus (LFA-11) is a German text corpus of promotional text, reviews and blog posts on music and smartphones. The texts were manually classified with respect to their topic relevance, language function, and sentiment polarity.
The purpose of the corpus is to provide textual data for the development and evaluation of approaches to language function analysis and sentiment analysis. Therefore, each text is classified by language function (personal, commercial, or informational) as well as by sentiment (positive, negative, neutral).
To download the corpus use the following links:
(4.9 MB, MD5 sum: 02051a782b664938b51c34e193efc343).
If you use the dataset in your research, please send us a copy of your publication. We kindly ask you to refer to the corpus via [bib].
The corpus consists of two separated collections, which contain the texts about music and smartphones respectively. The music collection consists of 2,713 promotional texts and reviews from both users and professionals. The smartphone collection contains 2,093 blog posts on smartphones from the Spinn3r corpus.
Annotations are provided in form of an UTF-8 encoded XMI file, which is preformatted for Apache UIMA. The two collections are split into training (50% of texts), validation (25%) and test sets (25%). The distribution of the different annotations over the sets are shown in the tables below.
|Training set||1327 (97.9%)||28 (2.1%)|
|Validation set||673 (99.1%)||6 (0.9%)|
|Test set||662 (97.5%)||17 (2.5%)|
|Training set||561 (53.6%)||486 (46.4%)|
|Validation set||307 (58.7%)||216 (41.3%)|
|Test set||287 (54.9%)||236 (45.1%)|
Three different functions of language are analyzed:
- commercial: the text is of obvious commercial interest. It seems to predominantly aim at persuading the reader to buy or like the product.
- informational: the text predominantly appears to be informative in a journalistic manner.
- personal: the text probably represents the personal view on the product of a private individual.
The distribution of the single functions in the corpus is shown in the table below.
|Training set||127 ( 9.4%)||707 (52.2%)||521 (38.5%)|
|Validation set||72 (10.6%)||188 (27.7%)||419 (61.7%)|
|Test set||68 (10.0%)||269 (39.6%)||342 (50.4%)|
|Training set||90 ( 8.6%)||411 (39.3%)||546 (52.1%)|
|Validation set||36 ( 6.9%)||208 (39.8%)||279 (53.4%)|
|Test set||28 ( 5.4%)||193 (36.9%)||302 (57.7%)|
|Training set||1003 (74.9%)||93 ( 6.9%)||259 (19.1%)|
|Validation set||558 (82.2%)||39 ( 5.7%)||82 (12.1%)|
|Test set||514 (75.7%)||50 ( 7.4%)||115 (16.9%)|
|Training set||205 (19.6%)||104 ( 9.9%)||738 (70.5%)|
|Validation set||110 (21.0%)||70 (13.4%)||343 (65.6%)|
|Test set||84 (16.1%)||80 (15.3%)||359 (68.6%)|
For more information please download the LFA-11 Corpus documentation lfa-11-documentation.pdf (last accessed on 04/23/2013).
- Henning Wachsmuth
- Kathrin Bujna