The Webis Tripad 2013 Sentiment Corpus is a English text corpus of 2100 hotel reviews for the development and evaluation of approaches to sentiment flow analysis. Each document in this corpus is assigned an overall rating score, some metadata, and two kinds of annotations. First, each statement of a review's text has been classified with respect to its sentiment polarity (positive, negative, objective) by Amazon Mechanical Turk (AMT) workers. Second, hotel aspects mentioned in the texts were tagged by in-house domain experts.
To give an example, the sentence "The service was perfect and the rooms were clean." consists of two statements "The service was perfect" and "the rooms were clean", both with positive sentiment classification. The aspect in the first statement is "service" and "rooms" in the second, respectively.
To download the corpus use the following link:
(2.9 MB, MD5 sum 97466da5b8e13072001e8f335d403ce8).
The text files are formatted in JSON and have the following structure: Show example document
The Webis Tripad 2013 Sentiment Corpus is an enriched version of a subset of the TripAdvisor Data Set. This corpus served as a starting point because of (1) its previous application to opinion analysis, (2) the given metadata in each review - overall rating, and seven aspect ratings: value, room, location, cleanliness, check in/front desk, service, business service ranging from one to five stars, and (3) its size of 246,399 hotel reviews from 1850 hotels in 65 locations. The original data was crawled from TripAdvisor in one month period from February 14, 2009 to March 15, 2009.
The chosen subset contains 2100 reviews from seven locations (Amsterdam, Barcelona, Berlin, Paris, San Francisco, Seattle, Sydney), equally distributed with respect to their assigned overall sentiment score that ranges from 1 to 5. Hence, there are 300 reviews per city, 60 for each score.
|Documents||2100||(per document numbers)|
Please note that the dataset you can download here is an updated version of the corpus, as we identified some defective reviews in the original TripAdvisor dataset that were due to parsing errors. The reviews in question have been repaired or replaced by hand, fetching the original content from the TripAdvisor website. Also the data format was changed from UIMA XML to JSON to increase compatibility. However, the results of our latest analysis (cf. Section Publications) are based on the version before the update, which you can download here.
Students: Dora Spensberger