Same Side Stance Classification
Synopsis
Identifying (classifying) the stance of an argument towards a particular topic is a fundamental task in computational argumentation. The stance of an argument as considered here is a two-valued function: it can either be ''pro'' a topic (= yes, I agree), or ''con'' a topic (= no, I do not agree).
With the new task »same side (stance) classification« we address a simpler variant of this problem: Given two arguments regarding a certain topic, the task is to decide whether or not the two arguments have the same stance.
Important Dates
- Submission open: June 14th, 2019
- Submission closed: June 21th, 2019 (11:59PM UTC-12:00)
- Result Presentation at ArgMining 2019: August 1st, 2019
- Notebooks dues (optional): January 24th, 2020
Task
Given two arguments on a topic, decide whether they are on the same side or not.
Example:
Rationale. Current stance classification algorithms require knowledge about the topic the argument is about, i.e., the classifiers must be trained for a particular topic and hence cannot be reliably applied for other (= across) topics. Observe that same side classification needs not to distinguish between topic-specific pro- and con-vocabulary — ''only'' the argument similarity within a stance needs to be assessed. I.e., same side classification can probably be solved independently of a topic or a domain, so to speak, in a topic-agnostic fashion. We believe that the development of such a technology has game-changing potential. It could be used to tackle, for instance, the following tasks: measure the bias strength within an argumentation, structure a discussion, find out who or what is challenging in a discussion, or filter wrongly labeled arguments in a large argument corpus—without relying on knowledge of a topic or a domain.
We invite you to participate in the same side stance classification task! To keep the effort low, participants need to submit only a short description of their models along with the predicted labels for the provided test set. We plan to present results of the task at the 6th Workshop on Argument Mining at ACL2019, including a discussion of this task and the received submissions. Follow-up challenges of this task, which will, among others, include much larger datasets to allow for deep learning, are in preparation and will be advertised after we have evaluated your feedback.
Data
The dataset used in the task is derived from the following four sources: idebate.org, debatepedia.org, debatewise.org and debate.org. Each instance in the dataset holds the following fields:
Input Format
- id: The id of the instance
- topic: The title of the debate. It can be a general topic (e.g. abortion) or a topic with a
stance (e.g. abortion should be legalized).
- argument1: A pro or con argument related to the topic.
- argument1_id: The ID of argument1.
- argument2: A pro or con argument related to the topic.
- argument2_id: The ID of argument2.
Output Format
- is_same_stance: True or False. True in case argument1 and argument2 have the same stance
towards the topic and False otherwise.
Evaluation
We choose to work with the two most discussed topics in the considered four sources: abortion and gay marriage. Two experiments are set-up for the task of same side stance classification:
- Within Topics: The training set contains arguments for a set of topics (abortion and
gay marriage) and the test set contains arguments related to the same set of topics
(abortion and gay marriage). The following table presents an overview about the data.
ClassTopic: Abortion
Topic: Gay Marriage
Same Side 20,834 13,277 Different Side 20,006 9,786 Total 40,840 23,063 - Cross Topics: The training set contains arguments for a topic (abortion) and the test
set contains argument related to another set of topics.
Class # of instances Same Side 31,195 Different Side 29,853 Total 61,048
Submission
We provide a test set for each experiment setting (within topics and across topics). Train your model, predict the labels of the test set, and send us the results as a CSV file, comma separated, with header id and label. The id is the instance id provided in the test set, and the label has a value of 'True' (if the two arguments have the same side) or 'False' (if the two arguments do not have the same side). The results will be evaluated and ranked by the achieved accuracy.
Results Within Topics
# | Team | University | Precision | Recall | Accuracy |
---|---|---|---|---|---|
1 | ReCAP✝ | Trier University | 0.85 | 0.66 | 0.77 |
1 | ASV | Leipzig University | 0.79 | 0.73 | 0.77 |
2 | IBM Research | IBM Research | 0.69 | 0.59 | 0.66 |
3 | UKP | TU Darmstadt | 0.68 | 0.52 | 0.64 |
4 | HHU SSSC | Düsseldorf University | 0.70 | 0.33 | 0.60 |
5 | ReCAP✝ | Trier University | 0.65 | 0.24 | 0.56 |
6 | DBS* | LMU | 0.53 | 1.00 | 0.55 |
7 | ACQuA✝ | MLU Halle | 0.53 | 0.57 | 0.54 |
8 | Paderborn University | Paderborn University | 0.59 | 0.19 | 0.53 |
9 | sam | Postdam University | 0.51 | 0.58 | 0.51 |
10 | ACQuA✝ | MLU Halle | 0.50 | 0.11 | 0.50 |
Model overview: Rule-based: same-side classification is considered as a sentiment analysis task. Arguments with negative as well as positive sentiments are on the same side. Vocabulary with positive and negative words from Minqing Hu and Bing Liu. |
✝ ReCAP and ACQuA teams submitted 8 and 6 approaches respectively. For each team, the Table shows the best and the worst performing models.
* In addition to their Bert model the DBS team from LMU exploited also the fact that the "same side" relation is an equivalence relation: if a test set contains arguments from the training set, labels can be deduced via the transitivity property (among others). However, the test set underlying the results shown above does not contain such exploitable algebraic or logical structures—in order to compare, as intended, the language processing power of the submissions. We would like to mention this fact since the DBS team informed us about their use of this possibility beforehand, which shows their fair-mindedness, and, not least, since their exploitation of the algebraic structure is a smart move.
Results Cross Topics
We evaluate all submitted predictions on a balanced subset of the Cross domain test set:
# | Team | University | Precision | Recall | Accuracy |
---|---|---|---|---|---|
1 | ReCAP✝ | Trier University | 0.72 | 0.72 | 0.73 |
2 | ASV | Leipzig University | 0.72 | 0.72 | 0.72 |
3 | HHU SSSC | Düsseldorf University | 0.72 | 0.53 | 0.66 |
4 | DBS | LMU | 0.67 | 0.53 | 0.63 |
4 | UKP | TU Darmstadt | 0.64 | 0.59 | 0.63 |
5 | IBM Research | IBM Research | 0.62 | 0.49 | 0.60 |
6 | Paderborn University | Paderborn University | 0.60 | 0.38 | 0.56 |
7 | ReCAP✝ | Trier University | 0.70 | 0.11 | 0.53 |
8 | sam | Postdam University | 0.51 | 0.52 | 0.51 |
9 | ACQuA✝ | MLU Halle | 0.50 | 0.57 | 0.50 |
9 | ACQuA✝ | MLU Halle | 0.46 | 0.00 | 0.50 |
✝ ReCAP and ACQuA teams submitted 8 and 6 approaches respectively. For each team, the Table shows the best and the worst performing models.