Same Side Stance Classification

Synopsis
Important Dates
Task
Data
Evaluation
Submission
Results

Synopsis

Identifying (classifying) the stance of an argument towards a particular topic is a fundamental task in computational argumentation. The stance of an argument as considered here is a two-valued function: it can either be ''pro'' a topic (= yes, I agree), or ''con'' a topic (= no, I do not agree).

With the new task »same side (stance) classification« we address a simpler variant of this problem: Given two arguments regarding a certain topic, the task is to decide whether or not the two arguments have the same stance.

Important Dates

Submission open: June 14th, 2019
Submission closed: June 21th, 2019 (11:59PM UTC-12:00)
Result Presentation at ArgMining 2019: August 1st, 2019
Notebooks dues (optional): January 24th, 2020

Task

Given two arguments on a topic, decide whether they are on the same side or not.

Example:

Argument 1

Marriage is a commitment to love and care for your spouse till death. This is what is heard in all wedding vows. Gays can clearly qualify for marriage according to these vows, and any definition of marriage deduced from these vows.

Argument 2

Gay Marriage should be legalized since denying some people the option to marry is dscrimenatory and creates a second class of citizens.

same side

Argument 1

Argument 2

Marriage is the institution that forms and upholds for society, its values and symbols are related to procreation. To change the definition of marriage to include same-sex couples would destroy its function, because it could no longer represent the inherently procreative relationship of opposite-sex pair-bonding."

different side

Rationale. Current stance classification algorithms require knowledge about the topic the argument is about, i.e., the classifiers must be trained for a particular topic and hence cannot be reliably applied for other (= across) topics. Observe that same side classification needs not to distinguish between topic-specific pro- and con-vocabulary — ''only'' the argument similarity within a stance needs to be assessed. I.e., same side classification can probably be solved independently of a topic or a domain, so to speak, in a topic-agnostic fashion. We believe that the development of such a technology has game-changing potential. It could be used to tackle, for instance, the following tasks: measure the bias strength within an argumentation, structure a discussion, find out who or what is challenging in a discussion, or filter wrongly labeled arguments in a large argument corpus—without relying on knowledge of a topic or a domain.

We invite you to participate in the same side stance classification task! To keep the effort low, participants need to submit only a short description of their models along with the predicted labels for the provided test set. We plan to present results of the task at the 6th Workshop on Argument Mining at ACL2019, including a discussion of this task and the received submissions. Follow-up challenges of this task, which will, among others, include much larger datasets to allow for deep learning, are in preparation and will be advertised after we have evaluated your feedback.

Data

The dataset used in the task is derived from the following four sources: idebate.org, debatepedia.org, debatewise.org and debate.org. Each instance in the dataset holds the following fields:

Input Format

id: The id of the instance
topic: The title of the debate. It can be a general topic (e.g. abortion) or a topic with a stance (e.g. abortion should be legalized).
argument1: A pro or con argument related to the topic.
argument1_id: The ID of argument1.
argument2: A pro or con argument related to the topic.
argument2_id: The ID of argument2.

Output Format

is_same_stance: True or False. True in case argument1 and argument2 have the same stance towards the topic and False otherwise.

Evaluation

We choose to work with the two most discussed topics in the considered four sources: abortion and gay marriage. Two experiments are set-up for the task of same side stance classification:

Within Topics: The training set contains arguments for a set of topics (abortion and gay marriage) and the test set contains arguments related to the same set of topics (abortion and gay marriage). The following table presents an overview about the data.

Class	Topic: Abortion	Topic: Gay Marriage
Same Side	20,834	13,277
Different Side	20,006	9,786
Total	40,840	23,063

Cross Topics: The training set contains arguments for a topic (abortion) and the test set contains argument related to another set of topics.

Class	# of instances
Same Side	31,195
Different Side	29,853
Total	61,048

Submission

We provide a test set for each experiment setting (within topics and across topics). Train your model, predict the labels of the test set, and send us the results as a CSV file, comma separated, with header id and label. The id is the instance id provided in the test set, and the label has a value of 'True' (if the two arguments have the same side) or 'False' (if the two arguments do not have the same side). The results will be evaluated and ranked by the achieved accuracy.

Results Within Topics

#	Team	University	Precision	Recall	Accuracy
1	ReCAP✝	Trier University	0.85	0.66	0.77
1	ASV	Leipzig University	0.79	0.73	0.77
2	IBM Research	IBM Research	0.69	0.59	0.66
3	UKP	TU Darmstadt	0.68	0.52	0.64
4	HHU SSSC	Düsseldorf University	0.70	0.33	0.60
5	ReCAP✝	Trier University	0.65	0.24	0.56
6	DBS*	LMU	0.53	1.00	0.55
7	ACQuA✝	MLU Halle	0.53	0.57	0.54
8	Paderborn University	Paderborn University	0.59	0.19	0.53
9	sam	Postdam University	0.51	0.58	0.51
10	ACQuA✝	MLU Halle	0.50	0.11	0.50
Model overview: Rule-based: same-side classification is considered as a sentiment analysis task. Arguments with negative as well as positive sentiments are on the same side. Vocabulary with positive and negative words from Minqing Hu and Bing Liu.

✝ ReCAP and ACQuA teams submitted 8 and 6 approaches respectively. For each team, the Table shows the best and the worst performing models.

* In addition to their Bert model the DBS team from LMU exploited also the fact that the "same side" relation is an equivalence relation: if a test set contains arguments from the training set, labels can be deduced via the transitivity property (among others). However, the test set underlying the results shown above does not contain such exploitable algebraic or logical structures—in order to compare, as intended, the language processing power of the submissions. We would like to mention this fact since the DBS team informed us about their use of this possibility beforehand, which shows their fair-mindedness, and, not least, since their exploitation of the algebraic structure is a smart move.

Results Cross Topics

We evaluate all submitted predictions on a balanced subset of the Cross domain test set:

#	Team	University	Precision	Recall	Accuracy
1	ReCAP✝	Trier University	0.72	0.72	0.73
2	ASV	Leipzig University	0.72	0.72	0.72
3	HHU SSSC	Düsseldorf University	0.72	0.53	0.66
4	DBS	LMU	0.67	0.53	0.63
4	UKP	TU Darmstadt	0.64	0.59	0.63
5	IBM Research	IBM Research	0.62	0.49	0.60
6	Paderborn University	Paderborn University	0.60	0.38	0.56
7	ReCAP✝	Trier University	0.70	0.11	0.53
8	sam	Postdam University	0.51	0.52	0.51
9	ACQuA✝	MLU Halle	0.50	0.57	0.50
9	ACQuA✝	MLU Halle	0.46	0.00	0.50

✝ ReCAP and ACQuA teams submitted 8 and 6 approaches respectively. For each team, the Table shows the best and the worst performing models.