The Webis Causal Question Answering 2022 (Webis-CausalQA-22) corpus comprises 1.1M causal question-answer pairs collected from the public QA datasets.
You can access the Webis Causal Question Answering 2022 from download.
After download you should unzip the archive. The resulting directory (440 MB after unzipping) contains 10
zip archives named after the QA datasets from which causal QA pairs were extracted and a
jupyter notebook to process and interactively explore the data. You do not need to unzip the 10 archives. The notebook handles archived files using the
zipfile Python library. All the respective samples from the QA datasets were converted into the
jsonl format, which allows an efficient iteration over the QA pairs and can be done on a laptop CPU in less than a minute.
Each QA pair is represented as a dictionary with the keys such as for example
"question", "answer", "context", etc. The original key names were preserved to reflect the respective QA dataset structure and may vary from sample to sample that allows to run QA systems developed for a particular dataset on its causal sample for comparing the system's effectiveness. These differences are tackled by the helper functions in the notebook, such that you can simply run the code which resolves these differences by itself. Using the notebook, you can iterate over all samples to create a single large corpus with causal QA pairs or extract instances from the specific samples.