数据集:

webis/conclugen

英文

ConcluGen数据集卡片

数据集摘要

ConcluGen语料库是为论证总结任务构建的。它包含了从ChangeMyView子论坛收集到的136,996对论证文本和它们的结论,ChangeMyView是一个关于有争议话题的辩论讨论网站。

该语料库有三个变体:topics、aspects和targets。每个变体通过控制代码来编码相应的信息。这些代码为生成更详细的结论提供了额外的论证知识。

支持的任务和排行榜

论证总结,结论生成

语言

英语('en'),由Reddit用户在 r/changemyview 个子论坛上使用。

数据集结构

数据实例

一个示例包含唯一的'id'、一个'argument'和它的'conclusion'。

基础

仅包含论证和它的结论。

{'id': 'ee11c116-23df-4795-856e-8b6c6626d5ed',
 'argument': "In my opinion, the world would be a better place if alcohol was illegal. I've done a little bit of research to get some numbers, and I was quite shocked at what I found. Source On average, one in three people will be involved in a drunk driving crash in their lifetime. In 2011, 9,878 people died in drunk driving crashes Drunk driving costs each adult in this country almost 500 per year. Drunk driving costs the United States 132 billion a year. Every day in America, another 27 people die as a result of drunk driving crashes. Almost every 90 seconds, a person is injured in a drunk driving crash. These are just the driving related statistics. They would each get reduced by at least 75 if the sale of alcohol was illegal. I just don't see enough positives to outweigh all the deaths and injuries that result from irresponsible drinking. Alcohol is quite literally a drug, and is also extremely addicting. It would already be illegal if not for all these pointless ties with culture. Most people wouldn't even think to live in a world without alcohol, but in my opinion that world would be a better, safer, and more productive one. , or at least defend the fact that it's legal.",
 'conclusion': 'I think alcohol should be illegal.'}

主题

使用讨论主题对论证进行编码。

{"id":"b22272fd-00d2-4373-b46c-9c1d9d21e6c2","argument":"<|TOPIC|>Should Planned Parenthood Be Defunded?<|ARGUMENT|>Even the best contraceptive methods such as surgical sterilisation can fail, and even with perfect use the pill may not work.<|CONCLUSION|>","conclusion":"Even with the best intentions and preparation, contraceptives can and do fail."}

方面

使用讨论主题和论证的方面对论证进行编码。

{"id":"adc92826-7892-42d4-9405-855e845bf027","argument":"<|TOPIC|>Gender Neutral Bathrooms: Should They be Standard?<|ARGUMENT|>Men's toilets and women's urine have different odours due to hormone differences in each biological sex. As a result, the urine of one sex may smell much worse to the other sex and vice versa, meaning that it is logical to keep their toilet facilities separate.<|ASPECTS|>hormone differences, urine, separate, facilities, different odours, smell much worse<|CONCLUSION|>","conclusion":"Men and women, because of their different biological characteristics, each need a different type of bathroom. Gender-segregated bathrooms reflect and honour these differences."}

目标

使用讨论主题和可能的结论目标对论证进行编码。

{"id":"c9a87a03-edda-42be-9c0d-1e7d2d311816","argument":"<|TOPIC|>Australian republic vs. monarchy<|ARGUMENT|>The monarchy is a direct reflection of Australia's past as a British colony and continues to symbolize Australia's subservience to the British crown. Such symbolism has a powerfully negative effect on Australians' sense of independence and identity. Ending the monarchy and establishing a republic would constitute a substantial stride in the direction of creating a greater sense of independence and national pride and identity.<|TARGETS|>Such symbolism, The monarchy, Ending the monarchy and establishing a republic<|CONCLUSION|>","conclusion":"Ending the monarchy would foster an independent identity in Australia"}

数据字段

  • id:每个示例的字符串标识符。
  • argument:论证文本。
  • conclusion:论证文本的结论。

数据拆分

每个数据集变体(包括基础数据集)都被拆分为训练集、验证集和测试集。

| | 训练集 | 验证集 | 测试集 ||--------- |--------- |------------ |------ || 基础 | 116,922 | 12,224 | 1373 || 方面 | 120,142 | 12,174 | 1357 || 目标 | 109,376 | 11,053 | 1237 || 主题 | 121,588 | 12,335 | 1372 |

数据集创建

策划理由

ConcluGen是朝着论证总结技术的第一步建设的。 rules of the subreddit 确保了适合该任务的高质量数据。

来源数据

初始数据收集和归一化

Reddit ChangeMyView

谁是源语言的生产者?

Reddit子论坛 r/changemyview 的用户。进一步的人口统计信息从数据源不可获取。

注释

该数据集还增加了自动提取的知识,如论证的方面、讨论主题和可能的结论目标。

注释过程

[N/A]

谁是注释者?

[N/A]

个人和敏感信息

仅提供论证文本及其结论。未包括发帖人的个人信息。

使用数据的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策划者

[需要更多信息]

许可信息

数据集的许可状态取决于 Pushshift.io 数据的法律状态,该状态目前不明确。

引用信息

@inproceedings{syed:2021,
  author    = {Shahbaz Syed and
               Khalid Al Khatib and
               Milad Alshomary and
               Henning Wachsmuth and
               Martin Potthast},
  editor    = {Chengqing Zong and
               Fei Xia and
               Wenjie Li and
               Roberto Navigli},
  title     = {Generating Informative Conclusions for Argumentative Texts},
  booktitle = {Findings of the Association for Computational Linguistics: {ACL/IJCNLP}
               2021, Online Event, August 1-6, 2021},
  pages     = {3482--3493},
  publisher = {Association for Computational Linguistics},
  year      = {2021},
  url       = {https://doi.org/10.18653/v1/2021.findings-acl.306},
  doi       = {10.18653/v1/2021.findings-acl.306}
}