科学夸张检测数据集卡片

数据集摘要

公众对科学的信任依赖于对科学论文进行诚实和事实准确的传播。然而，最近的研究表明新闻媒体存在一种夸大科学论文发现的倾向。鉴于此，我们提出了对科学传播中夸张检测问题的形式化研究。虽然有大量的科学论文和相关的流行媒体文章，但很少有文章包含指向原始论文的直接链接，这使得数据收集具有挑战性。我们通过整理一组来自现有专家注释研究中的新闻稿/摘要对标签数据，用于对机器学习模型在该任务上性能进行基准测试。我们利用此数据以及先前关于科学夸张检测的研究引入了MT-PET，这是Pattern Exploiting Training (PET) 的多任务版本，该版本利用互补的填空式QA任务中的知识来提高少样本学习的性能。我们证明了MT-PET在数据有限以及某一主要任务有大量数据时，都优于PET和监督学习的方法。

数据集结构

训练和测试数据来源于 Sumner et al. 2014 和 Bratton et al. 2019 的InSciOut研究。数据集包含以下字段：

original_file_id: The ID of the original spreadsheet in the Sumner/Bratton data where the annotations are derived from
press_release_conclusion: The conclusion sentence from the press release
press_release_strength: The strength label for the press release
abstract_conclusion: The conclusion sentence from the abstract
abstract_strength: The strength label for the abstract
exaggeration_label: The final exaggeration label

夸张标签可以是“same”（相同）、“exaggerates”（夸大）或“downplays”（淡化）之一。强度标签可以是以下之一：

0: Statement of no relationship
1: Statement of correlation
2: Conditional statement of causation
3: Statement of causation

数据集创建

有关数据集的创建详细信息，请参见 paper 中的第4节。原始的InSciOut数据可在 here 中找到。

引用

@inproceedings{wright2021exaggeration,
    title={{Semi-Supervised Exaggeration Detection of Health Science Press Releases}},
    author={Dustin Wright and Isabelle Augenstein},
    booktitle = {Proceedings of EMNLP},
    publisher = {Association for Computational Linguistics},
    year = 2021
}

感谢 @dwright37 提供此数据集。

作者:

copenlu

数据集大小:

345.67 KB