"hyperpartisan_news_detection" 数据集是为PAN @ SemEval 2019 Task 4创建的。给定一个新闻文章文本,判断它是否遵循偏执的论证,即是否对一个政党、派系、事业或个人表现出盲目、偏见或不合理的忠诚。
数据集主要有两个部分:
'train'的一个例子如下所示。
This example was too long and was cropped: { "hyperpartisan": true, "published_at": "2020-01-01", "text": "\"<p>This is a sample article which will contain lots of text</p>\\n \\n<p>Lorem ipsum dolor sit amet, consectetur adipiscing el...", "title": "Example article 1", "url": "http://www.example.com/example1" }bypublisher
'train'的一个例子如下所示。
This example was too long and was cropped: { "bias": 3, "hyperpartisan": false, "published_at": "2020-01-01", "text": "\"<p>This is a sample article which will contain lots of text</p>\\n \\n<p>Phasellus bibendum porta nunc, id venenatis tortor fi...", "title": "Example article 4", "url": "https://example.com/example4" }
数据字段在所有拆分中是相同的。
byarticletrain | |
---|---|
byarticle | 645 |
train | validation | |
---|---|---|
bypublisher | 600000 | 150000 |
该收集(包括标签)根据 Creative Commons Attribution 4.0 International License 授权。
@inproceedings{kiesel-etal-2019-semeval, title = "{S}em{E}val-2019 Task 4: Hyperpartisan News Detection", author = "Kiesel, Johannes and Mestre, Maria and Shukla, Rishabh and Vincent, Emmanuel and Adineh, Payam and Corney, David and Stein, Benno and Potthast, Martin", booktitle = "Proceedings of the 13th International Workshop on Semantic Evaluation", month = jun, year = "2019", address = "Minneapolis, Minnesota, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/S19-2145", doi = "10.18653/v1/S19-2145", pages = "829--839", abstract = "Hyperpartisan news is news that takes an extreme left-wing or right-wing standpoint. If one is able to reliably compute this meta information, news articles may be automatically tagged, this way encouraging or discouraging readers to consume the text. It is an open question how successfully hyperpartisan news detection can be automated, and the goal of this SemEval task was to shed light on the state of the art. We developed new resources for this purpose, including a manually labeled dataset with 1,273 articles, and a second dataset with 754,000 articles, labeled via distant supervision. The interest of the research community in our task exceeded all our expectations: The datasets were downloaded about 1,000 times, 322 teams registered, of which 184 configured a virtual machine on our shared task cloud service TIRA, of which in turn 42 teams submitted a valid run. The best team achieved an accuracy of 0.822 on a balanced sample (yes : no hyperpartisan) drawn from the manually tagged corpus; an ensemble of the submitted systems increased the accuracy by 0.048.", }
感谢 @thomwolf , @ghomasHudson 添加此数据集。