数据集:

hyperpartisan_news_detection

语言:

en

计算机处理:

monolingual

大小:

1M<n<10M

语言创建人:

found

源数据集:

original

许可:

cc-by-4.0
英文

"hyperpartisan_news_detection" 数据集卡片

数据集概述

"hyperpartisan_news_detection" 数据集是为PAN @ SemEval 2019 Task 4创建的。给定一个新闻文章文本,判断它是否遵循偏执的论证,即是否对一个政党、派系、事业或个人表现出盲目、偏见或不合理的忠诚。

数据集主要有两个部分:

  • byarticle:基于文章进行众包标注。数据包含只有众包工作者之间达成共识的文章。
  • bypublisher:根据BuzzFeed记者或MediaBiasFactCheck.com提供的出版商整体偏见进行标注。

支持的任务和排行榜

More Information Needed

语言

More Information Needed

数据集结构

数据实例

byarticle
  • 下载下来的数据集文件大小:1.00 MB
  • 生成的数据集大小:2.80 MB
  • 使用的总磁盘空间:3.81 MB

'train'的一个例子如下所示。

This example was too long and was cropped:

{
    "hyperpartisan": true,
    "published_at": "2020-01-01",
    "text": "\"<p>This is a sample article which will contain lots of text</p>\\n    \\n<p>Lorem ipsum dolor sit amet, consectetur adipiscing el...",
    "title": "Example article 1",
    "url": "http://www.example.com/example1"
}
bypublisher
  • 下载下来的数据集文件大小:1.00 GB
  • 生成的数据集大小:5.61 GB
  • 使用的总磁盘空间:6.61 GB

'train'的一个例子如下所示。

This example was too long and was cropped:

{
    "bias": 3,
    "hyperpartisan": false,
    "published_at": "2020-01-01",
    "text": "\"<p>This is a sample article which will contain lots of text</p>\\n    \\n<p>Phasellus bibendum porta nunc, id venenatis tortor fi...",
    "title": "Example article 4",
    "url": "https://example.com/example4"
}

数据字段

数据字段在所有拆分中是相同的。

byarticle
  • text : 一个字符串特征。
  • title : 一个字符串特征。
  • hyperpartisan : 一个布尔特征。
  • url : 一个字符串特征。
  • published_at : 一个字符串特征。
bypublisher
  • text : 一个字符串特征。
  • title : 一个字符串特征。
  • hyperpartisan : 一个布尔特征。
  • url : 一个字符串特征。
  • published_at : 一个字符串特征。
  • bias : 一个分类标签,可能的值包括 right (0), right-center (1), least (2), left-center (3), left (4)。

数据拆分

byarticle
train
byarticle 645
bypublisher
train validation
bypublisher 600000 150000

数据集创建

策展理由

More Information Needed

源数据

初始数据收集和归一化

More Information Needed

谁是源语言的生产者?

More Information Needed

注释

注释过程

More Information Needed

谁是注释者?

More Information Needed

个人和敏感信息

More Information Needed

使用数据的注意事项

数据集的社会影响

More Information Needed

偏见讨论

More Information Needed

其他已知限制

More Information Needed

其他信息

数据集策展人

More Information Needed

授权信息

该收集(包括标签)根据 Creative Commons Attribution 4.0 International License 授权。

引文信息

@inproceedings{kiesel-etal-2019-semeval,
    title = "{S}em{E}val-2019 Task 4: Hyperpartisan News Detection",
    author = "Kiesel, Johannes  and
      Mestre, Maria  and
      Shukla, Rishabh  and
      Vincent, Emmanuel  and
      Adineh, Payam  and
      Corney, David  and
      Stein, Benno  and
      Potthast, Martin",
    booktitle = "Proceedings of the 13th International Workshop on Semantic Evaluation",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/S19-2145",
    doi = "10.18653/v1/S19-2145",
    pages = "829--839",
    abstract = "Hyperpartisan news is news that takes an extreme left-wing or right-wing standpoint. If one is able to reliably compute this meta information, news articles may be automatically tagged, this way encouraging or discouraging readers to consume the text. It is an open question how successfully hyperpartisan news detection can be automated, and the goal of this SemEval task was to shed light on the state of the art. We developed new resources for this purpose, including a manually labeled dataset with 1,273 articles, and a second dataset with 754,000 articles, labeled via distant supervision. The interest of the research community in our task exceeded all our expectations: The datasets were downloaded about 1,000 times, 322 teams registered, of which 184 configured a virtual machine on our shared task cloud service TIRA, of which in turn 42 teams submitted a valid run. The best team achieved an accuracy of 0.822 on a balanced sample (yes : no hyperpartisan) drawn from the manually tagged corpus; an ensemble of the submitted systems increased the accuracy by 0.048.",
}

贡献

感谢 @thomwolf @ghomasHudson 添加此数据集。