数据集:
clickbait_news_bg
任务:
文本分类子任务:
fact-checking语言:
bg计算机处理:
monolingual大小:
1K<n<10K语言创建人:
expert-generated批注创建人:
expert-generated源数据集:
original许可:
license:unknownThis is a corpus of Bulgarian news over a fixed period of time, whose factuality had been questioned. The news come from 377 different sources from various domains, including politics, interesting facts and tips&tricks.
The dataset was prepared for the Hack the Fake News hackathon. It was provided by the Bulgarian Association of PR Agencies and is available in Gitlab .
The corpus was automatically collected, and then annotated by students of journalism.
The training dataset contains 2,815 examples, where 1,940 (i.e., 69%) are fake news and 1,968 (i.e., 70%) are click-baits; There are 761 testing examples.
There is 98% correlation between fake news and clickbaits.
One important aspect about the training dataset is that it contains many repetitions. This should not be surprising as it attempts to represent a natural distribution of factual vs. fake news on-line over a period of time. As publishers of fake news often have a group of websites that feature the same deceiving content, we should expect some repetition. In particular, the training dataset contains 434 unique articles with duplicates. These articles have three reposts each on average, with the most reposted article appearing 45 times. If we take into account the labels of the reposted articles, we can see that if an article is reposted, it is more likely to be fake news. The number of fake news that have a duplicate in the training dataset are 1018 whereas, the number of articles with genuine content that have a duplicate article in the training set is 322.
(The dataset description is from the following paper .)
[More Information Needed]
Bulgarian
[More Information Needed]
Each entry in the dataset consists of the following elements:
fake_news_score - a label indicating whether the article is fake or not
click_bait_score - another label indicating whether it is a click-bait
content_title - article heading
content_url - URL of the original article
content_published_time - date of publication
content - article content
The training dataset contains 2,815 examples, where 1,940 (i.e., 69%) are fake news and 1,968 (i.e., 70%) are click-baits;
The validation dataset contains 761 testing examples.
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]