数据集:

clickbait_news_bg

任务:

文本分类

子任务:

fact-checking

语言:

计算机处理:

monolingual

大小:

1K<n<10K

语言创建人:

expert-generated

批注创建人:

expert-generated

源数据集:

original

许可:

license:unknown

数据集介绍文件清单

中文

Dataset Card for Clickbait/Fake News in Bulgarian

Dataset Summary

This is a corpus of Bulgarian news over a fixed period of time, whose factuality had been questioned. The news come from 377 different sources from various domains, including politics, interesting facts and tips&tricks.

The dataset was prepared for the Hack the Fake News hackathon. It was provided by the Bulgarian Association of PR Agencies and is available in Gitlab .

The corpus was automatically collected, and then annotated by students of journalism.

The training dataset contains 2,815 examples, where 1,940 (i.e., 69%) are fake news and 1,968 (i.e., 70%) are click-baits; There are 761 testing examples.

There is 98% correlation between fake news and clickbaits.

One important aspect about the training dataset is that it contains many repetitions. This should not be surprising as it attempts to represent a natural distribution of factual vs. fake news on-line over a period of time. As publishers of fake news often have a group of websites that feature the same deceiving content, we should expect some repetition. In particular, the training dataset contains 434 unique articles with duplicates. These articles have three reposts each on average, with the most reposted article appearing 45 times. If we take into account the labels of the reposted articles, we can see that if an article is reposted, it is more likely to be fake news. The number of fake news that have a duplicate in the training dataset are 1018 whereas, the number of articles with genuine content that have a duplicate article in the training set is 322.

(The dataset description is from the following paper .)

Supported Tasks and Leaderboards

[More Information Needed]

Languages

Bulgarian

Dataset Structure

Data Instances

[More Information Needed]

Data Fields

Each entry in the dataset consists of the following elements:

fake_news_score - a label indicating whether the article is fake or not
click_bait_score - another label indicating whether it is a click-bait
content_title - article heading
content_url - URL of the original article
content_published_time - date of publication
content - article content

Data Splits

The training dataset contains 2,815 examples, where 1,940 (i.e., 69%) are fake news and 1,968 (i.e., 70%) are click-baits;

The validation dataset contains 761 testing examples.

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

[More Information Needed]

Contributions

Thanks to @tsvm , @lhoestq for adding this dataset.

作者:

佚名

数据集大小:

13.27 KB