数据集:
guardian_authorship
任务:
语言:
计算机处理:
monolingual大小:
1K<n<10K语言创建人:
found批注创建人:
found源数据集:
original许可:
这是一个跨主题的作者归属数据集。该数据集由Stamatatos 2013.1提供。1- 跨主题的场景基于Stamatatos 2017中的Table-4(例如cross_topic_1 => row 1:P S U&W)。2- 跨类型的场景基于同一论文中的Table-5(例如cross_genre_1 => row 1:B P S&U&W)。
3- 同一主题/类型的场景是通过按以下方式分组所有数据集创建的。例如,要使用same_topic并将数据集拆分为60-40,请使用:train_ds = load_dataset('guardian_authorship',name="cross_topic_<<#>>",split='train[:60%]+validation[:60%]+test[:60%]')tests_ds = load_dataset('guardian_authorship',name="cross_topic_<<#>>",split='train[-40%:]+validation[-40%:]+test[-40%:]')
重要提示:train+validation+test[:60%]会生成错误的拆分,因为数据不平衡
'train'的示例如下所示。
{ "article": "File 1a\n", "author": 0, "topic": 4 }cross_genre_2
'validation'的示例如下所示。
{ "article": "File 1a\n", "author": 0, "topic": 1 }cross_genre_3
'validation'的示例如下所示。
{ "article": "File 1a\n", "author": 0, "topic": 2 }cross_genre_4
'validation'的示例如下所示。
{ "article": "File 1a\n", "author": 0, "topic": 3 }cross_topic_1
'validation'的示例如下所示。
{ "article": "File 1a\n", "author": 0, "topic": 1 }
所有拆分的数据字段都是相同的。
cross_genre_1name | train | validation | test |
---|---|---|---|
cross_genre_1 | 63 | 112 | 269 |
cross_genre_2 | 63 | 62 | 319 |
cross_genre_3 | 63 | 90 | 291 |
cross_genre_4 | 63 | 117 | 264 |
cross_topic_1 | 112 | 62 | 207 |
@article{article, author = {Stamatatos, Efstathios}, year = {2013}, month = {01}, pages = {421-439}, title = {On the robustness of authorship attribution based on character n-gram features}, volume = {21}, journal = {Journal of Law and Policy} } @inproceedings{stamatatos2017authorship, title={Authorship attribution using text distortion}, author={Stamatatos, Efstathios}, booktitle={Proc. of the 15th Conf. of the European Chapter of the Association for Computational Linguistics}, volume={1} pages={1138--1149}, year={2017} }
感谢 @thomwolf 、 @eltoto1219 、 @malikaltakrori 添加了该数据集。