数据集:

yoruba_text_c3

任务:

文本生成

填充掩码

子任务:

language-modeling masked-language-modeling

语言:

计算机处理:

monolingual

大小:

100K<n<1M

语言创建人:

found

批注创建人:

expert-generated

源数据集:

original

许可:

cc-by-nc-4.0

数据集介绍文件清单

英文

Yorùbá Text C3数据集卡片

数据集概述

Yorùbá Text C3从网络的各个来源中收集而来（包括圣经、JW300、书籍、新闻文章、维基百科等），以比较预训练的词向量（FastText和BERT）以及在精心筛选的Yorùbá文本上训练的词向量和嵌入模型。该数据集包含干净的文本（即带有正确的Yorùbá音标的文本），如圣经和JW300，以及来自其他在线来源（如维基百科、BBC Yorùbá和VON Yorùbá）的带有错误或缺失音标的嘈杂文本。

支持的任务和排行榜

用于在Yoruba文本上训练词向量和语言模型。

语言

支持的语言为Yorùbá。

数据集结构

数据实例

数据点是每行的一个句子。{'text': 'lílo àkàbà — ǹjẹ́ o máa ń ṣe àyẹ̀wò wọ̀nyí tó lè dáàbò bò ẹ́'}

数据字段

文本：字符串特征。每行一个句子文本

数据拆分

仅包含训练集。

数据集创建

筛选理由

该数据集的创建是为了帮助引入新语言 - Yorùbá的资源。

源数据

初始数据收集和规范化

数据集来自网络的各个来源，如圣经、JW300、书籍、新闻文章、维基百科等。有关数据集和统计数据的摘要，请参见 paper 表中的内容。

谁是源语言生产者？

Jehovah Witness （JW300） Yorùbá Bible Yorùbá维基百科BBC Yorùbá VON Yorùbá 全球之声Yorùbá

以及其他来源，请参见 https://www.aclweb.org/anthology/2020.lrec-1.335/

注释

注释过程

[需要更多信息]

注释者是谁？

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据时的注意事项

数据产生的社会影响

[需要更多信息]

偏见讨论

该数据集对宗教领域（基督教）有偏见，因为包括了JW300和圣经的内容。

其他已知限制

[需要更多信息]

附加信息

数据集策划者

数据集由Saarland大学的学生Jesujoba Alabi和David Adelani策划。

许可信息

数据使用 Creative Commons Attribution-NonCommercial 4.0 许可。

引用信息

@inproceedings{alabi-etal-2020-massive,
    title = "Massive vs. Curated Embeddings for Low-Resourced Languages: the Case of {Y}or{\`u}b{\'a} and {T}wi",
    author = "Alabi, Jesujoba  and
      Amponsah-Kaakyire, Kwabena  and
      Adelani, David  and
      Espa{\~n}a-Bonet, Cristina",
    booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.335",
    pages = "2754--2762",
    abstract = "The success of several architectures to learn semantic representations from unannotated text and the availability of these kind of texts in online multilingual resources such as Wikipedia has facilitated the massive and automatic creation of resources for multiple languages. The evaluation of such resources is usually done for the high-resourced languages, where one has a smorgasbord of tasks and test sets to evaluate on. For low-resourced languages, the evaluation is more difficult and normally ignored, with the hope that the impressive capability of deep learning architectures to learn (multilingual) representations in the high-resourced setting holds in the low-resourced setting too. In this paper we focus on two African languages, Yor{\`u}b{\'a} and Twi, and compare the word embeddings obtained in this way, with word embeddings obtained from curated corpora and a language-dependent processing. We analyse the noise in the publicly available corpora, collect high quality and noisy data for the two languages and quantify the improvements that depend not only on the amount of data but on the quality too. We also use different architectures that learn word representations both from surface forms and characters to further exploit all the available information which showed to be important for these languages. For the evaluation, we manually translate the wordsim-353 word pairs dataset from English into Yor{\`u}b{\'a} and Twi. We extend the analysis to contextual word embeddings and evaluate multilingual BERT on a named entity recognition task. For this, we annotate with named entities the Global Voices corpus for Yor{\`u}b{\'a}. As output of the work, we provide corpora, embeddings and the test suits for both languages.",
    language = "English",
    ISBN = "979-10-95546-34-4",
}

贡献

感谢 @dadelani 添加了此数据集。

作者:

佚名

数据集大小:

19.38 MB