数据集:

para_crawl

任务:

翻译

语言:

计算机处理:

translation

大小:

10M<n<100M

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original

许可:

cc0-1.0

数据集介绍文件清单

英文

"para_crawl"的数据集卡片

数据集摘要

欧洲官方语言的网页规模平行语料库。

支持的任务和排行榜

More Information Needed

语言

More Information Needed

数据集结构

数据实例

enbg

下载的数据集文件大小：103.75 MB
生成的数据集大小：356.54 MB
总使用的磁盘空间：460.27 MB

“训练”示例如下。

This example was too long and was cropped:

{
    "translation": "{\"bg\": \". “A felirat faragott karnis a bejárat fölött, templom épült 14 Július 1643, A földesúr és felesége Jeremiás Murguleţ, C..."
}

encs

下载的数据集文件大小：196.41 MB
生成的数据集大小：638.07 MB
总使用的磁盘空间：834.48 MB

“训练”示例如下。

This example was too long and was cropped:

{
    "translation": "{\"cs\": \". “A felirat faragott karnis a bejárat fölött, templom épült 14 Július 1643, A földesúr és felesége Jeremiás Murguleţ, C..."
}

enda

下载的数据集文件大小：182.81 MB
生成的数据集大小：598.62 MB
总使用的磁盘空间：781.43 MB

“训练”示例如下。

This example was too long and was cropped:

{
    "translation": "{\"da\": \". “A felirat faragott karnis a bejárat fölött, templom épült 14 Július 1643, A földesúr és felesége Jeremiás Murguleţ, C..."
}

ende

下载的数据集文件大小：1.31 GB
生成的数据集大小：4.00 GB
总使用的磁盘空间：5.30 GB

“训练”示例如下。

This example was too long and was cropped:

{
    "translation": "{\"de\": \". “A felirat faragott karnis a bejárat fölött, templom épült 14 Július 1643, A földesúr és felesége Jeremiás Murguleţ, C..."
}

enel

下载的数据集文件大小：193.56 MB
生成的数据集大小：688.07 MB
总使用的磁盘空间：881.62 MB

“训练”示例如下。

This example was too long and was cropped:

{
    "translation": "{\"el\": \". “A felirat faragott karnis a bejárat fölött, templom épült 14 Július 1643, A földesúr és felesége Jeremiás Murguleţ, C..."
}

数据字段

所有拆分的数据字段相同。

enbg

翻译：一个多语言字符串变量，可能的语言包括en，bg。

encs

翻译：一个多语言字符串变量，可能的语言包括en，cs。

enda

翻译：一个多语言字符串变量，可能的语言包括en，da。

ende

翻译：一个多语言字符串变量，可能的语言包括en，de。

enel

翻译：一个多语言字符串变量，可能的语言包括en，el。

数据拆分

name	train
enbg	1039885
encs	2981949
enda	2414895
ende	16264448
enel	1985233

数据集创建

策划理由

More Information Needed

源数据

初始数据收集和规范化

More Information Needed

谁是源语言的制作者？

More Information Needed

注释

注释过程

More Information Needed

谁是注释者？

More Information Needed

个人和敏感信息

More Information Needed

使用数据的注意事项

其他信息

数据集策划者

More Information Needed

授权信息

Creative Commons CC0 license ("no rights reserved") .

引用信息

@inproceedings{banon-etal-2020-paracrawl,
    title = "{P}ara{C}rawl: Web-Scale Acquisition of Parallel Corpora",
    author = "Ba{\~n}{\'o}n, Marta  and
      Chen, Pinzhen  and
      Haddow, Barry  and
      Heafield, Kenneth  and
      Hoang, Hieu  and
      Espl{\`a}-Gomis, Miquel  and
      Forcada, Mikel L.  and
      Kamran, Amir  and
      Kirefu, Faheem  and
      Koehn, Philipp  and
      Ortiz Rojas, Sergio  and
      Pla Sempere, Leopoldo  and
      Ram{\'\i}rez-S{\'a}nchez, Gema  and
      Sarr{\'\i}as, Elsa  and
      Strelec, Marek  and
      Thompson, Brian  and
      Waites, William  and
      Wiggins, Dion  and
      Zaragoza, Jaume",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.acl-main.417",
    doi = "10.18653/v1/2020.acl-main.417",
    pages = "4555--4567",
    abstract = "We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.",
}

贡献者

感谢 @thomwolf ， @lewtun ， @patrickvonplaten ， @mariamabarham 添加此数据集。

作者:

佚名

数据集大小:

44.87 KB

"para_crawl"的数据集卡片

数据集摘要

支持的任务和排行榜

语言

数据集结构

数据实例

数据字段

数据拆分

数据集创建

策划理由

源数据

注释

个人和敏感信息

使用数据的注意事项

数据的社会影响

偏见讨论

其他已知限制

其他信息

数据集策划者

授权信息

引用信息

贡献者