数据集:
emea
任务:
翻译计算机处理:
multilingual大小:
1M<n<10M语言创建人:
found批注创建人:
found源数据集:
original许可:
license:unknown要加载不在配置中的语言对,您只需将语言代码指定为 pairs。您可以在数据集描述的主页部分找到有效的语言对: http://opus.nlpl.eu/EMEA.php 。例如
dataset = load_dataset("emea", lang1="en", lang2="nl")
[需要更多信息]
[需要更多信息]
这是 en-nl 配置的示例:
{'id': '4', 'translation': {'en': 'EPAR summary for the public', 'nl': 'EPAR-samenvatting voor het publiek'}}
数据字段为:
一些语言对的大小:
name | train |
---|---|
bg-el | 1044065 |
cs-et | 1053164 |
de-mt | 1000532 |
fr-sk | 1062753 |
es-lt | 1051370 |
[需要更多信息]
[需要更多信息]
Initial Data Collection and Normalization[需要更多信息]
Who are the source language producers?[需要更多信息]
[需要更多信息]
Annotation process[需要更多信息]
Who are the annotators?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
@InProceedings{TIEDEMANN12.463, author = {J{\"o}rg Tiedemann}, title = {Parallel Data, Tools and Interfaces in OPUS}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} }
感谢 @abhishekkrthakur 添加了该数据集。