数据集:

qanastek/ECDC

任务:

翻译

语言:

en

计算机处理:

en-sv en-pl en-hu

大小:

100K<n<1M

语言创建人:

found

源数据集:

extended

许可:

other
英文

欧洲疾病预防与控制中心(ECDC):欧洲联盟高度多语言平行语料库概述

数据集概述

2012年10月,欧洲联盟(EU)机构“欧洲疾病预防与控制中心”(ECDC)发布了一个翻译记忆库(TM),即一组句子及其专业翻译,涵盖了25种语言。数据通过 web pages of the EC's Joint Research Centre (JRC) 进行分发。

支持的任务和排行榜

翻译:可使用该数据集训练翻译模型。

语言

在我们的案例中,该语料库由欧洲联盟(EU)的22种不同语言的源语句和目标语句组成。

语言列表:英语 (en),瑞典语 (sv),波兰语 (pl),匈牙利语 (hu),立陶宛语 (lt),拉脱维亚语 (lv),德语 (de),芬兰语 (fi),斯洛伐克语 (sk),斯洛文尼亚语 (sl),法语 (fr),捷克语 (cs),丹麦语 (da),意大利语 (it),马耳他语 (mt),荷兰语 (nl),葡萄牙语 (pt),罗马尼亚语 (ro),西班牙语 (es),爱沙尼亚语 (et),保加利亚语 (bg),希腊语 (el),爱尔兰语 (ga),冰岛语 (is) 和 挪威语 (no)。

使用HuggingFace加载数据集

from datasets import load_dataset
dataset = load_dataset("qanastek/ECDC", "en-it", split='train', download_mode='force_redownload')
print(dataset)
print(dataset[0])

数据集结构

数据实例

key,lang,source_text,target_text
doc_0,en-bg,Vaccination against hepatitis C is not yet available.,Засега няма ваксина срещу хепатит С.
doc_1355,en-bg,Varicella infection,Инфекция с варицела
doc_2349,en-bg,"If you have any questions about the processing of your e-mail and related personal data, do not hesitate to include them in your message.","Ако имате въпроси относно обработката на вашия адрес на електронна поща и свързаните лични данни, не се колебайте да ги включите в съобщението си."
doc_192,en-bg,Transmission can be reduced especially by improving hygiene in food production handling.,Предаването на инфекцията може да бъде ограничено особено чрез подобряване на хигиената при манипулациите в хранителната индустрия.

数据字段

键(key):文档标识符 字符串。

语言(lang):源语言和目标语言对 类型为字符串。

源文本(source_text):源文本 类型为字符串。

目标文本(target_text):目标文本 类型为字符串。

数据拆分

lang key
en-bg 2567
en-cs 2562
en-da 2577
en-de 2560
en-el 2530
en-es 2564
en-et 2581
en-fi 2617
en-fr 2561
en-ga 1356
en-hu 2571
en-is 2511
en-it 2534
en-lt 2545
en-lv 2542
en-mt 2539
en-nl 2510
en-no 2537
en-pl 2546
en-pt 2531
en-ro 2555
en-sk 2525
en-sl 2545
en-sv 2527

数据集创建

整理理由

详细信息请查看相应的 pages

源数据

谁是源语言的制作人?

该语料库中的每个数据均由 JRC 上传。

个人和敏感信息

该语料库不包含个人或敏感信息。

使用数据时的注意事项

其他已知限制

任务的性质导致目标翻译质量存在变异性。

其他信息

数据集维护者

Hugging Face ECDC:Labrak Yanis, Dufour Richard(与原始语料库无关)

欧洲联盟高度多语言平行语料库概述:Steinberger Ralf, Mohamed Ebrahim, Alexandros Poulis, Manuel Carrasco-Benitez, Patrick Schlüter, Marek Przybyszewski & Signe Gilbro。

许可信息

下载或使用ECDC-Translation Memory即表示您受到 ECDC-TM usage conditions (PDF) 的约束。

无保证

每个作品均按原样提供,不论是明示还是暗示的,都没有任何明示或暗示的保证、义务和责任,包括但不限于对适销性、整合性、满意质量和特定用途适用性的任何暗示保证。

除故意不当行为或直接对自然人造成的损害之外,所有者对任何附带、相应的、直接或间接的损害不承担责任,包括但不限于因使用作品而导致的数据丢失、利润损失或任何其他经济损失,即使所有者已被告知可能发生此类损失、损害、索赔或费用,或者对任何第三方的任何索赔不承担责任。只要适用于作品的国家法律将所有者列入国家法定产品责任法律的范围之内,所有者可能会承担责任。

引用信息

在使用此数据集时,请引用以下论文。

@article{10.1007/s10579-014-9277-0,
  author = {Steinberger, Ralf and Ebrahim, Mohamed and Poulis, Alexandros and Carrasco-Benitez, Manuel and Schl\"{u}ter, Patrick and Przybyszewski, Marek and Gilbro, Signe},
  title = {An Overview of the European Union's Highly Multilingual Parallel Corpora},
  year = {2014},
  issue_date = {December  2014},
  publisher = {Springer-Verlag},
  address = {Berlin, Heidelberg},
  volume = {48},
  number = {4},
  issn = {1574-020X},
  url = {https://doi.org/10.1007/s10579-014-9277-0},
  doi = {10.1007/s10579-014-9277-0},
  abstract = {Starting in 2006, the European Commission's Joint Research Centre and other European Union organisations have made available a number of large-scale highly-multilingual parallel language resources. In this article, we give a comparative overview of these resources and we explain the specific nature of each of them. This article provides answers to a number of question, including: What are these linguistic resources? What is the difference between them? Why were they originally created and why was the data released publicly? What can they be used for and what are the limitations of their usability? What are the text types, subject domains and languages covered? How to avoid overlapping document sets? How do they compare regarding the formatting and the translation alignment? What are their usage conditions? What other types of multilingual linguistic resources does the EU have? This article thus aims to clarify what the similarities and differences between the various resources are and what they can be used for. It will also serve as a reference publication for those resources, for which a more detailed description has been lacking so far (EAC-TM, ECDC-TM and DGT-Acquis).},
  journal = {Lang. Resour. Eval.},
  month = {dec},
  pages = {679–707},
  numpages = {29},
  keywords = {DCEP, EAC-TM, EuroVoc, JRC EuroVoc Indexer JEX, Parallel corpora, DGT-TM, Eur-Lex, Highly multilingual, Linguistic resources, DGT-Acquis, European Union, ECDC-TM, JRC-Acquis, Translation memory}
}