数据集:

qanastek/WMT-16-PubMed

任务:

翻译

计算机处理:

multilingual

大小:

100K<n<1M

语言创建人:

found

源数据集:

extended
英文

WMT-16-PubMed : 欧洲药品管理局的平行翻译语料库

数据集概要

WMT-16-PubMed 是为 ACL 2016 收集和对齐的神经机器翻译平行语料库。

支持的任务和排行榜

翻译:该数据集可用于训练翻译模型。

语言

该语料库包括4种不同语言的源语句和目标语句对:

语言列表:英语(en),西班牙语(es),法语(fr),葡萄牙语(pt)。

使用HuggingFace加载数据集

from datasets import load_dataset
dataset = load_dataset("qanastek/WMT-16-PubMed", split='train', download_mode='force_redownload')
print(dataset)
print(dataset[0])

数据集结构

数据实例

         lang    doc_id                                     workshop publisher                                        source_text                                        target_text
0       en-fr  26839447  WMT'16 Biomedical Translation Task - PubMed    pubmed  Global Health: Where Do Physiotherapy and Reha...  La place des cheveux et des poils dans les rit...
1       en-fr  26837117  WMT'16 Biomedical Translation Task - PubMed    pubmed                                            Carabin                                       Les Carabins
2       en-fr  26837116  WMT'16 Biomedical Translation Task - PubMed    pubmed                                In Process Citation  Le laboratoire d'Anatomie, Biomécanique et Org...
3       en-fr  26837115  WMT'16 Biomedical Translation Task - PubMed    pubmed  Comment on the misappropriation of bibliograph...  Du détournement des références bibliographique...
4       en-fr  26837114  WMT'16 Biomedical Translation Task - PubMed    pubmed  Anti-aging medicine, a science-based, essentia...  La médecine anti-âge, une médecine scientifiqu...
...       ...       ...                                          ...       ...                                                ...                                                ...
973972  en-pt  20274330  WMT'16 Biomedical Translation Task - PubMed    pubmed     Myocardial infarction, diagnosis and treatment     Infarto do miocárdio; diagnóstico e tratamento
973973  en-pt  20274329  WMT'16 Biomedical Translation Task - PubMed    pubmed                          The health areas politics                     A política dos campos de saúde
973974  en-pt  20274328  WMT'16 Biomedical Translation Task - PubMed    pubmed  The role in tissue edema and liquid exchanges ...  O papel dos tecidos nos edemas e nas trocas lí...
973975  en-pt  20274327  WMT'16 Biomedical Translation Task - PubMed    pubmed  About suppuration of the wound after thoracopl...  Sôbre as supurações da ferida operatória após ...
973976  en-pt  20274326  WMT'16 Biomedical Translation Task - PubMed    pubmed  Experimental study of liver lesions in the tre...  Estudo experimental das lesões hepáticas no tr...

数据字段

lang:类型为String的源语言和目标语言对。

source_text:类型为String的源文本。

target_text:类型为String的目标文本。

数据拆分

en-es:285,584条

en-fr:614,093条

en-pt:74,300条

数据集创建

策划原因

详细信息请参阅相应的 pages

源数据

谁是源语言的制作者?

该共享任务由以下人员组织:

  • Antonio Jimeno Yepes(IBM Research Australia)
  • Aurélie Névéol(LIMSI,法国国家科学研究中心)
  • Mariana Neves(哈索-普拉特纳研究所,德国)
  • Karin Verspoor(墨尔本大学,澳大利亚)

个人和敏感信息

该语料库不包含个人或敏感信息。

使用数据的注意事项

其他已知限制

任务的性质导致目标翻译的质量具有一定的变异性。

额外信息

数据集策划者

Hugging Face WMT-16-PubMed:Labrak Yanis,Dufour Richard(未与原始语料库关联)

WMT'16 共享任务:生物医学翻译任务:

  • Antonio Jimeno Yepes(IBM Research Australia)
  • Aurélie Névéol(LIMSI,法国国家科学研究中心)
  • Mariana Neves(哈索-普拉特纳研究所,德国)
  • Karin Verspoor(墨尔本大学,澳大利亚)

引用信息

使用此数据集时,请引用以下论文。

@inproceedings{bojar-etal-2016-findings,
    title = Findings of the 2016 Conference on Machine Translation,
    author = {
      Bojar, Ondrej  and
      Chatterjee, Rajen  and
      Federmann, Christian  and
      Graham, Yvette  and
      Haddow, Barry  and
      Huck, Matthias  and
      Jimeno Yepes, Antonio  and
      Koehn, Philipp  and
      Logacheva, Varvara  and
      Monz, Christof  and
      Negri, Matteo  and
      Neveol, Aurelie  and
      Neves, Mariana  and
      Popel, Martin  and
      Post, Matt  and
      Rubino, Raphael  and
      Scarton, Carolina  and
      Specia, Lucia  and
      Turchi, Marco  and
      Verspoor, Karin  and
      Zampieri, Marcos,
    },
    booktitle = Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers,
    month = aug,
    year = 2016,
    address = Berlin, Germany,
    publisher = Association for Computational Linguistics,
    url = https://aclanthology.org/W16-2301,
    doi = 10.18653/v1/W16-2301,
    pages = 131--198,
}