数据集:
gsarti/magpie
语言:
en计算机处理:
monolingual大小:
10K<n<100K语言创建人:
expert-generated批注创建人:
expert-generated源数据集:
original许可:
cc-by-4.0MAGPIE语料库( Haagsma et al. 2020 )是一个大型的带有词义注释的潜在习语表达语料库,基于英国国家语料库(BNC)。潜在习语表达(PIE)类似于习惯用语,但该术语还涵盖了习语的字面用法,例如“我在一天结束时下班。”对于习语“at the end of the day”。此版本的数据集反映了作者在他们对PIEs如何被NMT模型表示的调查中使用的筛选子集。作者使用了37000个被完全比喻或字面说明的样本,用于包含名词、数字或颜色形容词的1482个习语(他们称之为关键词)。由于习语展现出句法和形态的可变性,重点主要放在名词上。PIEs及其上下文使用原始语料库的单词级别注释进行分离。
MAGPIE中的语言数据为英语(BCP-47 en )
magpie配置包含带有潜在习语表达的存在、使用和类型的句子的注释。下面是magpie配置(默认)中训练集中的一个示例。
{ 'sentence': 'There seems to be a dearth of good small tools across the board.', 'annotation': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1], 'idiom': 'across the board', 'usage': 'figurative', 'variant': 'identical', 'pos_tags': ['ADV', 'VERB', 'PART', 'VERB', 'DET', 'NOUN', 'ADP', 'ADJ', 'ADJ', 'NOUN', 'ADP', 'DET', 'NOUN'] }
文本按原样提供,没有进一步的预处理或分词。
字段如下所示:
config | train |
---|---|
magpie | 44451 |
有关数据集创建的更多信息,请参阅原始文章 MAGPIE: A Large Corpus of Potentially Idiomatic Expressions ,有关所选习语的筛选的更多信息,请参阅文章 Can Transformer be Too Compositional? Analysing Idiom Processing in Neural Machine Translation 。
原始作者是原始数据集的策划者。有关此?数据集版本的问题或更新,请联系gabriele.sarti996@gmail.com。
该数据集已根据 Creative Commons 4.0 license (CC-BY-4.0) 许可。
如果你在你的工作中使用了这个语料库,请引用作者:
@inproceedings{haagsma-etal-2020-magpie, title = "{MAGPIE}: A Large Corpus of Potentially Idiomatic Expressions", author = "Haagsma, Hessel and Bos, Johan and Nissim, Malvina", booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2020.lrec-1.35", pages = "279--287", language = "English", ISBN = "979-10-95546-34-4", } @inproceedings{dankers-etal-2022-transformer, title = "Can Transformer be Too Compositional? Analysing Idiom Processing in Neural Machine Translation", author = "Dankers, Verna and Lucas, Christopher and Titov, Ivan", booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = may, year = "2022", address = "Dublin, Ireland", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.acl-long.252", doi = "10.18653/v1/2022.acl-long.252", pages = "3608--3626", }