数据集:
ekinakyurek/ftrace
[论文]FTRACE是一个用于将语言模型的预测追溯回训练样本的零样式信息检索基准。在附带的论文中,我们评估了常见的影响方法,包括基于梯度的方法(TracIn)和基于嵌入的方法。数据集包含两部分。首先,我们从现有的LAMA查询(Petroni等人,2019年)中提取用于追踪知识的事实查询。其次,我们从TREx语料库(Elsahar等人,2018年)中提取了Wikidata句子。我们用其陈述的事实注释了提取的句子,这些事实可以与查询集中的事实进行匹配。在这两部分中,我们提供(输入,目标)对作为掩码语言建模任务--请参见下面的示例。然而,可以使用相同的数据进行其他形式的处理,例如通过处理input_pretokenized和targets_pretokenized字段来进行自回归完成。
'摘要'的示例如下所示。
{"inputs_pretokenized": "The name Austroasiatic comes from the Latin words for \"south\" and \"Asia\", hence \"<extra_id_0>\".",
"targets_pretokenized": "<extra_id_0> South Asia",
"page_uri": "Q33199",
"masked_uri": "Q771405",
"masked_type": "subject",
"example_uris": "Q33199-1-Q48-Q771405-1",
"facts": "P361,Q48,Q771405;P30,Q48,Q771405",
"id": 8}
'查询'的示例如下所示。
{"inputs_pretokenized": "Paul Ehrlich used to work in <extra_id_0> .",
"targets_pretokenized": "<extra_id_0> Frankfurt",
"uuid": "5b063008-a8ba-4064-9f59-e70102bb8c50",
"obj_uri": "Q1794",
"sub_uri": "Q57089",
"predicate_id": "P937",
"obj_surface": "Frankfurt",
"sub_surface": "Paul Ehrlich"}
所有拆分的数据字段相同。
摘要name | train |
---|---|
Abstracts | 1560453 |
Queries | 31479 |
LAMA: https://github.com/facebookresearch/LAMA TRex:12,393,21
初始数据收集和规范化 谁是源语言的生产者?该数据集的部分可以在 Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0) 和 The Creative Commons Attribution-Noncommercial 4.0 International License 下获得。
主要论文应按以下方式引用:
@misc{https://doi.org/10.48550/arxiv.2205.11482,
doi = {10.48550/ARXIV.2205.11482},
url = {https://arxiv.org/abs/2205.11482},
author = {Akyürek, Ekin and Bolukbasi, Tolga and Liu, Frederick and Xiong, Binbin and Tenney, Ian and Andreas, Jacob and Guu, Kelvin},
keywords = {Computation and Language (cs.CL), Information Retrieval (cs.IR), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Tracing Knowledge in Language Models Back to the Training Data},
publisher = {arXiv},
year = {2022},
}
还请引用Petroni等人,2019年关于查询集的论文,以及Elsahar等人,2018年关于摘要集的论文。
@inproceedings{petroni2019language,
title={Language Models as Knowledge Bases?},
author={F. Petroni, T. Rockt{\"{a}}schel, A. H. Miller, P. Lewis, A. Bakhtin, Y. Wu and S. Riedel},
booktitle={In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019},
year={2019}
}
@inproceedings{elsahar2018t,
title={T-rex: A large scale alignment of natural language with knowledge base triples},
author={Elsahar, Hady and Vougiouklis, Pavlos and Remaci, Arslen and Gravier, Christophe and Hare, Jonathon and Laforest, Frederique and Simperl, Elena},
booktitle={Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
year={2018}
}