TAPAS大模型在Sequential Question Answering（SQA）上进行了精调

该模型有两个可用的版本。默认版本对应于 tapas_sqa_inter_masklm_large_reset 检查点的 original Github repository 。该模型在MLM上进行了预训练，并额外进行了一步，作者称之为中间预训练，然后在 SQA 上进行了精调。它使用相对位置嵌入（即在表格的每个单元格处重置位置索引）。

另一个（非默认）可用的版本是：

no_reset，对应于tapas_sqa_inter_masklm_large（中间预训练，绝对位置嵌入）。

免责声明：发布TAPAS的团队没有为该模型编写模型卡片，因此这个模型卡片是由Hugging Face团队和贡献者编写的。

SQA上的结果-Dev准确率

Size	Reset	Dev Accuracy	Link
LARGE	noreset	0.7223	1236321
LARGE	reset	0.7289	1237321
BASE	noreset	0.6737	1238321
BASE	reset	0.874	1239321
MEDIUM	noreset	0.6464	12310321
MEDIUM	reset	0.6561	12311321
SMALL	noreset	0.5876	12312321
SMALL	reset	0.6155	12313321
MINI	noreset	0.4574	12314321
MINI	reset	0.5148	12315321 )
TINY	noreset	0.2004	12316321
TINY	reset	0.2375	12317321

模型描述

TAPAS是一种类似BERT的transformers模型，通过自我监督的方式在大型英文数据语料库（来自维基百科）上进行预训练。这意味着它只在原始表格和相关文本上进行了预训练，没有以任何方式人工标注它们（这就是它可以使用大量公开可用数据的原因），而是使用自动生成输入和标签的自动过程。更确切地说，它通过两个目标进行预训练：

遮盖语言模型（MLM）：对于一个（扁平化的）表格和相关上下文，模型会随机遮盖输入中15％的单词，然后将整个（部分遮盖的）序列输入模型。然后，模型必须预测被遮盖的单词。这与传统的递归神经网络（RNNs）通常逐个查看单词的方式不同，也不同于内部遮盖未来令牌的自回归模型（如GPT）。它允许模型学习表格和相关文本的双向表示。
中间预训练：为了鼓励在表格上进行数值推理，作者另外通过创建平衡数据集的方式对模型进行了预训练，该数据集包含数百万个从句法上创建的训练示例。在这里，模型必须预测（分类）句子是否由表格的内容支持或反驳。训练示例基于合成的陈述和反事实陈述生成。

通过这种方式，模型学习了表格和相关文本中使用的英语的内部表示，然后可以用于提取对下游任务有用的特征，例如回答关于表格的问题，或确定一句话是否被表格的内容支持或反驳。通过在预训练模型之上添加单元格选择头，并随机初始化此分类头与基本模型一起对SQA进行联合训练来进行精调。

拟用途和限制

您可以使用该模型来回答与表格相关的问题，例如在对话设置中。

有关代码示例，请参阅HuggingFace网站上的TAPAS文档。

训练过程

预处理

对文本进行小写处理，并使用WordPiece进行分词，词汇大小为30,000。该模型的输入形式为：

[CLS] Question [SEP] Flattened table [SEP]

精调

该模型在32个Cloud TPU v3核心上进行了200,000步的精调，最大序列长度为512，批量大小为128。在此设置中，精调大约需要20小时。使用的优化器是Adam，学习率为1.25e-5，热身比例为0.2。添加归纳偏差，使模型只选择同一列的单元格。这通过 TapasConfig 的 select_one_column 参数体现。另请参阅 original paper 的表12。

BibTeX条目和引用信息

@misc{herzig2020tapas,
      title={TAPAS: Weakly Supervised Table Parsing via Pre-training}, 
      author={Jonathan Herzig and Paweł Krzysztof Nowak and Thomas Müller and Francesco Piccinno and Julian Martin Eisenschlos},
      year={2020},
      eprint={2004.02349},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

@misc{eisenschlos2020understanding,
      title={Understanding tables with intermediate pre-training}, 
      author={Julian Martin Eisenschlos and Syrine Krichene and Thomas Müller},
      year={2020},
      eprint={2010.00571},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@InProceedings{iyyer2017search-based,
author = {Iyyer, Mohit and Yih, Scott Wen-tau and Chang, Ming-Wei},
title = {Search-based Neural Structured Learning for Sequential Question Answering},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics},
year = {2017},
month = {July},
abstract = {Recent work in semantic parsing for question answering has focused on long and complicated questions, many of which would seem unnatural if asked in a normal conversation between two humans. In an effort to explore a conversational QA setting, we present a more realistic task: answering sequences of simple but inter-related questions. We collect a dataset of 6,066 question sequences that inquire about semi-structured tables from Wikipedia, with 17,553 question-answer pairs in total. To solve this sequential question answering task, we propose a novel dynamic neural semantic parsing framework trained using a weakly supervised reward-guided search. Our model effectively leverages the sequential context to outperform state-of-the-art QA systems that are designed to answer highly complex questions.},
publisher = {Association for Computational Linguistics},
url = {https://www.microsoft.com/en-us/research/publication/search-based-neural-structured-learning-sequential-question-answering/},
}

作者:

Google AI

数据集大小:

2.51 GB