TAPAS基础模型在连续问答（SQA）上进行微调

这个模型有两个版本可以使用。默认版本对应于 original Github repository 检查点的tapas_sqa_inter_masklm_base_reset。该模型在MLM上进行了预训练，并在 SQA 上进行了额外的步骤（作者称之为中间预训练），然后进行了微调。它使用相对位置嵌入（即在每个表格单元格中重置位置索引）。

可以使用的另一个（非默认）版本是：

no_reset ，对应于tapas_sqa_inter_masklm_base（中间预训练，绝对位置嵌入）。

免责声明：发布TAPAS的团队没有为这个模型编写模型卡片，因此这个模型卡片是由Hugging Face团队和贡献者编写的。

SQA的结果-开发准确度

Size	Reset	Dev Accuracy	Link
LARGE	noreset	0.7223	1236321
LARGE	reset	0.7289	1237321
BASE	noreset	0.6737	1238321
BASE	reset	0.6874	1239321
MEDIUM	noreset	0.6464	12310321
MEDIUM	reset	0.6561	12311321
SMALL	noreset	0.5876	12312321
SMALL	reset	0.6155	12313321
MINI	noreset	0.4574	12314321
MINI	reset	0.5148	12315321 )
TINY	noreset	0.2004	12316321
TINY	reset	0.2375	12317321

模型描述

TAPAS是一种类似BERT的transformers模型，它以自监督的方式在大量来自维基百科的英文数据上进行了预训练。这意味着它仅在原始表格和相关文本上进行了预训练，而没有以任何方式人工标记它们（这就是为什么它可以使用大量公开可用的数据），并使用自动过程从这些文本中生成输入和标签。更确切地说，它通过两个目标进行了预训练：

掩码语言建模（MLM）：接收一个（扁平化的）表格和相关上下文，模型会随机掩盖输入中15％的单词，然后将整个（部分掩盖）序列输入模型。然后模型必须预测掩盖的单词。这与传统的递归神经网络（RNNs）不同，传统的RNNs通常一个接一个地看到单词，或者与内部掩盖将来标记的自回归模型（如GPT）不同。它使得模型能够学习表格和相关文本的双向表示。
中间预训练：为了鼓励在表格上进行数值推理，作者还通过创建一个由数百万个句法创建的训练示例平衡数据集来对模型进行了额外的预训练。在这里，模型必须预测（分类）一句话是否由表格的内容支持或反驳。训练示例是基于合成句子以及反事实陈述创建的。

这样，模型学习了用于表格和相关文本的英语的内部表示，然后可以用来提取对回答关于表格的问题有用的特征，或者确定一句话是否是由表格的内容推出或反驳。微调是通过在预训练模型之上添加一个单元选择头，并联合训练此随机初始化的分类头和基础模型在SQA上进行的。

期望的用途和限制

您可以在对话设置中使用此模型来回答与表格相关的问题。

对于代码示例，请参阅HuggingFace网站上的TAPAS文档。

训练过程

预处理

文本转换为小写并使用WordPiece进行分词，词汇表大小为30,000。模型的输入形式如下：

[CLS] Question [SEP] Flattened table [SEP]

微调

该模型在32个Cloud TPU v3核心上进行了20万步的微调，最大序列长度为512，批大小为128。在此设置中，微调大约需要20个小时。使用的优化器是Adam，学习率为1.25e-5，预热比率为0.2。增加了归纳偏差，使得模型只选择同一列的单元格。这通过TapasConfig的select_one_column参数反映出来。另请参阅 original paper 的表12。

BibTeX条目和引用信息

@misc{herzig2020tapas,
      title={TAPAS: Weakly Supervised Table Parsing via Pre-training}, 
      author={Jonathan Herzig and Paweł Krzysztof Nowak and Thomas Müller and Francesco Piccinno and Julian Martin Eisenschlos},
      year={2020},
      eprint={2004.02349},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

@misc{eisenschlos2020understanding,
      title={Understanding tables with intermediate pre-training}, 
      author={Julian Martin Eisenschlos and Syrine Krichene and Thomas Müller},
      year={2020},
      eprint={2010.00571},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@InProceedings{iyyer2017search-based,
author = {Iyyer, Mohit and Yih, Scott Wen-tau and Chang, Ming-Wei},
title = {Search-based Neural Structured Learning for Sequential Question Answering},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics},
year = {2017},
month = {July},
abstract = {Recent work in semantic parsing for question answering has focused on long and complicated questions, many of which would seem unnatural if asked in a normal conversation between two humans. In an effort to explore a conversational QA setting, we present a more realistic task: answering sequences of simple but inter-related questions. We collect a dataset of 6,066 question sequences that inquire about semi-structured tables from Wikipedia, with 17,553 question-answer pairs in total. To solve this sequential question answering task, we propose a novel dynamic neural semantic parsing framework trained using a weakly supervised reward-guided search. Our model effectively leverages the sequential context to outperform state-of-the-art QA systems that are designed to answer highly complex questions.},
publisher = {Association for Computational Linguistics},
url = {https://www.microsoft.com/en-us/research/publication/search-based-neural-structured-learning-sequential-question-answering/},
}

作者:

Google AI

数据集大小:

845 MB