TAPAS medium模型在Sequential Question Answering（SQA）上的微调

这个模型有两个版本可供使用。默认版本对应于 original Github repository 的 tapas_sqa_inter_masklm_medium_reset 检查点。此模型经过MLM的预训练和作者称之为中间预训练的额外步骤，并在 SQA 上进行了微调。它使用相对位置嵌入（即在表格的每个单元格中重置位置索引）。

另一个（非默认）可用的版本是：

no_reset，对应于 tapas_sqa_inter_masklm_medium（中间预训练，绝对位置嵌入）。

声明：发布TAPAS的团队没有为此模型编写模型卡片，所以此模型卡片是由Hugging Face团队和贡献者编写的。

SQA上的结果- Dev准确率

Size	Reset	Dev Accuracy	Link
LARGE	noreset	0.7223	1236321
LARGE	reset	0.7289	1237321
BASE	noreset	0.6737	1238321
BASE	reset	0.6874	1239321
MEDIUM	noreset	0.6464	12310321
MEDIUM	reset	0.6561	12311321
SMALL	noreset	0.5876	12312321
SMALL	reset	0.6155	12313321
MINI	noreset	0.4574	12314321
MINI	reset	0.5148	12315321 )
TINY	noreset	0.2004	12316321
TINY	reset	0.2375	12317321

模型描述

TAPAS是一种类似于BERT的transformers模型，通过自监督方式在维基百科的大型英语数据语料库上进行预训练。这意味着它仅在原始表格和相关文本上进行了预训练，并且没有以任何方式对其进行人工标注（这就是为什么它可以使用大量公开可用的数据），其使用自动化过程从这些文本中生成输入和标签。详细来说，它通过两个目标进行了预训练：

Masked Language Modeling（MLM）：对于给定的（扁平化的）表格和相关上下文，模型会随机掩盖输入中15％的单词，然后将整个（部分掩盖的）序列传递给模型。然后模型需要预测被掩盖的单词。这与传统的递归神经网络（RNN）通常按顺序查看单词或GPT等自回归模型内部掩盖未来标记的方式不同。它使模型能够学习表格和相关文本的双向表示。
中间预训练：为了促进表格上的数值推理，作者额外进行了预训练，创建了数百万个句法创建的训练样本的平衡数据集。在这里，模型必须预测（分类）一句话是否由表格的内容支持或反驳。训练样本是基于合成和反事实陈述创建的。

这样，模型学习了表格和相关文本中使用的英语的内部表示，然后可以用于提取在回答关于表格的问题或确定句子是否由表格的内容推断出或反驳的相关任务中有用的特征。微调是通过在预训练模型的顶部添加一个单元选择头部，然后将此随机初始化的分类头部与基础模型一起训练在SQA上完成的。

预期用途和限制

您可以使用此模型来回答与表格相关的问题，适用于会话式设置。

有关代码示例，请参阅HuggingFace网站上的TAPAS文档.

训练过程

预处理

对文本进行小写处理并使用WordPiece进行分词，词汇表大小为30,000。模型的输入形式如下：

[CLS] Question [SEP] Flattened table [SEP]

微调

该模型使用32个Cloud TPU v3核心进行了200,000步的微调，最大序列长度为512，批量大小为128。在此设置中，微调约需20小时。采用Adam优化器，学习率为1.25e-5，预热比例为0.2。还增加了归纳偏差，使模型只选择同一列的单元格。这反映在 TapasConfig 的 select_one_column 参数上。详见 original paper 的表格12。

BibTeX引用记录和引用信息

@misc{herzig2020tapas,
      title={TAPAS: Weakly Supervised Table Parsing via Pre-training}, 
      author={Jonathan Herzig and PaweÅ‚ Krzysztof Nowak and Thomas MÃ¼ller and Francesco Piccinno and Julian Martin Eisenschlos},
      year={2020},
      eprint={2004.02349},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

@misc{eisenschlos2020understanding,
      title={Understanding tables with intermediate pre-training}, 
      author={Julian Martin Eisenschlos and Syrine Krichene and Thomas MÃ¼ller},
      year={2020},
      eprint={2010.00571},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@InProceedings{iyyer2017search-based,
author = {Iyyer, Mohit and Yih, Scott Wen-tau and Chang, Ming-Wei},
title = {Search-based Neural Structured Learning for Sequential Question Answering},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics},
year = {2017},
month = {July},
abstract = {Recent work in semantic parsing for question answering has focused on long and complicated questions, many of which would seem unnatural if asked in a normal conversation between two humans. In an effort to explore a conversational QA setting, we present a more realistic task: answering sequences of simple but inter-related questions. We collect a dataset of 6,066 question sequences that inquire about semi-structured tables from Wikipedia, with 17,553 question-answer pairs in total. To solve this sequential question answering task, we propose a novel dynamic neural semantic parsing framework trained using a weakly supervised reward-guided search. Our model effectively leverages the sequential context to outperform state-of-the-art QA systems that are designed to answer highly complex questions.},
publisher = {Association for Computational Linguistics},
url = {https://www.microsoft.com/en-us/research/publication/search-based-neural-structured-learning-sequential-question-answering/},
}

作者:

Google AI

数据集大小:

320.22 MB