TAPAS小模型在顺序问答（SQA）上进行微调

该模型有两个可用版本。默认版本对应于tapas_sqa_inter_masklm_small_reset检查点的 original Github repository 。此模型在MLM上进行了预训练和额外的中间预训练步骤，然后在 SQA 上进行了微调。它使用相对位置嵌入（即在表的每个单元格中重置位置索引）。

可以使用的另一个（非默认）版本是：

no_reset ，对应于tapas_sqa_inter_masklm_small（中间预训练，绝对位置嵌入）

声明：发布TAPAS的团队没有为这个模型编写模型卡片，所以本模型卡片是由Hugging Face团队和贡献者编写的。

SQA上的结果 - Dev准确率

Size	Reset	Dev Accuracy	Link
LARGE	noreset	0.7223	1236321
LARGE	reset	0.7289	1237321
BASE	noreset	0.6737	1238321
BASE	reset	0.6874	1239321
MEDIUM	noreset	0.6464	12310321
MEDIUM	reset	0.6561	12311321
SMALL	noreset	0.5876	12312321
SMALL	reset	0.6155	12313321
MINI	noreset	0.4574	12314321
MINI	reset	0.5148	12315321 )
TINY	noreset	0.2004	12316321
TINY	reset	0.2375	12317321

模型描述

TAPAS是一种类似BERT的transformers模型，它在自监督的方式下在大量英文数据（来自维基百科）上进行了预训练。这意味着它仅在原始表格和相关文本上进行了预训练，没有人类以任何方式标记它们（这就是为什么它可以使用大量公共可用数据），使用自动生成输入和标签的自动过程。更确切地说，它采用了两个目标进行预训练：

掩码语言建模（MLM）：将（扁平化的）表格和相关上下文拿出，模型会随机屏蔽输入中15％的单词，然后将整个（部分屏蔽）序列输入模型。模型然后必须预测被屏蔽的单词。这与通常只看到一个接着一个单词的传统循环神经网络（RNN）或像GPT这样在内部遮蔽未来令牌的自回归模型不同。它使模型能够学习表格和相关文本的双向表示。
中间预训练：为了鼓励对表格的数值推理，作者额外预训练了模型，创建了数百万个句法生成的训练示例的平衡数据集。在这里，模型必须预测（分类）一个句子是否由表格的内容支持或反驳。训练示例是基于合成和反事实语句创建的。

通过这种方式，模型学习了用于表格和相关文本的英语的内部表示，然后可以用于提取用于回答关于表格的问题或确定句子是否被表格的内容所支持或反驳的特征。微调是通过在预训练模型之上添加一个细胞选择头部进行的，在SQA上与基础模型一起随机初始化该分类头部进行训练。

意图和限制

您可以在对话设置中使用此模型来回答与表格相关的问题。

有关代码示例，请参阅HuggingFace网站上的TAPAS文档。

训练过程

预处理

文本被转换为小写，并使用WordPiece和30000个词汇大小进行标记化。模型的输入形式为：

[CLS] Question [SEP] Flattened table [SEP]

微调

模型在32个Cloud TPU v3核心上进行了200,000步的微调，最大序列长度为512，批量大小为128.在这个设置中，微调需要大约20小时。所使用的优化器是Adam，学习率为1.25e-5，热身比例为0.2.加入了归纳偏差，使模型仅选择同一列的单元格。这体现在TapasConfig的select_one_column参数中。也请参见 original paper 的表12。

BibTeX条目和引文信息

@misc{herzig2020tapas,
      title={TAPAS: Weakly Supervised Table Parsing via Pre-training}, 
      author={Jonathan Herzig and Paweł Krzysztof Nowak and Thomas Müller and Francesco Piccinno and Julian Martin Eisenschlos},
      year={2020},
      eprint={2004.02349},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

@misc{eisenschlos2020understanding,
      title={Understanding tables with intermediate pre-training}, 
      author={Julian Martin Eisenschlos and Syrine Krichene and Thomas Müller},
      year={2020},
      eprint={2010.00571},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@InProceedings{iyyer2017search-based,
author = {Iyyer, Mohit and Yih, Scott Wen-tau and Chang, Ming-Wei},
title = {Search-based Neural Structured Learning for Sequential Question Answering},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics},
year = {2017},
month = {July},
abstract = {Recent work in semantic parsing for question answering has focused on long and complicated questions, many of which would seem unnatural if asked in a normal conversation between two humans. In an effort to explore a conversational QA setting, we present a more realistic task: answering sequences of simple but inter-related questions. We collect a dataset of 6,066 question sequences that inquire about semi-structured tables from Wikipedia, with 17,553 question-answer pairs in total. To solve this sequential question answering task, we propose a novel dynamic neural semantic parsing framework trained using a weakly supervised reward-guided search. Our model effectively leverages the sequential context to outperform state-of-the-art QA systems that are designed to answer highly complex questions.},
publisher = {Association for Computational Linguistics},
url = {https://www.microsoft.com/en-us/research/publication/search-based-neural-structured-learning-sequential-question-answering/},
}

作者:

Google AI

数据集大小:

223.92 MB