T5基础模型在MP-DocVQA上微调

这是T5基础模型在Multipage DocVQA (MP-DocVQA)数据集上进行微调的结果。

这个模型被用作 Hierarchical multimodal transformers for Multi-Page DocVQA 的基准模型。

MP-DocVQA数据集的结果报表请参见表2。
训练超参数请参见附录D的表8。

如何使用

以下是如何在PyTorch中使用此模型获取给定文本特征的方法：

import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = LongformerTokenizerFast.from_pretrained("rubentito/t5-base-mpdocvqa")
model = LongformerForQuestionAnswering.from_pretrained("rubentito/t5-base-mpdocvqa")

context = "Huggingface has democratized NLP. Huge thanks to Huggingface for this."
question = "What has Huggingface done?"
input_text = "question: {:s}  context: {:s}".format(question, context)

encoding = tokenizer(input_text, return_tensors="pt")
output = self.model.generate(**encoding)
answer = tokenizer.decode(output['sequences'], skip_special_tokens=True)

指标

平均标准化Levenshtein相似度（ANLS）

这是以文本为基础的视觉问答任务（ST-VQA和DocVQA）的标准评估指标。它评估了模型的推理能力，并在OCR识别错误的情况下进行平滑惩罚。详细信息请参见 Scene Text Visual Question Answering 。

答案页面预测准确性（APPA）

在MP-DocVQA任务中，模型可以提供包含回答问题所需信息的页面索引。对于这个子任务，使用准确性来评估预测结果：即判断预测的页面是否正确。详细信息请参见 Hierarchical multimodal transformers for Multi-Page DocVQA 。

模型结果

关于此模型的扩展实验结果请参见 Hierarchical multimodal transformers for Multi-Page DocVQA 的表2。您还可以在 RRC Portal 处查看实时榜单。

Model	HF name	Parameters	ANLS	APPA
1238321	rubentito/bert-large-mpdocvqa	334M	0.4183	51.6177
1239321	rubentito/longformer-base-mpdocvqa	148M	0.5287	71.1696
12310321	rubentito/bigbird-base-itc-mpdocvqa	131M	0.4929	67.5433
12311321	rubentito/layoutlmv3-base-mpdocvqa	125M	0.4538	51.9426
12312321	rubentito/t5-base-mpdocvqa	223M	0.5050	0.0000
12313321	rubentito/hivt5-base-mpdocvqa	316M	0.6201	79.23

引用信息

@article{tito2022hierarchical,
  title={Hierarchical multimodal transformers for Multi-Page DocVQA},
  author={Tito, Rub{\`e}n and Karatzas, Dimosthenis and Valveny, Ernest},
  journal={arXiv preprint arXiv:2212.05935},
  year={2022}
}

作者:

Rubèn Tito

数据集大小:

851.19 MB