MQDD - 多模态问题重复检测

该存储库发布了训练模型和其他支持材料用于论文 MQDD – Pre-training of Multimodal Question Duplicity Detection for Software Engineering Domain 。获取更多信息，请参阅论文。可从我们的 Stack Overflow Dataset repository 获得论文中提供的Stack Overflow数据集（SOD）和Stack Overflow Duplicity数据集（SODD）。

要仅获取预训练模型，请参阅 UWB-AIR/MQDD-pretrained 。

经过微调的MQDD

我们发布了我们MQDD模型的经过微调的版本，用于重复检测任务。该模型的架构遵循下图所示的两塔模型的架构：

可以使用以下源代码片段加载不带重复检测头的独立编码器。这样的模型可用于构建基于 Faiss 库的搜索系统，例如：

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("UWB-AIR/MQDD-duplicates")
model = AutoModel.from_pretrained("UWB-AIR/MQDD-duplicates")

完整两塔模型的检查点可以从我们的 GoogleDrive folder 获得。要加载模型，需要使用models/MQDD_model.py中的模型实现，以及我们的 GitHub repository 中的以下源代码构建模型和加载其检查点：

from MQDD_model import ClsHeadModelMQDD

model = ClsHeadModelMQDD("UWB-AIR/MQDD-duplicates")
ckpt = torch.load("model.pt",  map_location="cpu")
model.load_state_dict(ckpt["model_state"])

许可证

本作品采用知识共享署名-非商业性使用-相同方式共享4.0国际许可证

http://creativecommons.org/licenses/by-nc-sa/4.0/

我应该如何引用MQDD？

目前，请引用 the Arxiv paper ：

@misc{https://doi.org/10.48550/arxiv.2203.14093,
  doi = {10.48550/ARXIV.2203.14093},
  url = {https://arxiv.org/abs/2203.14093},
  author = {Pašek, Jan and Sido, Jakub and Konopík, Miloslav and Pražák, Ondřej},
  title = {MQDD -- Pre-training of Multimodal Question Duplicity Detection for Software Engineering Domain},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International}
}

作者:

Artificial Inteligence Research Group at University of West Bohenia in Pilsen

数据集大小:

559.39 MB