查询生成

此模型是来自 docTTTTTquery 的t5-base模型。

T5-base模型在 MS MARCO Passage Dataset 上进行了训练，该数据集包含来自必应的约500,000个真实搜索查询以及相关的段落。

该模型可以用于查询生成，以学习语义搜索模型，而不需要注释的训练数据： Synthetic Query Generation 。

用法

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained('model-name')
model = T5ForConditionalGeneration.from_pretrained('model-name')

para = "Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects."

input_ids = tokenizer.encode(para, return_tensors='pt')
outputs = model.generate(
    input_ids=input_ids,
    max_length=64,
    do_sample=True,
    top_p=0.95,
    num_return_sequences=3)

print("Paragraph:")
print(para)

print("\nGenerated Queries:")
for i in range(len(outputs)):
    query = tokenizer.decode(outputs[i], skip_special_tokens=True)
    print(f'{i + 1}: {query}')

作者:

BEIR

数据集大小:

5.5 GB