数据集:

Cohere/miracl-zh-queries-22-12

任务:

文本检索

子任务:

document-retrieval

语言:

计算机处理:

multilingual

批注创建人:

expert-generated

许可:

apache-2.0

数据集介绍文件清单

英文

MIRACL (zh)嵌入cohere.ai的多语言-22-12编码器

我们使用多语言-22-12嵌入模型将 MIRACL dataset 进行编码为 cohere.ai 。

查询嵌入可以在 Cohere/miracl-zh-queries-22-12 中找到，语料库嵌入可以在 Cohere/miracl-zh-corpus-22-12 中找到。

有关原始数据集，请参阅 miracl/miracl 和 miracl/miracl-corpus 。

数据集信息：

MIRACL 🌍🙌🌏（跨连续语言的多语言信息检索）是一个多语言检索数据集，重点研究18种不同语言的搜索，这些语言共有超过30亿的母语使用者。

每种语言的语料库都是从维基百科转储中准备的，我们只保留纯文本，舍弃了图像、表格等。每篇文章都是使用WikiExtractor根据自然话语单位（例如维基标记中的\n\n）进行分段的。这些段落中的每一个构成了一个“文档”或检索单元。我们保留每个段落的维基百科文章标题。

嵌入

我们使用我们的多语言-22-12嵌入模型计算title+" "+text的嵌入，这是一种用于100种语言的语义搜索的最先进模型。如果您想要了解更多关于此模型的信息，请查看 cohere.ai multilingual embedding model 。

加载数据集

在 miracl-zh-corpus-22-12 中，我们提供了语料库的嵌入。请注意，根据所选的拆分方式，相应的文件可能相当大。

您可以按以下方式加载数据集：

from datasets import load_dataset
docs = load_dataset(f"Cohere/miracl-zh-corpus-22-12", split="train")

或者您也可以在下载之前直接流式传输：

from datasets import load_dataset
docs = load_dataset(f"Cohere/miracl-zh-corpus-22-12", split="train", streaming=True)

for doc in docs:
    docid = doc['docid']
    title = doc['title']
    text = doc['text']
    emb = doc['emb']

搜索

请查看 miracl-zh-queries-22-12 ，我们在其中提供了MIRACL数据集的查询嵌入。

要在文档中进行搜索，您必须使用点积。

然后将此查询嵌入与向量数据库（推荐）或直接计算点积进行比较。

完整的搜索示例：

# Attention! For large datasets, this requires a lot of memory to store
# all document embeddings and to compute the dot product scores.
# Only use this for smaller datasets. For large datasets, use a vector DB

from datasets import load_dataset
import torch

#Load documents + embeddings
docs = load_dataset(f"Cohere/miracl-zh-corpus-22-12", split="train")
doc_embeddings = torch.tensor(docs['emb'])

# Load queries 
queries = load_dataset(f"Cohere/miracl-zh-queries-22-12", split="dev")

# Select the first query as example
qid = 0
query = queries[qid]
query_embedding = torch.tensor(queries['emb'])

# Compute dot score between query embedding and document embeddings
dot_scores = torch.mm(query_embedding, doc_embeddings.transpose(0, 1))
top_k = torch.topk(dot_scores, k=3)

# Print results
print("Query:", query['query'])
for doc_id in top_k.indices[0].tolist():
    print(docs[doc_id]['title'])
    print(docs[doc_id]['text'])

您可以使用我们的API获得新查询的嵌入：

#Run: pip install cohere
import cohere
co = cohere.Client(f"{api_key}")  # You should add your cohere API Key here :))
texts = ['my search query']
response = co.embed(texts=texts, model='multilingual-22-12')
query_embedding = response.embeddings[0] # Get the embedding for the first text

性能

在下表中，我们将cohere多语言-22-12模型与Elasticsearch版本8.6.0的词法搜索（标题和段落作为独立字段索引）进行了比较。请注意，Elasticsearch不支持MIRACL数据集中的所有语言。

我们计算nDCG@10（基于排序的损失），以及hit@3：在前3个结果中至少找到一个相关文档。我们发现hit@3更容易解释，因为它提供了在前3个结果中找到相关文档的查询数量。

注意：MIRACL仅为通过查询选择的片段（每个查询10个）进行了注释。特别是对于较大的维基百科（如英语），我们通常会找到更多相关片段。这被称为注释洞。真正的nDCG@10和hit@3性能可能比所示值更高。

Model	cohere multilingual-22-12 nDCG@10	cohere multilingual-22-12 hit@3	ES 8.6.0 nDCG@10	ES 8.6.0 acc@3
miracl-ar	64.2	75.2	46.8	56.2
miracl-bn	61.5	75.7	49.2	60.1
miracl-de	44.4	60.7	19.6	29.8
miracl-en	44.6	62.2	30.2	43.2
miracl-es	47.0	74.1	27.0	47.2
miracl-fi	63.7	76.2	51.4	61.6
miracl-fr	46.8	57.1	17.0	21.6
miracl-hi	50.7	62.9	41.0	48.9
miracl-id	44.8	63.8	39.2	54.7
miracl-ru	49.2	66.9	25.4	36.7
Avg	51.7	67.5	34.7	46.0

更多语言（不受Elasticsearch支持）：

Model	cohere multilingual-22-12 nDCG@10	cohere multilingual-22-12 hit@3
miracl-fa	44.8	53.6
miracl-ja	49.0	61.0
miracl-ko	50.9	64.8
miracl-sw	61.4	74.5
miracl-te	67.8	72.3
miracl-th	60.2	71.9
miracl-yo	56.4	62.2
miracl-zh	43.8	56.5
Avg	54.3	64.6

作者:

Cohere

数据集大小:

15.62 MB