数据集:

wiki_dpr

英文

数据集“wiki_dpr”的数据卡

数据集摘要

这是用于评估Dense Passage Retrieval (DPR)模型的维基百科数据集。它包含来自维基百科的2100万个段落以及它们的DPR嵌入。维基百科文章被分割成100个词的多个不重叠文本块作为段落。

维基百科的转储文件是2018年12月20日的版本。

支持的任务和排行榜

More Information Needed

语言

More Information Needed

数据集结构

数据实例

每个实例包含最多100个单词的段落,以及它来自的维基百科页面的标题和DPR嵌入(一个768维向量)。

psgs_w100.multiset.compressed
  • 下载的数据集文件大小: 70.97 GB
  • 生成的数据集大小: 78.42 GB
  • 总共使用的磁盘空间: 152.26 GB

“输入”数据集的示例如下所示。

This example was too long and was cropped:

{'id': '1',
 'text': 'Aaron Aaron ( or ; "Ahärôn") is a prophet, high priest, and the brother of Moses in the Abrahamic religions. Knowledge of Aaron, along with his brother Moses, comes exclusively from religious texts, such as the Bible and Quran. The Hebrew Bible relates that, unlike Moses, who grew up in the Egyptian royal court, Aaron and his elder sister Miriam remained with their kinsmen in the eastern border-land of Egypt (Goshen). When Moses first confronted the Egyptian king about the Israelites, Aaron served as his brother\'s spokesman ("prophet") to the Pharaoh. Part of the Law (Torah) that Moses received from'],
 'title': 'Aaron',
 'embeddings': [-0.07233893871307373,
   0.48035329580307007,
   0.18650995194911957,
   -0.5287084579467773,
   -0.37329429388046265,
   0.37622880935668945,
   0.25524479150772095,
   ...
   -0.336689829826355,
   0.6313082575798035,
   -0.7025573253631592]}
psgs_w100.multiset.exact
  • 下载的数据集文件大小: 70.97 GB
  • 生成的数据集大小: 78.42 GB
  • 总共使用的磁盘空间: 187.38 GB

“输入”数据集的示例如下所示。

This example was too long and was cropped:

{'id': '1',
 'text': 'Aaron Aaron ( or ; "Ahärôn") is a prophet, high priest, and the brother of Moses in the Abrahamic religions. Knowledge of Aaron, along with his brother Moses, comes exclusively from religious texts, such as the Bible and Quran. The Hebrew Bible relates that, unlike Moses, who grew up in the Egyptian royal court, Aaron and his elder sister Miriam remained with their kinsmen in the eastern border-land of Egypt (Goshen). When Moses first confronted the Egyptian king about the Israelites, Aaron served as his brother\'s spokesman ("prophet") to the Pharaoh. Part of the Law (Torah) that Moses received from'],
 'title': 'Aaron',
 'embeddings': [-0.07233893871307373,
   0.48035329580307007,
   0.18650995194911957,
   -0.5287084579467773,
   -0.37329429388046265,
   0.37622880935668945,
   0.25524479150772095,
   ...
   -0.336689829826355,
   0.6313082575798035,
   -0.7025573253631592]}
psgs_w100.multiset.no_index
  • 下载的数据集文件大小: 70.97 GB
  • 生成的数据集大小: 78.42 GB
  • 总共使用的磁盘空间: 149.38 GB

“输入”数据集的示例如下所示。

This example was too long and was cropped:

{'id': '1',
 'text': 'Aaron Aaron ( or ; "Ahärôn") is a prophet, high priest, and the brother of Moses in the Abrahamic religions. Knowledge of Aaron, along with his brother Moses, comes exclusively from religious texts, such as the Bible and Quran. The Hebrew Bible relates that, unlike Moses, who grew up in the Egyptian royal court, Aaron and his elder sister Miriam remained with their kinsmen in the eastern border-land of Egypt (Goshen). When Moses first confronted the Egyptian king about the Israelites, Aaron served as his brother\'s spokesman ("prophet") to the Pharaoh. Part of the Law (Torah) that Moses received from'],
 'title': 'Aaron',
 'embeddings': [-0.07233893871307373,
   0.48035329580307007,
   0.18650995194911957,
   -0.5287084579467773,
   -0.37329429388046265,
   0.37622880935668945,
   0.25524479150772095,
   ...
   -0.336689829826355,
   0.6313082575798035,
   -0.7025573253631592]}
psgs_w100.nq.compressed
  • 下载的数据集文件大小: 70.97 GB
  • 生成的数据集大小: 78.42 GB
  • 总共使用的磁盘空间: 152.26 GB

“输入”数据集的示例如下所示。

This example was too long and was cropped:

{'id': '1',
 'text': 'Aaron Aaron ( or ; "Ahärôn") is a prophet, high priest, and the brother of Moses in the Abrahamic religions. Knowledge of Aaron, along with his brother Moses, comes exclusively from religious texts, such as the Bible and Quran. The Hebrew Bible relates that, unlike Moses, who grew up in the Egyptian royal court, Aaron and his elder sister Miriam remained with their kinsmen in the eastern border-land of Egypt (Goshen). When Moses first confronted the Egyptian king about the Israelites, Aaron served as his brother\'s spokesman ("prophet") to the Pharaoh. Part of the Law (Torah) that Moses received from'],
 'title': 'Aaron',
 'embeddings': [0.013342111371457577,
   0.582173764705658,
   -0.31309744715690613,
   -0.6991612911224365,
   -0.5583199858665466,
   0.5187504887580872,
   0.7152731418609619,
   ...
   -0.5385938286781311,
   0.8093984127044678,
   -0.4741983711719513]}
psgs_w100.nq.exact
  • 下载的数据集文件大小: 70.97 GB
  • 生成的数据集大小: 78.42 GB
  • 总共使用的磁盘空间: 187.38 GB

“输入”数据集的示例如下所示。

This example was too long and was cropped:

{'id': '1',
 'text': 'Aaron Aaron ( or ; "Ahärôn") is a prophet, high priest, and the brother of Moses in the Abrahamic religions. Knowledge of Aaron, along with his brother Moses, comes exclusively from religious texts, such as the Bible and Quran. The Hebrew Bible relates that, unlike Moses, who grew up in the Egyptian royal court, Aaron and his elder sister Miriam remained with their kinsmen in the eastern border-land of Egypt (Goshen). When Moses first confronted the Egyptian king about the Israelites, Aaron served as his brother\'s spokesman ("prophet") to the Pharaoh. Part of the Law (Torah) that Moses received from'],
 'title': 'Aaron',
 'embeddings': [0.013342111371457577,
   0.582173764705658,
   -0.31309744715690613,
   -0.6991612911224365,
   -0.5583199858665466,
   0.5187504887580872,
   0.7152731418609619,
   ...
   -0.5385938286781311,
   0.8093984127044678,
   -0.4741983711719513]}

数据字段

所有拆分之间的数据字段都是相同的。

psgs_w100.multiset.compressed
  • id : 一个 字符串 feature.
  • text : 一个 字符串 feature.
  • title : 一个 字符串 feature.
  • embeddings : 一个 float32 特征的列表.
psgs_w100.multiset.exact
  • id : 一个 字符串 feature.
  • text : 一个 字符串 feature.
  • title : 一个 字符串 feature.
  • embeddings : 一个 float32 特征的列表.
psgs_w100.multiset.no_index
  • id : 一个 字符串 feature.
  • text : 一个 字符串 feature.
  • title : 一个 字符串 feature.
  • embeddings : 一个 float32 特征的列表.
psgs_w100.nq.compressed
  • id : 一个 字符串 feature.
  • text : 一个 字符串 feature.
  • title : 一个 字符串 feature.
  • embeddings : 一个 float32 特征的列表.
psgs_w100.nq.exact
  • id : 一个 字符串 feature.
  • text : 一个 字符串 feature.
  • title : 一个 字符串 feature.
  • embeddings : 一个 float32 特征的列表.

数据拆分

name train
psgs_w100.multiset.compressed 21015300
psgs_w100.multiset.exact 21015300
psgs_w100.multiset.no_index 21015300
psgs_w100.nq.compressed 21015300
psgs_w100.nq.exact 21015300

数据集创建

策划理由

More Information Needed

源数据

初始数据采集和规范化

More Information Needed

谁是源语言的生产者?

More Information Needed

注释

注释过程

More Information Needed

谁是注释者?

More Information Needed

个人和敏感信息

More Information Needed

使用数据时的注意事项

数据的社会影响

More Information Needed

对偏见的讨论

More Information Needed

其他已知限制

More Information Needed

其他信息

数据集策划者

More Information Needed

许可信息

More Information Needed

引用信息

@misc{karpukhin2020dense,
    title={Dense Passage Retrieval for Open-Domain Question Answering},
    author={Vladimir Karpukhin and Barlas Oğuz and Sewon Min and Patrick Lewis and Ledell Wu and Sergey Edunov and Danqi Chen and Wen-tau Yih},
    year={2020},
    eprint={2004.04906},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

贡献

感谢 @thomwolf @lewtun @lhoestq 为添加此数据集。