数据集:

miracl/miracl-corpus

计算机处理:

multilingual

批注创建人:

expert-generated

预印本库:

arxiv:2210.09984

许可:

apache-2.0
中文

Dataset Card for MIRACL Corpus

MIRACL ??? (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual retrieval dataset that focuses on search across 18 different languages, which collectively encompass over three billion native speakers around the world.

This dataset contains the collection data of the 16 "known languages". The remaining 2 "surprise languages" will not be released until later.

The corpus for each language is prepared from a Wikipedia dump, where we keep only the plain text and discard images, tables, etc. Each article is segmented into multiple passages using WikiExtractor based on natural discourse units (e.g., \n\n in the wiki markup). Each of these passages comprises a "document" or unit of retrieval. We preserve the Wikipedia article title of each passage.

Dataset Structure

Each retrieval unit contains three fields: docid , title , and text . Consider an example from the English corpus:

{
    "docid": "39#0",
    "title": "Albedo", 
    "text": "Albedo (meaning 'whiteness') is the measure of the diffuse reflection of solar radiation out of the total solar radiation received by an astronomical body (e.g. a planet like Earth). It is dimensionless and measured on a scale from 0 (corresponding to a black body that absorbs all incident radiation) to 1 (corresponding to a body that reflects all incident radiation)."
}

The docid has the schema X#Y , where all passages with the same X come from the same Wikipedia article, whereas Y denotes the passage within that article, numbered sequentially. The text field contains the text of the passage. The title field contains the name of the article the passage comes from.

The collection can be loaded using:

lang='ar'  # or any of the 16 languages
miracl_corpus = datasets.load_dataset('miracl/miracl-corpus', lang)['train']
for doc in miracl_corpus:
   docid = doc['docid']
   title = doc['title']
   text = doc['text']

Dataset Statistics and Links

The following table contains the number of passage and Wikipedia articles in the collection of each language, along with the links to the datasets and raw Wikipedia dumps.

Language # of Passages # of Articles Links Raw Wiki Dump
Arabic (ar) 2,061,414 656,982 ? ?
Bengali (bn) 297,265 63,762 ? ?
English (en) 32,893,221 5,758,285 ? ?
Spanish (es) 10,373,953 1,669,181 ? ?
Persian (fa) 2,207,172 857,827 ? ?
Finnish (fi) 1,883,509 447,815 ? ?
French (fr) 14,636,953 2,325,608 ? ?
Hindi (hi) 506,264 148,107 ? ?
Indonesian (id) 1,446,315 446,330 ? ?
Japanese (ja) 6,953,614 1,133,444 ? ?
Korean (ko) 1,486,752 437,373 ? ?
Russian (ru) 9,543,918 1,476,045 ? ?
Swahili (sw) 131,924 47,793 ? ?
Telugu (te) 518,079 66,353 ? ?
Thai (th) 542,166 128,179 ? ?
Chinese (zh) 4,934,368 1,246,389 ? ?