数据集:

joelito/EU_Wikipedias

计算机处理:

multilingual

大小:

10M<n<100M

语言创建人:

found

批注创建人:

other

源数据集:

original

许可:

cc-by-4.0
中文

Dataset Card for EUWikipedias: A dataset of Wikipedias in the EU languages

Dataset Summary

Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump ( https://dumps.wikimedia.org/ ) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

Supported Tasks and Leaderboards

The dataset supports the tasks of fill-mask.

Languages

The following languages are supported: bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv

Dataset Structure

It is structured in the following format: {date}/{language}_{shard}.jsonl.xz At the moment only the date '20221120' is supported.

Use the dataset like this:

from datasets import load_dataset

dataset = load_dataset('joelito/EU_Wikipedias', date="20221120", language="de", split='train', streaming=True)

Data Instances

The file format is jsonl.xz and there is one split available ( train ).

Source Size (MB) Words Documents Words/Document
20221120.all 86034 9506846949 26481379 359
20221120.bg 1261 88138772 285876 308
20221120.cs 1904 189580185 513851 368
20221120.da 679 74546410 286864 259
20221120.de 11761 1191919523 2740891 434
20221120.el 1531 103504078 215046 481
20221120.en 26685 3192209334 6575634 485
20221120.es 6636 801322400 1583597 506
20221120.et 538 48618507 231609 209
20221120.fi 1391 115779646 542134 213
20221120.fr 9703 1140823165 2472002 461
20221120.ga 72 8025297 57808 138
20221120.hr 555 58853753 198746 296
20221120.hu 1855 167732810 515777 325
20221120.it 5999 687745355 1782242 385
20221120.lt 409 37572513 203233 184
20221120.lv 269 25091547 116740 214
20221120.mt 29 2867779 5030 570
20221120.nl 3208 355031186 2107071 168
20221120.pl 3608 349900622 1543442 226
20221120.pt 3315 389786026 1095808 355
20221120.ro 1017 111455336 434935 256
20221120.sk 506 49612232 238439 208
20221120.sl 543 58858041 178472 329
20221120.sv 2560 257872432 2556132 100

Data Fields

[More Information Needed]

Data Splits

[More Information Needed]

Dataset Creation

This dataset has been created by downloading the wikipedias using olm/wikipedia for the 24 EU languages. For more information about the creation of the dataset please refer to prepare_wikipedias.py

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

TODO add citation

Contributions

Thanks to @JoelNiklaus for adding this dataset.