数据集:
EleutherAI/wikitext_document_level
预印本库:
arxiv:1609.07843This is a modified version of https://huggingface.co/datasets/wikitext that returns Wiki pages instead of Wiki text line-by-line. The original readme is contained below.
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.
Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.
An example of 'validation' looks as follows.
This example was too long and was cropped: { "text": "\" The gold dollar or gold one @-@ dollar piece was a coin struck as a regular issue by the United States Bureau of the Mint from..." }wikitext-103-v1
An example of 'train' looks as follows.
This example was too long and was cropped: { "text": "\" Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to..." }wikitext-2-raw-v1
An example of 'train' looks as follows.
This example was too long and was cropped: { "text": "\" The Sinclair Scientific Programmable was introduced in 1975 , with the same case as the Sinclair Oxford . It was larger than t..." }wikitext-2-v1
An example of 'train' looks as follows.
This example was too long and was cropped: { "text": "\" Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to..." }
The data fields are the same among all splits.
wikitext-103-raw-v1name | train | validation | test |
---|---|---|---|
wikitext-103-raw-v1 | 1801350 | 3760 | 4358 |
wikitext-103-v1 | 1801350 | 3760 | 4358 |
wikitext-2-raw-v1 | 36718 | 3760 | 4358 |
wikitext-2-v1 | 36718 | 3760 | 4358 |
The dataset is available under the Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0) .
@misc{merity2016pointer, title={Pointer Sentinel Mixture Models}, author={Stephen Merity and Caiming Xiong and James Bradbury and Richard Socher}, year={2016}, eprint={1609.07843}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Thanks to @thomwolf , @lewtun , @patrickvonplaten , @mariamabarham for adding this dataset.