数据集:

wikitext_tl39

中文

Dataset Card for WikiText-TL-39

Dataset Summary

Large scale, unlabeled text dataset with 39 Million tokens in the training set. Inspired by the original WikiText Long Term Dependency dataset (Merity et al., 2016). TL means "Tagalog." Published in Cruz & Cheng (2019).

Supported Tasks and Leaderboards

[More Information Needed]

Languages

Filipino/Tagalog

Dataset Structure

Data Instances

[More Information Needed]

Data Fields

  • text ( str )

The dataset is in plaintext and only has one field ("text") as it is compiled for language modeling.

Data Splits

Split Documents Tokens
Train 120,975 39M
Valid 25,919 8M
Test 25,921 8M

Please see the paper for more details on the dataset splits

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Tagalog Wikipedia

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

[More Information Needed]

Contributions

Thanks to @jcblaisecruz02 for adding this dataset.