数据集:
fujiki/wiki40b_ja
语言:
ja许可:
cc-by-sa-4.0这个数据集是 wiki40b 数据集的日文部分的重新格式化版本。使用这个数据集时,请引用原始论文:
@inproceedings{guo-etal-2020-wiki, title = "{W}iki-40{B}: Multilingual Language Model Dataset", author = "Guo, Mandy and Dai, Zihang and Vrande{\v{c}}i{\'c}, Denny and Al-Rfou, Rami", booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2020.lrec-1.297", pages = "2440--2452", abstract = "We propose a new multilingual language model benchmark that is composed of 40+ languages spanning several scripts and linguistic families. With around 40 billion characters, we hope this new resource will accelerate the research of multilingual modeling. We train monolingual causal language models using a state-of-the-art model (Transformer-XL) establishing baselines for many languages. We also introduce the task of multilingual causal language modeling where we train our model on the combined text of 40+ languages from Wikipedia with different vocabulary sizes and evaluate on the languages individually. We released the cleaned-up text of 40+ Wikipedia language editions, the corresponding trained monolingual language models, and several multilingual language models with different fixed vocabulary sizes.", language = "English", ISBN = "979-10-95546-34-4", }