数据集:
snow_simplified_japanese_corpus
SNOW T15: The simplified corpus for the Japanese language. The corpus has 50,000 manually simplified and aligned sentences. This corpus contains the original sentences, simplified sentences and English translation of the original sentences. It can be used for automatic text simplification as well as translating simple Japanese into English and vice-versa. The core vocabulary is restricted to 2,000 words where it is selected by accounting for several factors such as meaning preservation, variation, simplicity and the UniDic word segmentation criterion. For details, refer to the explanation page of Japanese simplification ( http://www.jnlp.org/research/Japanese_simplification ). The original texts are from "small_parallel_enja: 50k En/Ja Parallel Corpus for Testing SMT Methods", which is a bilingual corpus for machine translation.
SNOW T23: An expansion corpus of 35,000 sentences rewritten in easy Japanese (simple Japanese vocabulary) based on SNOW T15. The original texts are from "Tanaka Corpus" ( http://www.edrdg.org/wiki/index.php/Tanaka_Corpus ).
It can be used for automatic text simplification in Japanese as well as translating simple Japanese into English and vice-versa.
Japanese, simplified Japanese, and English.
SNOW T15 is xlsx file with ID, "#日本語(原文)" (Japanese (original)), "#やさしい日本語" (simplified Japanese), "#英語(原文)" (English (original)). SNOW T23 is xlsx file with ID, "#日本語(原文)" (Japanese (original)), "#やさしい日本語" (simplified Japanese), "#英語(原文)" (English (original)), and "#固有名詞" (proper noun).
The data is not split.
A dataset on the study of automatic conversion to simplified Japanese (Japanese simplification).
SNOW T15: The original texts are from "small_parallel_enja: 50k En/Ja Parallel Corpus for Testing SMT Methods", which is a bilingual corpus for machine translation.
SNOW T23: The original texts are from "Tanaka Corpus" ( http://www.edrdg.org/wiki/index.php/Tanaka_Corpus ).
[N/A]
SNOW T15: Five students in the laboratory rewrote the original Japanese sentences to simplified Japanese all by hand. The core vocabulary is restricted to 2,000 words where it is selected by accounting for several factors such as meaning preservation, variation, simplicity and the UniDic word segmentation criterion.
SNOW T23: Seven people, gathered through crowdsourcing, rewrote all the sentences manually. Each worker rewrote 5,000 sentences, of which 100 sentences were rewritten to be common among the workers. The average length of the sentences was kept as close to the same as possible so that the amount of work was not varied among the workers.
Five students for SNOW T15, seven crowd workers for SNOW T23.
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
The datasets are part of SNOW, Japanese language resources/tools created by Natural Language Processing Laboratory, Nagaoka University of Technology, Japan.
CC BY 4.0
@inproceedings{maruyama-yamamoto-2018-simplified, title = "Simplified Corpus with Core Vocabulary", author = "Maruyama, Takumi and Yamamoto, Kazuhide", booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)", month = may, year = "2018", address = "Miyazaki, Japan", publisher = "European Language Resources Association (ELRA)", url = "https://www.aclweb.org/anthology/L18-1185", } @inproceedings{yamamoto-2017-simplified-japanese, title = "やさしい⽇本語対訳コーパスの構築", author = "⼭本 和英 and 丸⼭ 拓海 and ⾓張 ⻯晴 and 稲岡 夢⼈ and ⼩川 耀⼀朗 and 勝⽥ 哲弘 and 髙橋 寛治", booktitle = "言語処理学会第23回年次大会", month = 3月, year = "2017", address = "茨城, 日本", publisher = "言語処理学会", url = "https://www.anlp.jp/proceedings/annual_meeting/2017/pdf_dir/B5-1.pdf", } @inproceedings{katsuta-yamamoto-2018-crowdsourced, title = "Crowdsourced Corpus of Sentence Simplification with Core Vocabulary", author = "Katsuta, Akihiro and Yamamoto, Kazuhide", booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)", month = may, year = "2018", address = "Miyazaki, Japan", publisher = "European Language Resources Association (ELRA)", url = "https://www.aclweb.org/anthology/L18-1072", }
Thanks to @forest1988 , @lhoestq for adding this dataset.