数据集:
alt
ALT项目旨在通过开放协作来开发和使用ALT,推动亚洲自然语言处理(NLP)技术的最新进展。它首先由NICT和UCSY进行了描述(详见Ye Kyaw Thu、Win Pa Pa、Masao Utiyama、Andrew Finch和Eiichiro Sumita(2016))。然后,根据该网页的描述,它在 ASEAN IVO 下进行了开发。
构建ALT的过程始于从英语Wikinews中抽取大约20,000个句子,然后翻译成其他语言。
机器翻译、依存句法分析
支持13种语言:
ALT平行语料库
{ "SNT.URLID": "80188", "SNT.URLID.SNTID": "1", "url": "http://en.wikinews.org/wiki/2007_Rugby_World_Cup:_Italy_31_-_5_Portugal", "bg": "[translated sentence]", "en": "[translated sentence]", "en_tok": "[translated sentence]", "fil": "[translated sentence]", "hi": "[translated sentence]", "id": "[translated sentence]", "ja": "[translated sentence]", "khm": "[translated sentence]", "lo": "[translated sentence]", "ms": "[translated sentence]", "my": "[translated sentence]", "th": "[translated sentence]", "vi": "[translated sentence]", "zh": "[translated sentence]" }ALT树库
{ "SNT.URLID": "80188", "SNT.URLID.SNTID": "1", "url": "http://en.wikinews.org/wiki/2007_Rugby_World_Cup:_Italy_31_-_5_Portugal", "status": "draft/reviewed", "value": "(S (S (BASENP (NNP Italy)) (VP (VBP have) (VP (VP (VP (VBN defeated) (BASENP (NNP Portugal))) (ADVP (RB 31-5))) (PP (IN in) (NP (BASENP (NNP Pool) (NNP C)) (PP (IN of) (NP (BASENP (DT the) (NN 2007) (NNP Rugby) (NNP World) (NNP Cup)) (PP (IN at) (NP (BASENP (NNP Parc) (FW des) (NNP Princes)) (COMMA ,) (BASENP (NNP Paris) (COMMA ,) (NNP France))))))))))) (PERIOD .))" }ALT缅甸音译
{ "en": "CASINO", "my": [ "ကက်စီနို", "ကစီနို", "ကာစီနို", "ကာဆီနို" ] }
ALT平行语料库
bg、en、fil、hi、id、ja、khm、lo、ms、my、th、vi、zh对应于目标语言
ALT树库注释的方式因语言而异,请参见 their guildlines 以获取更多详细信息。
train | valid | test | |
---|---|---|---|
# articles | 1698 | 98 | 97 |
# sentences | 18088 | 1000 | 1018 |
ALT项目由 National Institute of Information and Communications Technology, Japan (NICT)于2014年发起。NICT开始构建日语和英语ALT,并于2014年与缅甸仰光计算机研究所(UCSY)合作构建缅甸ALT。然后,印度尼西亚技术评估和应用机构(BPPT)、新加坡信息通信研究所(I2R)、越南信息技术研究所(IOIT)和柬埔寨国家邮政电信和信息通信技术研究所(NIPTICT)于2015年加入,为印尼语、马来语、越南语和高棉语创建ALT。
[需要更多信息]
源语言制造者是谁?该数据集于2014年从英语Wikinews中抽取。语言专家从以下机构对其进行了词分割、词性标注和句法信息标注,以及词对齐信息标注:
[需要更多信息]
注释人员是谁?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
Creative Commons Attribution 4.0 International (CC BY 4.0)
如果您使用了该数据集,请引用以下内容:
Hammam Riza、Michael Purwoadi、Gunarso、Teduh Uliniansyah、Aw Ai Ti、Sharifah Mahani Aljunied、Luong Chi Mai、Vu Tat Thang、Nguyen Phuong Thai、Vichet Chea、Rapid Sun、Sethserey Sam、Sopheap Seng、Khin Mar Soe、Khin Thandar Nwet、Masao Utiyama、Chenchen Ding(2016)“Introduction of the Asian Language Treebank” Oriental COCOSDA。
BibTeX:
@inproceedings{riza2016introduction, title={Introduction of the asian language treebank}, author={Riza, Hammam and Purwoadi, Michael and Uliniansyah, Teduh and Ti, Aw Ai and Aljunied, Sharifah Mahani and Mai, Luong Chi and Thang, Vu Tat and Thai, Nguyen Phuong and Chea, Vichet and Sam, Sethserey and others}, booktitle={2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA)}, pages={1--6}, year={2016}, organization={IEEE} }
感谢 @chameleonTK 添加此数据集。