数据集:

best2009

任务:

标记分类

语言:

计算机处理:

monolingual

大小:

100K<n<1M

语言创建人:

found

批注创建人:

expert-generated

源数据集:

original

其他:

word-tokenization

许可:

cc-by-nc-sa-3.0

数据集介绍文件清单

中文

Dataset Card for best2009

Dataset Summary

best2009 is a Thai word-tokenization dataset from encyclopedia, novels, news and articles by NECTEC (148,995/2,252 lines of train/test). It was created for BEST 2010: Word Tokenization Competition . The test set answers are not provided publicly.

Supported Tasks and Leaderboards

word tokenization

Languages

Thai

Dataset Structure

Data Instances

{'char': ['?', 'ภ', 'ู', 'ม', 'ิ', 'ป', 'ั', 'ญ', 'ญ', 'า', 'ช', 'า', 'ว', 'บ', '้', 'า', 'น', '\n'], 'char_type': [4, 1, 10, 1, 10, 1, 4, 1, 1, 10, 1, 10, 1, 1, 9, 10, 1, 4], 'fname': 'encyclopedia_00031.txt', 'is_beginning': [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1]}
{'char': ['ภ', 'ู', 'ม', 'ิ', 'ป', 'ั', 'ญ', 'ญ', 'า', 'ช', 'า', 'ว', 'บ', '้', 'า', 'น', ' ', 'ห', 'ม', 'า', 'ย', 'ถ', 'ึ', 'ง', ' ', 'ค', 'ว', 'า', 'ม', 'ร', 'ู', '้', 'ข', 'อ', 'ง', 'ช', 'า', 'ว', 'บ', '้', 'า', 'น', ' ', 'ซ', 'ึ', '่', 'ง', 'เ', 'ร', 'ี', 'ย', 'น', 'ร', 'ู', '้', 'ม', 'า', 'จ', 'า', 'ก', 'พ', '่', 'อ', 'แ', 'ม', '่', ' ', 'ป', 'ู', '่', 'ย', '่', 'า', 'ต', 'า', 'ย', 'า', 'ย', ' ', 'ญ', 'า', 'ต', 'ิ', 'พ', 'ี', '่', 'น', '้', 'อ', 'ง', ' ', 'ห', 'ร', 'ื', 'อ', 'ผ', 'ู', '้', 'ม', 'ี', 'ค', 'ว', 'า', 'ม', 'ร', 'ู', '้', 'ใ', 'น', 'ห', 'ม', 'ู', '่', 'บ', '้', 'า', 'น', 'ใ', 'น', 'ท', '้', 'อ', 'ง', 'ถ', 'ิ', '่', 'น', 'ต', '่', 'า', 'ง', 'ๆ', '\n'], 'char_type': [1, 10, 1, 10, 1, 4, 1, 1, 10, 1, 10, 1, 1, 9, 10, 1, 5, 3, 1, 10, 1, 1, 10, 1, 5, 1, 1, 10, 1, 1, 10, 9, 1, 1, 1, 1, 10, 1, 1, 9, 10, 1, 5, 1, 10, 9, 1, 11, 1, 10, 1, 1, 1, 10, 9, 1, 10, 1, 10, 1, 1, 9, 1, 11, 1, 9, 5, 1, 10, 9, 1, 9, 10, 1, 10, 1, 10, 1, 5, 1, 10, 1, 10, 1, 10, 9, 1, 9, 1, 1, 5, 3, 1, 10, 1, 3, 10, 9, 1, 10, 1, 1, 10, 1, 1, 10, 9, 11, 1, 3, 1, 10, 9, 1, 9, 10, 1, 11, 1, 1, 9, 1, 1, 1, 10, 9, 1, 1, 9, 10, 1, 7, 4], 'fname': 'encyclopedia_00031.txt', 'is_beginning': [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1]}

Data Fields

fname : file name; also marks if article is articles, news, encyclopedia or novels
char : characters
char_type : character types as adopted from by deepcut
is_beginning : is beginning of word

Data Splits

train	test
# lines	148,995	2,252
avg words per line	39.05	NA
total words	5,818,521	NA
avg characters per line	140.39	202.79
total characters	20,918,132	456,684
# lines articles	16,990	NA
# lines encyclopedia	50,631	NA
# lines novels	50,140	NA
# lines news	31,234	NA

Dataset Creation

Curation Rationale

The dataset was created for BEST 2010: Word Tokenization Competition by NECTEC .

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

Respective authors of the articles, news, encyclopedia and novels

Annotations

Annotation process

Detailed annotation guidelines can be found in BEST_Guideline_Release1.pdf as part of the uncompressed files. Word tokenization standard used was InterBEST2009

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

All data are curated from public sources. No personal and sensitive information is expected to be included.

Considerations for Using the Data

Social Impact of Dataset

word tokenization dataset from articles, news, encyclopedia and novels

Discussion of Biases

texts are relatively formal ones from articles, news, encyclopedia and novels.
word tokenization standard used was InterBEST2009 .

Other Known Limitations

some tags unrelated to word tokenization ( <NE> and <AB> ) are cleaned out.
no word boundary provdied for the test set

Additional Information

Dataset Curators

NECTEC

Licensing Information

CC-BY-NC-SA 3.0

Citation Information

Dataset:

@inproceedings{kosawat2009best,
  title={BEST 2009: Thai word segmentation software contest},
  author={Kosawat, Krit and Boriboon, Monthika and Chootrakool, Patcharika and Chotimongkol, Ananlada and Klaithin, Supon and Kongyoung, Sarawoot and Kriengket, Kanyanut and Phaholphinyo, Sitthaa and Purodakananda, Sumonmas and Thanakulwarapas, Tipraporn and others},
  booktitle={2009 Eighth International Symposium on Natural Language Processing},
  pages={83--88},
  year={2009},
  organization={IEEE}
}
@inproceedings{boriboon2009best,
  title={Best corpus development and analysis},
  author={Boriboon, Monthika and Kriengket, Kanyanut and Chootrakool, Patcharika and Phaholphinyo, Sitthaa and Purodakananda, Sumonmas and Thanakulwarapas, Tipraporn and Kosawat, Krit},
  booktitle={2009 International Conference on Asian Language Processing},
  pages={322--327},
  year={2009},
  organization={IEEE}
}

Character type features:

@inproceedings{haruechaiyasak2009tlex,
  title={TLex: Thai lexeme analyser based on the conditional random fields},
  author={Haruechaiyasak, Choochart and Kongyoung, Sarawoot},
  booktitle={Proceedings of 8th International Symposium on Natural Language Processing},
  year={2009}
}

Contributions

Thanks to @cstorm125 for adding this dataset.

作者:

佚名

数据集大小:

17.72 KB