数据集:
lccc
任务:
对话子任务:
dialogue-generation语言:
zh计算机处理:
monolingual大小:
10M<n<100M语言创建人:
other批注创建人:
other源数据集:
original预印本库:
arxiv:2008.03946许可:
mitLCCC: Large-scale Cleaned Chinese Conversation corpus (LCCC) is a large Chinese dialogue corpus originate from Chinese social medias. A rigorous data cleaning pipeline is designed to ensure the quality of the corpus. This pipeline involves a set of rules and several classifier-based filters. Noises such as offensive or sensitive words, special symbols, emojis, grammatically incorrect sentences, and incoherent conversations are filtered.
LCCC是一套来自于中文社交媒体的对话数据,我们设计了一套严格的数据过滤流程来确保该数据集中对话数据的质量。 这一数据过滤流程中包括一系列手工规则以及若干基于机器学习算法所构建的分类器。 我们所过滤掉的噪声包括:脏字脏词、特殊字符、颜表情、语法不通的语句、上下文不相关的对话等。
LCCC is in Chinese
LCCC中的对话是中文的
{ "dialog": ["火锅 我 在 重庆 成都 吃 了 七八 顿 火锅", "哈哈哈哈 ! 那 我 的 嘴巴 可能 要 烂掉 !", "不会 的 就是 好 油腻"] }
We do not provide the offical split for LCCC-large. But we provide a split for LCCC-base:
train | valid | test |
---|---|---|
6,820,506 | 20,000 | 10,000 |
[Needs More Information]
[Needs More Information]
Who are the source language producers?[Needs More Information]
[Needs More Information]
Who are the annotators?[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
MIT License
Copyright (c) 2020 lemon234071
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@inproceedings{wang2020chinese, title={A Large-Scale Chinese Short-Text Conversation Dataset}, author={Wang, Yida and Ke, Pei and Zheng, Yinhe and Huang, Kaili and Jiang, Yong and Zhu, Xiaoyan and Huang, Minlie}, booktitle={NLPCC}, year={2020}, url={https://arxiv.org/abs/2008.03946} }
Thanks to Yinhe Zheng for adding this dataset.