数据集:

lccc

任务:

对话

语言:

zh

计算机处理:

monolingual

大小:

10M<n<100M

语言创建人:

other

批注创建人:

other

源数据集:

original

预印本库:

arxiv:2008.03946

许可:

mit
中文

Dataset Card for LCCC

Dataset Summary

LCCC: Large-scale Cleaned Chinese Conversation corpus (LCCC) is a large Chinese dialogue corpus originate from Chinese social medias. A rigorous data cleaning pipeline is designed to ensure the quality of the corpus. This pipeline involves a set of rules and several classifier-based filters. Noises such as offensive or sensitive words, special symbols, emojis, grammatically incorrect sentences, and incoherent conversations are filtered.

LCCC是一套来自于中文社交媒体的对话数据,我们设计了一套严格的数据过滤流程来确保该数据集中对话数据的质量。 这一数据过滤流程中包括一系列手工规则以及若干基于机器学习算法所构建的分类器。 我们所过滤掉的噪声包括:脏字脏词、特殊字符、颜表情、语法不通的语句、上下文不相关的对话等。

Supported Tasks and Leaderboards

  • dialogue-generation: The dataset can be used to train a model for generating dialogue responses.
  • response-retrieval: The dataset can be used to train a reranker model that can be used to implement a retrieval-based dialogue model.

Languages

LCCC is in Chinese

LCCC中的对话是中文的

Dataset Structure

Data Instances

{
    "dialog": ["火锅 我 在 重庆 成都 吃 了 七八 顿 火锅", "哈哈哈哈 ! 那 我 的 嘴巴 可能 要 烂掉 !", "不会 的 就是 好 油腻"]
}

Data Fields

  • dialog (list of strings): List of utterances consisting of a dialogue.

Data Splits

We do not provide the offical split for LCCC-large. But we provide a split for LCCC-base:

train valid test
6,820,506 20,000 10,000

Dataset Creation

Curation Rationale

[Needs More Information]

Source Data

Initial Data Collection and Normalization

[Needs More Information]

Who are the source language producers?

[Needs More Information]

Annotations

Annotation process

[Needs More Information]

Who are the annotators?

[Needs More Information]

Personal and Sensitive Information

[Needs More Information]

Considerations for Using the Data

Social Impact of Dataset

[Needs More Information]

Discussion of Biases

[Needs More Information]

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

[Needs More Information]

Licensing Information

MIT License

Copyright (c) 2020 lemon234071

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Citation Information

@inproceedings{wang2020chinese,
  title={A Large-Scale Chinese Short-Text Conversation Dataset},
  author={Wang, Yida and Ke, Pei and Zheng, Yinhe and Huang, Kaili and Jiang, Yong and Zhu, Xiaoyan and Huang, Minlie},
  booktitle={NLPCC},
  year={2020},
  url={https://arxiv.org/abs/2008.03946}
}

Contributions

Thanks to Yinhe Zheng for adding this dataset.