数据集:
silver/mmchat
任务:
对话子任务:
dialogue-generation语言:
zh计算机处理:
monolingual大小:
10M<n<100M语言创建人:
found批注创建人:
no-annotation源数据集:
original许可:
otherMMChat is a large-scale dialogue dataset that contains image-grounded dialogues in Chinese. Each dialogue in MMChat is associated with one or more images (maximum 9 images per dialogue). We design various strategies to ensure the quality of the dialogues in MMChat.
MMChat comes with 4 different versions:
If you what to use high quality multi-modal dialogues that are closed related to the given images, I suggest you to use the mmchat_hf version. If you only care about the quality of dialogue texts, I suggest you to use the mmchat_lccc_filtered version.
MMChat is in Chinese
MMChat中的对话是中文的
Several versions of MMChat are available. For mmchat , mmchat_raw , mmchat_lccc_filtered , the following instance applies:
{ "dialog": ["你只拍出了你十分之一的美", "你的头像竟然换了,奥"], "weibo_content": "分享图片", "imgs": ["https://wx4.sinaimg.cn/mw2048/d716a6e2ly1fmug2w2l9qj21o02yox6p.jpg"] }
For mmchat_hf , the following instance applies:
{ "dialog": ["白百合", "啊?", "有点像", "还好吧哈哈哈牙像", "有男盆友没呢", "还没", "和你说话呢。没回我"], "weibo_content": "补一张昨天礼仪的照片", "imgs": ["https://ww2.sinaimg.cn/mw2048/005Co9wdjw1eyoz7ib9n5j307w0bu3z5.jpg"], "labels": { "image_qualified": true, "dialog_qualified": true, "dialog_image_related": true } }
For mmchat , we provide the following splits:
train | valid | test |
---|---|---|
115,842 | 4,000 | 1,000 |
For other versions, we do not provide the offical split. More stastics are listed here:
mmchat | Count |
---|---|
Sessions | 120.84 K |
Sessions with more than 4 utterances | 17.32 K |
Utterances | 314.13 K |
Images | 198.82 K |
Avg. utterance per session | 2.599 |
Avg. image per session | 2.791 |
Avg. character per utterance | 8.521 |
mmchat_hf | Count |
---|---|
Sessions | 19.90 K |
Sessions with more than 4 utterances | 8.91 K |
Totally annotated sessions | 100.01 K |
Utterances | 81.06 K |
Images | 52.66K |
Avg. utterance per session | 4.07 |
Avg. image per session | 2.70 |
Avg. character per utterance | 11.93 |
mmchat_raw | Count |
---|---|
Sessions | 4.257 M |
Sessions with more than 4 utterances | 2.304 M |
Utterances | 18.590 M |
Images | 4.874 M |
Avg. utterance per session | 4.367 |
Avg. image per session | 1.670 |
Avg. character per utterance | 14.104 |
mmchat_lccc_filtered | Count |
---|---|
Sessions | 492.6 K |
Sessions with more than 4 utterances | 208.8 K |
Utterances | 1.986 M |
Images | 1.066 M |
Avg. utterance per session | 4.031 |
Avg. image per session | 2.514 |
Avg. character per utterance | 11.336 |
[Needs More Information]
[Needs More Information]
Who are the source language producers?[Needs More Information]
[Needs More Information]
Who are the annotators?[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
other-weibo
This dataset is collected from Weibo. You can refer to the detailed policy required to use this dataset. Please restrict the usage of this dataset to non-commerical purposes.
@inproceedings{zheng2022MMChat, author = {Zheng, Yinhe and Chen, Guanyi and Liu, Xin and Sun, Jian}, title = {MMChat: Multi-Modal Chat Dataset on Social Media}, booktitle = {Proceedings of The 13th Language Resources and Evaluation Conference}, year = {2022}, publisher = {European Language Resources Association}, } @inproceedings{wang2020chinese, title={A Large-Scale Chinese Short-Text Conversation Dataset}, author={Wang, Yida and Ke, Pei and Zheng, Yinhe and Huang, Kaili and Jiang, Yong and Zhu, Xiaoyan and Huang, Minlie}, booktitle={NLPCC}, year={2020}, url={https://arxiv.org/abs/2008.03946} }
Thanks to Yinhe Zheng for adding this dataset.