数据集:

silver/mmchat

任务:

对话

子任务:

dialogue-generation

语言:

计算机处理:

monolingual

大小:

10M<n<100M

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original

预印本库:

arxiv:2108.07154 arxiv:2008.03946

许可:

other

数据集介绍文件清单

英文

MMChat 的数据集卡片

数据集概述

MMChat 是一个包含中文图像对话的大规模对话数据集。MMChat 中的每个对话都与一个或多个图像相关联（每个对话最多有 9 张图像）。我们设计了各种策略来确保 MMChat 中对话的质量。

MMChat 有 4 个不同的版本：

mmchat：我们论文中使用的 MMChat 数据集。
mmchat_hf：包含对 10 万个对话会话进行人工注释。
mmchat_raw：用于构建 MMChat 的原始对话。
mmchat_lccc_filtered：使用 LCCC 数据集过滤的原始对话。

如果您希望使用与给定图像密切相关的高质量多模式对话，请使用 mmchat_hf 版本。如果您只关心对话文本的质量，请使用 mmchat_lccc_filtered 版本。

支持的任务和排行榜

对话生成：该数据集可用于训练生成对话响应的模型。
响应检索：该数据集可用于训练一个重新排序模型，该模型可用于实现基于检索的对话模型。

语言

MMChat 是中文的。

MMChat中的对话是中文的

数据集结构

数据实例

可用多个版本的 MMChat。对于 mmchat，mmchat_raw，mmchat_lccc_filtered，适用以下实例：

{
  "dialog": ["你只拍出了你十分之一的美", "你的头像竟然换了，奥"],
  "weibo_content": "分享图片",
  "imgs": ["https://wx4.sinaimg.cn/mw2048/d716a6e2ly1fmug2w2l9qj21o02yox6p.jpg"]
}

对于 mmchat_hf，适用以下实例：

{
  "dialog": ["白百合", "啊？", "有点像", "还好吧哈哈哈牙像", "有男盆友没呢", "还没", "和你说话呢。没回我"],
  "weibo_content": "补一张昨天礼仪的照片",
  "imgs": ["https://ww2.sinaimg.cn/mw2048/005Co9wdjw1eyoz7ib9n5j307w0bu3z5.jpg"],
  "labels": {
    "image_qualified": true, 
    "dialog_qualified": true, 
    "dialog_image_related": true
  }
}

数据字段

dialog（字符串列表）：对话中包含的话语列表。
weibo_content（字符串）：对话的微博内容。
imgs（字符串列表）：图像的 URL 列表。
labels（字典）：对话的人工注释标签。
image_qualified（布尔值）：图像是否高质量。
dialog_qualified（布尔值）：对话是否高质量。
dialog_image_related（布尔值）：对话是否与图像相关。

数据拆分

对于 mmchat，我们提供以下拆分：

train	valid	test
115,842	4,000	1,000

对于其他版本，我们没有提供官方拆分。以下是更多统计信息：

mmchat	Count
Sessions	120.84 K
Sessions with more than 4 utterances	17.32 K
Utterances	314.13 K
Images	198.82 K
Avg. utterance per session	2.599
Avg. image per session	2.791
Avg. character per utterance	8.521

mmchat_hf	Count
Sessions	19.90 K
Sessions with more than 4 utterances	8.91 K
Totally annotated sessions	100.01 K
Utterances	81.06 K
Images	52.66K
Avg. utterance per session	4.07
Avg. image per session	2.70
Avg. character per utterance	11.93

mmchat_raw	Count
Sessions	4.257 M
Sessions with more than 4 utterances	2.304 M
Utterances	18.590 M
Images	4.874 M
Avg. utterance per session	4.367
Avg. image per session	1.670
Avg. character per utterance	14.104

mmchat_lccc_filtered	Count
Sessions	492.6 K
Sessions with more than 4 utterances	208.8 K
Utterances	1.986 M
Images	1.066 M
Avg. utterance per session	4.031
Avg. image per session	2.514
Avg. character per utterance	11.336

数据集创建

策划理由

[需要更多信息]

源数据

数据收集和规范化

[需要更多信息]

源语言制作者是谁？

[需要更多信息]

注释

注释过程

[需要更多信息]

注释者是谁？

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据时的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

附加信息

数据集策划者

[需要更多信息]

许可信息

other-weibo

该数据集是从微博收集的。您可以参考以下链接来使用该数据集。请将该数据集的使用限制为非商业用途。

引用信息

@inproceedings{zheng2022MMChat,
  author    = {Zheng, Yinhe and Chen, Guanyi and Liu, Xin and Sun, Jian},
  title     = {MMChat: Multi-Modal Chat Dataset on Social Media},
  booktitle = {Proceedings of The 13th Language Resources and Evaluation Conference},
  year      = {2022},
  publisher = {European Language Resources Association},
}

@inproceedings{wang2020chinese,
  title={A Large-Scale Chinese Short-Text Conversation Dataset},
  author={Wang, Yida and Ke, Pei and Zheng, Yinhe and Huang, Kaili and Jiang, Yong and Zhu, Xiaoyan and Huang, Minlie},
  booktitle={NLPCC},
  year={2020},
  url={https://arxiv.org/abs/2008.03946}
}

贡献者

感谢 Yinhe Zheng 添加此数据集。

作者:

silver

数据集大小:

1.05 GB