数据集:
ubuntu_dialogs_corpus
任务:
对话子任务:
dialogue-generation语言:
en计算机处理:
monolingual大小:
1M<n<10M语言创建人:
found批注创建人:
found源数据集:
original预印本库:
arxiv:1506.08909许可:
license:unknownUbuntu对话语料库是一个包含近100万个多轮对话的数据集,总计超过700万个话语和1亿个单词。它为基于神经语言模型构建对话管理器的研究提供了独特的资源,这些模型可以利用大量未标记的数据。数据集既具有Dialog State Tracking Challenge数据集中对话的多轮属性,又具有类似Twitter等微博服务的互动的非结构化特性。
"train" 的一个示例如下所示。
This example was too long and was cropped: { "Context": "\"i think we could import the old comment via rsync , but from there we need to go via email . i think it be easier than cach the...", "Label": 1, "Utterance": "basic each xfree86 upload will not forc user to upgrad 100mb of font for noth __eou__ no someth i do in my spare time . __eou__" }
所有拆分的数据字段都相同。
训练集name | train |
---|---|
train | 127422 |
@article{DBLP:journals/corr/LowePSP15, author = {Ryan Lowe and Nissan Pow and Iulian Serban and Joelle Pineau}, title = {The Ubuntu Dialogue Corpus: {A} Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems}, journal = {CoRR}, volume = {abs/1506.08909}, year = {2015}, url = {http://arxiv.org/abs/1506.08909}, archivePrefix = {arXiv}, eprint = {1506.08909}, timestamp = {Mon, 13 Aug 2018 16:48:23 +0200}, biburl = {https://dblp.org/rec/journals/corr/LowePSP15.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
感谢 @thomwolf , @patrickvonplaten , @lewtun 添加此数据集。