数据集:

pszemraj/qmsum-cleaned

语言:

en

源数据集:

tau/scrolls

许可:

apache-2.0
中文

qmsum-cleaned

prefixes

It's worth noting that each "document" in input is prefixed by a question/prompt on what the model is supposed to do. You may want to explicitly handle this in some way, or prefix your models trained on this dataset.

Most frequent "prefixes" separated via sentence-splitter in the train split:

Sentence Count
0 Summarize the whole meeting. 121
1 Summarize the meeting 25
2 What did the team discuss about the product cost? 4
3 How did Marketing design the product evaluation? 4
4 Summarize the wrap up of the meeting. 3
5 What did the group discuss about user requirements of the new remote control? 3
6 What did the team discuss during the product evaluation? 3
7 Summarize the meeting. 2
8 Summarize what was said about digits form 2
9 What was discussed in the meeting? 2

wordcloud

Visualized as a wordcloud ( train split):

token counts