数据集:
pszemraj/qmsum-cleaned
It's worth noting that each "document" in input is prefixed by a question/prompt on what the model is supposed to do. You may want to explicitly handle this in some way, or prefix your models trained on this dataset.
Most frequent "prefixes" separated via sentence-splitter in the train split:
Sentence | Count | |
---|---|---|
0 | Summarize the whole meeting. | 121 |
1 | Summarize the meeting | 25 |
2 | What did the team discuss about the product cost? | 4 |
3 | How did Marketing design the product evaluation? | 4 |
4 | Summarize the wrap up of the meeting. | 3 |
5 | What did the group discuss about user requirements of the new remote control? | 3 |
6 | What did the team discuss during the product evaluation? | 3 |
7 | Summarize the meeting. | 2 |
8 | Summarize what was said about digits form | 2 |
9 | What was discussed in the meeting? | 2 |
Visualized as a wordcloud ( train split):