数据集:
machelreid/m2d2
预印本库:
arxiv:2210.07370许可:
cc-by-nc-4.0来自论文 " M2D2: A Massively Multi-domain Language Modeling Dataset ",(Reid等人,EMNLP 2022)
这是如何加载数据集的:
import datasets dataset = datasets.load_dataset("machelreid/m2d2", "cs.CL") # replace cs.CL with the domain of your choice print(dataset['train'][0]['text'])
如果您发现这个数据有用,请引用此作品。
@article{reid2022m2d2, title = {M2D2: A Massively Multi-domain Language Modeling Dataset}, author = {Machel Reid and Victor Zhong and Suchin Gururangan and Luke Zettlemoyer}, year = {2022}, journal = {arXiv preprint arXiv: Arxiv-2210.07370} }