数据集:

codeparrot/github-jupyter-text-code-pairs

中文

This is a parsed version of github-jupyter-parsed , with markdown and code pairs. We provide the preprocessing script in preprocessing.py . The data is deduplicated and consists of 451662 examples.

For similar datasets with text and Python code, there is CoNaLa benchmark from StackOverflow, with some samples curated by annotators.