数据集:

codeparrot/codeparrot-valid-near-deduplication

中文

CodeParrot ? Dataset after near deduplication (validation)

Dataset Description

A dataset of Python files from Github. We performed near deduplication of this dataset split codeparrot-clean-train from codeparrot-clean . Exact deduplication can miss a fair amount of nearly identical files. We used MinHash with a Jaccard threshold (default=0.85) to create duplicate clusters. Then these clusters are reduced to unique files based on the exact Jaccard similarity. Fore more details, please refer to this repo .