数据集:
neulab/docprompting-conala
任务:
文生文语言:
code计算机处理:
monolingual源数据集:
original其他:
code-generation doc retrieval retrieval augmented generation doc+retrieval retrieval+augmented+generation许可:
mitThis is the re-split of CoNaLa dataset. For each code snippet in the dev and test set, at least one function is held out from the training set. This split aims at testing a code generation model's capacity in generating unseen functions We further make sure that examples from the same StackOverflow post (same question_id before - ) are in the same split.
This dataset is used to evaluate code generations.
English - Python code.
dataset = load_dataset("neulab/docpromting-conala") DatasetDict({ train: Dataset({ features: ['nl', 'cmd', 'question_id', 'cmd_name', 'oracle_man', 'canonical_cmd'], num_rows: 2135 }) test: Dataset({ features: ['nl', 'cmd', 'question_id', 'cmd_name', 'oracle_man', 'canonical_cmd'], num_rows: 543 }) validation: Dataset({ features: ['nl', 'cmd', 'question_id', 'cmd_name', 'oracle_man', 'canonical_cmd'], num_rows: 201 }) }) }) code_docs = load_dataset("neulab/docprompting-conala", "docs") DatasetDict({ train: Dataset({ features: ['doc_id', 'doc_content'], num_rows: 34003 }) })
train/dev/test:
docs:
The dataset was crawled from Stack Overflow, automatically filtered, then curated by annotators. For more details, please refer to the original paper
@article{zhou2022doccoder, title={DocCoder: Generating Code by Retrieving and Reading Docs}, author={Zhou, Shuyan and Alon, Uri and Xu, Frank F and JIang, Zhengbao and Neubig, Graham}, journal={arXiv preprint arXiv:2207.05987}, year={2022} }