数据集:
neulab/mconala
MCoNaLa是一个多语言代码/自然语言挑战数据集,包含三种语言(西班牙语、日语和俄语)的896个自然语言-代码(NL-Code)对。
西班牙语、日语、俄语;Python
from datasets import load_dataset # Spanish subset load_dataset("neulab/mconala", "es") DatasetDict({ test: Dataset({ features: ['question_id', 'intent', 'rewritten_intent', 'snippet'], num_rows: 341 }) }) # Japanese subset load_dataset("neulab/mconala", "ja") DatasetDict({ test: Dataset({ features: ['question_id', 'intent', 'rewritten_intent', 'snippet'], num_rows: 210 }) }) # Russian subset load_dataset("neulab/mconala", "ru") DatasetDict({ test: Dataset({ features: ['question_id', 'intent', 'rewritten_intent', 'snippet'], num_rows: 345 }) })
Field | Type | Description |
---|---|---|
question_id | int | StackOverflow post id of the sample |
intent | string | Title of the Stackoverflow post as the initial NL intent |
rewritten_intent | string | nl intent rewritten by human annotators |
snippet | string | Python code solution to the NL intent |
该数据集包含341个西班牙语样本,210个日语样本和345个俄语样本。
@article{wang2022mconala, title={MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages}, author={Zhiruo Wang, Grace Cuenca, Shuyan Zhou, Frank F. Xu, Graham Neubig}, journal={arXiv preprint arXiv:2203.08388}, year={2022} }