数据集:
lighteval/mmlu
任务:
子任务:
multiple-choice-qa语言:
计算机处理:
monolingual大小:
10K<n<100K语言创建人:
expert-generated批注创建人:
no-annotation源数据集:
original许可:
由 Dan Hendrycks , Collin Burns , Steven Basart ,Andy Zou,Mantas Mazeika, Dawn Song 和 Jacob Steinhardt (ICLR 2021)合作创建。
这是一个包含各个领域的多项选择题的大规模多任务测试。该测试涵盖了人文学科、社会科学、硬科学和其他一些重要学科的内容。其中包括了57个任务,包括初等数学、美国历史、计算机科学、法律等。为了在这个测试中获得高准确度,模型必须具备广博的世界知识和问题解决能力。
任务的完整列表:['abstract_algebra', 'anatomy', 'astronomy', 'business_ethics', 'clinical_knowledge', 'college_biology', 'college_chemistry', 'college_computer_science', 'college_mathematics', 'college_medicine', 'college_physics', 'computer_security', 'conceptual_physics', 'econometrics', 'electrical_engineering', 'elementary_mathematics', 'formal_logic', 'global_facts', 'high_school_biology', 'high_school_chemistry', 'high_school_computer_science', 'high_school_european_history', 'high_school_geography', 'high_school_government_and_politics', 'high_school_macroeconomics', 'high_school_mathematics', 'high_school_microeconomics', 'high_school_physics', 'high_school_psychology', 'high_school_statistics', 'high_school_us_history', 'high_school_world_history', 'human_aging', 'human_sexuality', 'international_law', 'jurisprudence', 'logical_fallacies', 'machine_learning', 'management', 'marketing', 'medical_genetics', 'miscellaneous', 'moral_disputes', 'moral_scenarios', 'nutrition', 'philosophy', 'prehistory', 'professional_accounting', 'professional_law', 'professional_medicine', 'professional_psychology', 'public_relations', 'security_studies', 'sociology', 'us_foreign_policy', 'virology', 'world_religions']
Model | Authors | Humanities | Social Science | STEM | Other | Average |
---|---|---|---|---|---|---|
1238321 | Khashabi et al., 2020 | 45.6 | 56.6 | 40.2 | 54.6 | 48.9 |
1239321 (few-shot) | Brown et al., 2020 | 40.8 | 50.4 | 36.7 | 48.8 | 43.9 |
12310321 | Radford et al., 2019 | 32.8 | 33.3 | 30.2 | 33.1 | 32.4 |
Random Baseline | N/A | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 |
英语
解剖学子任务的示例如下:
{ "question": "What is the embryological origin of the hyoid bone?", "choices": ["The first pharyngeal arch", "The first and second pharyngeal arches", "The second pharyngeal arch", "The second and third pharyngeal arches"], "answer": "D" }
auxiliary_train | dev | val | test | |
---|---|---|---|---|
TOTAL | 99842 | 285 | 1531 | 14042 |
Transformer模型在这方面取得了最新的进展,通过在大规模文本语料库上进行预训练,包括维基百科的所有内容、成千上万本书籍和许多网站的内容。因此,这些模型能够接触到关于专业主题的大量信息,而这些信息大部分不会在现有的自然语言处理基准测试中得到评估。为了弥合预训练模型所见到的广泛知识和现有的成功衡量标准之间的差距,我们引入了一个新的基准测试,用于评估模型在人类学习的多个不同学科上的表现。
数据收集和规范化的初始过程。
[需要更多信息]
语言制作人是谁?[需要更多信息]
[需要更多信息]
注释者是谁?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
如果您在研究中发现此数据集很有用,请考虑引用该测试以及它所引用的 ETHICS 数据集:
@article{hendryckstest2021, title={Measuring Massive Multitask Language Understanding}, author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt}, journal={Proceedings of the International Conference on Learning Representations (ICLR)}, year={2021} } @article{hendrycks2021ethics, title={Aligning AI With Shared Human Values}, author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt}, journal={Proceedings of the International Conference on Learning Representations (ICLR)}, year={2021} }
感谢 @andyzoujm 添加此数据集。