数据集:
lex_glue
受到最近广泛使用的GLUE多任务基准NLP数据集(Wang等人,2018),随后更困难的SuperGLUE(Wang等人,2019),其他先前的多任务NLP基准(Conneau和Kiela,2018; McCann等人,2018)和其他领域的类似倡议的启发(Peng等人,2019),我们引入了 法律普通语言理解评估(LexGLUE)基准,这是一个评估NLP方法在法律任务中性能的基准数据集。LexGLUE基于七个现有的法律NLP数据集,使用大部分来自SuperGLUE的标准进行选择。
与GLUE和SuperGLUE(Wang等人,2019b,a)一样,我们的目标之一是推动通用(或“基础”)模型的发展,这些模型可以处理多个NLP任务,在我们的案例中是法律NLP任务,可能需要有限的任务特定微调。另一个目标是为希望探索或开发法律NLP方法的NLP研究人员和从业者提供方便和信息丰富的入门点。考虑到这些目标,我们在LexGLUE中包含的数据集及其所解决的任务已经简化了多个方面,以便新手和通用模型能够更容易地解决所有任务。
LexGLUE基准伴随着依赖于Hugging Face Transformers库的实验基础设施,并驻留在: https://github.com/coastalcph/lex-glue .
支持的任务如下:
Dataset | Source | Sub-domain | Task Type | Classes |
ECtHR (Task A) | 1239321 | ECHR | Multi-label classification | 10+1 |
ECtHR (Task B) | 12310321 | ECHR | Multi-label classification | 10+1 |
SCOTUS | 12311321 | US Law | Multi-class classification | 14 |
EUR-LEX | 12312321 | EU Law | Multi-label classification | 100 |
LEDGAR | 12313321 | Contracts | Multi-class classification | 100 |
UNFAIR-ToS | 12314321 | Contracts | Multi-label classification | 8+1 |
CaseHOLD | 12315321 | US Law | Multiple choice QA | n/a |
欧洲人权法院(ECtHR)审理一个国家违反欧洲人权公约(ECHR)人权条款的指控。对于每个案件,数据集提供来自案件描述的一系列事实段落(事实)。每个案件都与被违反的ECHR条款(如果有的话)进行映射。
ecthr_b欧洲人权法院(ECtHR)审理一个国家违反欧洲人权公约(ECHR)人权条款的指控。对于每个案件,数据集提供来自案件描述的一系列事实段落(事实)。每个案件都与被法院认为被违反的ECHR条款进行映射。
scotus美国最高法院(SCOTUS)是美国联邦最高法院,通常只审理最具争议或难度较大的案件,这些案件在较低法院已经没有得到足够解决。这是一个单标签多类别分类任务,给定一个文件(法院意见书),任务是预测相关的问题领域。14个问题领域聚类了278个问题,其焦点在于争议(纠纷)的主题。
eurlex欧洲联盟(EU)的立法文件发布在EUR-Lex门户网站上。所有欧盟法律都由欧盟出版局用EuroVoc词库的多个概念进行注释。EuroVoc词库是由出版局维护的一个多语言词库,包含超过7,000个涉及欧盟及其成员国各种活动的概念(例如经济、医疗保健、贸易)。给定一个文件,任务是预测其EuroVoc标签(概念)。
ledgarLEDGAR数据集旨在进行合同条款(段落)分类。合同条款来自美国证券交易委员会(SEC)的提交文件,这些文件可以从EDGAR公开获取。每个标签表示相应合同条款的单一主题(主题)。
unfair_tosUNFAIR-ToS数据集包含来自在线平台(例如YouTube,eBay,Facebook等)的50个服务条款(ToS)。该数据集已根据欧洲消费者法律的定义,在句子级别上进行了不公平合同条款(句子)的注释,这意味着根据欧洲消费者法律,这些条款可能违反用户权利。
case_holdCaseHOLD(Legal Decisions的案件持有)数据集包括有关美国法律案件的多项选择题,这些案件出自哈佛大学法律图书馆的案例法律文本库。案件持有是在相关案例的附带法律裁决摘要,用于解释当前案件的裁决摘要。输入包括来自法院裁决的摘录(或提示),其中包含对特定案例的引用,而持有语句被掩盖。模型必须从五个选项中识别出正确的(掩盖的)持有语句。
当前排行榜包括基于Transformer(Vaswaniet al。,2017)预训练语言模型,这些模型在大多数NLP任务(Bommasani等人,2021)和NLU基准测试(Wang等人,2019a)中均达到了最先进的性能。由 Chalkidis et al. (2021) 报告的结果:
任务测试结果
Dataset | ECtHR A | ECtHR B | SCOTUS | EUR-LEX | LEDGAR | UNFAIR-ToS | CaseHOLD |
Model | μ-F1 / m-F1 | μ-F1 / m-F1 | μ-F1 / m-F1 | μ-F1 / m-F1 | μ-F1 / m-F1 | μ-F1 / m-F1 | μ-F1 / m-F1 |
TFIDF+SVM | 64.7 / 51.7 | 74.6 / 65.1 | 78.2 / 69.5 | 71.3 / 51.4 | 87.2 / 82.4 | 95.4 / 78.8 | n/a |
Medium-sized Models (L=12, H=768, A=12) | |||||||
BERT | 71.2 / 63.6 | 79.7 / 73.4 | 68.3 / 58.3 | 71.4 / 57.2 | 87.6 / 81.8 | 95.6 / 81.3 | 70.8 |
RoBERTa | 69.2 / 59.0 | 77.3 / 68.9 | 71.6 / 62.0 | 71.9 / 57.9 | 87.9 / 82.3 | 95.2 / 79.2 | 71.4 |
DeBERTa | 70.0 / 60.8 | 78.8 / 71.0 | 71.1 / 62.7 | 72.1 / 57.4 | 88.2 / 83.1 | 95.5 / 80.3 | 72.6 |
Longformer | 69.9 / 64.7 | 79.4 / 71.7 | 72.9 / 64.0 | 71.6 / 57.7 | 88.2 / 83.0 | 95.5 / 80.9 | 71.9 |
BigBird | 70.0 / 62.9 | 78.8 / 70.9 | 72.8 / 62.0 | 71.5 / 56.8 | 87.8 / 82.6 | 95.7 / 81.3 | 70.8 |
Legal-BERT | 70.0 / 64.0 | 80.4 / 74.7 | 76.4 / 66.5 | 72.1 / 57.4 | 88.2 / 83.0 | 96.0 / 83.0 | 75.3 |
CaseLaw-BERT | 69.8 / 62.9 | 78.8 / 70.3 | 76.6 / 65.9 | 70.7 / 56.6 | 88.3 / 83.0 | 96.0 / 82.3 | 75.4 |
Large-sized Models (L=24, H=1024, A=18) | |||||||
RoBERTa | 73.8 / 67.6 | 79.8 / 71.6 | 75.5 / 66.3 | 67.9 / 50.3 | 88.6 / 83.6 | 95.8 / 81.6 | 74.4 |
平均(任务的平均值)测试结果
Averaging | Arithmetic | Harmonic | Geometric |
Model | μ-F1 / m-F1 | μ-F1 / m-F1 | μ-F1 / m-F1 |
Medium-sized Models (L=12, H=768, A=12) | |||
BERT | 77.8 / 69.5 | 76.7 / 68.2 | 77.2 / 68.8 |
RoBERTa | 77.8 / 68.7 | 76.8 / 67.5 | 77.3 / 68.1 |
DeBERTa | 78.3 / 69.7 | 77.4 / 68.5 | 77.8 / 69.1 |
Longformer | 78.5 / 70.5 | 77.5 / 69.5 | 78.0 / 70.0 |
BigBird | 78.2 / 69.6 | 77.2 / 68.5 | 77.7 / 69.0 |
Legal-BERT | 79.8 / 72.0 | 78.9 / 70.8 | 79.3 / 71.4 |
CaseLaw-BERT | 79.4 / 70.9 | 78.5 / 69.7 | 78.9 / 70.3 |
Large-sized Models (L=24, H=1024, A=18) | |||
RoBERTa | 79.4 / 70.8 | 78.4 / 69.1 | 78.9 / 70.0 |
我们仅考虑英文数据集,以便让全球的研究人员进行实验。
“训练”示例如下所示。
{ "text": ["8. The applicant was arrested in the early morning of 21 October 1990 ...", ...], "labels": [6] }ecthr_b
“训练”示例如下所示。
{ "text": ["8. The applicant was arrested in the early morning of 21 October 1990 ...", ...], "label": [5, 6] }scotus
“训练”示例如下所示。
{ "text": "Per Curiam\nSUPREME COURT OF THE UNITED STATES\nRANDY WHITE, WARDEN v. ROGER L. WHEELER\n Decided December 14, 2015\nPER CURIAM.\nA death sentence imposed by a Kentucky trial court and\naffirmed by the ...", "label": 8 }eurlex
“训练”示例如下所示。
{ "text": "COMMISSION REGULATION (EC) No 1629/96 of 13 August 1996 on an invitation to tender for the refund on export of wholly milled round grain rice to certain third countries ...", "labels": [4, 20, 21, 35, 68] }ledgar
“训练”示例如下所示。
{ "text": "All Taxes shall be the financial responsibility of the party obligated to pay such Taxes as determined by applicable law and neither party is or shall be liable at any time for any of the other party ...", "label": 32 }unfair_tos
“训练”示例如下所示。
{ "text": "tinder may terminate your account at any time without notice if it believes that you have violated this agreement.", "label": 2 }casehold
“测试”示例如下所示。
{ "context": "In Granato v. City and County of Denver, No. CIV 11-0304 MSK/BNB, 2011 WL 3820730 (D.Colo. Aug. 20, 2011), the Honorable Marcia S. Krieger, now-Chief United States District Judge for the District of Colorado, ruled similarly: At a minimum, a party asserting a Mo-nell claim must plead sufficient facts to identify ... to act pursuant to City or State policy, custom, decision, ordinance, re d 503, 506-07 (3d Cir.l985)(<HOLDING>).", "endings": ["holding that courts are to accept allegations in the complaint as being true including monell policies and writing that a federal court reviewing the sufficiency of a complaint has a limited task", "holding that for purposes of a class certification motion the court must accept as true all factual allegations in the complaint and may draw reasonable inferences therefrom", "recognizing that the allegations of the complaint must be accepted as true on a threshold motion to dismiss", "holding that a court need not accept as true conclusory allegations which are contradicted by documents referred to in the complaint", "holding that where the defendant was in default the district court correctly accepted the fact allegations of the complaint as true" ], "label": 0 }
Dataset | Training | Development | Test | Total |
ECtHR (Task A) | 9,000 | 1,000 | 1,000 | 11,000 |
ECtHR (Task B) | 9,000 | 1,000 | 1,000 | 11,000 |
SCOTUS | 5,000 | 1,400 | 1,400 | 7,800 |
EUR-LEX | 55,000 | 5,000 | 5,000 | 65,000 |
LEDGAR | 60,000 | 10,000 | 10,000 | 80,000 |
UNFAIR-ToS | 5,532 | 2,275 | 1,607 | 9,414 |
CaseHOLD | 45,000 | 3,900 | 3,900 | 52,800 |
Dataset | Source | Sub-domain | Task Type |
ECtHR (Task A) | 1239321 | ECHR | Multi-label classification |
ECtHR (Task B) | 12310321 | ECHR | Multi-label classification |
SCOTUS | 12311321 | US Law | Multi-class classification |
EUR-LEX | 12312321 | EU Law | Multi-label classification |
LEDGAR | 12313321 | Contracts | Multi-class classification |
UNFAIR-ToS | 12314321 | Contracts | Multi-label classification |
CaseHOLD | 12315321 | US Law | Multiple choice QA |
Ilias Chalkidis,Abhik Jana,Dirk Hartung,Michael Bommarito,Ion Androutsopoulos,Daniel Martin Katz和Nikolaos Aletras。LexGLUE:用于英语法律语言理解的基准数据集。2022年。在计算语言学协会第60届年会的论文集中。爱尔兰都柏林。
@inproceedings{chalkidis-etal-2021-lexglue, title={LexGLUE: A Benchmark Dataset for Legal Language Understanding in English}, author={Chalkidis, Ilias and Jana, Abhik and Hartung, Dirk and Bommarito, Michael and Androutsopoulos, Ion and Katz, Daniel Martin and Aletras, Nikolaos}, year={2022}, booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics}, address={Dubln, Ireland}, }
感谢 @iliaschalkidis 添加了该数据集。