数据集:

lexlms/lex_files

英文

"LexFiles"数据集的数据卡

数据集简介

LexFiles是一个新的多样化的英语跨国法律语料库,我们创建了包括11个不同子语料库的语料库,涵盖了来自6个主要以英语为主要语言的法律制度的立法和案例法(欧盟、欧洲理事会、加拿大、美国、英国、印度)。该语料库包含大约190亿个标记。相比之下,Hendersons等人发布的"Pile of Law"语料库总共包含320亿个标记,其中大多数(26/30)子语料库来自美国,因此整个语料库在很大程度上对美国法律制度有偏见,特别是对联邦或州法律管辖权有偏见。

数据集规范

Corpus Corpus alias Documents Tokens Pct. Sampl. (a=0.5) Sampl. (a=0.2)
EU Legislation eu-legislation 93.7K 233.7M 1.2% 5.0% 8.0%
EU Court Decisions eu-court-cases 29.8K 178.5M 0.9% 4.3% 7.6%
ECtHR Decisions ecthr-cases 12.5K 78.5M 0.4% 2.9% 6.5%
UK Legislation uk-legislation 52.5K 143.6M 0.7% 3.9% 7.3%
UK Court Decisions uk-court-cases 47K 368.4M 1.9% 6.2% 8.8%
Indian Court Decisions indian-court-cases 34.8K 111.6M 0.6% 3.4% 6.9%
Canadian Legislation canadian-legislation 6K 33.5M 0.2% 1.9% 5.5%
Canadian Court Decisions canadian-court-cases 11.3K 33.1M 0.2% 1.8% 5.4%
U.S. Court Decisions [1] court-listener 4.6M 11.4B 59.2% 34.7% 17.5%
U.S. Legislation us-legislation 518 1.4B 7.4% 12.3% 11.5%
U.S. Contracts us-contracts 622K 5.3B 27.3% 23.6% 15.0%
Total lexlms/lex_files 5.8M 18.8B 100% 100% 100%

[1]我们只考虑从1965年以后的美国法院决定(参见《民权法案》之后),作为依赖于过时且在许多情况下有害的法律标准的案例的硬性阈值。其他语料库包括更近期的文件。

[2]采样(Sampl.)比率是根据Lample等人(2019)引入的指数采样计算的。

由于它们不代表事实法律知识,所以未考虑用于预训练的其他语料库。

Corpus Corpus alias Documents Tokens
Legal web pages from C4 legal-c4 284K 340M

引用

Ilias Chalkidis*, Nicolas Garneau*, Catalina E.C. Goanta, Daniel Martin Katz, and Anders Søgaard. LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development. 2022. In the Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada.

@inproceedings{chalkidis-garneau-etal-2023-lexlms,
    title = {{LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development}},
    author = "Chalkidis*, Ilias and 
              Garneau*, Nicolas and
              Goanta, Catalina and 
              Katz, Daniel Martin and 
              Søgaard, Anders",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics",
    month = june,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2305.07507",
}