数据集:
lexlms/lex_files
语言:
en计算机处理:
monolingual大小:
1M<n<10M语言创建人:
found批注创建人:
no-annotation源数据集:
extended预印本库:
arxiv:2305.07507许可:
cc-by-nc-sa-4.0LexFiles是一个新的多样化的英语跨国法律语料库,我们创建了包括11个不同子语料库的语料库,涵盖了来自6个主要以英语为主要语言的法律制度的立法和案例法(欧盟、欧洲理事会、加拿大、美国、英国、印度)。该语料库包含大约190亿个标记。相比之下,Hendersons等人发布的"Pile of Law"语料库总共包含320亿个标记,其中大多数(26/30)子语料库来自美国,因此整个语料库在很大程度上对美国法律制度有偏见,特别是对联邦或州法律管辖权有偏见。
Corpus | Corpus alias | Documents | Tokens | Pct. | Sampl. (a=0.5) | Sampl. (a=0.2) |
---|---|---|---|---|---|---|
EU Legislation | eu-legislation | 93.7K | 233.7M | 1.2% | 5.0% | 8.0% |
EU Court Decisions | eu-court-cases | 29.8K | 178.5M | 0.9% | 4.3% | 7.6% |
ECtHR Decisions | ecthr-cases | 12.5K | 78.5M | 0.4% | 2.9% | 6.5% |
UK Legislation | uk-legislation | 52.5K | 143.6M | 0.7% | 3.9% | 7.3% |
UK Court Decisions | uk-court-cases | 47K | 368.4M | 1.9% | 6.2% | 8.8% |
Indian Court Decisions | indian-court-cases | 34.8K | 111.6M | 0.6% | 3.4% | 6.9% |
Canadian Legislation | canadian-legislation | 6K | 33.5M | 0.2% | 1.9% | 5.5% |
Canadian Court Decisions | canadian-court-cases | 11.3K | 33.1M | 0.2% | 1.8% | 5.4% |
U.S. Court Decisions [1] | court-listener | 4.6M | 11.4B | 59.2% | 34.7% | 17.5% |
U.S. Legislation | us-legislation | 518 | 1.4B | 7.4% | 12.3% | 11.5% |
U.S. Contracts | us-contracts | 622K | 5.3B | 27.3% | 23.6% | 15.0% |
Total | lexlms/lex_files | 5.8M | 18.8B | 100% | 100% | 100% |
[1]我们只考虑从1965年以后的美国法院决定(参见《民权法案》之后),作为依赖于过时且在许多情况下有害的法律标准的案例的硬性阈值。其他语料库包括更近期的文件。
[2]采样(Sampl.)比率是根据Lample等人(2019)引入的指数采样计算的。
由于它们不代表事实法律知识,所以未考虑用于预训练的其他语料库。
Corpus | Corpus alias | Documents | Tokens |
---|---|---|---|
Legal web pages from C4 | legal-c4 | 284K | 340M |
@inproceedings{chalkidis-garneau-etal-2023-lexlms, title = {{LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development}}, author = "Chalkidis*, Ilias and Garneau*, Nicolas and Goanta, Catalina and Katz, Daniel Martin and Søgaard, Anders", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics", month = june, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/2305.07507", }