模型:
casehold/legalbert
来自 When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset of 53,000+ Legal Holdings 的Legal-BERT模型和分词器文件。
预训练语料库是通过导入整个哈佛法学案例文本库(1965年至今, https://case.law/ )构建的。这个语料库的规模很大(37GB),包含了所有联邦和州法院的3,446,187个法律判决,比最初用于训练BERT的BookCorpus/Wikipedia语料库(15GB)要大。
该模型使用基本的BERT模型(小写,110M参数) bert-base-uncased 进行初始化,并在MLM和NSP目标上额外进行了1M步的训练,其中词汇标记和句子分割已适应法律文本(详见论文)。
请参考 casehold repository 中的脚本,用于计算Legal-BERT的预训练损失并进行细调,用于描述论文中的分类和多项选择任务:Overruling、Terms of Service和CaseHOLD。
@inproceedings{zhengguha2021, title={When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset}, author={Lucia Zheng and Neel Guha and Brandon R. Anderson and Peter Henderson and Daniel E. Ho}, year={2021}, eprint={2104.08671}, archivePrefix={arXiv}, primaryClass={cs.CL}, booktitle={Proceedings of the 18th International Conference on Artificial Intelligence and Law}, publisher={Association for Computing Machinery} }
Lucia Zheng, Neel Guha, Brandon R. Anderson, Peter Henderson, and Daniel E. Ho. 2021. When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset. In Proceedings of the 18th International Conference on Artificial Intelligence and Law (ICAIL '21), June 21-25, 2021, São Paulo, Brazil. ACM Inc., New York, NY, (in press). arXiv: 2104.08671 [cs.CL] .