数据集:

pierreguillou/lener_br_finetuning_language_model

语言:

pt

计算机处理:

monolingual

其他:

lener_br
中文

Dataset Card for "LeNER-Br language modeling"

Dataset Summary

The LeNER-Br language modeling dataset is a collection of legal texts in Portuguese from the LeNER-Br dataset ( official site ).

The legal texts were downloaded from this link (93.6MB) and processed to create a DatasetDict with train and validation dataset (20%).

The LeNER-Br language modeling dataset allows the finetuning of language models as BERTimbau base and large .

Language

Portuguese from Brazil.

Blog post

NLP | Modelos e Web App para Reconhecimento de Entidade Nomeada (NER) no domínio jurídico brasileiro (29/12/2021)

Dataset structure

DatasetDict({
    validation: Dataset({
        features: ['text'],
        num_rows: 3813
    })
    train: Dataset({
        features: ['text'],
        num_rows: 15252
    })
})

Use

!pip install datasets
from datasets import load_dataset

dataset = load_dataset("pierreguillou/lener_br_finetuning_language_model")