数据集:

rcds/swiss_criticality_prediction

计算机处理:

multilingual

大小:

100K<n<1M

语言创建人:

expert-generated

批注创建人:

machine-generated

源数据集:

original

预印本库:

arxiv:2306.09237
中文

Dataset Card for Criticality Prediction

Dataset Summary

Legal Criticality Prediction (LCP) is a multilingual, diachronic dataset of 139K Swiss Federal Supreme Court (FSCS) cases annotated with two criticality labels. The bge_label i a binary label (critical, non-critical), while the citation label has 5 classes (critical-1, critical-2, critical-3, critical-4, non-critical). Critical classes of the citation_label are distinct subsets of the critical class of the bge_label. This dataset creates a challenging text classification task. We also provide additional metadata as the publication year, the law area and the canton of origin per case, to promote robustness and fairness studies on the critical area of legal NLP.

Supported Tasks and Leaderboards

LCP can be used as text classification task

Languages

Switzerland has four official languages with three languages German, French and Italian being represenated. The decisions are written by the judges and clerks in the language of the proceedings. German (91k), French (33k), Italian (15k)

Dataset Structure

{
  "decision_id": "008d8a52-f0ea-4820-a18c-d06066dbb407",
  "language": "fr",
  "year": "2018",
  "chamber": "CH_BGer_004",
  "region": "Federation",
  "origin_chamber": "338.0",
  "origin_court": "127.0",
  "origin_canton": "24.0",
  "law_area": "civil_law",
  "law_sub_area": ,
  "bge_label": "critical",
  "citation_label": "critical-1",
  "facts": "Faits : A. A.a. Le 17 août 2007, C.X._, née le 14 février 1944 et domiciliée...",
  "considerations": "Considérant en droit : 1. Interjeté en temps utile (art. 100 al. 1 LTF) par les défendeurs qui ont succombé dans leurs conclusions (art. 76 LTF) contre une décision...",
  "rulings": "Par ces motifs, le Tribunal fédéral prononce : 1. Le recours est rejeté. 2. Les frais judiciaires, arrêtés à 10'000 fr., sont mis solidairement à la charge des recourants...",
}

Data Fields

decision_id: (str) a unique identifier of the for the document
language: (str) one of (de, fr, it)
year: (int) the publication year
chamber: (str) the chamber of the case
region: (str) the region of the case
origin_chamber: (str) the chamber of the origin case
origin_court: (str) the court of the origin case
origin_canton:  (str) the canton of the origin case
law_area: (str) the law area of the case
law_sub_area:(str) the law sub area of the case
bge_label: (str) critical or non-critical
citation_label: (str) critical-1, critical-2, critical-3, critical-4, non-critical
facts: (str) the facts of the case
considerations: (str) the considerations of the case
rulings: (str) the rulings of the case

Data Instances

[More Information Needed]

Data Fields

[More Information Needed]

Data Splits

The dataset was split date-stratisfied

  • Train: 2002-2015
  • Validation: 2016-2017
  • Test: 2018-2022
Language Subset Number of Documents (Training/Validation/Test)
German de 81'264 (56592 / 19601 / 5071)
French fr 49'354 (29263 / 11117 / 8974)
Italian it 7913 (5220 / 1901 / 792)

Dataset Creation

Curation Rationale

The dataset was created by Stern (2023).

Source Data

Initial Data Collection and Normalization

The original data are published from the Swiss Federal Supreme Court ( https://www.bger.ch ) in unprocessed formats (HTML). The documents were downloaded from the Entscheidsuche portal ( https://entscheidsuche.ch ) in HTML.

Who are the source language producers?

The decisions are written by the judges and clerks in the language of the proceedings.

Annotations

Annotation process

bge_label:

  • all bger_references in the bge header were extracted (for bge see rcds/swiss_rulings).
  • bger file_names are compared with the found references
  • citation_label:

  • count all citations for all bger cases and weight citations
  • divide cited cases in four different classes, depending on amount of citations
  • Who are the annotators?

    Stern processed data and introduced bge and citation-label Metadata is published by the Swiss Federal Supreme Court ( https://www.bger.ch ).

    Personal and Sensitive Information

    The dataset contains publicly available court decisions from the Swiss Federal Supreme Court. Personal or sensitive information has been anonymized by the court before publication according to the following guidelines: https://www.bger.ch/home/juridiction/anonymisierungsregeln.html .

    Considerations for Using the Data

    Social Impact of Dataset

    [More Information Needed]

    Discussion of Biases

    [More Information Needed]

    Other Known Limitations

    [More Information Needed]

    Additional Information

    Dataset Curators

    [More Information Needed]

    Licensing Information

    We release the data under CC-BY-4.0 which complies with the court licensing ( https://www.bger.ch/files/live/sites/bger/files/pdf/de/urteilsveroeffentlichung_d.pdf ) © Swiss Federal Supreme Court, 2002-2022

    The copyright for the editorial content of this website and the consolidated texts, which is owned by the Swiss Federal Supreme Court, is licensed under the Creative Commons Attribution 4.0 International licence. This means that you can re-use the content provided you acknowledge the source and indicate any changes you have made. Source: https://www.bger.ch/files/live/sites/bger/files/pdf/de/urteilsveroeffentlichung_d.pdf

    Citation Information

    Please cite our ArXiv-Preprint

    @misc{rasiah2023scale,
          title={SCALE: Scaling up the Complexity for Advanced Language Model Evaluation}, 
          author={Vishvaksenan Rasiah and Ronja Stern and Veton Matoshi and Matthias Stürmer and Ilias Chalkidis and Daniel E. Ho and Joel Niklaus},
          year={2023},
          eprint={2306.09237},
          archivePrefix={arXiv},
          primaryClass={cs.CL}
    }
    

    Contributions

    Thanks to @Stern5497 for adding this dataset.