数据集:

greek_legal_code

语言:

el

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

found

批注创建人:

found

源数据集:

original

预印本库:

arxiv:2109.15298

许可:

cc-by-4.0
中文

Dataset Card for Greek Legal Code

Dataset Summary

Greek_Legal_Code (GLC) is a dataset consisting of approx. 47k legal resources from Greek legislation. The origin of GLC is “Permanent Greek Legislation Code - Raptarchis”, a collection of Greek legislative documents classified into multi-level (from broader to more specialized) categories.

Topics

GLC consists of 47 legislative volumes and each volume corresponds to a main thematic topic. Each volume is divided into thematic sub categories which are called chapters and subsequently, each chapter breaks down to subjects which contain the legal resources. The total number of chapters is 389 while the total number of subjects is 2285, creating an interlinked thematic hierarchy. So, for the upper thematic level (volume) GLC has 47 classes. For the next thematic level (chapter) GLC offers 389 classes and for the inner and last thematic level (subject), GLC has 2285 classes.

GLC classes are divided into three categories for each thematic level: frequent classes, which occur in more than 10 training documents and can be found in all three subsets (training, development and test); few-shot classes which appear in 1 to 10 training documents and also appear in the documents of the development and test sets, and zero-shot classes which appear in the development and/or test, but not in the training documents.

Supported Tasks and Leaderboards

The dataset supports:

Multi-class Text Classification: Given the text of a document, a model predicts the corresponding class.

Few-shot and Zero-shot learning: As already noted, the classes can be divided into three groups: frequent, few-shot, and zero- shot, depending on whether they were assigned to more than 10, fewer than 10 but at least one, or no training documents, respectively.

Level Total Frequent Few-Shot (<10) Zero-Shot
Volume 47 47 0 0
Chapter 389 333 53 3
Subject 2285 712 1431 142

Languages

All documents are written in Greek.

Dataset Structure

Data Instances

{
  "text": "179. ΑΠΟΦΑΣΗ ΥΠΟΥΡΓΟΥ ΜΕΤΑΦΟΡΩΝ ΚΑΙ ΕΠΙΚΟΙΝΩΝΙΩΝ Αριθ. Β-οικ. 68425/4765 της 2/17 Νοεμ. 2000 (ΦΕΚ Β΄ 1404) Τροποποίηση της 42000/2030/81 κοιν. απόφασης του Υπουργού Συγκοινωνιών «Κωδικοποίηση και συμπλήρωση καν. Αποφάσεων» που εκδόθηκαν κατ’ εξουσιοδότηση του Ν.Δ. 102/73 «περί οργανώσεως των δια λεωφορείων αυτοκινήτων εκτελουμένων επιβατικών συγκοινωνιών». ",
  "volume": 24,  # "ΣΥΓΚΟΙΝΩΝΙΕΣ"
}

Data Fields

The following data fields are provided for documents ( train , dev , test ):

text : ( str ) The full content of each document, which is represented by its header and articles (i.e., the main_body ). label : ( class label ): Depending on the configurarion, the volume/chapter/subject of the document. For volume-level class it belongs to specifically: ["ΚΟΙΝΩΝΙΚΗ ΠΡΟΝΟΙΑ", "ΓΕΩΡΓΙΚΗ ΝΟΜΟΘΕΣΙΑ", "ΡΑΔΙΟΦΩΝΙΑ ΚΑΙ ΤΥΠΟΣ", "ΒΙΟΜΗΧΑΝΙΚΗ ΝΟΜΟΘΕΣΙΑ", "ΥΓΕΙΟΝΟΜΙΚΗ ΝΟΜΟΘΕΣΙΑ", "ΠΟΛΕΜΙΚΟ ΝΑΥΤΙΚΟ", "ΤΑΧΥΔΡΟΜΕΙΑ - ΤΗΛΕΠΙΚΟΙΝΩΝΙΕΣ", "ΔΑΣΗ ΚΑΙ ΚΤΗΝΟΤΡΟΦΙΑ", "ΕΛΕΓΚΤΙΚΟ ΣΥΝΕΔΡΙΟ ΚΑΙ ΣΥΝΤΑΞΕΙΣ", "ΠΟΛΕΜΙΚΗ ΑΕΡΟΠΟΡΙΑ", "ΝΟΜΙΚΑ ΠΡΟΣΩΠΑ ΔΗΜΟΣΙΟΥ ΔΙΚΑΙΟΥ", "ΝΟΜΟΘΕΣΙΑ ΑΝΩΝΥΜΩΝ ΕΤΑΙΡΕΙΩΝ ΤΡΑΠΕΖΩΝ ΚΑΙ ΧΡΗΜΑΤΙΣΤΗΡΙΩΝ", "ΠΟΛΙΤΙΚΗ ΑΕΡΟΠΟΡΙΑ", "ΕΜΜΕΣΗ ΦΟΡΟΛΟΓΙΑ", "ΚΟΙΝΩΝΙΚΕΣ ΑΣΦΑΛΙΣΕΙΣ", "ΝΟΜΟΘΕΣΙΑ ΔΗΜΩΝ ΚΑΙ ΚΟΙΝΟΤΗΤΩΝ", "ΝΟΜΟΘΕΣΙΑ ΕΠΙΜΕΛΗΤΗΡΙΩΝ ΣΥΝΕΤΑΙΡΙΣΜΩΝ ΚΑΙ ΣΩΜΑΤΕΙΩΝ", "ΔΗΜΟΣΙΑ ΕΡΓΑ", "ΔΙΟΙΚΗΣΗ ΔΙΚΑΙΟΣΥΝΗΣ", "ΑΣΦΑΛΙΣΤΙΚΑ ΤΑΜΕΙΑ", "ΕΚΚΛΗΣΙΑΣΤΙΚΗ ΝΟΜΟΘΕΣΙΑ", "ΕΚΠΑΙΔΕΥΤΙΚΗ ΝΟΜΟΘΕΣΙΑ", "ΔΗΜΟΣΙΟ ΛΟΓΙΣΤΙΚΟ", "ΤΕΛΩΝΕΙΑΚΗ ΝΟΜΟΘΕΣΙΑ", "ΣΥΓΚΟΙΝΩΝΙΕΣ", "ΕΘΝΙΚΗ ΑΜΥΝΑ", "ΣΤΡΑΤΟΣ ΞΗΡΑΣ", "ΑΓΟΡΑΝΟΜΙΚΗ ΝΟΜΟΘΕΣΙΑ", "ΔΗΜΟΣΙΟΙ ΥΠΑΛΛΗΛΟΙ", "ΠΕΡΙΟΥΣΙΑ ΔΗΜΟΣΙΟΥ ΚΑΙ ΝΟΜΙΣΜΑ", "ΟΙΚΟΝΟΜΙΚΗ ΔΙΟΙΚΗΣΗ", "ΛΙΜΕΝΙΚΗ ΝΟΜΟΘΕΣΙΑ", "ΑΣΤΙΚΗ ΝΟΜΟΘΕΣΙΑ", "ΠΟΛΙΤΙΚΗ ΔΙΚΟΝΟΜΙΑ", "ΔΙΠΛΩΜΑΤΙΚΗ ΝΟΜΟΘΕΣΙΑ", "ΔΙΟΙΚΗΤΙΚΗ ΝΟΜΟΘΕΣΙΑ", "ΑΜΕΣΗ ΦΟΡΟΛΟΓΙΑ", "ΤΥΠΟΣ ΚΑΙ ΤΟΥΡΙΣΜΟΣ", "ΕΘΝΙΚΗ ΟΙΚΟΝΟΜΙΑ", "ΑΣΤΥΝΟΜΙΚΗ ΝΟΜΟΘΕΣΙΑ", "ΑΓΡΟΤΙΚΗ ΝΟΜΟΘΕΣΙΑ", "ΕΡΓΑΤΙΚΗ ΝΟΜΟΘΕΣΙΑ", "ΠΟΙΝΙΚΗ ΝΟΜΟΘΕΣΙΑ", "ΕΜΠΟΡΙΚΗ ΝΟΜΟΘΕΣΙΑ", "ΕΠΙΣΤΗΜΕΣ ΚΑΙ ΤΕΧΝΕΣ", "ΕΜΠΟΡΙΚΗ ΝΑΥΤΙΛΙΑ", "ΣΥΝΤΑΓΜΑΤΙΚΗ ΝΟΜΟΘΕΣΙΑ" ] \

The labels can also be a the chapter-level or subject-level class it belongs to. Some chapter labels are omitted due to size (389 classes). Some subject labels are also omitted due to size (2285 classes).

Data Splits

Split No of Documents Avg. words
Train 28,536 600
Development 9,511 574
Test 9,516 595

Dataset Creation

Curation Rationale

The dataset was curated by Papaloukas et al. (2021) with the hope to support and encourage further research in NLP for the Greek language.

Source Data

Initial Data Collection and Normalization

The Permanent Greek Legislation Code - Raptarchis is a thorough catalogue of Greek legislation since the creation of the Greek state in 1834 until 2015. It includes Laws, Royal and Presidential Decrees, Regulations and Decisions, retrieved from the Official Government Gazette, where Greek legislation is published. This collection is one of the official, publicly available sources of classified Greek legislation suitable for classification tasks.

Currently, the original catalogue is publicly offered in MS Word (.doc) format through the portal e-Themis, the legal database and management service of it, under the administration of the Ministry of the Interior (Affairs). E-Themis is primarily focused on providing legislation on a multitude of predefined thematic categories, as described in the catalogue. The main goal is to help users find legislation of interest using the thematic index.

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

The dataset does not include personal or sensitive information.

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

Papaloukas et al. (2021)

Licensing Information

[More Information Needed]

Citation Information

Christos Papaloukas, Ilias Chalkidis, Konstantinos Athinaios, Despina-Athanasia Pantazi and Manolis Koubarakis. Multi-granular Legal Topic Classification on Greek Legislation. Proceedings of the 3rd Natural Legal Language Processing (NLLP) Workshop, Punta Cana, Dominican Republic, 2021

@inproceedings{papaloukas-etal-2021-glc,
    title = "Multi-granular Legal Topic Classification on Greek Legislation",
    author = "Papaloukas, Christos and Chalkidis, Ilias and Athinaios, Konstantinos and Pantazi, Despina-Athanasia and Koubarakis, Manolis",
    booktitle = "Proceedings of the Natural Legal Language Processing Workshop 2021",
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2109.15298",
    doi = "10.48550/arXiv.2109.15298",
    pages = "63--75"
}

Contributions

Thanks to @christospi for adding this dataset.