数据集:

projecte-aina/Parafraseja

语言:

ca

计算机处理:

monolingual

语言创建人:

found

批注创建人:

CLiC-UB
中文

Dataset Card for Parafraseja

Dataset Summary

Parafraseja is a dataset of 21,984 pairs of sentences with a label that indicates if they are paraphrases or not. The original sentences were collected from TE-ca and STS-ca . For each sentence, an annotator wrote a sentence that was a paraphrase and another that was not. The guidelines of this annotation are available.

Supported Tasks and Leaderboards

This dataset is mainly intended to train models for paraphrase detection.

Languages

The dataset is in Catalan ( ca-CA ).

Dataset Structure

The dataset consists of pairs of sentences labelled with "Parafrasis" or "No Parafrasis" in a jsonl format.

Data Instances

  {
    "id": "te1_14977_1", 
    "source": "teca", 
    "original": "La 2a part consta de 23 cap\u00edtols, cadascun dels quals descriu un ocell diferent.", 
    "new": "La segona part consisteix en vint-i-tres cap\u00edtols, cada un dels quals descriu un ocell diferent.", 
    "label": "Parafrasis"
   }

Data Fields

  • original: original sentence
  • new: new sentence, which could be a paraphrase or a non-paraphrase
  • label: relation between original and new

Data Splits

  • dev.json: 2,000 examples
  • test.json: 4,000 examples
  • train.json: 15,984 examples

Dataset Creation

Curation Rationale

We created this corpus to contribute to the development of language models in Catalan, a low-resource language.

Source Data

The original sentences of this dataset came from the STS-ca and the TE-ca .

Initial Data Collection and Normalization

11,543 of the original sentences came from TE-ca, and 10,441 came from STS-ca.

Who are the source language producers?

TE-ca and STS-ca come from the Catalan Textual Corpus , which consists of several corpora gathered from web crawling and public corpora, and Vilaweb , a Catalan newswire.

Annotations

The dataset is annotated with the label "Parafrasis" or "No Parafrasis" for each pair of sentences.

Annotation process

The annotation process was done by a single annotator and reviewed by another.

Who are the annotators?

The annotators were Catalan native speakers, with a background on linguistics.

Personal and Sensitive Information

No personal or sensitive information included.

Considerations for Using the Data

Social Impact of Dataset

We hope this corpus contributes to the development of language models in Catalan, a low-resource language.

Discussion of Biases

We are aware that this data might contain biases. We have not applied any steps to reduce their impact.

Other Known Limitations

[N/A]

Additional Information

Dataset Curators

Text Mining Unit (TeMU) at the Barcelona Supercomputing Center ( bsc-temu@bsc.es )

This work was funded by the Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya within the framework of Projecte AINA .

Licensing Information

Creative Commons Attribution Non-commercial No-Derivatives 4.0 International .

Contributions

[N/A]