数据集:

opus_dogc

任务:

翻译

语言:

ca es

计算机处理:

translation

大小:

1M<n<10M

语言创建人:

expert-generated

批注创建人:

no-annotation

源数据集:

original

许可:

cc0-1.0
中文

Dataset Card for OPUS DOGC

Dataset Summary

OPUS DOGC is a collection of documents from the Official Journal of the Government of Catalonia, in Catalan and Spanish languages, provided by Antoni Oliver Gonzalez from the Universitat Oberta de Catalunya.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

Dataset is multilingual with parallel text in:

  • Catalan
  • Spanish

Dataset Structure

Data Instances

[More Information Needed]

Data Fields

A data instance contains the following fields:

  • ca : the Catalan text
  • es : the aligned Spanish text

Data Splits

[More Information Needed]

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

Dataset is in the Public Domain under CC0 1.0 .

Citation Information

@inproceedings{tiedemann-2012-parallel,
    title = "Parallel Data, Tools and Interfaces in {OPUS}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)",
    month = may,
    year = "2012",
    address = "Istanbul, Turkey",
    publisher = "European Language Resources Association (ELRA)",
    url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf",
    pages = "2214--2218",
    abstract = "This paper presents the current status of OPUS, a growing language resource of parallel corpora and related tools. The focus in OPUS is to provide freely available data sets in various formats together with basic annotation to be useful for applications in computational linguistics, translation studies and cross-linguistic corpus studies. In this paper, we report about new data sets and their features, additional annotation tools and models provided from the website and essential interfaces and on-line services included in the project.",
}

Contributions

Thanks to @albertvillanova for adding this dataset.