数据集:

opus_ubuntu

中文

Dataset Card for Opus Ubuntu

Dataset Summary

These are translations of the Ubuntu software package messages, donated by the Ubuntu community.

To load a language pair which isn't part of the config, all you need to do is specify the language code as pairs. You can find the valid pairs in Homepage section of Dataset Description: http://opus.nlpl.eu/Ubuntu.php E.g.

dataset = load_dataset("opus_ubuntu", lang1="it", lang2="pl")

Supported Tasks and Leaderboards

[More Information Needed]

Languages

[More Information Needed]

Dataset Structure

Data Instances

Example instance:

{
  'id': '0', 
  'translation': {
    'it': 'Comprende Gmail, Google Docs, Google+, YouTube e Picasa',
    'pl': 'Zawiera Gmail, Google Docs, Google+, YouTube oraz Picasa'
  }
}

Data Fields

Each instance has two fields:

  • id : the id of the example
  • translation : a dictionary containing translated texts in two languages.

Data Splits

Each subset simply consists in a train set. We provide the number of examples for certain language pairs:

train
as-bs 8583
az-cs 293
bg-de 184
br-es_PR 125
bn-ga 7324
br-hi 15551
br-la 527
bs-szl 646
br-uz 1416
br-yi 2799

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

[More Information Needed]

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

[More Information Needed]

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

BSD "Revised" license (see ( https://help.launchpad.net/Legal#Translations_copyright)[https://help.launchpad.net/Legal#Translations_copyright] )

Citation Information

@InProceedings{TIEDEMANN12.463,
  author = {J{\"o}rg Tiedemann},
  title = {Parallel Data, Tools and Interfaces in OPUS},
  booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
  year = {2012},
  month = {may},
  date = {23-25},
  address = {Istanbul, Turkey},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {978-2-9517408-7-7},
  language = {english}
 }

Contributions

Thanks to @rkc007 for adding this dataset.