数据集:

albertvillanova/sat

中文

Dataset Card for SAT

Dataset Summary

SAT (Style Augmented Translation) dataset contains roughly 3.3 million English-Vietnamese pairs of texts.

Supported Tasks and Leaderboards

  • Machine Translation

Languages

The languages in the dataset are:

  • Vietnamese ( vi )
  • English ( en )

Dataset Structure

Data Instances

{
  'translation': {
    'en': 'Rachel Pike : The science behind a climate headline',
    'vi': 'Khoa học đằng sau một tiêu đề về khí hậu'
  }
}

Data Fields

  • translation :
    • en : Parallel text in English.
    • vi : Parallel text in Vietnamese.

Data Splits

The dataset is split in "train" and "test".

train test
Number of examples 3359574 7221

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

Unknown.

Citation Information

Unknown.

Contributions

Thanks to @albertvillanova for adding this dataset.