数据集:

albertvillanova/sat

任务:

文本生成

翻译

语言:

计算机处理:

translation

大小:

1M<n<10M

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original extended|bible_para extended|kde4

其他:

conditional-text-generation

许可:

license:unknown

数据集介绍文件清单

中文

Dataset Card for SAT

Dataset Summary

SAT (Style Augmented Translation) dataset contains roughly 3.3 million English-Vietnamese pairs of texts.

Supported Tasks and Leaderboards

Machine Translation

Languages

The languages in the dataset are:

Vietnamese ( vi )
English ( en )

Dataset Structure

Data Instances

{
  'translation': {
    'en': 'Rachel Pike : The science behind a climate headline',
    'vi': 'Khoa học đằng sau một tiêu đề về khí hậu'
  }
}

Data Fields

translation :
- en : Parallel text in English.
- vi : Parallel text in Vietnamese.

Data Splits

The dataset is split in "train" and "test".

train	test
Number of examples	3359574	7221

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

Unknown.

Citation Information

Unknown.

Contributions

Thanks to @albertvillanova for adding this dataset.

作者:

albertvillanova

数据集大小:

6.67 KB