数据集:

Bingsu/KSS_Dataset

语言:

ko

计算机处理:

monolingual

大小:

10K<n<100K

语言创建人:

expert-generated

批注创建人:

expert-generated

源数据集:

original
中文

Description of the original author

KSS Dataset: Korean Single speaker Speech Dataset

KSS Dataset is designed for the Korean text-to-speech task. It consists of audio files recorded by a professional female voice actoress and their aligned text extracted from my books. As a copyright holder, by courtesy of the publishers, I release this dataset to the public. To my best knowledge, this is the first publicly available speech dataset for Korean.

File Format

Each line in transcript.v.1.3.txt is delimited by | into six fields.

  • A. Audio file path
  • B. Original script
  • C. Expanded script
  • D. Decomposed script
  • E. Audio duration (seconds)
  • F. English translation

e.g.,

1/1_0470.wav|저는 보통 20분 정도 낮잠을 잡니다.|저는 보통 이십 분 정도 낮잠을 잡니다.|저는 보통 이십 분 정도 낮잠을 잡니다.|4.1|I usually take a nap for 20 minutes.

Specification

License

NC-SA 4.0. You CANNOT use this dataset for ANY COMMERCIAL purpose. Otherwise, you can freely use this.

Citation

If you want to cite KSS Dataset, please refer to this:

Kyubyong Park, KSS Dataset: Korean Single speaker Speech Dataset, https://kaggle.com/bryanpark/korean-single-speaker-speech-dataset , 2018

Reference

Check out this for a project using this KSS Dataset.

Contact

You can contact me at kbpark.linguist@gmail.com .

April, 2018.

Kyubyong Park

Dataset Summary

12,853 Korean audio files with transcription.

Supported Tasks and Leaderboards

text-to-speech

Languages

korean

Dataset Structure

Data Instances

>>> from datasets import load_dataset

>>> dataset = load_dataset("Bingsu/KSS_Dataset")
>>> dataset["train"].features
{'audio': Audio(sampling_rate=44100, mono=True, decode=True, id=None),
 'original_script': Value(dtype='string', id=None),
 'expanded_script': Value(dtype='string', id=None),
 'decomposed_script': Value(dtype='string', id=None),
 'duration': Value(dtype='float32', id=None),
 'english_translation': Value(dtype='string', id=None)}
>>> dataset["train"][0]
{'audio': {'path': None,
  'array': array([ 0.00000000e+00,  3.05175781e-05, -4.57763672e-05, ...,
          0.00000000e+00, -3.05175781e-05, -3.05175781e-05]),
  'sampling_rate': 44100},
 'original_script': '그는 괜찮은 척하려고 애쓰는 것 같았다.',
 'expanded_script': '그는 괜찮은 척하려고 애쓰는 것 같았다.',
 'decomposed_script': '그는 괜찮은 척하려고 애쓰는 것 같았다.',
 'duration': 3.5,
 'english_translation': 'He seemed to be pretending to be okay.'}

Data Splits

train
# of examples 12853