数据集:

PedroDKE/LibriS2S

任务:

文本转语音

自动语音识别

翻译

语言:

计算机处理:

multilingual

大小:

10K<n<100K

预印本库:

arxiv:2204.10593 arxiv:1910.07924

其他:

LibriS2S LibrivoxDeEn Speech-to-Speech translation Speech-to-Speech+translation

许可:

cc-by-nc-sa-4.0

数据集介绍文件清单

英文

LibriS2S

此存储库包含脚本和对齐数据，用于在原始数据集基础上构建一个数据集，其中包含（德语语音，德语转录，英语语音，英语转录）四重对，可用于语音到语音翻译研究。由于如此，对齐数据也在相同开源协议下发布。

通过下载英文有声书并使用自动化工具进行音频与文本的对齐，收集这些对齐数据。详细信息请参阅原始论文（发表于LREC 2022）

数据

英语/德语音频可在EN/DE文件夹中找到，可从链接中进行下载。如果下载存在任何问题，请在此处或链接上描述。

存储库结构如下：

对齐：包含每本书和每个章节的对齐数据
DE：包含每本书的每个章节的德语音频。
EN：包含每本书的每个章节的英语音频。
示例：包含用于爬虫和对齐说明的示例文件。
LibrivoxDeEn_alignments：包含LibrivoxDeEn数据集的基本对齐数据。

如果发现数据中有任何遗漏部分，请随时提出问题！完整的压缩文件大小约为52 GB。

从Librivox网站爬取书籍

要从Librivox网址下载所有章节，可以使用以下命令：

python scrape_audio_from_librivox.py \
--url https://librivox.org/undine-by-friedrich-de-la-motte-fouque/ \
--save_dir ./examples

将Librivox的书与LibrivoxDeEn的文本进行对齐

要将之前下载的书与LibrivoxDeEn提供的文本和tsv表进行对齐，可以使用以下命令（基于此存储库提供的示例）：

python align_text_and_audio.py \
--text_dir ./example/en_text/ \
--audio_path ./example/audio_chapters/ \
--aeneas_path ./example/aeneas/ \
--en_audio_export_path ./example/sentence_level_audio/ \
--total_alignment_path ./example/bi-lingual-alignment/ \
--librivoxdeen_alignment ./example/undine_data.tsv \
--aeneas_head_max 120 \
--aeneas_tail_min 5 \

注意：此存储库中的示例文件夹已经包含了从Librivox爬取的前两章以及从LibrivoxDeEn获取的文本和tsv表（修改为仅包含前2章）。可以通过使用之前显示的相同文件爬取其他要对齐的数据，并与LibriVoxDeEn提供的数据进行组合。

此存储库还提供了以下8本书的完整对齐数据，它们的LibrivoxDeEn id分别为： 9 ， 10 ， 13 ， 18 ， 23 ， 108 ， 110 ， 120 。

其他书籍如 11 ， 36 ， 67 和 54 也包含在LibrivoxDeEn数据集中，但章节的对应关系不是一对一的（例如：书籍67的德文版有27个章节，而英文版有29个章节，因此在此存储库中的对齐数据可能有所不同）。因此，这些对齐数据已提供，但如果您自行爬取，对齐结果可能会有所不同。

在此存储库中提供的对齐数据上的指标

使用此存储库中提供的对齐数据进行一些指标的收集，并在此显示。对于此表格和下一个图表，手动对齐的书籍未计入其中，但完整表格可以在原始论文中找到。

German	English
number of files	18868	18868
total time (hh:mm:ss)	39:11:08	40:52:31
Speakers	41	22

注意：每本书的发言者分别计算，因此某些发言者可能被计数多次。

在此存储库中对齐的每本书的小时数：

在使用此作品时，请引用原始论文和LibrivoxDeEn的作者。

@inproceedings{jeuris-niehues-2022-libris2s,
   title = "{L}ibri{S}2{S}: A {G}erman-{E}nglish Speech-to-Speech Translation Corpus",
   author = "Jeuris, Pedro  and
     Niehues, Jan",
   booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
   month = jun,
   year = "2022",
   address = "Marseille, France",
   publisher = "European Language Resources Association",
   url = "https://aclanthology.org/2022.lrec-1.98",
   pages = "928--935",
   abstract = "Recently, we have seen an increasing interest in the area of speech-to-text translation. This has led to astonishing improvements in this area. In contrast, the activities in the area of speech-to-speech translation is still limited, although it is essential to overcome the language barrier. We believe that one of the limiting factors is the availability of appropriate training data. We address this issue by creating LibriS2S, to our knowledge the first publicly available speech-to-speech training corpus between German and English. For this corpus, we used independently created audio for German and English leading to an unbiased pronunciation of the text in both languages. This allows the creation of a new text-to-speech and speech-to-speech translation model that directly learns to generate the speech signal based on the pronunciation of the source language. Using this created corpus, we propose Text-to-Speech models based on the example of the recently proposed FastSpeech 2 model that integrates source language information. We do this by adapting the model to take information such as the pitch, energy or transcript from the source speech as additional input.",
}

@article{beilharz19,
 title = {LibriVoxDeEn: A Corpus for German-to-English Speech Translation and Speech Recognition},
 author = {Beilharz, Benjamin and Sun, Xin and Karimova, Sariya and Riezler, Stefan},
 journal = {Proceedings of the Language Resources and Evaluation Conference},
 journal-abbrev = {LREC},
 year = {2020},
 city = {Marseille, France},
 url = {https://arxiv.org/pdf/1910.07924.pdf}
}

作者:

PedroDKE

数据集大小:

663.04 MB