数据集:
PedroDKE/LibriS2S
此存储库包含脚本和对齐数据,用于在原始数据集基础上构建一个数据集,其中包含(德语语音,德语转录,英语语音,英语转录)四重对,可用于语音到语音翻译研究。由于如此,对齐数据也在相同开源协议下发布。
通过下载英文有声书并使用自动化工具进行音频与文本的对齐,收集这些对齐数据。详细信息请参阅原始论文(发表于LREC 2022)英语/德语音频可在EN/DE文件夹中找到,可从链接中进行下载。如果下载存在任何问题,请在此处或链接上描述。
存储库结构如下:如果发现数据中有任何遗漏部分,请随时提出问题!完整的压缩文件大小约为52 GB。
要从Librivox网址下载所有章节,可以使用以下命令:
python scrape_audio_from_librivox.py \ --url https://librivox.org/undine-by-friedrich-de-la-motte-fouque/ \ --save_dir ./examples
要将之前下载的书与LibrivoxDeEn提供的文本和tsv表进行对齐,可以使用以下命令(基于此存储库提供的示例):
python align_text_and_audio.py \ --text_dir ./example/en_text/ \ --audio_path ./example/audio_chapters/ \ --aeneas_path ./example/aeneas/ \ --en_audio_export_path ./example/sentence_level_audio/ \ --total_alignment_path ./example/bi-lingual-alignment/ \ --librivoxdeen_alignment ./example/undine_data.tsv \ --aeneas_head_max 120 \ --aeneas_tail_min 5 \
注意:此存储库中的示例文件夹已经包含了从Librivox爬取的前两章以及从LibrivoxDeEn获取的文本和tsv表(修改为仅包含前2章)。可以通过使用之前显示的相同文件爬取其他要对齐的数据,并与LibriVoxDeEn提供的数据进行组合。
此存储库还提供了以下8本书的完整对齐数据,它们的LibrivoxDeEn id分别为: 9 , 10 , 13 , 18 , 23 , 108 , 110 , 120 。
其他书籍如 11 , 36 , 67 和 54 也包含在LibrivoxDeEn数据集中,但章节的对应关系不是一对一的(例如:书籍67的德文版有27个章节,而英文版有29个章节,因此在此存储库中的对齐数据可能有所不同)。因此,这些对齐数据已提供,但如果您自行爬取,对齐结果可能会有所不同。
使用此存储库中提供的对齐数据进行一些指标的收集,并在此显示。对于此表格和下一个图表,手动对齐的书籍未计入其中,但完整表格可以在原始论文中找到。
German | English | |
---|---|---|
number of files | 18868 | 18868 |
total time (hh:mm:ss) | 39:11:08 | 40:52:31 |
Speakers | 41 | 22 |
注意:每本书的发言者分别计算,因此某些发言者可能被计数多次。
在此存储库中对齐的每本书的小时数:
在使用此作品时,请引用原始论文和LibrivoxDeEn的作者。
@inproceedings{jeuris-niehues-2022-libris2s, title = "{L}ibri{S}2{S}: A {G}erman-{E}nglish Speech-to-Speech Translation Corpus", author = "Jeuris, Pedro and Niehues, Jan", booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference", month = jun, year = "2022", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2022.lrec-1.98", pages = "928--935", abstract = "Recently, we have seen an increasing interest in the area of speech-to-text translation. This has led to astonishing improvements in this area. In contrast, the activities in the area of speech-to-speech translation is still limited, although it is essential to overcome the language barrier. We believe that one of the limiting factors is the availability of appropriate training data. We address this issue by creating LibriS2S, to our knowledge the first publicly available speech-to-speech training corpus between German and English. For this corpus, we used independently created audio for German and English leading to an unbiased pronunciation of the text in both languages. This allows the creation of a new text-to-speech and speech-to-speech translation model that directly learns to generate the speech signal based on the pronunciation of the source language. Using this created corpus, we propose Text-to-Speech models based on the example of the recently proposed FastSpeech 2 model that integrates source language information. We do this by adapting the model to take information such as the pitch, energy or transcript from the source speech as additional input.", }
@article{beilharz19, title = {LibriVoxDeEn: A Corpus for German-to-English Speech Translation and Speech Recognition}, author = {Beilharz, Benjamin and Sun, Xin and Karimova, Sariya and Riezler, Stefan}, journal = {Proceedings of the Language Resources and Evaluation Conference}, journal-abbrev = {LREC}, year = {2020}, city = {Marseille, France}, url = {https://arxiv.org/pdf/1910.07924.pdf} }