This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech.
This model is intended for general-purpose speech recognition, such as broadcast conversations, interviews, talks, etc.
from espnet2.bin.asr_inference import Speech2Text model = Speech2Text.from_pretrained( "TalTechNLP/espnet2_estonian", lm_weight=0.6, ctc_weight=0.4, beam_size=60 ) # read a sound file with 16k sample rate import soundfile speech, rate = soundfile.read("speech.wav") assert rate == 16000 text, *_ = model(speech) print(text[0])Limitations and bias
Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following:
Acoustic training data:
Type | Amount (h) |
---|---|
Broadcast speech | 591 |
Spontaneous speech | 53 |
Elderly speech corpus | 53 |
Talks, lectures | 49 |
Parliament speeches | 31 |
Total | 761 |
Language model training data:
Standard EspNet2 Conformer recipe.
dataset | Snt | Wrd | Corr | Sub | Del | Ins | Err | S.Err |
---|---|---|---|---|---|---|---|---|
decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/aktuaalne2021.testset | 2864 | 56575 | 93.1 | 4.5 | 2.4 | 2.0 | 8.9 | 63.4 |
decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/jutusaated.devset | 273 | 4677 | 93.9 | 3.6 | 2.4 | 1.2 | 7.3 | 46.5 |
decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/jutusaated.testset | 818 | 11093 | 94.7 | 2.7 | 2.5 | 0.9 | 6.2 | 45.0 |
decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/www-trans.devset | 1207 | 13865 | 82.3 | 8.5 | 9.3 | 3.4 | 21.2 | 74.1 |
decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/www-trans.testset | 1648 | 22707 | 86.4 | 7.6 | 6.0 | 2.5 | 16.1 | 75.7 |
@inproceedings{watanabe2018espnet, author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai}, title={{ESPnet}: End-to-End Speech Processing Toolkit}, year={2018}, booktitle={Proceedings of Interspeech}, pages={2207--2211}, doi={10.21437/Interspeech.2018-1456}, url={http://dx.doi.org/10.21437/Interspeech.2018-1456} }