模型:
Harveenchadha/vakyansh-wav2vec2-hindi-him-4200
检查空格演示 here
在多语言预训练模型 CLSRIL-23 上进行微调。原始的fairseq检查点存在 here 。使用此模型时,请确保语音输入采样率为16kHz。
注意:此模型的结果没有使用语言模型,因此在某些情况下可能会有更高的词错误率。
此模型使用了4200小时的印地语标记数据进行训练。标记数据目前不在公共领域中。
模型是使用Vakyansh团队在Ekstep上建立的实验平台进行训练的。这是 training repository 。
如果你想在wandb上查看训练日志,可以 here 。
可以直接使用该模型(无需语言模型),如下所示:
import soundfile as sf import torch from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor import argparse def parse_transcription(wav_file): # load pretrained model processor = Wav2Vec2Processor.from_pretrained("Harveenchadha/vakyansh-wav2vec2-hindi-him-4200") model = Wav2Vec2ForCTC.from_pretrained("Harveenchadha/vakyansh-wav2vec2-hindi-him-4200") # load audio audio_input, sample_rate = sf.read(wav_file) # pad input values and return pt tensor input_values = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt").input_values # INFERENCE # retrieve logits & take argmax logits = model(input_values).logits predicted_ids = torch.argmax(logits, dim=-1) # transcribe transcription = processor.decode(predicted_ids[0], skip_special_tokens=True) print(transcription)
可以按照以下方式对Common Voice的印地语测试数据进行评估。
import torch import torchaudio from datasets import load_dataset, load_metric from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor import re test_dataset = load_dataset("common_voice", "hi", split="test") wer = load_metric("wer") processor = Wav2Vec2Processor.from_pretrained("Harveenchadha/vakyansh-wav2vec2-hindi-him-4200") model = Wav2Vec2ForCTC.from_pretrained("Harveenchadha/vakyansh-wav2vec2-hindi-him-4200") model.to("cuda") resampler = torchaudio.transforms.Resample(48_000, 16_000) chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“]' # Preprocessing the datasets. # We need to read the aduio files as arrays def speech_file_to_array_fn(batch): batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower() speech_array, sampling_rate = torchaudio.load(batch["path"]) batch["speech"] = resampler(speech_array).squeeze().numpy() return batch test_dataset = test_dataset.map(speech_file_to_array_fn) # Preprocessing the datasets. # We need to read the aduio files as arrays def evaluate(batch): inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True) with torch.no_grad(): logits = model(inputs.input_values.to("cuda")).logits pred_ids = torch.argmax(logits, dim=-1) batch["pred_strings"] = processor.batch_decode(pred_ids, skip_special_tokens=True) return batch result = test_dataset.map(evaluate, batched=True, batch_size=8) print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
测试结果:33.17%
感谢Ekstep基金会使这成为可能。Vakyansh团队将开源所有印度语言的语音模型。