使用Python进行AI语音转文本转语音—操作指南

2024年02月18日由 alex 发表 497 0

程序流程

服务器运行后，用户将听到应用程序“说话”，提示他们选择想要交谈的人物并开始与他们选择的角色交谈。每次他们想大声说话时，他们都应该在说话时按住键盘上的某个键。当他们说完（并释放按键）时，他们的录音将被Whisper（语音到文本模型OpenAI）转录，并且转录将被发送ChatGPT以获取响应。将使用文本转语音库大声读出响应，用户将听到它。

执行

注意：该项目是在Windows操作系统上开发的，并包含该pyttsx3库，该库缺乏与M1/M2芯片的兼容性。由于pyttsx3Mac 不支持，建议用户探索与 macOS 环境兼容的替代文本转语音库。

集成 Openai

我使用了两个 OpenAI 模型： Whisper 用于语音到文本的转录，ChatGPT API 用于根据用户对所选数字的输入生成回复。

获得 OpenAI API 密钥后，请将其设置为环境变量，以便在调用 API 时使用。确保不要将密钥推送到代码库或任何公共位置，也不要不安全地共享密钥。

语音转文本--创建转录

语音转文本功能是通过 OpenAI 模型 Whisper 实现的。

以下是负责转录功能的代码片段：

async def get_transcript(audio_file_path: str, 
                         text_to_draw_while_waiting: str) -> Optional[str]:
    openai.api_key = os.environ.get("OPENAI_API_KEY")
    audio_file = open(audio_file_path, "rb")
    transcript = None
    async def transcribe_audio() -> None:
        nonlocal transcript
        try:
            response = openai.Audio.transcribe(
                model="whisper-1", file=audio_file, language="en")
            transcript = response.get("text")
        except Exception as e:
            print(e)
    draw_thread = Thread(target=print_text_while_waiting_for_transcription(
        text_to_draw_while_waiting))
    draw_thread.start()
    transcription_task = asyncio.create_task(transcribe_audio())
    await transcription_task
    if transcript is None:
        print("Transcription not available within the specified timeout.")
    return transcript

该函数被标记为异步（async），因为 API 调用可能需要一段时间才能返回响应，我们等待它以确保在收到响应之前程序不会继续运行。

正如你所看到的，get_transcript 函数也调用了 print_text_while_waiting_for_transcription 函数。为什么呢？因为获取转录是一项耗时的任务，我们希望让用户知道程序正在积极处理他们的请求，而不是卡住或无响应。因此，在用户等待下一步时，这段文字会逐渐打印出来。

使用 FuzzyWuzzy 进行文本比较的字符串匹配

将语音转录为文本后，我们要么按原样使用，要么尝试将其与现有字符串进行比较。

比较用例包括：从预定义的选项列表中选择一个数字，决定是否继续播放，以及当选择继续播放时，决定是选择一个新的数字还是坚持当前的数字。

在这种情况下，我们希望将用户的口语输入转录与列表中的选项进行比较，因此我们决定使用 FuzzyWuzzy 库进行字符串匹配。

这样，只要匹配分数超过预定义的阈值，就能从列表中选择最接近的选项。

下面是我们的函数片段：

def detect_chosen_option_from_transcript(
        transcript: str, options: List[str]) -> str:
    best_match_score = 0
    best_match = ""
    for option in options:
        score = fuzz.token_set_ratio(transcript.lower(), option.lower())
        if score > best_match_score:
            best_match_score = score
            best_match = option
    if best_match_score >= 70:
        return best_match
    else:
        return ""

获取 ChatGPT 响应

转录完成后，我们就可以将其发送到 ChatGPT 以获得响应。

对于每个 ChatGPT 请求，我们都添加了一个提示，要求得到简短而有趣的回复。我们还告诉 ChatGPT 假装是哪个人物。

因此，我们的功能如下：

这样，只要匹配分数超过预定义的阈值，就能从列表中选择最接近的选项。

下面是我们的函数片段：

def get_gpt_response(transcript: str, chosen_figure: str) -> str:
    system_instructions = get_system_instructions(chosen_figure)
    try:
        return make_openai_request(
            system_instructions=system_instructions, 
            user_question=transcript).choices[0].message["content"]
    except Exception as e:
        logging.error(f"could not get ChatGPT response. error: {str(e)}")
        raise e

系统说明如下：

def get_system_instructions(figure: str) -> str:
    return f"You provide funny and short answers. You are: {figure}"

文本转语音

在文本到语音部分，我们选择了一个名为 pyttsx3 的 Python 库。这一选择不仅简单易用，而且还具有一些额外的优势。它是免费的，提供两种语音选择--男声和女声，并允许你选择以每分钟字数为单位的说话速度（语速）。

用户启动游戏时，可以从预定义的选项列表中选择一个角色。如果我们在列表中找不到与他们所说的相匹配的角色，我们就会从 "后备人物 "列表中随机选择一个角色。在这两个列表中，每个角色都与一个性别相关联，因此我们的文本到语音功能也会收到与所选性别相对应的语音 ID。

这就是我们的文本到语音功能：

def text_to_speech(text: str, gender: str = Gender.FEMALE.value) -> None:
    engine = pyttsx3.init()
    engine.setProperty("rate", WORDS_PER_MINUTE_RATE)
    voices = engine.getProperty("voices")
    voice_id = voices[0].id if gender == "male" else voices[1].id
    engine.setProperty("voice", voice_id)
    engine.say(text)
    engine.runAndWait()

主要流程

现在我们已经大致完成了应用程序的所有部分，重头戏来了！主要流程概述如下。你可能会注意到一些我们没有深入研究的函数（如 choose_figure、play_round），但你可以通过查看 repoplay_round来探索完整的代码。最终，这些高级函数中的大部分都与我们上面介绍的内部函数相关联。

下面是主要游戏流程的一个片段：

import asyncio
from src.handle_transcript import text_to_speech
from src.main_flow_helpers import choose_figure, start, play_round, \
    is_another_round

def farewell() -> None:
    farewell_message = "It was great having you here, " \
                       "hope to see you again soon!"
    print(f"\n{farewell_message}")
    text_to_speech(farewell_message)

async def get_round_settings(figure: str) -> dict:
    new_round_choice = await is_another_round()
    if new_round_choice == "new figure":
        return {"figure": "", "another_round": True}
    elif new_round_choice == "no":
        return {"figure": "", "another_round": False}
    elif new_round_choice == "yes":
        return {"figure": figure, "another_round": True}

async def main():
    start()
    another_round = True
    figure = ""
    while True:
        if not figure:
            figure = await choose_figure()
        while another_round:
            await play_round(chosen_figure=figure)
            user_choices = await get_round_settings(figure)
            figure, another_round = \
                user_choices.get("figure"), user_choices.get("another_round")
            if not figure:
                break
        if another_round is False:
            farewell()
            break

if __name__ == "__main__":
    asyncio.run(main())

结论

在本文中，我们学习了如何使用 Python 创建语音到文本到语音的游戏，并将其与人工智能交织在一起。我们使用该Whisper模型进行OpenAI语音识别，使用该FuzzyWuzzy库进行文本匹配，通过其开发人员 API 挖掘其对话魔力，并通过文本转语音ChatGPT将其变为现实。pyttsx3虽然OpenAI的服务（Whisper以及ChatGPT针对开发人员）的成本确实较低，但它的预算友好。

文章来源：https://medium.com/towards-data-science/speech-to-text-to-speech-with-ai-using-python-a-how-to-guide-ee9b0b0ef082

标签：

人工智能 Python

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇 FastText：彻底改变词嵌入和文本分类

下一篇【Python】如何用Python来绘制交互式地图

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来