对话媒体平台：使用OpenAI、Qdrant和Gemma2通过播客和视频聊天

2024年07月01日由 alex 发表 310 0

在数字时代，我们的会话媒体平台彻底改变了你参与播客和视频的方式。通过使用 OpenAI Whisper 进行转录，我们将口语转化为文本，使搜索和与内容互动变得更加容易。我们的语义搜索引擎 Qdrant 可让用户无缝检索特定片段。我们通过整合托管在 Ollama 上的 Gemma2 LLM 来增强互动性，使用户能够与内容聊天并获得有见地的回应。这种混合方法结合了云服务的稳健性和本地解决方案的高效性，确保了性能和可靠性。

架构：

该架构的核心是首先将播客和 YouTube 视频中的音频内容转到转录服务。该服务使用 OpenAI Whisper 进行语音到文本的转换，有效地将口语转化为文本。Whisper 以其准确性和处理不同音频质量的能力而闻名，是转录多媒体内容的可靠选择。

将音频转录为文本后，嵌入信息将被推送到托管在 GCP 上的高性能矢量搜索引擎 Qdrant。Qdrant 对于管理所涉及的大规模数据、提供高效、可扩展和快速的特定内容片段检索至关重要。其先进的功能可确保用户无缝搜索和浏览海量数据。

然后，Ollama 上托管的语言模型将利用转录文本及其嵌入。该模型（在本例中为 Gemma 2）允许用户通过提问和接收有见地的回复与内容进行交互。LLM（大型语言模型）的集成增强了互动体验，让用户感觉就像在与智能伴侣对话。

整体集成由 LlamaIndex 提供支持，它将各个组件联系在一起，确保了数据流和操作的顺畅。LlamaIndex 负责管理转录服务、Qdrant 和 Ollama 之间的交互，促进相关上下文的检索，并对用户查询做出一致的回应。

实施

本项目有 3 个主要文件 audio_transcription_service.py、vide_transcription_service.py 和 main.py

audio_transcription_service.py 定义了一个类 AudioTranscription，用于使用 OpenAI 的 Whisper 模型转录音频文件。该类通过加载 Whisper 模型的小变体进行初始化。转录方法接收音频文件的可选目录路径和启用日志记录的标志。它会遍历指定目录中的所有文件，检查它们是否是有效文件。如果启用了日志记录，它就会打印文件路径。对于每个文件，该方法都会尝试转录音频，并捕捉单词时间戳。如果成功，该方法会记录转录结果（如果启用了日志记录功能），确保用于存储转录内容的目标目录存在，并将转录文本写入一个新文件，该文件以原始音频文件命名，并添加了"_transcript.txt "后缀。如果在转录过程中出现任何错误，该方法会打印错误并返回 False。如果所有文件都处理完毕，没有出现错误，则返回 True。

import whisper
from pathlib import Path
import os
# Define the directory and file path
dir_path = '../transcriptions'

class AudioTranscription:
    def __init__(self):
        self.model = whisper.load_model("small")
    def transcribe(self, audio_file_dir: str = '', is_log_enabled: bool = False) -> bool:
        for file_path in Path(audio_file_dir).rglob('*'):
            if file_path.is_file():
                if is_log_enabled:
                    print(file_path)
                try:
                    result = self.model.transcribe(audio=f'./{file_path}', word_timestamps=True)
                    if is_log_enabled:
                        print(result)
                    # Ensure the directory exists
                    os.makedirs(dir_path, exist_ok=True)
                    with open(file=f'{file_path}_transcript.txt', mode='w') as transcription:
                        transcription.write(result.get('text'))
                except Exception as e:
                    print('Error', e.__cause__)
                    return False
        return True

video_transcription_service.py 代码从 YouTube 视频中下载音频并将其转换为 MP3 文件。它首先导入必要的库： youtube 用于处理 YouTube 视频，moviepy.editor 用于音频转换，os 用于文件操作。函数 download_youtube_audio 接收 YouTube URL 和 MP3 文件的输出路径。它为下载的音频定义了一个临时文件名。使用提供的 URL 创建 YouTube 对象，并选择最佳可用音频流。该音频流将下载到临时文件中。然后使用 moviepy 库将 MP4 音频文件转换为 MP3 格式，并保存到指定的输出路径。转换完成后，临时 MP4 文件将被删除以进行清理，并在尝试删除前进行检查以确保文件存在。最后，脚本使用特定的 YouTube URL 和输出路径调用 download_youtube_audio 函数，下载音频并将其转换为 MP3 文件保存到指定位置。按照 Chiawei Lim - Medium 的文章创建。

from pytube import YouTube
import moviepy.editor as mp
import os

def download_youtube_audio(youtube_url, output_path):
    # temp audio file name
    temp_audio_file = 'temp_audio.mp4'
    # Create a YouTube object
    yt = YouTube(youtube_url)
    # Select the best audio stream
    audio_stream = yt.streams.filter(only_audio=True).first()
    # Download the audio stream to a temporary file
    audio_stream.download(filename=temp_audio_file)
    # Convert the downloaded file to MP3
    clip = mp.AudioFileClip(temp_audio_file)
    clip.write_audiofile(output_path)
    # Remove the temporary MP4 file
    clip.close()
    # Check if the file exists before attempting to delete it
    if os.path.exists(temp_audio_file):
        os.remove(temp_audio_file)
        print(f"{temp_audio_file} has been deleted successfully.")
    else:
        print(f"{temp_audio_file} does not exist in the current directory or its already deleted")

youtube_url = "https://www.youtube.com/watch?v=mXNrhyw4q84&t=99s"
output_path = '../data/how_vector_search_algo_works.mp3'
download_youtube_audio(youtube_url, output_path)

main.py 首先使用 AudioTranscription 类转录位于"./data "目录下的音频文件，并启用日志记录。如果转录成功，它就会设置日志记录，并从 "data "目录中加载文本数据，使用句子分割器将其分割成易于管理的块。然后，它会初始化 Qdrant 向量存储客户端，并使用 OllamaEmbedding 建立嵌入模型。这些组件在 "设置 "中进行全局配置。

文本块被处理成节点，每个节点的内容都使用嵌入模型进行嵌入。然后将这些节点编入 VectorStoreIndex 索引。一个 VectorIndexRetriever 会被初始化，以检索前 5 个相似的嵌入，并创建一个 RetrieverQueryEngine 来查询索引数据。此外，还建立了一个 HyDEQueryTransform 实例，以加强查询处理。

程序进入一个循环，不断接受用户查询，使用 hyde_query_engine 对其进行处理，并打印响应。当用户输入 "再见 "或 "退出 "时，循环退出，此时 Qdrant 客户端关闭。

import os
from services.audio_transcription_service import AudioTranscription
from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    StorageContext,
    Settings,
    get_response_synthesizer)
from llama_index.core.query_engine import RetrieverQueryEngine, TransformQueryEngine
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import TextNode, MetadataMode
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.llms.ollama import Ollama
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from dotenv import load_dotenv, find_dotenv
import qdrant_client
import logging
_ = load_dotenv(find_dotenv())
is_audio_transcribed = AudioTranscription().transcribe(audio_file_dir='./data', is_log_enabled=True)
if is_audio_transcribed:
    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger(__name__)
    # load the local data directory and chunk the data for further processing
    docs = SimpleDirectoryReader(input_dir="data", required_exts=[".txt"]).load_data(show_progress=True)
    text_parser = SentenceSplitter(chunk_size=512, chunk_overlap=100)
    # Create a local Qdrant vector store
    logger.info("initializing the vector store related objects")
    client = qdrant_client.QdrantClient(url=os.environ['qdrant_url'], api_key=os.environ['qdrant_api_key'])
    vector_store = QdrantVectorStore(client=client, collection_name="research_papers")
    # local vector embeddings model
    logger.info("initializing the OllamaEmbedding")
    embed_model = OllamaEmbedding(model_name='nomic-embed-text:latest', base_url='http://localhost:11434')
    logger.info("initializing the global settings")
    Settings.embed_model = embed_model
    Settings.llm = Ollama(model="gemma2:latest", base_url='http://localhost:11434', request_timeout=600)
    Settings.transformations = [text_parser]
    text_chunks = []
    doc_ids = []
    nodes = []
    logger.info("enumerating docs")
    for doc_idx, doc in enumerate(docs):
        curr_text_chunks = text_parser.split_text(doc.text)
        text_chunks.extend(curr_text_chunks)
        doc_ids.extend([doc_idx] * len(curr_text_chunks))
    logger.info("enumerating text_chunks")
    for idx, text_chunk in enumerate(text_chunks):
        node = TextNode(text=text_chunk)
        src_doc = docs[doc_ids[idx]]
        node.metadata = src_doc.metadata
        nodes.append(node)
    logger.info("enumerating nodes")
    for node in nodes:
        node_embedding = embed_model.get_text_embedding(
            node.get_content(metadata_mode=MetadataMode.ALL)
        )
        node.embedding = node_embedding
    logger.info("initializing the storage context")
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    logger.info("indexing the nodes in VectorStoreIndex")
    index = VectorStoreIndex(
        nodes=nodes,
        storage_context=storage_context,
        transformations=Settings.transformations,
    )
    logger.info("initializing the VectorIndexRetriever with top_k as 5")
    vector_retriever = VectorIndexRetriever(index=index, similarity_top_k=5)
    response_synthesizer = get_response_synthesizer()
    logger.info("creating the RetrieverQueryEngine instance")
    vector_query_engine = RetrieverQueryEngine(
        retriever=vector_retriever,
        response_synthesizer=response_synthesizer,
    )
    logger.info("creating the HyDEQueryTransform instance")
    hyde = HyDEQueryTransform(include_original=True)
    hyde_query_engine = TransformQueryEngine(vector_query_engine, hyde)
    logger.info("retrieving the response to the query")
    # Start a loop to continually get input from the user
    while True:
        # Get a query from the user
        user_query = input("Enter your query [type 'bye' to 'exit']: ")
        # Check if the user wants to terminate the loop
        if user_query.lower() == "bye" or user_query.lower() == "exit":
            break
            client.close()
        response = hyde_query_engine.query(str_or_query_bundle=user_query)
        print(response)

启动并运行代码后，就可以开始向在 qdrant 中编入索引的 YouTube 视频或播客提问了。

结论

总之，我们的会话媒体平台充分体现了整合先进人工智能技术的力量，以增强我们与多媒体内容的交互方式。通过利用 OpenAI Whisper 实现精确的音频转录，利用 Qdrant 实现高效的语义搜索，利用 Ollama 实现智能的上下文感知响应，我们创造了一种无缝且极具吸引力的用户体验。该平台在被动内容消费和主动互动之间架起了一座桥梁，让用户能够轻松深入地了解播客和视频。

文章来源：https://medium.com/towardsdev/conversational-media-platform-chatting-with-podcasts-and-videos-using-openai-qdrant-and-gemma2-4208ab7e90ee

标签：

人工智能

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇在LlamaIndex中自定义属性图索引

下一篇构建“Auto-Analyst”：数据分析人工智能代理系统

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来