介绍
鉴于YouTube每日上传的海量内容,从冗长视频中挖掘有价值的信息犹如大海捞针。无论你是研究人员、学生还是内容创作者,快速提炼视频主旨和要点都能节省宝贵时间,提升工作效率。此时,一款基于人工智能的视频摘要工具就显得尤为重要,它能自动将冗长的文字记录转换成简洁、富有洞察力的摘要。
在本篇文章中,我们将引导你使用LangChain管理基于人工智能的工作流、利用Llama模型进行语言生成,并结合YouTube Transcript API实现无缝的文字记录提取,从而构建一个YouTube视频摘要工具。最终,你将拥有一个功能完善的摘要工具,能够将任何YouTube视频转换成结构清晰、易于阅读的摘要,非常适合快节奏的研究或深入探索内容前的快速浏览。
环境设置
为了构建我们的YouTube视频摘要工具,我们需要安装一些必要的库。每个库在处理从文字记录检索到语言生成等不同方面都发挥着关键作用。
Ollama安装步骤:
要在你的系统中使用Ollama,你需要先安装Ollama应用程序,然后在你的系统中下载LLama 3.2模型。
你可以使用Python中的以下pip命令来安装其他库:
pip install streamlit langchain langchain-community youtube-transcript-api
导入所需的库:
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain
from langchain_community.chat_models import ChatOllama
from langchain.prompts import PromptTemplate
from youtube_transcript_api import YouTubeTranscriptApi
from typing import Optional
import re
定义YouTubeSummarizer类
class YouTubeSummarizer:
def __init__(self):
"""
Initialize the YouTube Summarizer
"""
self.llm = ChatOllama(temperature=0, model="llama3.2")
# Initialize text splitter for long transcripts
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=10000,
chunk_overlap=1000,
separators=["\n\n", "\n", " ", ""]
)
# Custom prompts for the summary chain
self.map_prompt_template = """
Summarize the following part of a YouTube video transcript:
"{text}"
KEY POINTS AND TAKEAWAYS:
"""
self.combine_prompt_template = """
Create a detailed summary of the YouTube video based on these transcript summaries:
"{text}"
Please structure the summary as follows:
1. Main Topic/Theme
2. Key Points
3. Important Details
4. Conclusions/Takeaways
DETAILED SUMMARY:
"""
# Create the summary chain
self.map_prompt = PromptTemplate(
template=self.map_prompt_template,
input_variables=["text"]
)
self.combine_prompt = PromptTemplate(
template=self.combine_prompt_template,
input_variables=["text"]
)
self.chain = load_summarize_chain(
llm=self.llm,
chain_type="map_reduce",
map_prompt=self.map_prompt,
combine_prompt=self.combine_prompt,
verbose=False
)
在YouTubeSummarizer类中,__init__方法设置了总结YouTube视频文字记录所需的核心组件。让我们来逐一解析每个部分:
1. self.llm = ChatOllama(temperature=0, model="llama3.2")
2. self.text_splitter = RecursiveCharacterTextSplitter(...)
3. self.map_prompt_template 和 self.combine_prompt_template
4. self.map_prompt 和 self.combine_prompt
5. self.chain = load_summarize_chain(...)
定义YouTubeSummarizer类的函数或方法:
extract_video_id函数:
def extract_video_id(self, youtube_url: str) -> Optional[str]:extract_video_id(self, youtube_url: str) -> Optional[str]:
"""
Extract video ID from various forms of YouTube URLs
Args:
youtube_url (str): YouTube video URL
Returns:
str: Video ID if found, None otherwise
"""
patterns = [
r'(?:youtube\.com\/watch\?v=|youtu\.be\/|youtube\.com\/embed\/)([^&\n?]*)',
r'(?:youtube\.com\/shorts\/)([^&\n?]*)'
]
for pattern in patterns:
match = re.search(pattern, youtube_url)
if match:
return match.group(1)
return None
extract_video_id函数的目的是从各种YouTube URL格式中检索唯一的视频ID。这是非常关键的,因为视频ID用于获取文字记录。让我们来详细解析它的工作原理:
函数目的
该函数接收一个YouTube URL,并从中识别出视频ID,无论URL的具体格式如何。
参数
返回值
模式
函数使用两个正则表达式模式来匹配不同类型的YouTube URL:
模式1:(youtube\.com\/watch\?v=|youtu\.be\/|youtube\.com\/embed\/)([^&\n?]*)
这个模式涵盖了:
该模式捕获这些URL段之后的所有内容,直到遇到&、换行符或?,从而隔离出视频ID。
模式2:(youtube\.com\/shorts\/)([^&\n?]*)
代码逻辑
遍历模式:函数遍历两个模式以覆盖多种URL格式。
正则表达式匹配:对于每个模式,它使用re.search()来检查URL是否与模式匹配。
如果找到匹配项,ID将从匹配组(match.group(1))中提取出来,并且函数返回它。
如果没有模式匹配,函数返回None,表示URL无效或不受支持。
定义get_transcript函数
def get_transcript(self, video_id: str) -> str:get_transcript(self, video_id: str) -> str:
"""
Get the transcript of a YouTube video
Args:
video_id (str): YouTube video ID
Returns:
str: Combined transcript text
"""
try:
transcript_list = YouTubeTranscriptApi.get_transcript(video_id)
return " ".join([entry['text'] for entry in transcript_list])
except Exception as e:
raise Exception(f"Error getting transcript: {str(e)}")
get_transcript函数通过使用提供的视频ID来检索YouTube视频的文字记录。这份文字记录是生成摘要的基础。让我们来看看关键部分:
函数目的
此函数连接到YouTube文字记录API,获取指定视频ID的文字记录,并将整个文字记录文本合并成一个字符串,以便于处理。
参数
返回值
代码逻辑
获取文字记录:
合并文字记录条目:
异常处理:
如果在检索文字记录时出现问题(例如,文字记录不可用或视频ID无效),函数将抛出一个包含清晰错误消息的新异常。
示例用法
transcript = get_transcript("6K3wiD6ACWg")
print(transcript)
# Output: "Hello world Welcome to the video ..."
错误处理
总之,get_transcript函数高效地检索并处理视频文字记录,输出一个适合进一步摘要的单一字符串。这种简化的方法使得处理大量口语内容变得容易,而无需担心单个时间戳。
定义summarize_video函数
def summarize_video(self, youtube_url: str) -> dict:
"""
Summarize a YouTube video given its URL
Args:
youtube_url (str): YouTube video URL
Returns:
dict: Summary result with status and content
"""
try:
# Extract video ID
video_id = self.extract_video_id(youtube_url)
if not video_id:
return {
"status": "error",
"message": "Invalid YouTube URL"
}
# Get transcript
transcript = self.get_transcript(video_id)
# Split transcript into chunks
texts = self.text_splitter.create_documents([transcript])
# Generate summary
summary = self.chain.run(texts)
return {
"status": "success",
"summary": summary,
"video_id": video_id
}
except Exception as e:
return {
"status": "error",
"message": str(e)
}
summarize_video函数负责协调整个视频摘要生成过程,结合所有辅助函数从YouTube视频URL中生成一个简洁的摘要。让我们逐步了解其工作流程的每一部分:
函数目的
此函数接受一个YouTube URL,检索并处理其文字记录,然后生成一个结构化的摘要。函数返回一个包含摘要结果和其他元数据的字典。
参数
返回值
代码逻辑
提取视频ID
获取文字记录
将文字记录分割成块
生成摘要
返回成功字典
异常处理
summarize_video函数无缝地集成了每个步骤,将YouTube视频URL转换成一个信息丰富、可读的摘要。它旨在优雅地处理错误,为用户提供了一种用户友好且可靠的方法来总结视频内容。
整个脚本:
import os
from langchain_openai import ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain
from langchain_community.chat_models import ChatOllama
from langchain.prompts import PromptTemplate
from youtube_transcript_api import YouTubeTranscriptApi
from typing import Optional
import re
class YouTubeSummarizer:
def __init__(self):
"""
Initialize the YouTube Summarizer
"""
self.llm = ChatOllama(temperature=0, model="llama3.2")
# Initialize text splitter for long transcripts
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=10000,
chunk_overlap=1000,
separators=["\n\n", "\n", " ", ""]
)
# Custom prompts for the summary chain
self.map_prompt_template = """
Summarize the following part of a YouTube video transcript:
"{text}"
KEY POINTS AND TAKEAWAYS:
"""
self.combine_prompt_template = """
Create a detailed summary of the YouTube video based on these transcript summaries:
"{text}"
Please structure the summary as follows:
1. Main Topic/Theme
2. Key Points
3. Important Details
4. Conclusions/Takeaways
DETAILED SUMMARY:
"""
# Create the summary chain
self.map_prompt = PromptTemplate(
template=self.map_prompt_template,
input_variables=["text"]
)
self.combine_prompt = PromptTemplate(
template=self.combine_prompt_template,
input_variables=["text"]
)
self.chain = load_summarize_chain(
llm=self.llm,
chain_type="map_reduce",
map_prompt=self.map_prompt,
combine_prompt=self.combine_prompt,
verbose=False
)
def extract_video_id(self, youtube_url: str) -> Optional[str]:
"""
Extract video ID from various forms of YouTube URLs
Args:
youtube_url (str): YouTube video URL
Returns:
str: Video ID if found, None otherwise
"""
patterns = [
r'(?:youtube\.com\/watch\?v=|youtu\.be\/|youtube\.com\/embed\/)([^&\n?]*)',
r'(?:youtube\.com\/shorts\/)([^&\n?]*)'
]
for pattern in patterns:
match = re.search(pattern, youtube_url)
if match:
return match.group(1)
return None
def get_transcript(self, video_id: str) -> str:
"""
Get the transcript of a YouTube video
Args:
video_id (str): YouTube video ID
Returns:
str: Combined transcript text
"""
try:
transcript_list = YouTubeTranscriptApi.get_transcript(video_id)
return " ".join([entry['text'] for entry in transcript_list])
except Exception as e:
raise Exception(f"Error getting transcript: {str(e)}")
def summarize_video(self, youtube_url: str) -> dict:
"""
Summarize a YouTube video given its URL
Args:
youtube_url (str): YouTube video URL
Returns:
dict: Summary result with status and content
"""
try:
# Extract video ID
video_id = self.extract_video_id(youtube_url)
if not video_id:
return {
"status": "error",
"message": "Invalid YouTube URL"
}
# Get transcript
transcript = self.get_transcript(video_id)
# Split transcript into chunks
texts = self.text_splitter.create_documents([transcript])
# Generate summary
summary = self.chain.run(texts)
return {
"status": "success",
"summary": summary,
"video_id": video_id
}
except Exception as e:
return {
"status": "error",
"message": str(e)
}
def main():
# Initialize summarizer
summarizer = YouTubeSummarizer()
# Example YouTube video URL
video_url = "https://youtu.be/6K3wiD6ACWg?si=WYbngXId1RW28ADr"
# Get summary
result = summarizer.summarize_video(video_url)
if result["status"] == "success":
print("\nVideo Summary:")
print(result["summary"])
else:
print(f"\nError: {result['message']}")
if __name__ == "__main__":
main()
输出:
Video Summary:
**Detailed Summary**
**Main Topic/Theme:** The Daily Routine and Habits of Marcus Aurelius, a Roman Emperor and Philosopher
**Key Points:**
* Deep work is essential for productivity and focus
* Balance is crucial in life to avoid burnout and maintain well-being
* Marcus Aurelius's approach to life shows that one can find peace and relaxation even in the midst of chaos
**Important Details:**
* Marcus Aurelius's daily routine included meditation, journaling, and reflection on his day
* He practiced Memento Mori, which means "remember that you will die," to stay humble and focused on what's truly important
* He prioritized discipline over fame and wealth, recognizing that true power comes from within
* Despite being a busy emperor, he made time for self-care, including exercise, reading, and relaxation
* He remained humble despite his success, recognizing that his fame was fleeting and that true value lies in living a virtuous life
**Conclusions/Takeaways:**
* To achieve productivity and focus, it's essential to prioritize deep work and minimize distractions
* Finding balance between work and personal life is crucial for maintaining well-being and avoiding burnout
* Marcus Aurelius's approach to life shows that one can find peace and relaxation even in the midst of chaos by cultivating humility, discipline, and self-awareness
* By incorporating simple habits like meditation, journaling, and reflection into daily routine, individuals can set themselves up for success and live a more virtuous life
Overall, this summary highlights the importance of living a balanced and disciplined life, prioritizing deep work and self-care, and cultivating humility in order to achieve productivity, focus, and overall well-being.
将整个代码转换成Streamlit应用:
运行Streamlit应用的步骤:
设置环境
在运行Streamlit应用之前,请确保你已经安装了所有必需的库。
!pip install streamlit
pip install -r requirements.txttxt
创建一个Python脚本文件
使用以下代码:
import streamlit as st
from langchain_community.chat_models import ChatOllama
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain
from langchain.prompts import PromptTemplate
from youtube_transcript_api import YouTubeTranscriptApi
from typing import Optional
import re
class YouTubeSummarizer:
def __init__(self):
self.llm = ChatOllama(temperature=0, model="llama3.2")
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=10000,
chunk_overlap=1000,
separators=["\n\n", "\n", " ", ""]
)
self.map_prompt_template = """
Summarize the following part of a YouTube video transcript:
"{text}"
KEY POINTS AND TAKEAWAYS:
"""
self.combine_prompt_template = """
Create a detailed summary of the YouTube video based on these transcript summaries:
"{text}"
Please structure the summary as follows:
1. Main Topic/Theme
2. Key Points
3. Important Details
4. Conclusions/Takeaways
DETAILED SUMMARY:
"""
self.map_prompt = PromptTemplate(
template=self.map_prompt_template,
input_variables=["text"]
)
self.combine_prompt = PromptTemplate(
template=self.combine_prompt_template,
input_variables=["text"]
)
self.chain = load_summarize_chain(
llm=self.llm,
chain_type="map_reduce",
map_prompt=self.map_prompt,
combine_prompt=self.combine_prompt,
verbose=False
)
def extract_video_id(self, youtube_url: str) -> Optional[str]:
patterns = [
r'(?:youtube\.com\/watch\?v=|youtu\.be\/|youtube\.com\/embed\/)([^&\n?]*)',
r'(?:youtube\.com\/shorts\/)([^&\n?]*)'
]
for pattern in patterns:
match = re.search(pattern, youtube_url)
if match:
return match.group(1)
return None
def get_transcript(self, video_id: str) -> str:
try:
transcript_list = YouTubeTranscriptApi.get_transcript(video_id)
return " ".join([entry['text'] for entry in transcript_list])
except Exception as e:
raise Exception(f"Error getting transcript: {str(e)}")
def summarize_video(self, youtube_url: str) -> dict:
try:
video_id = self.extract_video_id(youtube_url)
if not video_id:
return {
"status": "error",
"message": "Invalid YouTube URL"
}
transcript = self.get_transcript(video_id)
texts = self.text_splitter.create_documents([transcript])
summary = self.chain.run(texts)
return {
"status": "success",
"summary": summary,
"video_id": video_id
}
except Exception as e:
return {
"status": "error",
"message": str(e)
}
def main():
st.set_page_config(
page_title="YouTube Video Summarizer",
page_icon="?",
layout="wide"
)
# Add custom CSS
st.markdown("""
<style>
.big-font {
font-size:24px !important;
font-weight: bold;
}
.summary-box {
padding: 20px;
border-radius: 10px;
background-color: #f0f2f6;
margin: 10px 0;
}
</style>
""", unsafe_allow_html=True)
# Header
st.markdown('<p class="big-font">? YouTube Video Summarizer</p>', unsafe_allow_html=True)
st.markdown("Powered by LangChain and Ollama")
# Sidebar with information
with st.sidebar:
st.markdown("### About")
st.markdown("""
This app uses AI to create summaries of YouTube videos.
**Features:**
- Supports regular YouTube videos and shorts
- Provides structured summaries
- Uses local Ollama model
**Note:** Videos must have closed captions/transcripts available.
""")
st.markdown("### Instructions")
st.markdown("""
1. Paste a YouTube URL
2. Click 'Generate Summary'
3. Wait for the AI to process the video
""")
# Main content
col1, col2 = st.columns([2, 1])
with col1:
# URL input
youtube_url = st.text_input("Enter YouTube URL:", placeholder="https://youtube.com/watch?v=...")
# Generate button
if st.button("Generate Summary", type="primary"):
if youtube_url:
try:
# Show loading spinner
with st.spinner("Generating summary... This may take a few moments."):
summarizer = YouTubeSummarizer()
result = summarizer.summarize_video(youtube_url)
if result["status"] == "success":
# Display video thumbnail
video_id = result["video_id"]
st.image(f"https://img.youtube.com/vi/{video_id}/maxresdefault.jpg",
use_column_width=True)
# Display summary
st.markdown("### Summary")
st.markdown('<div class="summary-box">', unsafe_allow_html=True)
st.markdown(result["summary"])
st.markdown('</div>', unsafe_allow_html=True)
# Add copy button
st.markdown("### Actions")
if st.button("Copy Summary to Clipboard"):
st.write("Summary copied!")
st.session_state.clipboard = result["summary"]
else:
st.error(f"Error: {result['message']}")
except Exception as e:
st.error(f"An error occurred: {str(e)}")
else:
st.warning("Please enter a YouTube URL")
with col2:
if youtube_url and 'video_id' in locals():
st.markdown("### Original Video")
st.video(youtube_url)
if __name__ == "__main__":
main()
保存代码
运行Streamlit应用程序
streamlit run streamlit_app.pypy
打开Streamlit界面
一旦应用开始运行,你的默认网络浏览器将会打开一个新标签页,显示Streamlit界面。
或者,你也可以手动打开浏览器并前往终端中显示的地址,通常类似于:
http://localhost:8501/
应用界面示例: