在本文中,我将向你展示如何根据文章穿件视频演讲。
初始化 LLM
我将使用 Google Gemini Flash,因为(a)它是目前最便宜的前沿 LLM;(b)它是多模态的,可以读取和理解图像;(c)它支持受控生成,这意味着我们可以确保 LLM 的输出与所需结构相匹配。
import pdfkit
import os
import google.generativeai as genai
from dotenv import load_dotenv
load_dotenv("../genai_agents/keys.env")
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
请注意,我使用的是 Google Generative AI,而不是 Google Cloud Vertex AI。这两个软件包不同。Google 的包支持 Pydantic 对象的可控生成;而 Vertex AI 的包目前只支持 JSON。
获取文章的 PDF 文件
我使用 Python 将文章下载为 PDF 格式,并上传到 Gemini 可以读取的临时存储位置:
ARTICLE_URL = "https://lakshmanok.medium...."
pdfkit.from_url(ARTICLE_URL, "article.pdf")
pdf_file = genai.upload_file("article.pdf")
由于介质的原因,pdfkit 无法获取文章中的图片(也许是因为它们是 webm 而不是 png......)。因此,我的幻灯片将只基于文章的文字而不是图片。
用 JSON 创建讲座笔记
在这里,我需要的数据格式是一组幻灯片,每张幻灯片都有标题、要点和讲义。整个讲座也有一个标题和一个归属。
class Slide(BaseModel):
title: str
key_points: List[str]
lecture_notes: str
class Lecture(BaseModel):
slides: List[Slide]
lecture_title: str
based_on_article_by: str
让我们告诉双子座我们想要它做什么:
lecture_prompt = """
You are a university professor who needs to create a lecture to
a class of undergraduate students.
* Create a 10-slide lecture based on the following article.
* Each slide should contain the following information:
- title: a single sentence that summarizes the main point
- key_points: a list of between 2 and 5 bullet points. Use phrases, not full sentences.
- lecture_notes: 3-10 sentences explaining the key points in easy-to-understand language. Expand on the points using other information from the article.
* Also, create a title for the lecture and attribute the original article's author.
"""
提示非常直接--要求双子座阅读文章、提取要点并创建讲义。
现在,调用模型,传入 PDF 文件并要求它填充所需的结构:
model = genai.GenerativeModel(
"gemini-1.5-flash-001",
system_instruction=[lecture_prompt]
)
generation_config={
"temperature": 0.7,
"response_mime_type": "application/json",
"response_schema": Lecture
}
response = model.generate_content(
[pdf_file],
generation_config=generation_config,
stream=False
)
关于上面的代码,有几点需要注意:
结果是 JSON,因此将其提取为 Python 对象:
lecture = json.loads(response.text)
例如,第 3 张幻灯片就是这个样子:
{'key_points': [
'Silver layer cleans, structures, and prepares data for self-service analytics.',
'Data is denormalized and organized for easier use.',
'Type 2 slowly changing dimensions are handled in this layer.',
'Governance responsibility lies with the source team.'
],
'lecture_notes': 'The silver layer takes data from the bronze layer and transforms it into a usable format for self-service analytics. This involves cleaning, structuring, and organizing the data. Type 2 slowly changing dimensions, which track changes over time, are also handled in this layer. The governance of the silver layer rests with the source team, which is typically the data engineering team responsible for the source system.',
'title': 'The Silver Layer: Data Transformation and Preparation'
}
转换为 PowerPoint
我们可以使用 Python 软件包 pptx 创建带有注释和要点的演示文稿。创建幻灯片的代码如下:
for slidejson in lecture['slides']:
slide = presentation.slides.add_slide(presentation.slide_layouts[1])
title = slide.shapes.title
title.text = slidejson['title']
# bullets
textframe = slide.placeholders[1].text_frame
for key_point in slidejson['key_points']:
p = textframe.add_paragraph()
p.text = key_point
p.level = 1
# notes
notes_frame = slide.notes_slide.notes_text_frame
notes_frame.text = slidejson['lecture_notes']
结果就是这样的 PowerPoint 演示文稿:
朗读笔记并保存音频
我们已经有了讲座笔记,现在就来创建每张幻灯片的音频文件。
下面的代码可以获取一些文本,并让人工智能语音朗读出来。我们将生成的音频保存为 mp3 文件:
from google.cloud import texttospeech
def convert_text_audio(text, audio_mp3file):
"""Synthesizes speech from the input string of text."""
tts_client = texttospeech.TextToSpeechClient()
input_text = texttospeech.SynthesisInput(text=text)
voice = texttospeech.VoiceSelectionParams(
language_code="en-US",
name="en-US-Standard-C",
ssml_gender=texttospeech.SsmlVoiceGender.FEMALE,
)
audio_config = texttospeech.AudioConfig(
audio_encoding=texttospeech.AudioEncoding.MP3
)
response = tts_client.synthesize_speech(
request={"input": input_text, "voice": voice, "audio_config": audio_config}
)
# The response's audio_content is binary.
with open(audio_mp3file, "wb") as out:
out.write(response.audio_content)
print(f"{audio_mp3file} written.")
上面的代码中发生了什么?
现在,通过迭代幻灯片和传入讲座笔记来创建音频:
for slideno, slide in enumerate(lecture['slides']):
text = f"On to {slide['title']} \n"
text += slide['lecture_notes'] + "\n\n"
filename = os.path.join(outdir, f"audio_{slideno+1:02}.mp3")
convert_text_audio(text, filename)
filenames.append(filename)
结果就是一堆音频文件。你可以使用 pydub 将它们串联起来:
combined = pydub.AudioSegment.empty()
for audio_file in audio_files:
audio = pydub.AudioSegment.from_file(audio_file)
combined += audio
# pause for 4 seconds
silence = pydub.AudioSegment.silent(duration=4000)
combined += silence
combined.export("lecture.wav", format="wav")
创建幻灯片图像
比较恼人的是,没有一种简单的方法能用 Python 将 PowerPoint 幻灯片渲染成图像。你需要一台安装了 Office 软件的机器才能做到这一点--这可不是一件容易实现自动化的事情。渲染图像的简单方法是使用 Python 图像库 (PIL):
def text_to_image(output_path, title, keypoints):
image = Image.new("RGB", (1000, 750), "black")
draw = ImageDraw.Draw(image)
title_font = ImageFont.truetype("Coval-Black.ttf", size=42)
draw.multiline_text((10, 25), wrap(title, 50), font=title_font)
text_font = ImageFont.truetype("Coval-Light.ttf", size=36)
for ptno, keypoint in enumerate(keypoints):
draw.multiline_text((10, (ptno+2)*100), wrap(keypoint, 60), font=text_font)
image.save(output_path)
生成的图像不是很好,但还可以使用(你可以看出已经没有人付钱让我写生产代码了):
创建视频
现在我们有了一组音频文件和一组图像文件,可以使用 Python 包 moviepy 来创建视频片段:
clips = []
for slide, audio in zip(slide_files, audio_files):
audio_clip = AudioFileClip(f"article_audio/{audio}")
slide_clip = ImageClip(f"article_slides/{slide}").set_duration(audio_clip.duration)
slide_clip = slide_clip.set_audio(audio_clip)
clips.append(slide_clip)
full_video = concatenate_videoclips(clips)
现在我们可以把它写出来了:
full_video.write_videofile("lecture.mp4", fps=24, codec="mpeg4",
temp_audiofile='temp-audio.mp4', remove_temp=True)
最终结果如何?我们有四个人工制品,都是根据文章.pdf 自动创建的:
lecture.json lecture.mp4 lecture.pptx lecture.wav
总结
受 NotebookLM 播客功能的启发,我开始构建一个应用程序,将我的文章转换成视频讲座。关键步骤是促使一个 LLM 从文章中生成幻灯片内容,另一个 GenAI 模型将音频脚本转换成音频文件,然后使用现有的 Python API 将它们组合成视频。