使用生成式人工智能根据文章自动创建视频演讲

2024年09月23日由 alex 发表 302 0

在本文中，我将向你展示如何根据文章穿件视频演讲。

初始化 LLM

我将使用 Google Gemini Flash，因为（a）它是目前最便宜的前沿 LLM；（b）它是多模态的，可以读取和理解图像；（c）它支持受控生成，这意味着我们可以确保 LLM 的输出与所需结构相匹配。

import pdfkit
import os
import google.generativeai as genai
from dotenv import load_dotenv
load_dotenv("../genai_agents/keys.env")
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

请注意，我使用的是 Google Generative AI，而不是 Google Cloud Vertex AI。这两个软件包不同。Google 的包支持 Pydantic 对象的可控生成；而 Vertex AI 的包目前只支持 JSON。

获取文章的 PDF 文件

我使用 Python 将文章下载为 PDF 格式，并上传到 Gemini 可以读取的临时存储位置：

ARTICLE_URL = "https://lakshmanok.medium...."
pdfkit.from_url(ARTICLE_URL, "article.pdf")
pdf_file = genai.upload_file("article.pdf")

由于介质的原因，pdfkit 无法获取文章中的图片（也许是因为它们是 webm 而不是 png......）。因此，我的幻灯片将只基于文章的文字而不是图片。

用 JSON 创建讲座笔记

在这里，我需要的数据格式是一组幻灯片，每张幻灯片都有标题、要点和讲义。整个讲座也有一个标题和一个归属。

class Slide(BaseModel):
    title: str
    key_points: List[str]
    lecture_notes: str
class Lecture(BaseModel):
    slides: List[Slide]
    lecture_title: str
    based_on_article_by: str

让我们告诉双子座我们想要它做什么：

lecture_prompt = """
You are a university professor who needs to create a lecture to
a class of undergraduate students.
* Create a 10-slide lecture based on the following article.
* Each slide should contain the following information:
  - title: a single sentence that summarizes the main point
  - key_points: a list of between 2 and 5 bullet points. Use phrases, not full sentences.
  - lecture_notes: 3-10 sentences explaining the key points in easy-to-understand language. Expand on the points using other information from the article.
* Also, create a title for the lecture and attribute the original article's author.
"""

提示非常直接--要求双子座阅读文章、提取要点并创建讲义。

现在，调用模型，传入 PDF 文件并要求它填充所需的结构：

model = genai.GenerativeModel(
    "gemini-1.5-flash-001",
    system_instruction=[lecture_prompt]
)
generation_config={
    "temperature": 0.7,
    "response_mime_type": "application/json",
    "response_schema": Lecture
}
response = model.generate_content(
    [pdf_file],
    generation_config=generation_config,
    stream=False
)

关于上面的代码，有几点需要注意：

我们将提示作为系统提示传入，这样就不需要在输入新内容时不断发送提示。
我们将所需的响应类型指定为 JSON，并将模式指定为 Pydantic 对象。
我们将 PDF 文件发送给模型，并告诉它生成一个响应。我们将等待它完成（无需流式处理）

结果是 JSON，因此将其提取为 Python 对象：

lecture = json.loads(response.text)

例如，第 3 张幻灯片就是这个样子：

{'key_points': [
    'Silver layer cleans, structures, and prepares data for self-service analytics.',
    'Data is denormalized and organized for easier use.',
    'Type 2 slowly changing dimensions are handled in this layer.',
    'Governance responsibility lies with the source team.'
  ],
 'lecture_notes': 'The silver layer takes data from the bronze layer and transforms it into a usable format for self-service analytics. This involves cleaning, structuring, and organizing the data. Type 2 slowly changing dimensions, which track changes over time, are also handled in this layer. The governance of the silver layer rests with the source team, which is typically the data engineering team responsible for the source system.',
 'title': 'The Silver Layer: Data Transformation and Preparation'
}

转换为 PowerPoint

我们可以使用 Python 软件包 pptx 创建带有注释和要点的演示文稿。创建幻灯片的代码如下：

for slidejson in lecture['slides']:
    slide = presentation.slides.add_slide(presentation.slide_layouts[1])
    title = slide.shapes.title
    title.text = slidejson['title']
    # bullets
    textframe = slide.placeholders[1].text_frame
    for key_point in slidejson['key_points']:
        p = textframe.add_paragraph()
        p.text = key_point
        p.level = 1
    # notes
    notes_frame = slide.notes_slide.notes_text_frame
    notes_frame.text = slidejson['lecture_notes']

结果就是这样的 PowerPoint 演示文稿：

朗读笔记并保存音频

我们已经有了讲座笔记，现在就来创建每张幻灯片的音频文件。

下面的代码可以获取一些文本，并让人工智能语音朗读出来。我们将生成的音频保存为 mp3 文件：

from google.cloud import texttospeech
def convert_text_audio(text, audio_mp3file):
    """Synthesizes speech from the input string of text."""
    tts_client = texttospeech.TextToSpeechClient()    
    input_text = texttospeech.SynthesisInput(text=text)
    
    voice = texttospeech.VoiceSelectionParams(
        language_code="en-US",
        name="en-US-Standard-C",
        ssml_gender=texttospeech.SsmlVoiceGender.FEMALE,
    )
    audio_config = texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.MP3
    )
    response = tts_client.synthesize_speech(
        request={"input": input_text, "voice": voice, "audio_config": audio_config}
    )
    # The response's audio_content is binary.
    with open(audio_mp3file, "wb") as out:
        out.write(response.audio_content)
        print(f"{audio_mp3file} written.")

上面的代码中发生了什么？

我们正在使用谷歌云的文本到语音 API
要求它使用标准的美国口音女声。
然后，我们将输入文本交给它，要求它生成音频
将音频保存为 mp3 文件。请注意，这必须与音频编码相匹配。

现在，通过迭代幻灯片和传入讲座笔记来创建音频：

for slideno, slide in enumerate(lecture['slides']):
        text = f"On to {slide['title']} \n"
        text += slide['lecture_notes'] + "\n\n"
        filename = os.path.join(outdir, f"audio_{slideno+1:02}.mp3")
        convert_text_audio(text, filename)
        filenames.append(filename)

结果就是一堆音频文件。你可以使用 pydub 将它们串联起来：

combined = pydub.AudioSegment.empty()
for audio_file in audio_files:
    audio = pydub.AudioSegment.from_file(audio_file)
    combined += audio
    # pause for 4 seconds
    silence = pydub.AudioSegment.silent(duration=4000)
    combined += silence
combined.export("lecture.wav", format="wav")

创建幻灯片图像

比较恼人的是，没有一种简单的方法能用 Python 将 PowerPoint 幻灯片渲染成图像。你需要一台安装了 Office 软件的机器才能做到这一点--这可不是一件容易实现自动化的事情。渲染图像的简单方法是使用 Python 图像库 (PIL)：

def text_to_image(output_path, title, keypoints):
    image = Image.new("RGB", (1000, 750), "black")
    draw = ImageDraw.Draw(image)
    title_font = ImageFont.truetype("Coval-Black.ttf", size=42)
    draw.multiline_text((10, 25), wrap(title, 50), font=title_font)
    text_font = ImageFont.truetype("Coval-Light.ttf", size=36)
    for ptno, keypoint in enumerate(keypoints):
        draw.multiline_text((10, (ptno+2)*100), wrap(keypoint, 60), font=text_font) 
    image.save(output_path)

生成的图像不是很好，但还可以使用（你可以看出已经没有人付钱让我写生产代码了）：

创建视频

现在我们有了一组音频文件和一组图像文件，可以使用 Python 包 moviepy 来创建视频片段：

clips = []
for slide, audio in zip(slide_files, audio_files):
    audio_clip = AudioFileClip(f"article_audio/{audio}")
    slide_clip = ImageClip(f"article_slides/{slide}").set_duration(audio_clip.duration)
    slide_clip = slide_clip.set_audio(audio_clip)
    clips.append(slide_clip)
full_video = concatenate_videoclips(clips)

现在我们可以把它写出来了：

full_video.write_videofile("lecture.mp4", fps=24, codec="mpeg4", 
                           temp_audiofile='temp-audio.mp4', remove_temp=True)

最终结果如何？我们有四个人工制品，都是根据文章.pdf 自动创建的：

lecture.json  lecture.mp4  lecture.pptx  lecture.wav

包含要点、讲义等内容的 JSON 文件。
可以修改的 PowerPoint 文件。幻灯片上有要点，幻灯片的注释部分有 “讲义”。
人工智能语音朗读讲义的音频文件
音频和图像的 mp4 电影（我上传到了 YouTube）。这就是我要制作的视频讲座。

总结

受 NotebookLM 播客功能的启发，我开始构建一个应用程序，将我的文章转换成视频讲座。关键步骤是促使一个 LLM 从文章中生成幻灯片内容，另一个 GenAI 模型将音频脚本转换成音频文件，然后使用现有的 Python API 将它们组合成视频。

文章来源：https://lakshmanok.medium.com/using-generative-ai-to-automatically-create-a-video-talk-from-an-article-6381c44c5fe0

标签：

人工智能

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇【指南】多模态RAG

下一篇如何分析和修复LLM中的应用错误

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来