【指南】快速生成文档摘要的AI工具

2024年10月15日由 alex 发表 601 0

构建一个可以使用语言模型总结 PDF 文档的网络应用程序比你想象的要简单得多。通过利用 Streamlit、LangChain 和 Hugging Face 的转换器等库，你可以创建一个功能强大的摘要工具。在本篇文章中，我将带你了解创建文档摘要应用程序的代码和过程。

关键工具和库

Streamlit：一个快速易用的框架，用于用 Python 构建网络应用程序。
LangChain：可处理和管理文本文档的框架。
Transformers：用于自然语言处理（NLP）任务的先进库。

让我们从创建虚拟环境开始

python -m venv venv
.\venv\Scripts\activate

现在让我们安装 requiments.txt

streamlit==1.25.0  
langchain==0.0.324  
transformers 
torch
pypdf
sentencepiece
python-dotenv

步骤 1：设置环境

第一步是通过加载必要的库和环境变量来设置环境。在这里，我们使用 dotenv 库加载敏感信息，如 API 密钥。

import os
from dotenv import load_dotenv
import streamlit as st
from transformers import pipeline, T5Tokenizer, T5ForConditionalGeneration
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader
import base64
# Load environment variables
load_dotenv()
# Model and tokenizer loading
model_name = "MBZUAI/LaMini-Flan-T5-248M"
token = os.getenv("api_key")

确保在项目目录中创建 .env 文件，并添加 API 密钥或其他敏感信息。例如：

api_key=YOUR_API_KEY

步骤 2：加载预训练模型和标记符

接下来，我们加载预训练模型和标记符。在本例中，我们使用的是 LaMini-Flan-T5，它是 T5 模型的一个变体。该模型非常适合摘要任务。

# Load tokenizer and model
tokenizer = T5Tokenizer.from_pretrained(model_name, legacy=False)
model = T5ForConditionalGeneration.from_pretrained(model_name)

步骤3：处理和拆分文档

由于大多数 PDF 包含多页和大量文本，我们需要将内容拆分成可管理的块。为此，我们使用LangChain库，它使用递归字符拆分器帮助将文档拆分成较小的块。

# File loader and preprocessing
def file_preprocessing(file):
    loader = PyPDFLoader(file)
    pages = loader.load_and_split()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=50)
    texts = text_splitter.split_documents(pages)
    final_texts = ""
    for text in texts:
        final_texts += text.page_content
    return final_texts

该函数读取 PDF 文件，将其分割成若干页，然后再将其分割成更小的块

步骤4：使用 LLM 管道总结文本

处理完文档后，模型就可以使用 Hugging Face 管道进行总结。在这里，我们定义了一个函数，它接收文件路径，预处理文本并生成摘要。

# LLM pipeline
def llm_pipeline(filepath):
    pipe_sum = pipeline(
        'summarization',
        model=model,
        tokenizer=tokenizer,
        max_length=500, 
        min_length=50)
    input_text = file_preprocessing(filepath)
    result = pipe_sum(input_text)
    return result[0]['summary_text']

步骤5：在应用程序中显示 PDF 和摘要

为了使应用程序方便用户使用，我们允许用户通过 Streamlit 的 file_uploader 上传 PDF 文件。我们还提供了一个选项，使用 Streamlit 的列功能并排显示上传的 PDF 及其摘要。

@st.cache_data
def displayPDF(file):
    with open(file, "rb") as f:
        base64_pdf = base64.b64encode(f.read()).decode('utf-8')
    pdf_display = F'<iframe src=_"data:application/pdf;base64,{base64_pdf}" width="100%" height="600" type="application/pdf"></iframe>'
    st.markdown(pdf_display, unsafe_allow_html=True)
# Streamlit app
st.set_page_config(layout="wide")
def main():
    st.title("Document Summarization App using LLMs")
    uploaded_file = st.file_uploader("Upload your PDF file", type=['pdf'])
    if uploaded_file is not None:
        if st.button("Summarize"):
            col1, col2 = st.columns(2)
            if not os.path.exists('data'):
                os.makedirs('data')
            filepath = "data/" + uploaded_file.name
            with open(filepath, "wb") as temp_file:
                temp_file.write(uploaded_file.read())
            with col1:
                st.info("Uploaded File")
                displayPDF(filepath)
            with col2:
                summary = llm_pipeline(filepath)
                st.info("Summarization Complete")
                st.success(summary)
if __name__ == "__main__":
    main()

步骤 6：运行应用程序

要运行 Streamlit 应用程序，只需在终端运行以下命令即可：

streamlit run app.py

这将在网络浏览器中启动应用程序。在这里，你可以上传 PDF 文件、进行汇总并查看结果。

总结

在这篇文章中，我们介绍了如何构建一个简单的网络应用程序，使用语言模型总结文档。通过利用 Streamlit 和 LangChain 以及 Hugging Face Transformer 模型的强大功能，你可以以最小的代价快速部署有用的 NLP 应用程序。

文章来源：https://medium.com/@himanshugangwar0509/document-summarization-app-using-language-model-5de603ff2d09

标签：

自然语言处理 LLM 人工智能

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇使用Ollama、Llama 3.2和Milvus进行函数调用

下一篇使用Distil-Whisper和PyTorch进行实时语音转文本

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来