其中一种方法是使用LangChain的文档中所示的方法对文档进行摘要概括。然而,问题在于高计算成本以及由此带来的高经济成本。一个千页文档大约包含25万个单词,每个单词都需要输入到LLM中。而且,结果还必须进一步处理,比如使用map-reduce方法。使用GPT-3.5 Turbo和4k上下文的保守估计成本超过每份文档1美元,仅用于摘要概括。即使使用免费资源,比如非官方的HuggingChat API,所需的API调用数量也会是滥用行为。因此,我需要采用不同的方法。





我在Google Colab中实现了此代码。所需的库包括gensim用于LDA,pypdf用于PDF处理,nltk用于单词处理,以及LangChain用于其提示模板和与OpenAI API的接口。

import gensim
import nltk
from gensim import corpora
from gensim.models import LdaModel
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords
from pypdf import PdfReader
from langchain.chains import LLMChain
from langchain.prompts import ChatPromptTemplate
from langchain.llms import OpenAI


def preprocess(text, stop_words):
    Tokenizes and preprocesses the input text, removing stopwords and short 
        text (str): The input text to preprocess.
        stop_words (set): A set of stopwords to be removed from the text.
        list: A list of preprocessed tokens.
    result = []
    for token in simple_preprocess(text, deacc=True):
        if token not in stop_words and len(token) > 3:
    return result


def get_topic_lists_from_pdf(file, num_topics, words_per_topic):
    Extracts topics and their associated words from a PDF document using the 
    Latent Dirichlet Allocation (LDA) algorithm.
        file (str): The path to the PDF file for topic extraction.
        num_topics (int): The number of topics to discover.
        words_per_topic (int): The number of words to include per topic.
        list: A list of num_topics sublists, each containing relevant words 
        for a topic.
    # Load the pdf file
    loader = PdfReader(file)
    # Extract the text from each page into a list. Each page is considered a document
    documents= []
    for page in loader.pages:
    # Preprocess the documents
    stop_words = set(stopwords.words(['english','spanish']))
    processed_documents = [preprocess(doc, stop_words) for doc in documents]
    # Create a dictionary and a corpus
    dictionary = corpora.Dictionary(processed_documents)
    corpus = [dictionary.doc2bow(doc) for doc in processed_documents]
    # Build the LDA model
    lda_model = LdaModel(
    # Retrieve the topics and their corresponding words
    topics = lda_model.print_topics(num_words=words_per_topic)
    # Store each list of words from each topic into a list
    topics_ls = []
    for topic in topics:
        words = topic[1].split("+")
        topic_words = [word.split("*")[1].replace('"', '').strip() for word in words]
    return topics_ls


def topics_from_pdf(llm, file, num_topics, words_per_topic):
    Generates descriptive prompts for LLM based on topic words extracted from a 
    PDF document.
    This function takes the output of `get_topic_lists_from_pdf` function, 
    which consists of a list of topic-related words for each topic, and 
    generates an output string in table of content format.
        llm (LLM): An instance of the Large Language Model (LLM) for generating 
        file (str): The path to the PDF file for extracting topic-related words.
        num_topics (int): The number of topics to consider.
        words_per_topic (int): The number of words per topic to include.
        str: A response generated by the language model based on the provided 
        topic words.
    # Extract topics and convert to string
    list_of_topicwords = get_topic_lists_from_pdf(file, num_topics, 
    string_lda = ""
    for list in list_of_topicwords:
        string_lda += str(list) + "\n"
    # Create the template
    template_string = '''Describe the topic of each of the {num_topics} 
        double-quote delimited lists in a simple sentence and also write down 
        three possible different subthemes. The lists are the result of an 
        algorithm for topic discovery.
        Do not provide an introduction or a conclusion, only describe the 
        topics. Do not mention the word "topic" when describing the topics.
        Use the following template for the response.
        1: <<<(sentence describing the topic)>>>
        - <<<(Phrase describing the first subtheme)>>>
        - <<<(Phrase describing the second subtheme)>>>
        - <<<(Phrase describing the third subtheme)>>>
        2: <<<(sentence describing the topic)>>>
        - <<<(Phrase describing the first subtheme)>>>
        - <<<(Phrase describing the second subtheme)>>>
        - <<<(Phrase describing the third subtheme)>>>
        n: <<<(sentence describing the topic)>>>
        - <<<(Phrase describing the first subtheme)>>>
        - <<<(Phrase describing the second subtheme)>>>
        - <<<(Phrase describing the third subtheme)>>>
        Lists: """{string_lda}""" '''
    # LLM call
    prompt_template = ChatPromptTemplate.from_template(template_string)
    chain = LLMChain(llm=llm, prompt=prompt_template)
    response = chain.run({
        "string_lda" : string_lda,
        "num_topics" : num_topics
    return response

在前一个函数中,单词列表被转换为字符串。然后,使用LangChain中的ChatPromptTemplate对象创建提示信息;需要注意的是,提示信息定义了响应的结构。最后,函数调用chatgpt-3.5 Turbo模型。返回值是LLM模型给出的响应。


openai_key = "sk-p...""sk-p..."
llm = OpenAI(openai_api_key=openai_key, max_tokens=-1)

接下来,我们调用topics_from_pdf函数。我为主题数量和每个主题的单词数量选择了值。我还选择了一本公共领域的书籍《变形记》(The Metamorphosis)进行测试。这个文件储存在我的个人驱动器上,并通过使用gdown库进行下载。

!gdown https://drive.google.com/uc?id=1mpXUmuLGzkVEqsTicQvBPcpPJW0aPqdLid=1mpXUmuLGzkVEqsTicQvBPcpPJW0aPqdL
file = "./the-metamorphosis.pdf"
num_topics = 6
words_per_topic = 30
summary = topics_from_pdf(llm, file, num_topics, words_per_topic)


1: Exploring the transformation of Gregor Samsa and the effects on his family and lodgers
- Understanding Gregor's metamorphosis
- Examining the reactions of Gregor's family and lodgers
- Analyzing the impact of Gregor's transformation on his family
2: Examining the events surrounding the discovery of Gregor's transformation
- Investigating the initial reactions of Gregor's family and lodgers
- Analyzing the behavior of Gregor's family and lodgers
- Exploring the physical changes in Gregor's environment
3: Analyzing the pressures placed on Gregor's family due to his transformation
- Examining the financial strain on Gregor's family
- Investigating the emotional and psychological effects on Gregor's family
- Examining the changes in family dynamics due to Gregor's metamorphosis
4: Examining the consequences of Gregor's transformation
- Investigating the physical changes in Gregor's environment
- Analyzing the reactions of Gregor's family and lodgers
- Investigating the emotional and psychological effects on Gregor's family
5: Exploring the impact of Gregor's transformation on his family
- Analyzing the financial strain on Gregor's family
- Examining the changes in family dynamics due to Gregor's metamorphosis
- Investigating the emotional and psychological effects on Gregor's family
6: Investigating the physical changes in Gregor's environment
- Analyzing the reactions of Gregor's family and lodgers
- Examining the consequences of Gregor's transformation
- Exploring the impact of Gregor's transformation on his family


这种方法也适用于技术书籍。例如,David Hilbert(1899)的《几何基础》(同时也是公有领域的):

1: Analyzing the properties of geometric shapes and their relationships
- Exploring the axioms of geometry
- Analyzing the congruence of angles and lines
- Investigating theorems of geometry
2: Studying the behavior of rational functions and algebraic equations
- Examining the straight lines and points of a problem
- Investigating the coefficients of a function
- Examining the construction of a definite integral
3: Investigating the properties of a number system
- Exploring the domain of a true group
- Analyzing the theorem of equal segments
- Examining the circle of arbitrary displacement
4: Examining the area of geometric shapes
- Analyzing the parallel lines and points
- Investigating the content of a triangle
- Examining the measures of a polygon
5: Examining the theorems of algebraic geometry
- Exploring the congruence of segments
- Analyzing the system of multiplication
- Investigating the valid theorems of a call
6: Investigating the properties of a figure
- Examining the parallel lines of a triangle
- Analyzing the equation of joining sides
- Examining the intersection of segments




