使用LLM和LDA算法进行文档主题提取

2023年09月18日 由 alex 发表 426 0

介绍


我正在开发一个能够处理大型文件(超过1000页)的用于与PDF文件聊天的Web应用程序。但在与文档进行交互之前,我希望应用程序能够给用户提供主要主题的简要摘要,以便更容易开始互动。


其中一种方法是使用LangChain的文档中所示的方法对文档进行摘要概括。然而,问题在于高计算成本以及由此带来的高经济成本。一个千页文档大约包含25万个单词,每个单词都需要输入到LLM中。而且,结果还必须进一步处理,比如使用map-reduce方法。使用GPT-3.5 Turbo和4k上下文的保守估计成本超过每份文档1美元,仅用于摘要概括。即使使用免费资源,比如非官方的HuggingChat API,所需的API调用数量也会是滥用行为。因此,我需要采用不同的方法。


LDA算法拯救


LDA算法是解决这个问题的自然选择。该算法接收一组“文档”(在这个上下文中,“文档”指的是一段文本),并返回每个“文档”的一系列主题以及与每个主题相关联的词语列表。对于我们的情况来说,与每个主题相关的单词列表很重要。这些单词列表编码了文件的内容,因此可以将它们输入到LLM中以提示进行摘要。


在得到高质量结果之前,有两个关键考虑因素需要解决:选择LDA算法的超参数和确定输出格式。最重要的超参数是主题数量,因为它对最终结果影响最大。至于输出格式,一个相当不错的选择是嵌套的项目符号列表。在这种格式中,每个主题被表示为一个带有进一步描述主题的子条目的项目符号列表。至于为什么这样有效,我认为通过使用这种格式,模型可以集中于从列表中提取内容,而不需要编写带有连接词和关系的段落的复杂性。


实施


我在Google Colab中实现了此代码。所需的库包括gensim用于LDA,pypdf用于PDF处理,nltk用于单词处理,以及LangChain用于其提示模板和与OpenAI API的接口。


import gensim
import nltk
from gensim import corpora
from gensim.models import LdaModel
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords
from pypdf import PdfReader
from langchain.chains import LLMChain
from langchain.prompts import ChatPromptTemplate
from langchain.llms import OpenAI


接下来,我定义了一个实用函数preprocess,用于辅助处理输入文本。它会去除停用词和短词。


def preprocess(text, stop_words):
    """
    Tokenizes and preprocesses the input text, removing stopwords and short 
    tokens.
    Parameters:
        text (str): The input text to preprocess.
        stop_words (set): A set of stopwords to be removed from the text.
    Returns:
        list: A list of preprocessed tokens.
    """
    result = []
    for token in simple_preprocess(text, deacc=True):
        if token not in stop_words and len(token) > 3:
            result.append(token)
    return result


第二个函数get_topic_lists_from_pdf实现了代码的LDA部分。它接受PDF文件的路径、主题数量和每个主题的单词数量,并返回一个列表。该列表中每个元素包含与每个主题相关联的单词列表。在这里,我们将PDF文件中的每个页面视为一个“文档”。


def get_topic_lists_from_pdf(file, num_topics, words_per_topic):
    """
    Extracts topics and their associated words from a PDF document using the 
    Latent Dirichlet Allocation (LDA) algorithm.
    Parameters:
        file (str): The path to the PDF file for topic extraction.
        num_topics (int): The number of topics to discover.
        words_per_topic (int): The number of words to include per topic.
    Returns:
        list: A list of num_topics sublists, each containing relevant words 
        for a topic.
    """
    # Load the pdf file
    loader = PdfReader(file)
    # Extract the text from each page into a list. Each page is considered a document
    documents= []
    for page in loader.pages:
        documents.append(page.extract_text())
    # Preprocess the documents
    nltk.download('stopwords')
    stop_words = set(stopwords.words(['english','spanish']))
    processed_documents = [preprocess(doc, stop_words) for doc in documents]
    # Create a dictionary and a corpus
    dictionary = corpora.Dictionary(processed_documents)
    corpus = [dictionary.doc2bow(doc) for doc in processed_documents]
    # Build the LDA model
    lda_model = LdaModel(
        corpus, 
        num_topics=num_topics, 
        id2word=dictionary, 
        passes=15
        )
    # Retrieve the topics and their corresponding words
    topics = lda_model.print_topics(num_words=words_per_topic)
    # Store each list of words from each topic into a list
    topics_ls = []
    for topic in topics:
        words = topic[1].split("+")
        topic_words = [word.split("*")[1].replace('"', '').strip() for word in words]
        topics_ls.append(topic_words)
    return topics_ls



下一个函数,topics_from_pdf,调用LLM模型。如前所述,该模型被要求将输出格式化为嵌套的项目符号列表。


def topics_from_pdf(llm, file, num_topics, words_per_topic):
    """
    Generates descriptive prompts for LLM based on topic words extracted from a 
    PDF document.
    This function takes the output of `get_topic_lists_from_pdf` function, 
    which consists of a list of topic-related words for each topic, and 
    generates an output string in table of content format.
    Parameters:
        llm (LLM): An instance of the Large Language Model (LLM) for generating 
        responses.
        file (str): The path to the PDF file for extracting topic-related words.
        num_topics (int): The number of topics to consider.
        words_per_topic (int): The number of words per topic to include.
    Returns:
        str: A response generated by the language model based on the provided 
        topic words.
    """
    # Extract topics and convert to string
    list_of_topicwords = get_topic_lists_from_pdf(file, num_topics, 
                                                  words_per_topic)
    string_lda = ""
    for list in list_of_topicwords:
        string_lda += str(list) + "\n"
    # Create the template
    template_string = '''Describe the topic of each of the {num_topics} 
        double-quote delimited lists in a simple sentence and also write down 
        three possible different subthemes. The lists are the result of an 
        algorithm for topic discovery.
        Do not provide an introduction or a conclusion, only describe the 
        topics. Do not mention the word "topic" when describing the topics.
        Use the following template for the response.
        1: <<<(sentence describing the topic)>>>
        - <<<(Phrase describing the first subtheme)>>>
        - <<<(Phrase describing the second subtheme)>>>
        - <<<(Phrase describing the third subtheme)>>>
        2: <<<(sentence describing the topic)>>>
        - <<<(Phrase describing the first subtheme)>>>
        - <<<(Phrase describing the second subtheme)>>>
        - <<<(Phrase describing the third subtheme)>>>
        ...
        n: <<<(sentence describing the topic)>>>
        - <<<(Phrase describing the first subtheme)>>>
        - <<<(Phrase describing the second subtheme)>>>
        - <<<(Phrase describing the third subtheme)>>>
        Lists: """{string_lda}""" '''
    # LLM call
    prompt_template = ChatPromptTemplate.from_template(template_string)
    chain = LLMChain(llm=llm, prompt=prompt_template)
    response = chain.run({
        "string_lda" : string_lda,
        "num_topics" : num_topics
        })
    return response


在前一个函数中,单词列表被转换为字符串。然后,使用LangChain中的ChatPromptTemplate对象创建提示信息;需要注意的是,提示信息定义了响应的结构。最后,函数调用chatgpt-3.5 Turbo模型。返回值是LLM模型给出的响应。


现在,是调用这些函数的时候了。我们首先设置API密钥。本文提供了如何获得API密钥的说明。


openai_key = "sk-p...""sk-p..."
llm = OpenAI(openai_api_key=openai_key, max_tokens=-1)


接下来,我们调用topics_from_pdf函数。我为主题数量和每个主题的单词数量选择了值。我还选择了一本公共领域的书籍《变形记》(The Metamorphosis)进行测试。这个文件储存在我的个人驱动器上,并通过使用gdown库进行下载。


!gdown https://drive.google.com/uc?id=1mpXUmuLGzkVEqsTicQvBPcpPJW0aPqdLid=1mpXUmuLGzkVEqsTicQvBPcpPJW0aPqdL
file = "./the-metamorphosis.pdf"
num_topics = 6
words_per_topic = 30
summary = topics_from_pdf(llm, file, num_topics, words_per_topic)


结果显示在下方:


1: Exploring the transformation of Gregor Samsa and the effects on his family and lodgers
- Understanding Gregor's metamorphosis
- Examining the reactions of Gregor's family and lodgers
- Analyzing the impact of Gregor's transformation on his family
2: Examining the events surrounding the discovery of Gregor's transformation
- Investigating the initial reactions of Gregor's family and lodgers
- Analyzing the behavior of Gregor's family and lodgers
- Exploring the physical changes in Gregor's environment
3: Analyzing the pressures placed on Gregor's family due to his transformation
- Examining the financial strain on Gregor's family
- Investigating the emotional and psychological effects on Gregor's family
- Examining the changes in family dynamics due to Gregor's metamorphosis
4: Examining the consequences of Gregor's transformation
- Investigating the physical changes in Gregor's environment
- Analyzing the reactions of Gregor's family and lodgers
- Investigating the emotional and psychological effects on Gregor's family
5: Exploring the impact of Gregor's transformation on his family
- Analyzing the financial strain on Gregor's family
- Examining the changes in family dynamics due to Gregor's metamorphosis
- Investigating the emotional and psychological effects on Gregor's family
6: Investigating the physical changes in Gregor's environment
- Analyzing the reactions of Gregor's family and lodgers
- Examining the consequences of Gregor's transformation
- Exploring the impact of Gregor's transformation on his family


输出结果相当不错,只需几秒钟!它正确地从书中提取了主要思想。


这种方法也适用于技术书籍。例如,David Hilbert(1899)的《几何基础》(同时也是公有领域的):


1: Analyzing the properties of geometric shapes and their relationships
- Exploring the axioms of geometry
- Analyzing the congruence of angles and lines
- Investigating theorems of geometry
2: Studying the behavior of rational functions and algebraic equations
- Examining the straight lines and points of a problem
- Investigating the coefficients of a function
- Examining the construction of a definite integral
3: Investigating the properties of a number system
- Exploring the domain of a true group
- Analyzing the theorem of equal segments
- Examining the circle of arbitrary displacement
4: Examining the area of geometric shapes
- Analyzing the parallel lines and points
- Investigating the content of a triangle
- Examining the measures of a polygon
5: Examining the theorems of algebraic geometry
- Exploring the congruence of segments
- Analyzing the system of multiplication
- Investigating the valid theorems of a call
6: Investigating the properties of a figure
- Examining the parallel lines of a triangle
- Analyzing the equation of joining sides
- Examining the intersection of segments


结论


将LDA算法与LLM相结合,用于大规模文档主题提取,可以产生良好的结果,同时大大降低了成本和处理时间。我们的API调用从数百次减少到了只有一次,处理时间也从几分钟缩短到几秒钟。


输出结果的质量很大程度上取决于其格式。在这种情况下,嵌套的项目列表效果很好。此外,主题数量和每个主题的词数对结果的质量很重要。我建议尝试不同的提示语、主题数量和每个主题的词数,以找到适合给定文档的最佳组合。

文章来源:https://medium.com/towards-data-science/document-topic-extraction-with-large-language-models-llm-and-the-latent-dirichlet-allocation-e4697e4dae87
欢迎关注ATYUN官方公众号
商务合作及内容投稿请联系邮箱:bd@atyun.com
评论 登录
热门职位
Maluuba
20000~40000/月
Cisco
25000~30000/月 深圳市
PilotAILabs
30000~60000/年 深圳市
写评论取消
回复取消