使用OpenAlex API进行主题建模开源研究

2024年07月29日由 alex 发表 299 0

什么是主题建模？

主题建模是一种无监督机器学习技术，用于分析文档并利用语义相似性识别 “主题”。这与聚类类似，但并不是每个文档都只属于一个主题。它更多地是对语料库中的内容进行分组。主题建模有许多不同的应用，但主要用于更好地理解大量文本数据。

例如，零售连锁店可能会对客户调查和评论进行建模，以识别负面评论，并深入研究客户提出的关键问题。在这种情况下，我们将导入大量文章和摘要，以了解数据集中的关键主题。

OpenAlex

OpenAlex 是一个免费使用的全球研究目录系统。他们已经为超过 2.5 亿条新闻、文章、摘要等编制了索引。

幸运的是，他们有一个免费（但有限）且灵活的 API，可以让我们快速获取数以万计的文章，同时还可以应用筛选器，如年份、媒体类型、关键词等。

创建数据管道

当我们从 API 获取数据时，我们将应用一些标准。首先，我们只采集年份在 2016 年至 2022 年之间的文档。我们需要相当新的语言，因为某些主题的术语和分类法会在很长一段时间内发生变化。

我们还将添加关键术语并进行多重搜索。通常情况下，我们可能会摄取随机的主题领域，但我们将使用关键术语来缩小搜索范围。这样，我们就能知道有多少高级主题，并将其与模型的输出进行比较。下面，我们将创建一个函数，用于添加关键术语并通过 API 进行搜索。

import pandas as pd
import requests

def import_data(pages, start_year, end_year, search_terms):
    
    """
    This function is used to use the OpenAlex API, conduct a search on works, a return a dataframe with associated works.
    
    Inputs: 
        - pages: int, number of pages to loop through
        - search_terms: str, keywords to search for (must be formatted according to OpenAlex standards)
        - start_year and end_year: int, years to set as a range for filtering works
    """
    
    #create an empty dataframe
    search_results = pd.DataFrame()
    
    for page in range(1, pages):
        
        #use paramters to conduct request and format to a dataframe
        response = requests.get(f'https://api.openalex.org/works?page={page}&per-page=200&filter=publication_year:{start_year}-{end_year},type:article&search={search_terms}')
        data = pd.DataFrame(response.json()['results'])
        
        #append to empty dataframe
        search_results = pd.concat([search_results, data])
    
    #subset to relevant features
    search_results = search_results[["id", "title", "display_name", "publication_year", "publication_date",
                                        "type", "countries_distinct_count","institutions_distinct_count",
                                        "has_fulltext", "cited_by_count", "keywords", "referenced_works_count", "abstract_inverted_index"]]
    
    return(search_results)

下面是一个使用 OpenAlex 必要语法进行搜索的示例：

#search for Trusted AI and Autonomy
ai_search = import_data(35, 2016, 2024, "'artificial intelligence' OR 'deep learn' OR 'neural net' OR 'autonomous' OR drone")

在编译我们的搜索并删除重复文档后，我们必须清理数据，为我们的主题模型做好准备。我们当前的输出有两个主要问题。

摘要以倒排索引的形式返回（由于法律原因）。不过，我们可以利用它们来返回原文。
一旦我们获得了原文，它将是未经处理的原始文本，会产生噪音并损害我们的模型。我们将进行传统的 NLP 预处理，以便为模型做好准备。

下面是一个从倒排索引返回原文的函数。

def undo_inverted_index(inverted_index):
    
    """
    The purpose of the function is to 'undo' and inverted index. It inputs an inverted index and
    returns the original string.
    """
    #create empty lists to store uninverted index
    word_index = []
    words_unindexed = []
    
    #loop through index and return key-value pairs
    for k,v in inverted_index.items(): 
        for index in v: word_index.append([k,index])
    #sort by the index
    word_index = sorted(word_index, key = lambda x : x[1])
    
    #join only the values and flatten
    for pair in word_index:
        words_unindexed.append(pair[0])
    words_unindexed = ' '.join(words_unindexed)
    
    return(words_unindexed)

现在我们有了原始文本，可以进行传统的预处理步骤，如标准化、删除停滞词、词法化等。以下是可以映射到文档列表或文档系列的函数。

def preprocess(text):
    
    """
    This function takes in a string, coverts it to lowercase, cleans
    it (remove special character and numbers), and tokenizes it.
    """
    
    #convert to lowercase
    text = text.lower()
    
    #remove special character and digits
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    
    #tokenize
    tokens = nltk.word_tokenize(text)
    
    return(tokens)
def remove_stopwords(tokens):
    
    """
    This function takes in a list of tokens (from the 'preprocess' function) and 
    removes a list of stopwords. Custom stopwords can be added to the 'custom_stopwords' list.
    """
    
    #set default and custom stopwords
    stop_words = nltk.corpus.stopwords.words('english')
    custom_stopwords = []
    stop_words.extend(custom_stopwords)
    
    #filter out stopwords
    filtered_tokens = [word for word in tokens if word not in stop_words]
    
    return(filtered_tokens)

def lemmatize(tokens):
    
    """
    This function conducts lemmatization on a list of tokens (from the 'remove_stopwords' function).
    This shortens each word down to its root form to improve modeling results.
    """
    
    #initalize lemmatizer and lemmatize
    lemmatizer = nltk.WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    return(lemmatized_tokens)

def clean_text(text):
    
    """
    This function uses the previously defined functions to take a string and\
    run it through the entire data preprocessing process.
    """
    
    #clean, tokenize, and lemmatize a string
    tokens = preprocess(text)
    filtered_tokens = remove_stopwords(tokens)
    lemmatized_tokens = lemmatize(filtered_tokens)
    clean_text = ' '.join(lemmatized_tokens)
    
    return(clean_text)

现在，我们有了一系列经过预处理的文档，可以创建第一个主题模型了！

创建主题模型

对于我们的主题模型，我们将使用 gensim 创建一个 Latent Dirichlet Allocation (LDA) 模型。LDA 是最常见的主题建模模型，因为它在识别语料库中的高级主题方面非常有效。以下是用于创建模型的软件包。

import gensim.corpora as corpora
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel

在创建模型之前，我们必须准备好语料库和 ID 映射。这只需几行代码即可完成。

#convert the preprocessed text to a list
documents = list(data["clean_text"])
#seperate by ' ' to tokenize each article
texts = [x.split(' ') for x in documents]
#construct word ID mappings
id2word = Dictionary(texts)
#use word ID mappings to build corpus
corpus = [id2word.doc2bow(text) for text in texts]

现在我们可以创建一个主题模型。如下所示，有许多不同的参数会影响模型的性能。你可以在 gensim 文档中阅读有关这些参数的信息。

#build LDA model
lda_model = LdaModel(corpus = corpus, id2word = id2word, num_topics = 10, decay = 0.5,
                     random_state = 0, chunksize = 100, alpha = 'auto', per_word_topics = True)

最重要的参数是主题数量。在这里，我们任意设置了 10 个。因为我们不知道应该有多少个主题，所以这个参数一定要优化。但我们如何衡量模型的质量呢？

这就是一致性得分的作用所在。一致性得分的范围为 0-1。一致性分数通过确保主题的合理性和独特性来衡量主题的质量。我们希望定义明确的主题之间有清晰的界限。虽然这最终有点主观，但它能让我们对结果的质量有一个很好的了解。

#compute coherence score
coherence_model_lda = CoherenceModel(model = lda_model, texts = texts, dictionary = id2word, coherence = 'c_v')
coherence_score = coherence_model_lda.get_coherence()
print(coherence_score)

在这里，我们得到的一致性得分约为 0.48，还不算太差！但还不能用于生产。

可视化我们的主题模型

主题模型很难可视化。幸运的是，有一个很棒的模块 “pyLDAvis ”可以自动生成交互式可视化，让我们可以在向量空间中查看主题，并深入到每个主题。

import pyLDAvis
#create Topic Distance Visualization 
pyLDAvis.enable_notebook()
lda_viz = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
lda_viz

正如你在下文中看到的，这产生了一个很好的可视化效果，让我们可以快速了解我们的模型是如何运行的。通过观察向量空间，我们可以看到一些主题是独特而明确的。不过，我们也发现了一些重叠的主题。

我们可以点击一个主题来查看最相关的标记。当我们调整相关性指标（lambda）时，向左滑动可以看到特定主题的标记，向右滑动可以看到相关性较低但不特定的标记。

当点击进入每个主题时，我可以隐约看到我最初搜索的主题。例如，主题 5 似乎与我的 “人机界面 ”搜索一致。还有一组主题似乎与生物技术有关，但有些主题比其他主题更清晰。

优化主题模型

从 pyLDAvis 界面和 0.48 的一致性得分来看，我们肯定还有改进的余地。最后一步，让我们编写一个函数，循环使用不同的参数值，尝试优化一致性得分。下面是一个测试不同主题数和衰减率值的函数。该函数会计算每种参数组合的一致性得分，并将其保存在一个数据帧中。

def lda_model_evaluation():
    
    """
    This function loops through a number of parameters for an LDA model, creates the model,
    computes the coherenece score, and saves the results in a pandas dataframe. The outputed dataframe
    contains the values of the parameters tested and the resulting coherence score.
    """
    
    #define empty lists to save results
    topic_number, decay_rate_list, score  = [], [], []
    
    #loop through a number of parameters
    for topics in range(5,12):
        for decay_rate in [0.5, 0.6, 0.7]:
                
                #build LDA model
                lda_model = LdaModel(corpus = corpus, id2word = id2word, num_topics = topics, decay = decay_rate,
                               random_state = 0, chunksize = 100, alpha = 'auto', per_word_topics = True)
                
                #compute coherence score
                coherence_model_lda = CoherenceModel(model = lda_model, texts = texts, dictionary = id2word, coherence = 'c_v')
                coherence_score = coherence_model_lda.get_coherence()
                
                #append parameters to lists
                topic_number.append(topics)
                decay_rate_list.append(decay_rate)
                score.append(coherence_score)
                
                print("Model Saved")
    
    #gather result into a dataframe
    results = {"Number of Topics": topic_number,
                "Decay Rate": decay_rate_list,
                "Score": score}
    
    results = pd.DataFrame(results)
    
    return(results)

只需通过两个参数的几个小范围，我们就能确定参数，将一致性得分从 0.48 提高到 0.55，这是一个相当大的改进。

结论

在本文中，我们：

介绍了主题建模和 OpenAlex 数据源
构建数据管道，从 API 中获取数据并为 NLP 模型做好准备
构建了一个 LDA 模型，并使用 pyLDAvis 可视化了结果
编写代码，帮助我们找到最佳参数

文章来源：https://towardsdatascience.com/topic-modeling-open-source-research-with-the-openalex-api-5191c7db9156

标签：

人工智能

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇使用HuggingFace Transformers Agent构建Agentic RAG

下一篇什么是人工智能CRM系统？

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来