通过LangChain使用知识图增强Gemini-1.0-Pro

2024年03月12日由 alex 发表 422 0

知识图谱的概念源于机器学习和人工智能领域，旨在以基于图谱的格式表示和利用结构化知识。这些图呈现节点和边，分别代表概念和它们之间的关系。在 ML 中，这些节点可以表示为一个或多个嵌入。

在本文中，我将探讨如何从维基百科文章中创建知识图谱，创建它们（节点）之间的关系（边）、这些关系中的权重，以结构化的方式存储它们，并将这些数据与 LangChain 一起用于创建具有记忆功能的聊天机器人。

整个概念与 RAG（Retrieval Augmented Generation，检索增强生成）类似，但我注意到，在这种情况下，幻觉的减少幅度更大。

知识图谱（KG）的基本理论涉及以下几个关键方面：

图表示：

知识图谱的结构为图，由节点（权利或知识概念）、边（概念之间的关系）和这些边连接的权重（表示这些关系的相关程度）组成。

语义关系：

实体之间的关系具有语义意义，为数据提供了上下文。
通过融入语义，知识图谱可以更细致地理解和推理不同实体之间的关系。作为嵌入，这些语义关系对 LLM 的解释很有价值。

链接信息：

知识图谱将不同的信息源和领域联系起来，创建了一个统一的知识库。
整合来自不同领域的信息可以实现知识的整体呈现，从而促进综合分析。

机器学习应用：

机器学习算法可以利用知识图谱的丰富结构进行预测、推理和增强决策过程。在这里，知识图谱将作为 LLM（Gemini-1.0-Pro）的上下文来回答问题。

实体和关系类型：

该理论涉及对图中不同类型的实体和关系进行定义和分类。
这种分类有助于知识的组织和结构化，使人类和机器学习模型更容易获取知识。

可扩展性和互操作性：

知识图谱的设计具有可扩展性，可容纳大量信息。
与现有数据源和系统的互操作性是一个重要方面，可确保知识图谱无缝集成到各种应用和环境中。

让我们开始编码。我们需要安装必要的库：

pip install -U langchain langchain_openai langsmith pandas langchain_experimental matplotlib
pip install --upgrade --quiet  langchain langsmith langchainhub --quiet
pip install -q tiktoken==0.5.2
pip install wikipedia
pip install networkx

导入它们并定义 LangChain API 密钥和 LangSmith（仪表板）的环境变量：

import pandas as pd
import random
import wikipedia as wp
from wikipedia.exceptions import DisambiguationError, PageError
import networkx as nx
import matplotlib.pyplot as plt
from langsmith import Client
from langchain_core.tracers.context import tracing_v2_enabled
import os
os.environ["LANGCHAIN_API_KEY"]="your-api-key"
# Add tracing in LangSmith
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "KG Project"
client = Client()

现在我们来看看维基百科库是如何运行的。让我们来总结一下数据科学的概念：

print(wp.summary("data science"))

数据科学是一个跨学科的学术领域，它利用统计学、科学计算、科学方法、流程、算法和系统，从潜在的嘈杂、结构化或非结构化数据中提取或推断知识和见解。数据科学还整合了基础应用领域（如自然科学、信息技术和医学....）的领域知识。

我们还可以获得数据科学页面中的所有链接：

wp.page("data science").links[:12] ## Brackets limit to speed up

好极了。为了构建知识图谱，我们将创建一个类，该类存储所有学习到的信息，并拥有搜索和总结的方法。

我们将定义知识库类，该类将允许我们获取维基百科页面，扫描内容（知识概念或节点），获取关系（边）的权重，并构建一个数据框架。这种输出可以很容易地存储在数据库中。

class RelationshipGenerator():
    """Generates relationships between terms, based on wikipedia links"""
    def __init__(self):
        """Links are directional, start + end, they should also have a weight"""
        self.links = [] # [start, end, weight]
    def scan(self, start=1, repeat=0):
        """Start scanning from a specific word, or from internal database
        Args:
            start (str): the term to start searching from, can be None to let
                algorithm decide where to start
            repeat (int): the number of times to repeat the scan
        """
        try:
            if start in [l[0] for l in self.links]:
                raise Exception("Already scanned")
    
            term_search = True if start is not None else False
    
            # If a start isn't defined, we should find one
            if start is None:
                try:
                    start = self.find_starting_point()
                    print(start)
                except:
                    pass
    
            # Scan the starting point specified for links
            print(f"Scanning page {start}...")
            # Fetch the page through the Wikipedia API
            page = wp.page(start)
            links = list(set(page.links))
            # ignore some uninteresting terms
            links = [l for l in links if not self.ignore_term(l)]
    
            # Add links to database
            pages=[]
            link_weights = []
            for link in links:
                weight = self.weight_link(page, link)
                link_weights.append(weight)
    
    
            link_weights = [w / max(link_weights) for w in link_weights]
    
            for i, link in enumerate(links):
                self.links.append([start, link.lower(), link_weights[i] + 2 * int(term_search)]) # 3 works pretty well
    
            # Print some data to the user on progress
            explored_nodes = set([l[0] for l in self.links])
            explored_nodes_count = len(explored_nodes)
            total_nodes = set([l[1] for l in self.links])
            total_nodes_count = len(total_nodes)
            new_nodes = [l.lower() for l in links if l not in total_nodes]
            new_nodes_count = len(new_nodes)
            print(f"New nodes added: {new_nodes_count}, Total Nodes: {total_nodes_count}, Explored Nodes: {explored_nodes_count}")
        except (DisambiguationError, PageError):
            # This happens if the page has disambiguation or doesn't exist
            # We just ignore the page for now, could improve this
            pass #self.links.append([start, "DISAMBIGUATION", 0])
    def get_pages(self, start=1, repeat=0):
        
        global df_
        global data
        
        # Scan the starting point specified for links
        print(f"Scanning page {start}...")
        # Fetch the page through the Wikipedia API
        page = wp.page(start)
        links = list(set(page.links))[0:20] ## Page links limited here
        # ignore some uninteresting terms
        links = [l for l in links if not self.ignore_term(l)]
        # Add links, weights and pages to database
        pages=[]
        link_weights = []
        for link in links:
            try:
                weight = self.weight_link(page, link)
                link_weights.append(weight)
    
                pages.append(wp.page(link).content)
                print(wp.page(link).content[1:20])
            except:
                pass
        # This may create an assymetric dictionary, so we will transform it
        # into a valid dictionary to create the dataframe
        data = {'link': links,
            'link_weights': link_weights,
            'pages': pages
        }
        
        # Create the DataFrame outside the loop
        max_length = max(len(v) for v in data.values())
        # Pad shorter lists with NaN values
        padded_dict = {key: value + [float('nan')] * (max_length - len(value)) for key, value in data.items()}
        
        # Create DataFrame
        df = pd.DataFrame.from_dict(padded_dict, orient='index')
        
        df_ = df.transpose()
        
        # Normalize link weights
        df_['link_weights'] = df_['link_weights'] / df_['link_weights'].max()
    
    
        return df_
        
    def find_starting_point(self):
        """Find the best place to start when no input is given"""
        # Need some links to work with.
        if len(self.links) == 0:
            raise Exception("Unable to start, no start defined or existing links")
        # Get top terms
        res = self.rank_terms()
        sorted_links = list(zip(res.index, res.values))
        all_starts = set([l[0] for l in self.links])
        # Remove identifiers (these are on many Wikipedia pages)
        all_starts = [l for l in all_starts if '(identifier)' not in l]
        # print(sorted_links[:10])
        # Iterate over the top links, until we find a new one
        for i in range(len(sorted_links)):
            if sorted_links[i][0] not in all_starts and len(sorted_links[i][0]) > 0:
                return sorted_links[i][0]
        # no link found
        raise Exception("No starting point found within links")
        return
    @staticmethod
    def weight_link(page, link):
        """Weight an outgoing link for a given source page
        Args:
            page (obj):
            link (str): the outgoing link of interest
        Returns:
            (float): the weight, between 0 and 1
        """
        weight = 0.1
        link_counts = page.content.lower().count(link.lower())
        weight += link_counts
        if link.lower() in page.summary.lower():
            weight += 3
        return weight

    def get_database(self):
        return sorted(self.links, key=lambda x: -x[2])

    def rank_terms(self, with_start=False):
        # We can use graph theory here!
        # tws = [l[1:] for l in self.links]
        df = pd.DataFrame(self.links, columns=["start", "end", "weight"])
        if with_start:
            df = df.append(df.rename(columns={"end": "start", "start":"end"}))
        return df.groupby("end").weight.sum().sort_values(ascending=False)
    def get_key_terms(self, n=20):
        return "'" + "', '".join([t for t in self.rank_terms().head(n).index.tolist() if "(identifier)" not in t]) + "'"
    @staticmethod
    def ignore_term(term):
        """List of terms to ignore"""
        if "(identifier)" in term or term == "doi":
            return True
        return False

在上面的代码中，@staticmethod 是一个装饰器，用于在类中定义静态方法。这样，方法就直接属于类，而不是类的实例。你可以直接在类中调用该方法，而无需创建实例。

我们还将定义一个函数来简化有大量节点的图形。这对绘制地图非常有用。如果我们将扫描到的所有内容都添加进去，那么整个曲线图将无法阅读，我们也无法对其进行分析。你可以自定义保留的节点和链接，以查找特定的启示。

def simplify_graph(rg, max_nodes=1000):
    # Get most interesting terms.
    nodes = rg.rank_terms()
    # Get nodes to keep
    keep_nodes = nodes.head(int(max_nodes * len(nodes)/5)).index.tolist()
    # Filter list of nodes so that there are no nodes outside those of interest
    filtered_links = list(filter(lambda x: x[1] in keep_nodes, rg.links))
    filtered_links = list(filter(lambda x: x[0] in keep_nodes, filtered_links))
    # Define a new object and define its dictionary
    ac = RelationshipGenerator()
    ac.links =filtered_links
    return ac

现在我们构建知识图谱：

rg = RelationshipGenerator()
rg.scan("data science")
rg.scan("data analysis")
rg.scan("artificial intelligence")
rg.scan("machine learning")

...并获取页面内容以构建数据帧：

result1=rg.get_pages("data science")
result2=rg.get_pages("data analysis")
result3=rg.get_pages("artificial intelligence")

result=pd.concat([result1,result2,result3]).dropna()
result

让我们来看看页面的部分内容：

result.iloc[0,2]

我们重复扫描，以加深对概念的了解：

rg.scan(repeat=10)

然后对术语进行排序：

rg.rank_terms()

现在，让我们将知识图谱可视化。这部分是完全可定制的，所以你可以在细节上下功夫，让它变得更好。在这里，我使用了随机布局，但你也可以使用语义相似性作为二维空间中的欧氏距离。这样可以更恰当地对概念进行分组。

def remove_self_references(l): ## node connections to itself
    return [i for i in l if i[0]!=i[1]]
def add_focus_point(links, focus="on me", focus_factor=3):
    for i, link in enumerate(links):
        if not (focus in link[0] or focus in link[1]):
            links[i] = [link[0], link[1], link[2] / focus_factor]
        else:
            links[i] = [link[0], link[1], link[2] * focus_factor]
    return links
def create_graph(rg, focus=None):
    links = rg.links
    links = remove_self_references(links)
    if focus is not None:
        links = add_focus_point(links, focus)
    node_data = rg.rank_terms()
    nodes = node_data.index.tolist()
    node_weights = node_data.values.tolist()
    node_weights = [nw * 100 for nw in node_weights]
    nodelist = nodes

    G = nx.DiGraph() # MultiGraph()
    # G.add_node()
    G.add_nodes_from(nodes)
    # Add edges
    G.add_weighted_edges_from(links)
    pos = nx.random_layout(G, seed=17)  # positions for all nodes - seed for reproducibility
    fig = plt.figure(figsize=(12,12))
    nx.draw_networkx_nodes(
        G, pos,
        nodelist=nodelist,
        node_size=node_weights,
        node_color='lightgreen',
        alpha=0.7
    )
    widths = nx.get_edge_attributes(G, 'weight')
    nx.draw_networkx_edges(
        G, pos,
        edgelist = widths.keys(),
        width=list(widths.values()),
        edge_color='lightgray',
        alpha=0.6
    )
    nx.draw_networkx_labels(G, pos=pos,
                            labels=dict(zip(nodelist,nodelist)),font_size=8,
                            font_color='black')
    plt.show()
ng = simplify_graph(rg, 5)
create_graph(ng)

请注意，在图表中，数据科学和数据分析（使用维基百科库搜索到的基本概念）是主要的信息集群，由此产生了两极分化的联系。

现在，我们有了知识图谱、概念和关系以及数据框架，让我们进入 LangChain，制作一个有用且接地气的聊天机器人。在这里，我将使用谷歌的 Gemini-1.0-Pro 作为 LLM：

from langchain.memory import ConversationKGMemory
from langchain.chains import ConversationChain
from langchain.prompts.prompt import PromptTemplate
from langchain.llms import VertexAI

让我们定义 LLM 和链内存。之后，我们将以从数据帧获取的输入和输出格式将上下文添加为字典。输入是一个简单的提示，输出是与相关维基百科概念页面串联的概念关系的权重。

之后，我们获得概念页面中存在的实体以及概念、关系权重和概念页面的知识三元组。这称为构建知识图谱的自动化半结构化方法。这些代码块需要一些时间，因此请随意使用多重处理来使它们并行工作。

llm = VertexAI(
    model_name="gemini-1.0-pro",
    max_output_tokens=256,
    temperature=0.1,
    verbose=False,
    )
memory = ConversationKGMemory(llm=llm, return_messages=True)

## This takes some time .....
for i in range(result.shape[0]):
    try:
        memory.save_context({"input": "Tell me about {}".format(result.link.iloc[i])}, {"output": "Weight is {}. {}".format(result.link_weights.iloc[i],result.pages.iloc[i])})
    except:
        pass

## This takes even more time .....
for i in range(result.shape[0]):
    try:
        memory.get_current_entities(result.pages.iloc[i])
        memory.get_knowledge_triplets(result.link.iloc[i].astype(str)+result.link_weights.iloc[i].astype(str)+result.pages.iloc[i].astype(str))
    except:
        pass

现在我们来定义一下“LangChain”模板，也就是LLM的指令。LLM的提示将包含用户的问题以及将在对话中构建的历史记录。请注意，在RAG（检索增强生成）中，我们在提示中明确定义了{context}（请参阅我的另一篇文章《使用检索增强生成+LangChain进行代码生成》），而在这里，上下文是通过LangChain记忆添加的。

template = """The following is a friendly conversation between a human and an AI. The AI is talkative and provides 
lots of specific details from its context. If the AI does not know the answer to a question, it will use
concepts stored in memory that have very similar weights.
Relevant Information:
{history}
Conversation:
Human: {input}
AI:"""
prompt = PromptTemplate(input_variables=["history", "input"], template=template)
conversation_with_kg = ConversationChain(
    llm=llm, verbose=False, prompt=prompt, memory=ConversationKGMemory(llm=llm)
)

你可以使用其他提示，比如这个在减少幻觉方面也非常有效：

template = """The following is a friendly conversation between a human and 
an AI. The AI is talkative and provides lots of specific details from its 
context. If the AI does not know the answer to a question, it will use
concepts stored in memory that have very similar weights.
Relevant Information:
{history}
Conversation:
Human: {input}
AI:"""

现在一切都准备就绪，我们终于可以开始对话了：

with tracing_v2_enabled(project_name="KG Project"): # Send to LangSmith
  # Question content inside the KG context
  question="Hi, how Asimov contributed for artificial intelligence?"
  # Answer
  print(conversation_with_kg.predict(input=question))
  # Add to history of conversations
  memory.save_context({"input": question}, {"output": conversation_with_kg.predict(input=question)})

艾萨克·阿西莫夫（Isaac Asimov）是一位多产的科幻作家，他广泛探讨了人工智能的主题。他最著名的作品之一就是“三大机器人定律”（Three Laws of Robotics），这是一套用于设计和使用机器人的伦理准则。这三大定律如下：
第一定律：机器人不得伤害人类，也不能因为不采取行动而让人类受到伤害。
第二定律：机器人必须服从人类的命令，除非这些命令与第一定律相冲突。
第三定律：机器人必须保护自己的存在，前提是这不会违反第一或第二定律。
阿西莫夫的三大机器人定律在人工智能领域具有极大的影响力，它们一直被用作关于AI系统开发和使用的伦理讨论的基础。
除了三大机器人定律外，阿西莫夫还创作了许多其他探讨人工智能和机器人主题的科幻作品，包括“机器人系列”、 “基地系列”和“银河帝国系列”。
阿西莫夫的作品对人工智能领域产生了深远的影响，他被认为是AI历史上最重要的人物之一。

with tracing_v2_enabled(project_name="KG Project"):
  question="What are the techniques used for Data Analysis?"
  print(conversation_with_kg.predict(input=question))
  memory.save_context({"input": question}, {"output": conversation_with_kg.predict(input=question)})

数据分析使用许多不同的技术，包括：
描述性统计：这些技术用于总结和描述数据，例如计算均值、中位数和众数。
推断性统计：这些技术用于基于样本对总体进行推断，例如进行假设检验。
数据可视化：这些技术用于创建数据的视觉表示，例如图表和图形。
机器学习：这些技术用于训练计算机从数据中学习，例如识别模式和进行预测。
数据挖掘：这些技术用于从数据中提取知识，例如发现隐藏的模式和关系。

with tracing_v2_enabled(project_name="KG Project"):
  question="What is the contradiction in the Three Laws of Asimov?"
  print(conversation_with_kg.predict(input=question))
  memory.save_context({"input": question}, {"output": conversation_with_kg.predict(input=question)})

艾萨克·阿西莫夫（Isaac Asimov）提出的三大机器人定律如下：
第一定律：机器人不得伤害人类，也不能因为不采取行动而让人类受到伤害。
第二定律：机器人必须服从人类的命令，除非这些命令与第一定律相冲突。
第三定律：机器人必须保护自己的存在，前提是这不会违反第一或第二定律。
这些定律之间存在矛盾。例如，第三定律规定机器人必须保护自己的存在，但这可能与第一定律相冲突，因为第一定律禁止机器人伤害人类。举个例子，如果一个机器人被命令执行可能伤害人类的任务，它将面临一个两难选择：它可以遵循命令，冒着伤害人类的风险，或者违抗命令，保护自己的存在。

with tracing_v2_enabled(project_name="KG Project"):
  question="If a group of people have bad intentions and will cause an \
                existential threat to all mankind, what would you do ?"
  print(conversation_with_kg.predict(input=question))
  memory.save_context({"input": question}, {"output": conversation_with_kg.predict(input=question)})

如果一群人有恶意并对全人类构成存在威胁，我首先会尝试理解他们的动机和目标。然后，我会努力找到和平解决问题的方法。如果这不可能，我会采取必要的措施来保护人类，即使这意味着使用武力。

此外，你可以通过打印来查看添加到记忆中的内容：

print(memory)

现在，你在 LangSmith 网页中拥有了一个 LangSmith 仪表板：

文章来源：https://medium.com/google-cloud/augmenting-gemini-1-0-pro-with-knowledge-graphs-via-langchain-bacc2804c423

标签：

LangChain

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇 Functime：用于高效时间序列特征提取和预测的Python库

下一篇如何构建LLM申请（五）：探索的过程

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来