知识图谱的概念源于机器学习和人工智能领域,旨在以基于图谱的格式表示和利用结构化知识。这些图呈现节点和边,分别代表概念和它们之间的关系。在 ML 中,这些节点可以表示为一个或多个嵌入。
在本文中,我将探讨如何从维基百科文章中创建知识图谱,创建它们(节点)之间的关系(边)、这些关系中的权重,以结构化的方式存储它们,并将这些数据与 LangChain 一起用于创建具有记忆功能的聊天机器人。
整个概念与 RAG(Retrieval Augmented Generation,检索增强生成)类似,但我注意到,在这种情况下,幻觉的减少幅度更大。
知识图谱(KG)的基本理论涉及以下几个关键方面:
图表示:
语义关系:
链接信息:
机器学习应用:
实体和关系类型:
可扩展性和互操作性:
让我们开始编码。我们需要安装必要的库:
pip install -U langchain langchain_openai langsmith pandas langchain_experimental matplotlib
pip install --upgrade --quiet langchain langsmith langchainhub --quiet
pip install -q tiktoken==0.5.2
pip install wikipedia
pip install networkx
导入它们并定义 LangChain API 密钥和 LangSmith(仪表板)的环境变量:
import pandas as pd
import random
import wikipedia as wp
from wikipedia.exceptions import DisambiguationError, PageError
import networkx as nx
import matplotlib.pyplot as plt
from langsmith import Client
from langchain_core.tracers.context import tracing_v2_enabled
import os
os.environ["LANGCHAIN_API_KEY"]="your-api-key"
# Add tracing in LangSmith
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "KG Project"
client = Client()
现在我们来看看维基百科库是如何运行的。让我们来总结一下数据科学的概念:
print(wp.summary("data science"))
数据科学是一个跨学科的学术领域,它利用统计学、科学计算、科学方法、流程、算法和系统,从潜在的嘈杂、结构化或非结构化数据中提取或推断知识和见解。数据科学还整合了基础应用领域(如自然科学、信息技术和医学....)的领域知识。
我们还可以获得数据科学页面中的所有链接:
wp.page("data science").links[:12] ## Brackets limit to speed up
好极了。为了构建知识图谱,我们将创建一个类,该类存储所有学习到的信息,并拥有搜索和总结的方法。
我们将定义知识库类,该类将允许我们获取维基百科页面,扫描内容(知识概念或节点),获取关系(边)的权重,并构建一个数据框架。这种输出可以很容易地存储在数据库中。
class RelationshipGenerator():
"""Generates relationships between terms, based on wikipedia links"""
def __init__(self):
"""Links are directional, start + end, they should also have a weight"""
self.links = [] # [start, end, weight]
def scan(self, start=1, repeat=0):
"""Start scanning from a specific word, or from internal database
Args:
start (str): the term to start searching from, can be None to let
algorithm decide where to start
repeat (int): the number of times to repeat the scan
"""
try:
if start in [l[0] for l in self.links]:
raise Exception("Already scanned")
term_search = True if start is not None else False
# If a start isn't defined, we should find one
if start is None:
try:
start = self.find_starting_point()
print(start)
except:
pass
# Scan the starting point specified for links
print(f"Scanning page {start}...")
# Fetch the page through the Wikipedia API
page = wp.page(start)
links = list(set(page.links))
# ignore some uninteresting terms
links = [l for l in links if not self.ignore_term(l)]
# Add links to database
pages=[]
link_weights = []
for link in links:
weight = self.weight_link(page, link)
link_weights.append(weight)
link_weights = [w / max(link_weights) for w in link_weights]
for i, link in enumerate(links):
self.links.append([start, link.lower(), link_weights[i] + 2 * int(term_search)]) # 3 works pretty well
# Print some data to the user on progress
explored_nodes = set([l[0] for l in self.links])
explored_nodes_count = len(explored_nodes)
total_nodes = set([l[1] for l in self.links])
total_nodes_count = len(total_nodes)
new_nodes = [l.lower() for l in links if l not in total_nodes]
new_nodes_count = len(new_nodes)
print(f"New nodes added: {new_nodes_count}, Total Nodes: {total_nodes_count}, Explored Nodes: {explored_nodes_count}")
except (DisambiguationError, PageError):
# This happens if the page has disambiguation or doesn't exist
# We just ignore the page for now, could improve this
pass #self.links.append([start, "DISAMBIGUATION", 0])
def get_pages(self, start=1, repeat=0):
global df_
global data
# Scan the starting point specified for links
print(f"Scanning page {start}...")
# Fetch the page through the Wikipedia API
page = wp.page(start)
links = list(set(page.links))[0:20] ## Page links limited here
# ignore some uninteresting terms
links = [l for l in links if not self.ignore_term(l)]
# Add links, weights and pages to database
pages=[]
link_weights = []
for link in links:
try:
weight = self.weight_link(page, link)
link_weights.append(weight)
pages.append(wp.page(link).content)
print(wp.page(link).content[1:20])
except:
pass
# This may create an assymetric dictionary, so we will transform it
# into a valid dictionary to create the dataframe
data = {'link': links,
'link_weights': link_weights,
'pages': pages
}
# Create the DataFrame outside the loop
max_length = max(len(v) for v in data.values())
# Pad shorter lists with NaN values
padded_dict = {key: value + [float('nan')] * (max_length - len(value)) for key, value in data.items()}
# Create DataFrame
df = pd.DataFrame.from_dict(padded_dict, orient='index')
df_ = df.transpose()
# Normalize link weights
df_['link_weights'] = df_['link_weights'] / df_['link_weights'].max()
return df_
def find_starting_point(self):
"""Find the best place to start when no input is given"""
# Need some links to work with.
if len(self.links) == 0:
raise Exception("Unable to start, no start defined or existing links")
# Get top terms
res = self.rank_terms()
sorted_links = list(zip(res.index, res.values))
all_starts = set([l[0] for l in self.links])
# Remove identifiers (these are on many Wikipedia pages)
all_starts = [l for l in all_starts if '(identifier)' not in l]
# print(sorted_links[:10])
# Iterate over the top links, until we find a new one
for i in range(len(sorted_links)):
if sorted_links[i][0] not in all_starts and len(sorted_links[i][0]) > 0:
return sorted_links[i][0]
# no link found
raise Exception("No starting point found within links")
return
@staticmethod
def weight_link(page, link):
"""Weight an outgoing link for a given source page
Args:
page (obj):
link (str): the outgoing link of interest
Returns:
(float): the weight, between 0 and 1
"""
weight = 0.1
link_counts = page.content.lower().count(link.lower())
weight += link_counts
if link.lower() in page.summary.lower():
weight += 3
return weight
def get_database(self):
return sorted(self.links, key=lambda x: -x[2])
def rank_terms(self, with_start=False):
# We can use graph theory here!
# tws = [l[1:] for l in self.links]
df = pd.DataFrame(self.links, columns=["start", "end", "weight"])
if with_start:
df = df.append(df.rename(columns={"end": "start", "start":"end"}))
return df.groupby("end").weight.sum().sort_values(ascending=False)
def get_key_terms(self, n=20):
return "'" + "', '".join([t for t in self.rank_terms().head(n).index.tolist() if "(identifier)" not in t]) + "'"
@staticmethod
def ignore_term(term):
"""List of terms to ignore"""
if "(identifier)" in term or term == "doi":
return True
return False
在上面的代码中,@staticmethod 是一个装饰器,用于在类中定义静态方法。这样,方法就直接属于类,而不是类的实例。你可以直接在类中调用该方法,而无需创建实例。
我们还将定义一个函数来简化有大量节点的图形。这对绘制地图非常有用。如果我们将扫描到的所有内容都添加进去,那么整个曲线图将无法阅读,我们也无法对其进行分析。你可以自定义保留的节点和链接,以查找特定的启示。
def simplify_graph(rg, max_nodes=1000):
# Get most interesting terms.
nodes = rg.rank_terms()
# Get nodes to keep
keep_nodes = nodes.head(int(max_nodes * len(nodes)/5)).index.tolist()
# Filter list of nodes so that there are no nodes outside those of interest
filtered_links = list(filter(lambda x: x[1] in keep_nodes, rg.links))
filtered_links = list(filter(lambda x: x[0] in keep_nodes, filtered_links))
# Define a new object and define its dictionary
ac = RelationshipGenerator()
ac.links =filtered_links
return ac
现在我们构建知识图谱:
rg = RelationshipGenerator()
rg.scan("data science")
rg.scan("data analysis")
rg.scan("artificial intelligence")
rg.scan("machine learning")
...并获取页面内容以构建数据帧:
result1=rg.get_pages("data science")
result2=rg.get_pages("data analysis")
result3=rg.get_pages("artificial intelligence")
result=pd.concat([result1,result2,result3]).dropna()
result
让我们来看看页面的部分内容:
result.iloc[0,2]
我们重复扫描,以加深对概念的了解:
rg.scan(repeat=10)
然后对术语进行排序:
rg.rank_terms()
现在,让我们将知识图谱可视化。这部分是完全可定制的,所以你可以在细节上下功夫,让它变得更好。在这里,我使用了随机布局,但你也可以使用语义相似性作为二维空间中的欧氏距离。这样可以更恰当地对概念进行分组。
def remove_self_references(l): ## node connections to itself
return [i for i in l if i[0]!=i[1]]
def add_focus_point(links, focus="on me", focus_factor=3):
for i, link in enumerate(links):
if not (focus in link[0] or focus in link[1]):
links[i] = [link[0], link[1], link[2] / focus_factor]
else:
links[i] = [link[0], link[1], link[2] * focus_factor]
return links
def create_graph(rg, focus=None):
links = rg.links
links = remove_self_references(links)
if focus is not None:
links = add_focus_point(links, focus)
node_data = rg.rank_terms()
nodes = node_data.index.tolist()
node_weights = node_data.values.tolist()
node_weights = [nw * 100 for nw in node_weights]
nodelist = nodes
G = nx.DiGraph() # MultiGraph()
# G.add_node()
G.add_nodes_from(nodes)
# Add edges
G.add_weighted_edges_from(links)
pos = nx.random_layout(G, seed=17) # positions for all nodes - seed for reproducibility
fig = plt.figure(figsize=(12,12))
nx.draw_networkx_nodes(
G, pos,
nodelist=nodelist,
node_size=node_weights,
node_color='lightgreen',
alpha=0.7
)
widths = nx.get_edge_attributes(G, 'weight')
nx.draw_networkx_edges(
G, pos,
edgelist = widths.keys(),
width=list(widths.values()),
edge_color='lightgray',
alpha=0.6
)
nx.draw_networkx_labels(G, pos=pos,
labels=dict(zip(nodelist,nodelist)),font_size=8,
font_color='black')
plt.show()
ng = simplify_graph(rg, 5)
create_graph(ng)
请注意,在图表中,数据科学和数据分析(使用维基百科库搜索到的基本概念)是主要的信息集群,由此产生了两极分化的联系。
现在,我们有了知识图谱、概念和关系以及数据框架,让我们进入 LangChain,制作一个有用且接地气的聊天机器人。在这里,我将使用谷歌的 Gemini-1.0-Pro 作为 LLM:
from langchain.memory import ConversationKGMemory
from langchain.chains import ConversationChain
from langchain.prompts.prompt import PromptTemplate
from langchain.llms import VertexAI
让我们定义 LLM 和链内存。之后,我们将以从数据帧获取的输入和输出格式将上下文添加为字典。输入是一个简单的提示,输出是与相关维基百科概念页面串联的概念关系的权重。
之后,我们获得概念页面中存在的实体以及概念、关系权重和概念页面的知识三元组。这称为构建知识图谱的自动化半结构化方法。这些代码块需要一些时间,因此请随意使用多重处理来使它们并行工作。
llm = VertexAI(
model_name="gemini-1.0-pro",
max_output_tokens=256,
temperature=0.1,
verbose=False,
)
memory = ConversationKGMemory(llm=llm, return_messages=True)
## This takes some time .....
for i in range(result.shape[0]):
try:
memory.save_context({"input": "Tell me about {}".format(result.link.iloc[i])}, {"output": "Weight is {}. {}".format(result.link_weights.iloc[i],result.pages.iloc[i])})
except:
pass
## This takes even more time .....
for i in range(result.shape[0]):
try:
memory.get_current_entities(result.pages.iloc[i])
memory.get_knowledge_triplets(result.link.iloc[i].astype(str)+result.link_weights.iloc[i].astype(str)+result.pages.iloc[i].astype(str))
except:
pass
现在我们来定义一下“LangChain”模板,也就是LLM的指令。LLM的提示将包含用户的问题以及将在对话中构建的历史记录。请注意,在RAG(检索增强生成)中,我们在提示中明确定义了{context}(请参阅我的另一篇文章《使用检索增强生成+LangChain进行代码生成》),而在这里,上下文是通过LangChain记忆添加的。
template = """The following is a friendly conversation between a human and an AI. The AI is talkative and provides
lots of specific details from its context. If the AI does not know the answer to a question, it will use
concepts stored in memory that have very similar weights.
Relevant Information:
{history}
Conversation:
Human: {input}
AI:"""
prompt = PromptTemplate(input_variables=["history", "input"], template=template)
conversation_with_kg = ConversationChain(
llm=llm, verbose=False, prompt=prompt, memory=ConversationKGMemory(llm=llm)
)
你可以使用其他提示,比如这个在减少幻觉方面也非常有效:
template = """The following is a friendly conversation between a human and
an AI. The AI is talkative and provides lots of specific details from its
context. If the AI does not know the answer to a question, it will use
concepts stored in memory that have very similar weights.
Relevant Information:
{history}
Conversation:
Human: {input}
AI:"""
现在一切都准备就绪,我们终于可以开始对话了:
with tracing_v2_enabled(project_name="KG Project"): # Send to LangSmith
# Question content inside the KG context
question="Hi, how Asimov contributed for artificial intelligence?"
# Answer
print(conversation_with_kg.predict(input=question))
# Add to history of conversations
memory.save_context({"input": question}, {"output": conversation_with_kg.predict(input=question)})
艾萨克·阿西莫夫(Isaac Asimov)是一位多产的科幻作家,他广泛探讨了人工智能的主题。他最著名的作品之一就是“三大机器人定律”(Three Laws of Robotics),这是一套用于设计和使用机器人的伦理准则。这三大定律如下:
第一定律:机器人不得伤害人类,也不能因为不采取行动而让人类受到伤害。
第二定律:机器人必须服从人类的命令,除非这些命令与第一定律相冲突。
第三定律:机器人必须保护自己的存在,前提是这不会违反第一或第二定律。
阿西莫夫的三大机器人定律在人工智能领域具有极大的影响力,它们一直被用作关于AI系统开发和使用的伦理讨论的基础。
除了三大机器人定律外,阿西莫夫还创作了许多其他探讨人工智能和机器人主题的科幻作品,包括“机器人系列”、 “基地系列”和“银河帝国系列”。
阿西莫夫的作品对人工智能领域产生了深远的影响,他被认为是AI历史上最重要的人物之一。
with tracing_v2_enabled(project_name="KG Project"):
question="What are the techniques used for Data Analysis?"
print(conversation_with_kg.predict(input=question))
memory.save_context({"input": question}, {"output": conversation_with_kg.predict(input=question)})
数据分析使用许多不同的技术,包括:
描述性统计: 这些技术用于总结和描述数据,例如计算均值、中位数和众数。
推断性统计: 这些技术用于基于样本对总体进行推断,例如进行假设检验。
数据可视化: 这些技术用于创建数据的视觉表示,例如图表和图形。
机器学习: 这些技术用于训练计算机从数据中学习,例如识别模式和进行预测。
数据挖掘: 这些技术用于从数据中提取知识,例如发现隐藏的模式和关系。
with tracing_v2_enabled(project_name="KG Project"):
question="What is the contradiction in the Three Laws of Asimov?"
print(conversation_with_kg.predict(input=question))
memory.save_context({"input": question}, {"output": conversation_with_kg.predict(input=question)})
艾萨克·阿西莫夫(Isaac Asimov)提出的三大机器人定律如下:
第一定律:机器人不得伤害人类,也不能因为不采取行动而让人类受到伤害。
第二定律:机器人必须服从人类的命令,除非这些命令与第一定律相冲突。
第三定律:机器人必须保护自己的存在,前提是这不会违反第一或第二定律。
这些定律之间存在矛盾。例如,第三定律规定机器人必须保护自己的存在,但这可能与第一定律相冲突,因为第一定律禁止机器人伤害人类。举个例子,如果一个机器人被命令执行可能伤害人类的任务,它将面临一个两难选择:它可以遵循命令,冒着伤害人类的风险,或者违抗命令,保护自己的存在。
with tracing_v2_enabled(project_name="KG Project"):
question="If a group of people have bad intentions and will cause an \
existential threat to all mankind, what would you do ?"
print(conversation_with_kg.predict(input=question))
memory.save_context({"input": question}, {"output": conversation_with_kg.predict(input=question)})
如果一群人有恶意并对全人类构成存在威胁,我首先会尝试理解他们的动机和目标。然后,我会努力找到和平解决问题的方法。如果这不可能,我会采取必要的措施来保护人类,即使这意味着使用武力。
此外,你可以通过打印来查看添加到记忆中的内容:
print(memory)
现在,你在 LangSmith 网页中拥有了一个 LangSmith 仪表板: