法律文件RAG:通过法律条款进行多图多代理递归检索

2024年09月10日 由 alex 发表 819 0

在这篇文章和开放源代码库中,我们希望通过一个多代理系统来演示法律文件条款的智能导航,该系统利用词法图(文件层次结构)和多图多代理工作流中的块链接来实现监管文件的智能导航。我们在这里使用的堆栈包括 Reducto.AI、WhyHow.AI、Langgraph 和 LlamaIndex。


9


我们在文件(尤其是法律文件)中遇到的一些非常具体的问题是,需要为文件中的不同条款建立文件层次结构。这是因为条款有时会提到其他条款,以获得完整的含义和上下文。


为了获得完整的上下文,你必须递归地浏览和检索任何被提及的条款(甚至脚注!),浏览文档层次图以找到被提及的条款,查看是否有任何其他条款被提及,然后重复进行。递归检索可以在法律文档之外的一系列其他文档元素上进行,包括页码、多模态数据(如图像)、指向其他文档或外部数据的超链接等。


在本笔记本和示例中,我们将向你展示如何轻松创建法律文档中条款的文档层次图、法律文档中的定义图以及多图多代理递归检索流程,该流程可通过智能浏览文档层次图和定义图来回答问题,从而获得针对所提问题的完整上下文。


在这个例子中,我们以马来西亚中央银行发布的一份简单的监管文件为例,特别是这份管理各种组织合规规则的文件。


作为一个快速基准,当我们使用 GPT4o(内嵌代理推理功能)通过 ChatGPT 执行此操作时,我们得到了以下答案:


The management of control functions by the Board and the Chief Compliance 
Officer (CCO) involves several key responsibilities as outlined in the 
compliance document:
Responsibilities of the Board:
1. Oversight of Compliance Risk Management: The Board is responsible for 
overseeing the management of compliance risk within the financial institution. This includes approving the compliance policy, establishing the compliance function, and appointing the CCO.
2. Approval and Support for the Compliance Function: The Board must 
ensure that the compliance function and the CCO have the appropriate standing, 
authority, and independence to carry out their duties effectively. The Board 
must also ensure that the compliance function is adequately resourced.
3.Engagement with the CCO: The Board should engage with the CCO regularly to 
discuss issues faced by the compliance function and ensure the CCO has direct 
access to the Board.
4. Annual Evaluation: The Board must evaluate the effectiveness of the 
institution’s management of compliance risk at least annually, considering 
assessments from senior management, internal audit, and the CCO.
Responsibilities of the Chief Compliance Officer (CCO):
1. Coordination and Management of Compliance Risk: The CCO is responsible for 
coordinating the identification, management, and monitoring of compliance risk
across the institution. This includes ensuring compliance monitoring and 
testing are consistent across the organization.
2. Reporting and Advising: The CCO must report regularly to senior management 
on the findings and analysis of compliance risks and must ensure that reports 
are readily available to internal audit and regulatory authorities. The CCO 
also advises the Board and senior management on legal and regulatory 
requirements, keeping them informed of developments and their implications.
3. Independence: The CCO must maintain independence from business lines to 
effectively carry out the role of a control function. This includes ensuring 
that there is no conflict of interest in their responsibilities, reporting 
lines, or remuneration.
4. Resources and Training: The CCO must ensure that the compliance function 
is sufficiently resourced with officers who have the necessary qualifications 
and experience. The CCO is also responsible for ensuring that adequate training
 is provided to officers on relevant legal and regulatory requirements.
Interaction with Other Control Functions:
The compliance function should be organized in a way that allows effective 
management of compliance risk, considering the size, nature, and complexity of
the financial institution’s operations. Where responsibilities are shared 
between a dedicated compliance unit and other control functions,
clear documentation and coordination are required to ensure that the CCO can 
perform their duties effectively without impairing independence or focus.
By fulfilling these responsibilities, the Board and the CCO can ensure that 
the institution's compliance risks are managed effectively, supporting the 
overall safety and soundness of the financial institution.


这个答案很有意思,但不完整,因为它:

  • 似乎特别强调了关键词 “合规”,并强调了与其他条款相关的所有合规义务,而不是特别针对董事会和首席合规官。例如,在 “合规 ”的背景下提出了跨辖区协调(第 8.2 段),在 “合规 ”的背景下提出了薪酬独立性(第 7.8 段)。
  • 最重要的是,它似乎忽略了第 7.3 段和第 7.4 段的明确提及,这两段明确规定了董事会与 CCO 之间的主要义务,即当 CCO 分担控制职能时,需要董事会的批准。它还忽略了第 9.1 段关于审计与合规职能分离的规定。


这是可以理解的,因为条款、页面和页脚的递归检索并不明确属于管理 RAG 的典型语义相似性检索过程。


多图谱多代理工作流程摘要


10


创建图表

在本笔记本中,我们首先提取由 Reducto 文档摄取引擎解析的文档结构。文档结构将每个页面分解为不同的元素,如章节页眉、列表项或页脚。


11


然后,根据元素出现的顺序及其隐含的层次结构(例如,“章节标题 ”是 “列表项 ”的父级元素)将其组合起来。然后,我们分析文档中的链接,以识别可在词法图中建模的提取元素之间的联系。


12


13


然后,我们将这些块和三元组导入WhyHow的知识图谱工作室(Knowledge Graph Studio),并在此使用我们的SDK创建词法图谱。


我们还创建了文档的法律定义图。在法律文件中,每份文件都有一个独特的定义部分,帮助定义某些术语的具体解释方式。这可能因文件、使用情况和客户而异。在本案例中,文件第 4-5 页包含定义。这些文本被提取出来并传入 GPT-4o,GPT-4o 会提示逐字提取法律术语及其定义,并以结构化输出的形式返回。输出结果被转换成 CSV 文件,并使用 SDK 和预定义模式作为单独的图表上传。定义代理在需要时会调用该定义图,用特定的相关定义来增强上下文。在这种情况下,定义代理会在检索到初始条款后调用。


14


然后,我们将 WhyHow 中的节点导入笔记本,并使用 LlamaIndex 对节点信息进行本地索引,同时保留 WhyHow 生成的嵌入。我们结合使用了 LlamaIndex 的矢量、BM25 和关键词检索器。在法律文档的使用案例中,查询和检索过程需要精确的术语,而 BM25 和关键词检索器的加入则有助于实现这一目标。BM25 可帮助识别高度重复文本中的关键术语,而关键词检索器则可确保在需要时检索到重要术语,尽管这些术语并不经常出现。


LangGraph 被用于利用 WhyHow SDK 和 GPT-4o 围绕词法图构建多代理工作流。从本质上讲,当传递查询时,系统首先通过初始搜索代理搜索相关的向量块。在这种情况下,向量块就是条款或子条款。随后,定义代理(Definition Agent)会被调用,用相关定义来增强这些条款。然后,路由器代理会检测是否有需要参考的其他链接部分或脚注,如果有,则会检索适当的部分并加以考虑。如果随后检索到的条款(如这里的情况)涉及更多条款,递归检索代理将递归检索这些条款。


它检索的第一个条款是第 6.3 和 7.2 段。定义图被引用,以检查是否有任何应包括在定义部分的附加上下文。其中包括 “CCO ”和 “高级管理层 ”的附加定义。


第 6.3 段第 6.3.f 分段如下:

  • “如果首席控制官还履行与其他控制职能有关的职责3 ,则应确保健全的整体 控制环境不会因首席控制官同时履行多项职责而受到损害"。


路由器代理根据其检索到的第一个条款的信息,帮助检测材料中是否有提及的条款或页脚。在本例中,脚注(脚注 3)与第一个条款相关联。然后,路由器代理触发页脚解析代理。脚注解析代理识别相关脚注,并返回以下脚注:

  • “参见第 7.3 和 7.4 段"。


这里需要进行另一次遍历,使用递归检索代理遍历词法图并检索第 7.3 和 7.4 段中的语块/语段。


第 7.3 和 7.4 段中的新信息将被合并在一起。第 7.4(b)段提到了第 9.1 段:

  • “合规职能的责任不能与内部审计分担,CCO 也不能承担内部审计的责任,因为这种做法会使第 9.1 段所述的独立审查程序失效"。


第 7.4 段包含一个指向第 9.1 段的链接,路由器代理检测到该链接后,会指示递归检索代理在下一次传递时检索该链接。应答代理会跟踪所有接收到的信息,对其进行归纳总结,最终形成最终答案返回给用户。


所构建的最终答案反映了我们的理想流程,吸收了定义页面、第 6.3 和 7.2 段、页脚注释 3、第 7.3、7.4 和 9.1 段中的信息,并通过智能遍历吸收了所有相关信息,为用户提供了准确的摘要。


为了确保我们不是碰运气,我们运行了 3 次最终查询,结果显示成功检索到了相关信息。


To manage control functions effectively, the Board and the Chief Compliance 
Officer (CCO) have distinct responsibilities that they must exercise:
Responsibilities of the Board:
- Approve critical decisions regarding the CCO, including appointment, 
remuneration, and termination (Section 6.3(a)).
- Ensure the CCO has sufficient stature to engage effectively with senior 
management (Section 6.3(b)).
- Regularly engage with the CCO to discuss compliance issues and consider 
interactions without senior management present (Section 6.3(c)).
- Provide the CCO with unimpeded access to communicate with the board directly
(Section 6.3(d)).
- Support the CCO with adequate resources to perform duties effectively, 
including competent staff (Section 6.3(e)).
- Satisfy themselves that combined responsibilities, if any, do not compromise
the control environment (Section 6.3(f)).
Responsibilities of the CCO:
- Coordinate the identification and management of institution-wide compliance 
risks (Section 7.2(b)).
- Ensure consistent conduct of compliance monitoring and testing across the 
organization (Section 7.2(b)).
- Maintain independence and sufficient focus on compliance duties, even when 
tasked with additional control functions (Section 7.3).
Shared Responsibilities & Coordination:
- The board must approve any sharing of compliance function responsibilities 
between the compliance unit and other control functions (Section 7.4(a)).
- Function responsibilities, including timely communication of issues, should 
be well-defined and documented (Section 7.2(a)).
- Effective arrangements for coordination among control functions should be in
place to facilitate the CCO’s responsibilities (Section 7.2(d)).
- Compliance responsibilities must not compromise the separation of the 
internal audit function (Section 9.1).
The board should ensure comprehensive oversight, and the CCO should maintain
effective coordination and communication across the organization to manage
control functions efficiently.


总之,通过这次练习,我们开发了一个系统,展示了以下内容:

  • 多图系统,每个图代表 RAG 系统中不同的流程和目标。
  • 利用 Reducto、WhyHow 和 LlamaIndex 创建支持 RAG 的自动词法图谱
  • 一个多代理系统,可根据文档希望人类阅读和遍历其信息的方式对文档进行智能遍历,以结构化的方式从每个部分和子部分返回答案。
  • 使用 LangGraph 管理的多图多代理系统。


WhyHow.AI的知识图谱工作室平台(目前处于测试阶段)是构建模块化、代理式知识图谱的最简单方法,它结合了LLM、开发人员和非技术领域专家的工作流程。


附录

代理代码片段


定义代理

  • 检索查询中提到的术语的定义。


def definitions_search(query_prompt: str, client: Optional[WhyHow]=None) -> Dict[str, str]:definitions_search(query_prompt: str, client: Optional[WhyHow]=None) -> Dict[str, str]:
    """
    Search for definitions of terms in a question prompt and return them as a dictionary.
    """
    if client is None:
        client = WhyHow(api_key=WHYHOW_API_KEY, base_url=WHYHOW_API_URL)
    definitions_response = client.graphs.query_unstructured(
        graph_id=definitions_graph.graph_id,
        query=query_prompt,
    )
    
    response_text = definitions_response.answer
    term_def_pairs = response_text.split('\n')
    definitions_dict = {}
    
    for pair in term_def_pairs:
        if ':' in pair:
            term, definition = pair.split(':', 1)
            definitions_dict[term.strip()] = definition.strip()
    
    return definitions_dict
query_prompt = """Return me definitions for the terms in this query: "How can the Board and the CCO manage control functions?" Ensure the term-definition pairs are separated by newlines, properly capitalised"""
definitions_dict = definitions_search(query_prompt)

def print_prompt_definitions_dict(definitions_dict):
    prompt = "Relevant Definitions:\n"
    for term, definition in definitions_dict.items():
        prompt += f"{term}: {definition}\n"
    return prompt
print(print_prompt_definitions_dict(definitions_dict))


路由器代理

  • 决定进程是停止还是继续。同时确定包含页脚信息或链接到其他节点的相关节点,由递归检索代理进行检索。


def router_agent(state: AgentState) -> AgentState:
    # decide if process should should stop or continue
    starter_prompt_footer = f"""
        You are an intelligent agent overseeing a multi-agent retrieval process of graph nodes from a document. These nodes are to answer the query: 
        ```{state['query']}```
        
        Below this request is a list of nodes that were automatically retrieved. 
        
        You must determine if the list of nodes is enough to answer the query. If there isn't enough information, you must identify any relevant footer information in the nodes.
        
        A node can footer information asking to look in another section/part of the document, which will require a separate natural language search. 
        Example: If the footer says "see paragraph x", a search query e.g. "Return paragraph x to answer the query '{state['query']}'" should be made. 
    
        If there are no further nodes worth analyzing, return an empty response. ONLY RETURN QUERIES FOR FOOTERS THAT ARE RELEVANT TO ANSWERING THE QUERY
        
        Else, if any relevant nodes require a footer search, specify the node_id and the search query.
        Nodes are identified by node_id and must be quoted in backticks.     
    """
    
    starter_prompt_link = f"""
        You are an intelligent agent overseeing a multi-agent retrieval process of graph nodes from a document. These nodes are to answer the query: 
        ```{state['query']}```
        
        Below this request is a list of nodes that were automatically retrieved. 
        
        You must determine if the list of nodes is enough to answer the query. If there isn't enough information, you must identify any linked nodes that could be worth exploring.
        
        If there are no further nodes worth analyzing, return an empty response.
        
        Return a list of node_ids. ONLY RETURN NODE_IDS for NODES THAT ARE RELEVANT TO ANSWERING THE QUERY. Nodes are identified by node_id and must be quoted in backticks.
    """
    
    # collect latest nodes, and all nodes
    last_fetched_nodes_flattened: Dict[str, MultiAgentSearchLocalNode] = {}
    all_nodes_flattened: Dict[str, MultiAgentSearchLocalNode] = {}


监督员代理

  • 监控上下文窗口,并在使用过多上下文时删除节点。
  • 如果页脚搜索或链接节点未找到节点或相关信息,还会跟踪搜索失败情况。如果重复搜索失败次数过多,它会提前结束检索过程。


def supervisor_agent(state:AgentState) -> AgentState:state:AgentState) -> AgentState:
    
    # Look for search failures. This might be an instance where multiple searches were made for certain parts of the document, but no relevant information was found.
    # This means that the search has to be ended prematurely to prevent infinite loops.
    printout = ""
    for node in state["previous_nodes"]:
        printout += node.print_node_prompt()
    for node in state["last_fetched_context_nodes"]:
        printout += node.print_node_prompt()
        
    prompt = f"""
You are a supervisor agent overseeing the multi-agent retrieval process of graph nodes from a document. The nodes are to answer the query:
```{state['query']}```

Below is a list of nodes that were automatically retrieved, followed by a list of errors. If there are many similar, repeated errors in the retrieval process , where no further linked or relevant nodes could be retrieved, return END to end the process. Else return CONTINUE. 
Return only a single word, either END or CONTINUE.
"""
    
    completion = openai_client.beta.chat.completions.parse(
        model="gpt-4o-2024-08-06",
        messages=[
            {"role": "system", "content": prompt},
            {"role": "user", "content": printout},
            {"role": "user", "content": state['search_failures']},
        ],
    )


递归代理

  • 检索路由器代理标记的节点信息
  • 获取新的链接节点,或对文档中的页脚进行关键字搜索(例如,如果页脚要求参考第 7.3 和 7.4 段,但没有链接,则进行搜索)。在此步骤中,LLM 调用将删除冗余节点


def recursive_retrieval(state: AgentState) -> AgentState:state: AgentState) -> AgentState:
    current_nodes = state["last_fetched_context_nodes"]
    
    for current_node in current_nodes:
        state["previous_nodes"].append(current_node)
    new_current_nodes = []
    # look up the nodes to fetch by id    
    
    for node_id in state["node_links_to_fetch"]:
    # sometimes GPT returns node ids with or without backticks
        if node_id[0] == "`":
            node_id = node_id[1:-1]
        if node_id in local_nodes_map:
            new_current_nodes.append(local_nodes_map[node_id])
        else:
            state["search_failures"].append(f"Failed to fetch node with id: {node_id}")

    for node_id, search_query in state["node_footers_to_fetch"].items():
        # fetch nodes by keyword and bm25 search
        footer_retrieved_nodes = retrieve_with_keywords_bm25(search_query)
        # LLM prunes nodes that are not relevant
        footer_retrieved_nodes, _ = prune_nodes(search_query, footer_retrieved_nodes)
        for node in footer_retrieved_nodes:
            new_current_nodes.append(node)
        # if no nodes fetched, log failure
        if len(footer_retrieved_nodes) == 0:
            state["search_failures"].append(
                f"Failed to fetch nodes for query: {search_query}"
            )
    state["last_fetched_context_nodes"] = new_current_nodes
    state["pass_count"] += 1
    state["node_footers_to_fetch"] = {}
    state["node_links_to_fetch"] = []
    return state


应答代理

  • 根据检索过程中收集的所有信息和文档中的定义,最终确定查询的答案
  • 在可能的情况下,它还会尝试链接来源,例如段落编号


def answering_agent(state: AgentState) -> AgentState:
    # answer the query
    prompt = f"""
You are an answering agent. You will be given a list of document nodes that were automatically retrieved by the system. These nodes are to answer the query:
```{state['query']}```
Give references to sections/paragraphs if possible, but do not output full node ids with backticks and the hash. 
"""


通过多图多代理系统检索到的答案的变体


答案 #2:


To manage control functions effectively, both the Board and the Chief 
Compliance Officer (CCO) must collaborate and establish a structured 
framework. Key measures include:
Board’s Responsibilities:
- The Board must approve the appointment, remuneration, and termination of the
CCO, ensuring they have the necessary stature to engage effectively with
senior management (S 6.3(a), (b)).
- The Board should engage regularly with the CCO to discuss compliance issues
and ensure the CCO has direct access to the Board (S 6.3(c), (d)).
- The Board must also assure that the CCO is granted adequate resources and
support, including competent officers, to fulfill their duties (S 6.3(e)).
CCO’s Role and Independence:
- The CCO should oversee the institution-wide compliance risk management and 
ensure that responsibilities are clearly documented and communicated in the 
compliance policy (S 7.2(a), (b)).
- If the CCO is involved in other control functions, they must ensure their 
independence and ability to focus on compliance responsibilities remain 
uncompromised (S 7.3).
Coordination with Other Functions:
- Proper coordination between the CCO and other control functions is essential.
This includes robust information flows and mechanisms for seeking compliance
advice (S 7.2(d)).
- The Board must approve the sharing of compliance responsibilities to ensure
there is no dilution of focus or effectiveness (S 7.4(a)).
Separation from Internal Audit:
- Compliance functions should remain distinct from the internal audit to 
maintain the efficacy of independent reviews (S 7.4(b), S 9.1).
Training and Capacity Building:
- The compliance function is responsible for training and providing guidance
to other officers on legal and regulatory requirements to manage compliance 
risk effectively (S 8.11).
By implementing these steps, the Board and the CCO can effectively manage
control functions, maintaining robust compliance and governance frameworks.


答案#3


The management of control functions by the Board and the Chief Compliance 
Officer (CCO) involves several key responsibilities and structures:
Role of the Board in Managing CCO Functions:
- The board must approve the appointment, remuneration, and termination of the
CCO (Section 6.3(a)).
- The board should ensure that the CCO has sufficient stature to interact 
effectively with the CEO and senior management (Section 6.3(b)).
- Regular engagement between the board and the CCO is important to discuss 
compliance issues directly (Section 6.3(c)).
- The CCO must have direct, unimpeded access to the board (Section 6.3(d)).
- There must be adequate resources and support for the CCO to perform his
duties effectively (Section 6.3(e)).
Shared Responsibilities and Independence:
- Where compliance functions are shared, the board must approve this 
arrangement, and responsibilities should be clearly defined and documented
in the compliance policy (Section 7.2).
- The CCO should not assume responsibilities for internal audit, as this can
compromise independent review processes (Sections 7.4, 9.1).
- The CCO must ensure that their independence and ability to focus on
compliance are not impaired by additional responsibilities (Section 7.3).
Responsibilities Within the Organization:
- Compliance is the responsibility of all officers within the institution. 
Business lines manage compliance risk through their managerial controls, 
while the compliance function ensures that these controls are adequate 
(Section 1.2).
- The internal audit function provides independent assurance on the quality 
and effectiveness of the institution’s controls, including those concerning
compliance (Section 1.2(c)).
Coordination Across Control Functions:
- Arrangements for coordination among control functions and the CCO must 
promote a consistent approach to managing compliance risk, with adequate 
information flows and avenues for advice (Section 7.2(d)).
By following these guidelines, the Board and the CCO can manage the
compliance control functions effectively, ensuring that compliance risks
are appropriately identified, managed, and mitigated across the organization.



文章来源:https://medium.com/enterprise-rag/legal-document-rag-multi-graph-multi-agent-recursive-retrieval-through-legal-clauses-c90e073e0052
欢迎关注ATYUN官方公众号
商务合作及内容投稿请联系邮箱:bd@atyun.com
评论 登录
热门职位
Maluuba
20000~40000/月
Cisco
25000~30000/月 深圳市
PilotAILabs
30000~60000/年 深圳市
写评论取消
回复取消