法律案件搜索引擎:使用Qdrant、Llama 3 和 LangChain

2024年07月02日 由 alex 发表 349 0

简介

随着 GenAI 在涉及 RAG 的实际应用中得到越来越多的使用和适应,我们经常会遇到一些限制,例如:


  1. 数据短缺
  2. LLM 的预测问题
  3. 通常使用向量数据库检索查询的相关结果


第三点无疑是构建任何 RAG 应用程序时面临的最大限制,因为这关系到我们能否根据查询找到相关结果,从而在将查询发送给 LLM 之前为查询创建 "上下文"。


在本文中,我将尝试复制检索的局限性,并向你展示如何克服这一局限性。


12


利用 LangChain 在矢量数据库中进行简单查询搜索。


问题

在本文中,我将使用 Qdrant、Llama3(使用 Groq 的 API)和 LangChain 构建一个法律案例搜索引擎。


主要议程是关注 "过滤相关结果",因为法律案件的文档非常丰富。我们有很多元数据可以利用,因此问题的关键在于:如何使用这些元数据?


在这种情况下可能出现的另一个问题是,法律案件都有摘要,你会发现许多案件在性质上非常相似--如果只将这些摘要输入矢量数据库,你就无法控制查询搜索后得到的结果。


13


为了解决这个问题,我们将介绍 3 种过滤技术:


  1. LangChain 的自查询
  2. Qdrant 的有效载荷过滤器
  3. 语义过滤(我们将构建语义过滤)


在此之前,让我们先看看我们的数据集: https://github.com/sachink1729/legal-cases-search-using-self-query-qdrant-llama3-langchain/blob/main/dataset.json


我使用 ChatGPT 生成了这个数据集;如果你想看提示,请点击这里:https://chatgpt.com/share/849d5987-1f47-4766-9738-2bcc87430f4b


如你所见,该数据集是一个字典列表,每个小字典都是一个法律案件,其中包含页面内容和元数据。我们稍后将详细探讨这个数据集。


开始编码

使用以下命令安装所需的库


!pip install langchain-huggingface
!pip install qdrant-client
!pip install langchain-qdrant
!pip install langchain-community
!pip install lark


让我们来看看我们的数据集。


import json
dataset = json.load(open("dataset.json"))['cases_list']
dataset[0]


输出:


{'page_content': "The Supreme Court upheld the Centre's 2016 demonetisation scheme in a 4:1 majority, ruling that demonetisation was proportionate to the Union's objectives and implemented reasonably.",
'metadata': {'year': 2023,
'court': 'Supreme Court of India',
'judges': ['S. Abdul Nazeer',
'B.R. Gavai',
'A.S. Bopanna',
'V. Ramasubramanian',
'B.V. Nagarathna'],
'legal_topics': ['Constitutional Law', 'Economic Policy'],
'relevant_laws': ['Article 370 of the Indian Constitution']}}


元数据中有 5 个字段:年份、法院、法官、法律主题和相关法律,这些字段不言自明。


该数据集中总共有 30 个独特的案例。


加载高频嵌入模型

使用 langchain-huggingface 集成加载嵌入模型:


from langchain_huggingface import HuggingFaceEmbeddings
model_name = "BAAI/bge-small-en-v1.5"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': True}
hf = HuggingFaceEmbeddings(
model_name=model_name,
model_kwargs=model_kwargs,
encode_kwargs=encode_kwargs
)


使用以下命令启动模型


hf.embed_query("demonetization case")


它给出了一个向量列表,即查询的表示形式:


[-0.08326546847820282,
0.022219663485884666,
-0.017243655398488045,
-0.053167618811130524,
0.05313488468527794,
-0.016739359125494957,

-0.043265871703624725,
0.06921685487031937,
-0.032498423010110855,
-0.01773829013109207,
0.0005499106482602656]


输出是截断的,但如果你检查一下这个列表的长度,它是 384,也就是 "BAAI/bge-small-en-v1.5 "模型的嵌入大小。


接下来,让我们构建一个自查询检索器!


LangChain 自查询检索器

为了便于命名,让我们再次加载数据集,并将其重命名为 langchain_dataset。这个变量将用于自查询检索器;这就是为什么我们要将它与原始数据集分开。


langchain_dataset = json.load(open("dataset.json"))['cases_list']


我们将对langchain_dataset进行预处理。


import re
def remove_punctuation(text):
no_punct_text = re.sub(r'[^\w\s,]', '', text)
return no_punct_text.lower()
Now use this function:
for x in langchain_dataset:
for key, val in x['metadata'].items():
x['metadata'][key] = remove_punctuation(str(val))
for val in x['page_content']:
x['page_content'] = remove_punctuation(str(x['page_content']))


让我们看看结果:


langchain_dataset[0]


{'page_content': 'the supreme court upheld the centres 2016 demonetisation scheme in a 41 majority, ruling that demonetisation was proportionate to the unions objectives and implemented reasonably',
'metadata': {'year': '2023',
'court': 'supreme court of india',
'judges': 's abdul nazeer, br gavai, as bopanna, v ramasubramanian, bv nagarathna',
'legal_topics': 'constitutional law, economic policy',
'relevant_laws': 'article 370 of the indian constitution'}}


接下来,让我们初始化 Qdrant 客户端,建立向量存储。


from langchain_core.documents import Document
from langchain_community.vectorstores.qdrant import Qdrant
for i, x in enumerate(langchain_dataset):
langchain_dataset[i] = Document(
page_content=x['page_content'],
metadata=x['metadata'],
)
vectorstore = Qdrant.from_documents(langchain_dataset, hf,
location=":memory:",
collection_name="langchain_legal",)


下面是建立元数据字段信息的主要部分;这是告诉检索器可用字段的一种方式。


from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_groq import ChatGroq
metadata_field_info = [
AttributeInfo(
name="court",
description="The judiciary court's name.",
type="string",
),
AttributeInfo(
name="year",
description="The year the court decision was given.",
type="string",
),
AttributeInfo(
name="judges",
description="The list of names of the case judges.",
type="string",
),
AttributeInfo(
name="legal_topics",
description="list of the topic names of the case",
type="string",
),
AttributeInfo(
name="relevant_laws",
description="list of relevant laws that applied on the case decision",
type="string",
)
]
document_content_description = "Brief summary of a court case"


最后,我们将整理所有内容并初始化我们的检索器:


在本实验中,我们将使用 Groq 的 API 访问 Llama 3。如果你还没有注册,请先注册,然后获取你的 API 密钥,并将其替换为 "your-groq-key"。在此注册: https://console.groq.com/docs/quickstart


from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_groq import ChatGroq
llm = ChatGroq(temperature=0,api_key="your-groq-key")
retriever = SelfQueryRetriever.from_llm(
llm,
vectorstore,
document_content_description,
metadata_field_info,
)


最后,让我们看看它的性能如何:


retriever.invoke("demonetization case summary")


结果:


[Document(page_content="The Supreme Court upheld the Centre's 2016 demonetisation scheme in a 4:1 majority, ruling that demonetisation was proportionate to the Union's objectives and implemented reasonably.", metadata={'year': '2023', 'court': 'supreme court of india', 'judges': 's abdul nazeer, br gavai, as bopanna, v ramasubramanian, bv nagarathna', 'legal_topics': 'constitutional law, economic policy', 'relevant_laws': 'article 370 of the indian constitution', '_id': 'aaf5d793903e4a6aa2bcd68ee11ccfad', '_collection_name': 'langchain_legal'}),
Document(page_content='The Supreme Court of the United Kingdom ruled that Uber drivers are workers and entitled to employment rights such as minimum wage and holiday pay.', metadata={'year': '2021', 'court': 'supreme court of the united kingdom', 'judges': 'lord reed, lord hodge, lady arden', 'legal_topics': 'employment law, gig economy', 'relevant_laws': 'employment rights act 1996', '_id': '07dc3ad5dbf243cfa6c67e3bea2589a0', '_collection_name': 'langchain_legal'}),
Document(page_content='The U.S. Supreme Court invalidated a federal law prohibiting sports betting outside Nevada, holding it violated the anti-commandeering rule.', metadata={'year': '2018', 'court': 'supreme court of the united states', 'judges': 'majority opinion by justice samuel alito', 'legal_topics': 'sports law, constitutional law', 'relevant_laws': 'professional and amateur sports protection act paspa', '_id': 'cd0196513549463798f0805851bcad48', '_collection_name': 'langchain_legal'}),
Document(page_content='The Supreme Court of India decriminalized consensual homosexual acts between adults, overturning a colonial-era law under Section 377 of the Indian Penal Code.', metadata={'year': '2018', 'court': 'supreme court of india', 'judges': 'chief justice dipak misra', 'legal_topics': 'human rights, lgbtq rights', 'relevant_laws': 'section 377 of the indian penal code', '_id': 'f6f893a7103f4ec1a355d80691e74cfd', '_collection_name': 'langchain_legal'})]


如果你看到第一个结果,它正是我们要找的。但仔细一看,你会发现这个查询没有任何过滤器。


让我们试试带有过滤器的查询:


# This example only specifies a filter
retriever.invoke("cases in year 2020")


结果:


[Document(page_content='The International Court of Justice ruled that Myanmar must take measures to prevent the genocide of the Rohingya people.', metadata={'year': '2020', 'court': 'international court of justice', 'judges': 'president abdulqawi yusuf', 'legal_topics': 'international law, human rights', 'relevant_laws': 'convention on the prevention and punishment of the crime of genocide', '_id': '7829c80e30eb4d31948623f6873ff46f', '_collection_name': 'langchain_legal'}),
Document(page_content='The Constitutional Court of South Africa ruled that domestic workers are entitled to the same compensation benefits as other workers.', metadata={'year': '2020', 'court': 'constitutional court of south africa', 'judges': 'chief justice mogoeng mogoeng', 'legal_topics': 'labor law, equality', 'relevant_laws': 'compensation for occupational injuries and diseases act', '_id': 'bb34eeb9ce5a49f39da01e4ed3764367', '_collection_name': 'langchain_legal'})]


很好,这些情况都是 2020 年的,请查看这两个结果的 "年份 "字段。


再举一个例子:


# This example only specifies a filter
retriever.invoke("cases with topic Constitutional Law")


[Document(page_content='The U.S. Supreme Court decided that same-sex marriage is a constitutional right under the Fourteenth Amendment.', metadata={'year': '2015', 'court': 'supreme court of the united states', 'judges': 'majority opinion by justice anthony kennedy', 'legal_topics': 'constitutional law, civil rights', 'relevant_laws': 'fourteenth amendment of the us constitution', '_id': '6480fdbd1a8143b383b8fa03b8748cab', '_collection_name': 'langchain_legal'}),
Document(page_content="The U.S. Supreme Court ruled that state laws banning abortion are unconstitutional, recognizing a woman's right to privacy in making medical decisions.", metadata={'year': '1973', 'court': 'supreme court of the united states', 'judges': 'majority opinion by justice harry blackmun', 'legal_topics': 'constitutional law, reproductive rights', 'relevant_laws': 'fourteenth amendment of the us constitution', '_id': '0b193c8b40794f6dac578dfd046fbbd1', '_collection_name': 'langchain_legal'}),
Document(page_content='The U.S. Supreme Court held that the Defense of Marriage Act (DOMA) was unconstitutional as it violated the Fifth Amendment by denying federal recognition of same-sex marriages.', metadata={'year': '2013', 'court': 'supreme court of the united states', 'judges': 'majority opinion by justice anthony kennedy', 'legal_topics': 'constitutional law, lgbtq rights', 'relevant_laws': 'fifth amendment of the us constitution', '_id': '2b7a5bb46e1a4492b88d312fbc335014', '_collection_name': 'langchain_legal'}),
Document(page_content='The Constitutional Court of South Africa ruled that the use of corporal punishment in the home is unconstitutional.', metadata={'year': '2019', 'court': 'constitutional court of south africa', 'judges': 'chief justice mogoeng mogoeng', 'legal_topics': 'family law, human rights', 'relevant_laws': 'south african constitution', '_id': 'b284224c8cbb4878b2b43906ef57073d', '_collection_name': 'langchain_legal'})]


这又是一个自我查询如何发挥作用的好例子。


现在,让我们看看它的失败之处! ?


retriever.invoke("cases where Judge Sylvia Steiner was the judge")


这会导致错误:


OutputParserException: Parsing text
```json
{
"query": "",
"filter": "eq(contains(judges, \"Sylvia Steiner\"), true)"
}
```


说明:


- 查询是一个空字符串,因为用户并不是要在案例内容中查找任何特定文本。


- 过滤器使用 `eq` 比较语句检查 `judges` 属性是否包含 "Sylvia Steiner"。


- `contains` 函数用于检查 `judges` 属性是否是包含名称 "Sylvia Steiner "的字符串。


- true "值被用作 "eq "语句的比较值,以检查 "contains "函数是否返回 "true"。


raised following error:
Unexpected token Token('COMMA', ',') at line 1, column 19.
Expected one of:
* LPAR
Previous tokens: [Token('CNAME', 'judges')]


下面是一个没有任何结果的例子,很奇怪。


retriever.invoke("demonetization case of Supreme court of India")


你将得到一个空列表。


接下来,让我们看看第二种方法: Qdrant 的有效载荷过滤器。


Qdrant 的有效载荷过滤器

首先,让我们使用 Qdrant 的官方绑定创建一个新的集合。


from qdrant_client.models import PointStruct
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
client = QdrantClient(":memory:")
client.recreate_collection(
    collection_name="law_docs",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)
# NOTE: consider splitting the data into chunks to avoid hitting the server's payload size limit
# or use `upload_collection` or `upload_points` methods which handle this for you
# WARNING: uploading points one-by-one is not recommended due to requests overhead
client.upsert(
    collection_name="law_docs",
    points=[
        PointStruct(
            id=idx,
            vector=hf.embed_query(element['page_content']),
            payload=element['metadata']
        )
        for idx, element in enumerate(dataset)
    ]
)


理想情况下,这种方法需要人工操作,因为我们必须扫描整个句子以找到匹配字段,然后才能使用条件。


在此,我们只举一个例子:


from qdrant_client import models
client.scroll(
    collection_name="law_docs",
    scroll_filter=models.Filter(
        must=[
            models.FieldCondition(
                key="year",
                match=models.MatchValue(value=2023),
            ),
        ]
    ),
)


这将搜索所有提到年份为 2023 年的记录,你将得到


([Record(id=0, payload={'year': 2023, 'court': 'Supreme Court of India', 'judges': ['S. Abdul Nazeer', 'B.R. Gavai', 'A.S. Bopanna', 'V. Ramasubramanian', 'B.V. Nagarathna'], 'legal_topics': ['Constitutional Law', 'Economic Policy'], 'relevant_laws': ['Article 370 of the Indian Constitution']}, vector=None, shard_key=None),
Record(id=1, payload={'year': 2023, 'court': 'Supreme Court of the United States', 'judges': ['K.M. Joseph', 'Ajay Rastogi', 'Aniruddha Bose', 'Hrishikesh Roy', 'C.T. Ravikumar'], 'legal_topics': ['Election Law', 'Constitutional Law'], 'relevant_laws': ['Election Commission Act']}, vector=None, shard_key=None),
Record(id=4, payload={'year': 2023, 'court': 'Supreme Court of India', 'judges': ['K.M. Joseph', 'Ajay Rastogi', 'Aniruddha Bose', 'Hrishikesh Roy', 'C.T. Ravikumar'], 'legal_topics': ['Health Law', 'Right to Die'], 'relevant_laws': ['Guidelines for Passive Euthanasia']}, vector=None, shard_key=None)],
None)


这是正确的!


作为一项练习,试着在一个过滤管道中实现这一点,它将涵盖 Qdrant 提供的所有条款和条件--嗯,那将是一件了不起的事情!


现在,我们来谈谈语义过滤。


语义过滤

我发现的第三种也是最好的一种技术是使用单独的向量数据库来过滤结果!


原理是这样的:

  1. 主矢量数据库保持不变,即 "law_docs "集合;这将是第一层。
  2. 我们将建立一个名为 "元数据 "的新矢量数据库,作为第二层。
  3. 这些数据库将根据查询找到最佳匹配。


让我们创建新的集合,但首先要对文本进行预处理,因为我们必须在单个文本片段中提供条目的全部元数据:


metadata_fields = [x['metadata'] for x in dataset]
metadata_list = []
for elem in metadata_fields:
    s = ''
    for key in elem:
        s = s + f"{key} : {elem[key]}\n"
    s = s.strip().lower().replace('.','').replace("'",'').replace('[','').replace(']','')
    # s = remove_punctuation(s)
    metadata_list.append(s)

metadata_client = QdrantClient(":memory:")
metadata_client.recreate_collection(
    collection_name="metadata",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)

metadata_client.upsert(
    collection_name="metadata",
    points=[
        PointStruct(
            id=idx,
            vector=hf.embed_query(element),
        )
        for idx, element in enumerate(metadata_list)
    ]
)


让我们来定义驱动结果的主函数:


如前所述,我们将进行两层矢量查询搜索,一层是原始矢量数据库,另一层是查看第一层中是否有任何命中结果与第二层相匹配。然后,我们将根据这些搜索结果与查询结果的匹配程度进行排序!


def find_hit_details(hit_list, hit):
    for i, x in enumerate(hit_list):
        if hit == x.id:
            return i
    return -1

def semantic_filtering(text):
    first_level = set()
    second_level = set()
    matching_hits = {}

    query_vector = hf.embed_query(text)
    hits = client.search("law_docs", query_vector, limit=5)
    for h in hits:
        first_level.add(h.id)
   
    filter_hits = metadata_client.search("metadata", query_vector, limit=5)
    filter_hits_dict = {fh.id: fh for fh in filter_hits}
    for fh in filter_hits:
        second_level.add(fh.id)
   
    common_hits = first_level & second_level
    for hit in common_hits:
        filter_hit_detail = filter_hits_dict[hit]
        if filter_hit_detail.score > 0.65:
            matching_hits[filter_hit_detail.score] = hit

    sorted_matching_hits = sorted(matching_hits.items(), reverse=True)
   
    if sorted_matching_hits:
        print("semantic_filtering")
        return [dataset[hit] for score, hit in sorted_matching_hits]
    else:
        print("No filter found")
        return [dataset[hit] for hit in first_level]


如果你看到 "if filter_hit_detail.score > 0.65",请注意这是我们控制阈值的地方,它对噪声非常敏感。因此,你必须谨慎选择这个分数,或者尝试找到一种更好的方法来计算分数,或许可以将我们从向量数据库中获得的余弦分数与针对问题陈述的自定义指标相结合。


最后,让我们运行语义过滤器。


# Example usage
text = "gay marriage cases"
results = semantic_filtering(text)
results


这说明


[{'page_content': 'The U.S. Supreme Court held that the Defense of Marriage Act (DOMA) was unconstitutional as it violated the Fifth Amendment by denying federal recognition of same-sex marriages.',
  'metadata': {'year': 2013,
   'court': 'Supreme Court of the United States',
   'judges': ['Majority opinion by Justice Anthony Kennedy'],
   'legal_topics': ['Constitutional Law', 'LGBTQ+ Rights'],
   'relevant_laws': ['Fifth Amendment of the U.S. Constitution']}},
 {'page_content': 'The Supreme Court of Japan upheld a law requiring married couples to share a surname, ruling it did not violate constitutional guarantees of equality.',
  'metadata': {'year': 2015,
   'court': 'Supreme Court of Japan',
   'judges': ['Chief Justice Itsuro Terada'],
   'legal_topics': ['Family Law', 'Equality'],
   'relevant_laws': ['Japanese Civil Code']}}]


让我们再举一个例子:


semantic_filtering("demonetisation case judge as boppana".lower())


这说明


semantic_filtering
[{'page_content': "The Supreme Court upheld the Centre's 2016 demonetisation scheme in a 4:1 majority, ruling that demonetisation was proportionate to the Union's objectives and implemented reasonably.",
  'metadata': {'year': 2023,
   'court': 'Supreme Court of India',
   'judges': ['S. Abdul Nazeer',
    'B.R. Gavai',
    'A.S. Bopanna',
    'V. Ramasubramanian',
    'B.V. Nagarathna'],
   'legal_topics': ['Constitutional Law', 'Economic Policy'],
   'relevant_laws': ['Article 370 of the Indian Constitution']}},
 {'page_content': 'The Constitutional Court of South Africa ruled that domestic workers are entitled to the same compensation benefits as other workers.',
  'metadata': {'year': 2020,
   'court': 'Constitutional Court of South Africa',
   'judges': ['Chief Justice Mogoeng Mogoeng'],
   'legal_topics': ['Labor Law', 'Equality'],
   'relevant_laws': ['Compensation for Occupational Injuries and Diseases Act']}},
 {'page_content': 'The Constitutional Court of South Africa ruled that the use of corporal punishment in the home is unconstitutional.',
  'metadata': {'year': 2019,
   'court': 'Constitutional Court of South Africa',
   'judges': ['Chief Justice Mogoeng Mogoeng'],
   'legal_topics': ['Family Law', 'Human Rights'],
   'relevant_laws': ['South African Constitution']}}]


如果从文本中找不到过滤器,我们只返回第一级结果。


semantic_filtering("cases about money".lower())


给出:


No filter found
[{'page_content': "The Supreme Court upheld the Centre's 2016 demonetisation scheme in a 4:1 majority, ruling that demonetisation was proportionate to the Union’s objectives and implemented reasonably.",
  'metadata': {'year': 2023,
   'court': 'Supreme Court of India',
   'judges': ['S. Abdul Nazeer',
    'B.R. Gavai',
    'A.S. Bopanna',
    'V. Ramasubramanian',
    'B.V. Nagarathna'],
   'legal_topics': ['Constitutional Law', 'Economic Policy'],
   'relevant_laws': ['Article 370 of the Indian Constitution']}},
 {'page_content': 'In a case concerning the rights of students with disabilities, the United States Department of Justice concluded that Alabama’s foster care system discriminates against students with emotional and behavioral disabilities.',
  'metadata': {'year': 2022,
   'court': 'United States Department of Justice',
   'judges': ['Investigation by the Civil Rights Division'],
   'legal_topics': ['Disability Law', 'Education Law'],
   'relevant_laws': ['Title II of the Americans with Disabilities Act']}},
 {'page_content': 'The U.S. Supreme Court ruled that a state cannot require out-of-state sellers to collect sales tax unless they have a significant presence in the state.',
  'metadata': {'year': 2018,
   'court': 'Supreme Court of the United States',
   'judges': ['Majority opinion by Justice Kennedy'],
   'legal_topics': ['Tax Law', 'Commerce Clause'],
   'relevant_laws': ['Commerce Clause of the U.S. Constitution']}},
 {'page_content': 'The Supreme Court of the United Kingdom ruled that Uber drivers are workers and entitled to employment rights such as minimum wage and holiday pay.',
  'metadata': {'year': 2021,
   'court': 'Supreme Court of the United Kingdom',
   'judges': ['Lord Reed', 'Lord Hodge', 'Lady Arden'],
   'legal_topics': ['Employment Law', 'Gig Economy'],
   'relevant_laws': ['Employment Rights Act 1996']}},
 {'page_content': 'The Court of Justice of the European Union ruled that data retention laws requiring telecommunications companies to retain users’ data indiscriminately are invalid.',
  'metadata': {'year': 2016,
   'court': 'Court of Justice of the European Union',
   'judges': ['Judge Koen Lenaerts'],
   'legal_topics': ['Data Privacy', 'Telecommunications Law'],
   'relevant_laws': ['Directive 2006/24/EC']}}]


结论

在本文中,我们看到了元数据过滤在构建智能 RAG 时的重要性!我们构建了一个法律案例搜索系统。我们看到了存储和使用元数据的 3 种不同技术:


  1. LangChain 的自查询
  2. Qdrant的有效载荷过滤器
  3. 语义过滤
文章来源:https://medium.com/ai-advances/building-a-legal-case-search-engine-using-qdrant-llama-3-langchain-and-exploring-different-655ed5b25f30
欢迎关注ATYUN官方公众号
商务合作及内容投稿请联系邮箱:bd@atyun.com
评论 登录
写评论取消
回复取消