【指南】如何从LLM获取JSON输出

2024年08月21日由 alex 发表 825 0

大型语言模型（LLM）擅长生成文本，但要获得 JSON 这样的结构化输出通常需要巧妙的提示，并希望 LLM 能够理解。值得庆幸的是，JSON 模式在 LLM 框架和服务中越来越常见。这可以让你定义你想要的精确输出模式。

本篇文章将介绍使用 JSON 模式进行受限生成。我们将使用一个复杂、嵌套和现实的 JSON 模式示例来指导 LLM 框架/API（如 Llama.cpp 或 Gemini API）生成结构化数据，特别是游客位置信息。这篇文章基于之前一篇关于使用 Guidance 进行受限生成的文章，但侧重于更广泛采用的 JSON 模式。

虽然 JSON 模式比 Guidance 更有局限性，但其更广泛的支持使其更易于使用，尤其是对于基于云的 LLM 提供商而言。

在一个个人项目中，我发现虽然 JSON 模式可以直接与 Llama.cpp 结合使用，但让它与 Gemini API 结合使用却需要一些额外的步骤。本篇文章将分享这些解决方案，帮助你有效利用 JSON 模式。

JSON 模式：游客位置文档

我们的示例模式表示 TouristLocation。这是一个包含嵌套对象、列表、枚举和各种数据类型（如字符串和数字）的非复杂结构。

下面是一个简化版本：

{
"name": "string",
"location_long_lat": ["number", "number"],
"climate_type": {"type": "string", "enum": ["tropical", "desert", "temperate", "continental", "polar"]},
"activity_types": ["string"],
"attraction_list": [
{
"name": "string",
"description": "string"
}
],
"tags": ["string"],
"description": "string",
"most_notably_known_for": "string",
"location_type": {"type": "string", "enum": ["city", "country", "establishment", "landmark", "national park", "island", "region", "continent"]},
"parents": ["string"]
}

你可以手工编写这种模式，也可以使用 Pydantic 库生成这种模式。下面是一个简化示例：

from typing import List
from pydantic import BaseModel, Field
class TouristLocation(BaseModel):
    """Model for a tourist location"""
    high_season_months: List[int] = Field(
        [], description="List of months (1-12) when the location is most visited"
    )
    tags: List[str] = Field(
        ...,
        description="List of tags describing the location (e.g. accessible, sustainable, sunny, cheap, pricey)",
        min_length=1,
    )
    description: str = Field(..., description="Text description of the location")
# Example usage and schema output
location = TouristLocation(
    high_season_months=[6, 7, 8],
    tags=["beach", "sunny", "family-friendly"],
    description="A beautiful beach with white sand and clear blue water.",
)
schema = location.model_json_schema()
print(schema)

这段代码使用 Pydantic 定义了一个简化版的 TouristLocation 数据类。它有三个字段：

high_season_months（旺季月份）：一个整数列表，代表一年中游客最多的月份（1-12）。默认为空列表。
tags（标签）：描述地点的字符串列表，包含 “无障碍”、“可持续 ”等标签。此字段为必填字段（...），且必须至少有一个元素（min_length=1）。
description（描述）：包含地点文字描述的字符串字段。该字段也是必填字段。

然后，代码会创建 TouristLocation 类的实例，并使用 model_json_schema() 获取模型的 JSON 模式表示。该模式定义了该类所需的数据结构和类型。

model_json_schema() 返回：
{'description': 'Model for a tourist location',
 'properties': {'description': {'description': 'Text description of the '
                                               'location',
                                'title': 'Description',
                                'type': 'string'},
                'high_season_months': {'default': [],
                                       'description': 'List of months (1-12) '
                                                      'when the location is '
                                                      'most visited',
                                       'items': {'type': 'integer'},
                                       'title': 'High Season Months',
                                       'type': 'array'},
                'tags': {'description': 'List of tags describing the location '
                                        '(e.g. accessible, sustainable, sunny, '
                                        'cheap, pricey)',
                         'items': {'type': 'string'},
                         'minItems': 1,
                         'title': 'Tags',
                         'type': 'array'}},
 'required': ['tags', 'description'],
 'title': 'TouristLocation',
 'type': 'object'}

现在我们有了模式，让我们看看如何执行它。首先在 Llama.cpp 中使用其 Python 封装器，其次使用 Gemini 的 API。

方法 1：使用 Llama.cpp 的直接方法

Llama.cpp 是一个用于本地运行 Llama 模型的 C++ 库。它对初学者友好，而且拥有一个活跃的社区。我们将通过其 Python 封装使用它。

下面介绍如何使用它生成 TouristLocation 数据：

# Imports and stuff
# Model init:
checkpoint = "lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF"
model = Llama.from_pretrained(
    repo_id=checkpoint,
    n_gpu_layers=-1,
    filename="*Q4_K_M.gguf",
    verbose=False,
    n_ctx=12_000,
)
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant that outputs in JSON."
        f"Follow this schema {TouristLocation.model_json_schema()}",
    },
    {"role": "user", "content": "Generate information about Hawaii, US."},
    {"role": "assistant", "content": f"{location.model_dump_json()}"},
    {"role": "user", "content": "Generate information about Casablanca"},
]
response_format = {
    "type": "json_object",
    "schema": TouristLocation.model_json_schema(),
}
start = time.time()
outputs = model.create_chat_completion(
    messages=messages, max_tokens=1200, response_format=response_format
)
print(outputs["choices"][0]["message"]["content"])
print(f"Time: {time.time() - start}")

代码首先导入必要的库并初始化 LLM 模型。然后，它定义了与模型对话的信息列表，包括指示模型根据特定模式以 JSON 格式输出的系统信息、用户对夏威夷和卡萨布兰卡信息的请求，以及使用指定模式的助手响应。

Llama.cpp 在引擎盖下使用无上下文语法来限制结构，并为一个新城市生成有效的 JSON 输出。

在输出中，我们得到以下生成的字符串：

{'activity_types': ['shopping', 'food and wine', 'cultural'],
 'attraction_list': [{'description': 'One of the largest mosques in the world '
                                     'and a symbol of Moroccan architecture',
                      'name': 'Hassan II Mosque'},
                     {'description': 'A historic walled city with narrow '
                                     'streets and traditional shops',
                      'name': 'Old Medina'},
                     {'description': 'A historic square with a beautiful '
                                     'fountain and surrounding buildings',
                      'name': 'Mohammed V Square'},
                     {'description': 'A beautiful Catholic cathedral built in '
                                     'the early 20th century',
                      'name': 'Casablanca Cathedral'},
                     {'description': 'A scenic waterfront promenade with '
                                     'beautiful views of the city and the sea',
                      'name': 'Corniche'}],
 'climate_type': 'temperate',
 'description': 'A large and bustling city with a rich history and culture',
 'location_type': 'city',
 'most_notably_known_for': 'Its historic architecture and cultural '
                           'significance',
 'name': 'Casablanca',
 'parents': ['Morocco', 'Africa'],
 'tags': ['city', 'cultural', 'historical', 'expensive']}

然后就可以将其解析为我们的 Pydantic 类的实例。

方法 2：克服 Gemini API 的怪癖

Gemini API 是 Google 的托管 LLM 服务，在其文档中声称对 Gemini Flash 1.5 的 JSON 模式支持有限。不过，只需稍作调整，它就能正常工作。

以下是使其正常工作的一般说明：

schema = TouristLocation.model_json_schema()
schema = replace_value_in_dict(schema.copy(), schema.copy())
del schema["$defs"]
delete_keys_recursive(schema, key_to_delete="title")
delete_keys_recursive(schema, key_to_delete="location_long_lat")
delete_keys_recursive(schema, key_to_delete="default")
delete_keys_recursive(schema, key_to_delete="default")
delete_keys_recursive(schema, key_to_delete="minItems")
print(schema)
messages = [
    ContentDict(
        role="user",
        parts=[
            "You are a helpful assistant that outputs in JSON."
            f"Follow this schema {TouristLocation.model_json_schema()}"
        ],
    ),
    ContentDict(role="user", parts=["Generate information about Hawaii, US."]),
    ContentDict(role="model", parts=[f"{location.model_dump_json()}"]),
    ContentDict(role="user", parts=["Generate information about Casablanca"]),
]
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
# Using `response_mime_type` with `response_schema` requires a Gemini 1.5 Pro model
model = genai.GenerativeModel(
    "gemini-1.5-flash",
    # Set the `response_mime_type` to output JSON
    # Pass the schema object to the `response_schema` field
    generation_config={
        "response_mime_type": "application/json",
        "response_schema": schema,
    },
)
response = model.generate_content(messages)
print(response.text)

下面介绍如何克服Gemini的限制：

1. 用完整定义替换 $ref： Gemini 会偶然发现模式引用 ($ref)。这些引用用于嵌套对象定义。用模式中的完整定义替换它们。

def replace_value_in_dict(item, original_schema):
    # Source: https://github.com/pydantic/pydantic/issues/889
    if isinstance(item, list):
        return [replace_value_in_dict(i, original_schema) for i in item]
    elif isinstance(item, dict):
        if list(item.keys()) == ["$ref"]:
            definitions = item["$ref"][2:].split("/")
            res = original_schema.copy()
            for definition in definitions:
                res = res[definition]
            return res
        else:
            return {
                key: replace_value_in_dict(i, original_schema)
                for key, i in item.items()
            }
    else:
        return item

2. 移除不支持的键： Gemini 尚未处理 “title”、“AnyOf ”或 “minItems ”等键。请从模式中删除这些键。这将导致模式的可读性和限制性降低，但如果坚持使用 Gemini，我们别无选择。

def delete_keys_recursive(d, key_to_delete):
    if isinstance(d, dict):
        # Delete the key if it exists
        if key_to_delete in d:
            del d[key_to_delete]
        # Recursively process all items in the dictionary
        for k, v in d.items():
            delete_keys_recursive(v, key_to_delete)
    elif isinstance(d, list):
        # Recursively process all items in the list
        for item in d:
            delete_keys_recursive(item, key_to_delete)

3. 枚举的单次或多次提示： Gemini 有时在处理枚举时很费劲，会输出所有可能的值，而不是单个选择。这些值还在单个字符串中用“|”分隔，导致它们在我们的模式下无效。使用一次性提示，提供一个格式正确的示例，引导它实现所需的行为。

通过应用这些转换并提供清晰的示例，你就可以使用 Gemini API 成功生成结构化的 JSON 输出。

结论

JSON 模式允许你直接从 LLM 获取结构化数据，使其在实际应用中更加有用。虽然 Llama.cpp 等框架提供了直接的实现方法，但你可能会在使用 Gemini API 等云服务时遇到问题。

文章来源：https://medium.com/towards-data-science/how-to-get-json-output-from-llms-a-practical-guide-838234ba3bab

标签：

LLM

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇虚拟分类器解释：初学者视觉指南

下一篇探索 K 近邻分类器：代码演示与工作原理

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来