利用Gemini：从结构化数据生成内容的指南

2024年10月18日由 alex 发表 459 0

简介

在人工智能（AI）飞速发展的时代，分析和利用大型数据集的能力至关重要。虽然 RAG（Retrieval Augmented Generation，检索增强生成）环境通常是此类任务的理想选择，但在某些情况下，内容生成需要使用较小的数据集来实现。

Gemini 能够处理大量token，是一种很有前景的解决方案。通过结合提示和上传文件的功能，它甚至可以有效地利用有限的数据。不过，在处理 CSV 或 JSON 等结构化数据格式时，必须确保人工智能能够准确解释和理解信息。

本文将探讨使用 Python 脚本实现这一目标的实用方法。我们将深入探讨具体的技术，并提供示例来说明如何有效地训练人工智能，使其能够有效地理解和生成基于较小的结构化数据集的内容。

流程

这是使用双子座和 CSV 数据生成内容的流程图。以下是相关步骤：

数据准备：输入：包含必要数据的 CSV 文件。创建模式：直接使用 CSV 文件：生成 CSV 模式，以定义 CSV 文件中数据的结构和类型，使 Gemini 能够有效地理解和处理这些数据。CSV 到 JSON 转换：如果 CSV 数据转换为 JSON 格式，则会创建一个 JSON 模式来描述 JSON 数据的结构和类型，使 Gemini 能够清楚地理解输入内容。
输出模式定义：创建一个 JSON 模式，以指定输出内容所需的结构和类型。该模式可用于提示本身或响应模式参数，以指导 Gemini 的生成过程。
内容生成： Gemini 利用准备好的输入数据和定义好的提示来生成所需的内容。输出内容遵循指定的输出模式。
结果返回：生成的内容作为最终输出返回。

使用方法

1. 创建 API 密钥

请访问https://ai.google.dev/gemini-api/docs/api-key并创建 API 密钥。然后，请在 API 控制台启用生成语言 API。此 API 密钥用于以下脚本。

2. 样本数据

本报告使用上述样本数据。虽然图片显示的是 Google 电子表格，但实际测试使用的是从该电子表格转换而来的 CSV 数据。CSV 文件的文件名是sample.csv 。

样本数据来自e-Stat，特别是 “A”、“B ”和 “C ”列。这些列分别代表年份、地区和人口。虽然图片中只显示了 “北海道”，但实际数据包括所有都道府县。数据包括 2,303 行和 3 列，以 CSV 格式用于以下脚本。

3. 主脚本

这是一个 Python 脚本。

这是用于测试以下示例脚本的主类。请创建一个包含以下脚本的文件GenerateContent.py。以下示例脚本使用此脚本作为导入 GenerateContent.py 文件。

import google.generativeai as genai
import io
import json
import requests
import time

class Main:
    def __init__(self):
        self.genai = None
        self.model = None
        self.api_key = None
    def run(self, object):
        self.api_key = object["api_key"]
        self._setInstances(object)
        print("Get file...")
        file = self._uploadFile(object["name"], object["data"])
        print("Generate content...")
        response = self.model.generate_content(
            [file, object["prompt"]], request_options={"timeout": 600}
        )
        data = None
        try:
            data = json.loads(response.text)
        except json.JSONDecodeError:
            data = response.text
        return data
    def _setInstances(self, object):
        genai.configure(api_key=self.api_key)
        generation_config = {"response_mime_type": "application/json"}
        if "response_schema" in object:
            generation_config["response_schema"] = object["response_schema"]
        self.genai = genai
        self.model = genai.GenerativeModel(
            model_name="gemini-1.5-flash-002", # or gemini-1.5-pro-002
            generation_config=generation_config,
        )
    def _uploadFile(self, name, text):
        file = None
        try:
            file = genai.get_file(f"files/{name}")
        except:
            requests.post(
                f"https://generativelanguage.googleapis.com/upload/v1beta/files?uploadType=multipart&key={self.api_key}",
                files={
                    "data": (
                        "metadata",
                        json.dumps(
                            {
                                "file": {
                                    "mimeType": "text/plain",
                                    "name": f"files/{name}",
                                }
                            }
                        ),
                        "application/json",
                    ),
                    "file": ("file", io.StringIO(text), "text/plain"),
                },
            )
            time.sleep(2)
            file = genai.get_file(f"files/{name}")
            print(f"File was uploaded.")
        while file.state.name == "PROCESSING":
            print(".", end="")
            time.sleep(10)
            file = genai.get_file(file.name)
        if file.state.name == "FAILED":
            raise ValueError(file.state.name)
        return file

示例

模式 1

在此模式中，直接使用 CSV 数据。为了让 Gemini 理解 CSV 数据，我使用了 CSV 模式。此外，为了将结果导出为 JSON 数据，我使用了 JSON 模式。你可以在下面的脚本中看到csvSchema和jsonSchema。提示符可以用prompt.CsvSchema 和 jsonSchema 来表示。

函数createData返回原始 CSV 数据。

在本脚本中，提示中使用了用于输出的 JSON 模式。

import GenerateContent
import json

api_key = "###"  # Please set your API key.
filename = "sample.csv"  # Please set your CSV file with the path.

def createPrompt():
    csvSchema = {
        "description": 'Order of "fields" is the order of columns of CSV data.\n"name" is the column name.\n"type" is the type of value in the column.',
        "fields": [
            {"name": "Year", "type": "number"},
            {"name": "Region", "type": "string"},
            {"name": "Population", "type": "number"},
        ],
    }
    jsonSchema = {
        "description": "JSON schema for outputting the result.",
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "region": {"description": "Region name.", "type": "string"},
                "reason": {
                    "description": "Reasons for the population increase.",
                    "type": "string",
                },
                "measures": {
                    "description": "Details of measures to stop the population increase.",
                    "type": "string",
                },
                "currentPopulation": {
                    "description": "Current population.",
                    "type": "number",
                },
                "futurePopulationWithMeasures": {
                    "description": "Future population after 50 years with measures to keep the population increasing.",
                    "type": "number",
                },
                "futurePopulationWithoutMeasures": {
                    "description": "Future population after 50 years without measures to keep the population increasing.",
                    "type": "number",
                },
            },
            "required": [
                "region",
                "reason",
                "measures",
                "currentPopulation",
                "futurePopulationWithMeasures",
                "futurePopulationWithoutMeasures",
            ],
        },
    }
    prompt = "\n".join(
        [
            "Run the following steps.",
            '1. Read the CSV data in the following text file. The CSV schema of this data is "CSVSchema".',
            f"<CSVSchema>{json.dumps(csvSchema)}</CSVSchema>",
            "2. Using the data collected and your knowledge, predict 3 regions that will have the largest increase in population in the future in the order of increase. Return the region name, detailed reasons for the increase, and measures to keep the population increasing by considering the features of the region. Also, return the current population and the population 50 years later predicted by you with and without measures to keep the population increasing.",
            '3. Return the result by following "JSONSchema".',
            f"<JSONSchema>{json.dumps(jsonSchema)}</JSONSchema>",
        ]
    )
    return prompt

def createData(filename):
    return open(filename, "r").read()

data = createData(filename)
prompt = createPrompt()
object = {
    "api_key": api_key,
    "name": "sample-name-1a",
    "data": data,
    "prompt": prompt,
}
res = GenerateContent.Main().run(object)
print(res)

在本脚本中，使用response_schema 输出 JSON 模式。

import GenerateContent
import json

api_key = "###"  # Please set your API key.
filename = "sample.csv"  # Please set your CSV file with the path.

def createPrompt():
    csvSchema = {
        "description": 'Order of "fields" is the order of columns of CSV data.\n"name" is the column name.\n"type" is the type of value in the column.',
        "fields": [
            {"name": "Year", "type": "number"},
            {"name": "Region", "type": "string"},
            {"name": "Population", "type": "number"},
        ],
    }
    prompt = "\n".join(
        [
            "Run the following steps.",
            '1. Read the CSV data in the following text file. The CSV schema of this data is "CSVSchema".',
            f"<CSVSchema>{json.dumps(csvSchema)}</CSVSchema>",
            "2. Using the data collected and your knowledge, predict 3 regions that will have the largest increase in population in the future in the order of increase. Return the region name, detailed reasons for the increase, and measures to keep the population increasing by considering the features of the region. Also, return the current population and the population 50 years later predicted by you with and without measures to keep the population increasing.",
        ]
    )
    return prompt

def createData(filename):
    return open(filename, "r").read()

jsonSchema = {
    "description": "JSON schema for outputting the result.",
    "type": "array",
    "items": {
        "type": "object",
        "properties": {
            "region": {"description": "Region name.", "type": "string"},
            "reason": {
                "description": "Reasons for the population increase.",
                "type": "string",
            },
            "measures": {
                "description": "Details of measures to stop the population increase.",
                "type": "string",
            },
            "currentPopulation": {
                "description": "Current population.",
                "type": "number",
            },
            "futurePopulationWithMeasures": {
                "description": "Future population after 50 years with measures to keep the population increasing.",
                "type": "number",
            },
            "futurePopulationWithoutMeasures": {
                "description": "Future population after 50 years without measures to keep the population increasing.",
                "type": "number",
            },
        },
        "required": [
            "region",
            "reason",
            "measures",
            "currentPopulation",
            "futurePopulationWithMeasures",
            "futurePopulationWithoutMeasures",
        ],
    },
}
data = createData(filename)
prompt = createPrompt()
object = {
    "api_key": api_key,
    "name": "sample-name-1b",
    "data": data,
    "prompt": prompt,
    "response_schema": jsonSchema,
}
res = GenerateContent.Main().run(object)
print(res)

模式 2

在此模式中，CSV 数据通过转换为 JSON 数据来使用。为了让 Gemini 理解 JSON 数据，我使用了 JSON 模式。此外，为了将结果导出为 JSON 数据，我还使用了 JSON 模式。你可以在下面的脚本中看到jsonSchema1和jsonSchema2。提示符可以显示为prompt 。

函数createData返回从 CSV 数据转换而来的 JSON 数据，如下所示。

[
  {"region": "Hokkaido", "populations": [{"year": "1975", "population": "5338206"} ,,,]},
  {"region": "Aomori", "populations": [{"year": "1975", "population": "1468646"} ,,,]},
  {"region": "Iwate", "populations": [{"year": "1975", "population": "1385563"} ,,,]},
  ,
  ,
  ,
}

在这个脚本中，输出使用了JSON模式的提示。

import GenerateContent
import csv
import json

api_key = "###"  # Please set your API key.
filename = "sample.csv"  # Please set your CSV file with the path.

def createPrompt():
    jsonSchema1 = {
        "description": 'JSON schema of the inputted value. The filename is "blobName@sample.txt".',
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "region": {"description": "Region name.", "type": "string"},
                "populations": {
                    "description": "Populations for each year.",
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "year": {"type": "number", "description": "Year."},
                            "population": {
                                "type": "number",
                                "description": "Population.",
                            },
                        },
                        "required": ["year", "populations"],
                    },
                },
            },
            "required": ["region", "populations"],
        },
    }
    jsonSchema2 = {
        "description": "JSON schema for outputting the result.",
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "region": {"description": "Region name.", "type": "string"},
                "reason": {
                    "description": "Reasons for the population increase.",
                    "type": "string",
                },
                "measures": {
                    "description": "Details of measures to stop the population increase.",
                    "type": "string",
                },
                "currentPopulation": {
                    "description": "Current population.",
                    "type": "number",
                },
                "futurePopulationWithMeasures": {
                    "description": "Future population after 50 years with measures to keep the population increasing.",
                    "type": "number",
                },
                "futurePopulationWithoutMeasures": {
                    "description": "Future population after 50 years without measures to keep the population increasing.",
                    "type": "number",
                },
            },
            "required": [
                "region",
                "reason",
                "measures",
                "currentPopulation",
                "futurePopulationWithMeasures",
                "futurePopulationWithoutMeasures",
            ],
        },
    }
    prompt = "\n".join(
        [
            "Run the following steps.",
            '1. Read the JSON data in the following text file. The JSON schema of this data is "JSONSchema1".',
            f"<JSONSchema1>{json.dumps(jsonSchema1)}</JSONSchema1>",
            "2. Using the data collected and your knowledge, predict 3 regions that will have the largest increase in population in the future in the order of increase. Return the region name, detailed reasons for the increase, and measures to keep the population increasing by considering the features of the region. Also, return the current population and the population 50 years later predicted by you with and without measures to keep the population increasing.",
            '3. Return the result by following "JSONSchema2".',
            f"<JSONSchema2>{json.dumps(jsonSchema2)}</JSONSchema2>",
        ]
    )
    return prompt

def createData(filename):
    ar = list(csv.reader(open(filename, "r"), delimiter=","))[1:]
    obj = {}
    for r in ar:
        year, region, population = r
        v = {"year": year, "population": population}
        obj[region] = (obj[region] + [v]) if region in obj else [v]
    arr = [{"region": k, "populations": v} for (k, v) in obj.items()]
    return json.dumps(arr)

data = createData(filename)
prompt = createPrompt()
object = {
    "api_key": api_key,
    "name": "sample-name-2a",
    "data": data,
    "prompt": prompt,
}
res = GenerateContent.Main().run(object)
print(res)

在本脚本中，使用response_schema 输出 JSON 模式。

import GenerateContent
import csv
import json

api_key = "###"  # Please set your API key.
filename = "sample.csv"  # Please set your CSV file with the path.
def createPrompt():
    jsonSchema1 = {
        "description": 'JSON schema of the inputted value. The filename is "blobName@sample.txt".',
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "region": {"description": "Region name.", "type": "string"},
                "populations": {
                    "description": "Populations for each year.",
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "year": {"type": "number", "description": "Year."},
                            "population": {
                                "type": "number",
                                "description": "Population.",
                            },
                        },
                        "required": ["year", "populations"],
                    },
                },
            },
            "required": ["region", "populations"],
        },
    }
    prompt = "\n".join(
        [
            "Run the following steps.",
            '1. Read the JSON data in the following text file. The JSON schema of this data is "JSONSchema1".',
            f"<JSONSchema1>{json.dumps(jsonSchema1)}</JSONSchema1>",
            "2. Using the data collected and your knowledge, predict 3 regions that will have the largest increase in population in the future in the order of increase. Return the region name, detailed reasons for the increase, and measures to keep the population increasing by considering the features of the region. Also, return the current population and the population 50 years later predicted by you with and without measures to keep the population increasing.",
        ]
    )
    return prompt
def createData(filename):
    ar = list(csv.reader(open(filename, "r"), delimiter=","))[1:]
    obj = {}
    for r in ar:
        year, region, population = r
        v = {"year": year, "population": population}
        obj[region] = (obj[region] + [v]) if region in obj else [v]
    arr = [{"region": k, "populations": v} for (k, v) in obj.items()]
    return json.dumps(arr)
jsonSchema2 = {
    "description": "JSON schema for outputting the result.",
    "type": "array",
    "items": {
        "type": "object",
        "properties": {
            "region": {"description": "Region name.", "type": "string"},
            "reason": {
                "description": "Reasons for the population increase.",
                "type": "string",
            },
            "measures": {
                "description": "Details of measures to stop the population increase.",
                "type": "string",
            },
            "currentPopulation": {
                "description": "Current population.",
                "type": "number",
            },
            "futurePopulationWithMeasures": {
                "description": "Future population after 50 years with measures to keep the population increasing.",
                "type": "number",
            },
            "futurePopulationWithoutMeasures": {
                "description": "Future population after 50 years without measures to keep the population increasing.",
                "type": "number",
            },
        },
        "required": [
            "region",
            "reason",
            "measures",
            "currentPopulation",
            "futurePopulationWithMeasures",
            "futurePopulationWithoutMeasures",
        ],
    },
}
data = createData(filename)
prompt = createPrompt()
object = {
    "api_key": api_key,
    "name": "sample-name-2b",
    "data": data,
    "prompt": prompt,
    "response_schema": jsonSchema2,
}
res = GenerateContent.Main().run(object)
print(res)

结果

[
  {
    "region": "Tokyo",
    "reason": "Tokyo's robust economy, diverse job market, and well-established infrastructure continue to attract both domestic and international migrants.  Its status as a global hub for business and culture ensures ongoing population growth.",
    "measures": "Invest in affordable housing, improve public transportation, enhance green spaces and recreational facilities to improve quality of life, and continue promoting Tokyo as a global center for innovation and opportunity.",
    "currentPopulation": 14086000,
    "futurePopulationWithMeasures": 16000000,
    "futurePopulationWithoutMeasures": 15000000
  },
  {
    "region": "Osaka",
    "reason": "Osaka is a major economic center with a strong industrial base and a thriving service sector. Its vibrant culture and relatively lower cost of living compared to Tokyo attract individuals seeking opportunities.",
    "measures": "Focus on attracting skilled workers and entrepreneurs by offering tax incentives and streamlining business regulations. Improve affordable housing options and educational facilities. Promote Osaka's cultural attractions to attract tourists and residents.",
    "currentPopulation": 8763000,
    "futurePopulationWithMeasures": 10500000,
    "futurePopulationWithoutMeasures": 9500000
  },
  {
    "region": "Aichi",
    "reason": "Aichi Prefecture benefits from its position as a major manufacturing and automotive hub. This strong industrial base and associated employment opportunities fuel consistent population growth.",
    "measures": "Promote further diversification of the economy beyond automotive manufacturing to ensure long-term resilience. Invest in education and technology to attract highly skilled professionals. Develop sustainable infrastructure to enhance quality of life.",
    "currentPopulation": 7477000,
    "futurePopulationWithMeasures": 9000000,
    "futurePopulationWithoutMeasures": 8000000
  }
]

在执行上述部分的脚本时，我们观察到以下结果：

数据格式处理： Gemini 通过使用相应的模式（CSV 模式和 JSON 模式），成功处理了 CSV 和 JSON 结构化数据格式。
模式有效性：事实证明，两种模式方法都能有效帮助 Gemini 理解输入数据。
原因字段的可变性：由于温度输入不为零，“原因 ”字段的具体值在脚本执行过程中可能会略有不同，但其他字段的值保持一致。
区域预测准确性：在某些情况下，预测区域与预期结果不同。不过，在这种情况下，无论具体的区域预测结果如何，主要关注点还是 Gemini 是否准确理解了输入数据。
这些发现凸显了 Gemini 有效处理结构化数据格式和利用模式提高理解能力的能力。

文章来源：https://medium.com/google-cloud/harnessing-geminis-power-a-guide-to-generating-content-from-structured-data-45080dac0bbb

标签：

Gemini Python

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇将多模态数据融入大型语言模型：方法与应用

下一篇 AI图像与故事生成：FastAPI、Groq与Replicate的应用指南

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来