简介
在人工智能(AI)飞速发展的时代,分析和利用大型数据集的能力至关重要。虽然 RAG(Retrieval Augmented Generation,检索增强生成)环境通常是此类任务的理想选择,但在某些情况下,内容生成需要使用较小的数据集来实现。
Gemini 能够处理大量token,是一种很有前景的解决方案。通过结合提示和上传文件的功能,它甚至可以有效地利用有限的数据。不过,在处理 CSV 或 JSON 等结构化数据格式时,必须确保人工智能能够准确解释和理解信息。
本文将探讨使用 Python 脚本实现这一目标的实用方法。我们将深入探讨具体的技术,并提供示例来说明如何有效地训练人工智能,使其能够有效地理解和生成基于较小的结构化数据集的内容。
流程
这是使用双子座和 CSV 数据生成内容的流程图。以下是相关步骤:
使用方法
1. 创建 API 密钥
请访问https://ai.google.dev/gemini-api/docs/api-key并创建 API 密钥。然后,请在 API 控制台启用生成语言 API。此 API 密钥用于以下脚本。
2. 样本数据
本报告使用上述样本数据。虽然图片显示的是 Google 电子表格,但实际测试使用的是从该电子表格转换而来的 CSV 数据。CSV 文件的文件名是sample.csv 。
样本数据来自e-Stat,特别是 “A”、“B ”和 “C ”列。这些列分别代表年份、地区和人口。虽然图片中只显示了 “北海道”,但实际数据包括所有都道府县。数据包括 2,303 行和 3 列,以 CSV 格式用于以下脚本。
3. 主脚本
这是一个 Python 脚本。
这是用于测试以下示例脚本的主类。请创建一个包含以下脚本的文件GenerateContent.py。以下示例脚本使用此脚本作为导入 GenerateContent.py 文件。
import google.generativeai as genai
import io
import json
import requests
import time
class Main:
def __init__(self):
self.genai = None
self.model = None
self.api_key = None
def run(self, object):
self.api_key = object["api_key"]
self._setInstances(object)
print("Get file...")
file = self._uploadFile(object["name"], object["data"])
print("Generate content...")
response = self.model.generate_content(
[file, object["prompt"]], request_options={"timeout": 600}
)
data = None
try:
data = json.loads(response.text)
except json.JSONDecodeError:
data = response.text
return data
def _setInstances(self, object):
genai.configure(api_key=self.api_key)
generation_config = {"response_mime_type": "application/json"}
if "response_schema" in object:
generation_config["response_schema"] = object["response_schema"]
self.genai = genai
self.model = genai.GenerativeModel(
model_name="gemini-1.5-flash-002", # or gemini-1.5-pro-002
generation_config=generation_config,
)
def _uploadFile(self, name, text):
file = None
try:
file = genai.get_file(f"files/{name}")
except:
requests.post(
f"https://generativelanguage.googleapis.com/upload/v1beta/files?uploadType=multipart&key={self.api_key}",
files={
"data": (
"metadata",
json.dumps(
{
"file": {
"mimeType": "text/plain",
"name": f"files/{name}",
}
}
),
"application/json",
),
"file": ("file", io.StringIO(text), "text/plain"),
},
)
time.sleep(2)
file = genai.get_file(f"files/{name}")
print(f"File was uploaded.")
while file.state.name == "PROCESSING":
print(".", end="")
time.sleep(10)
file = genai.get_file(file.name)
if file.state.name == "FAILED":
raise ValueError(file.state.name)
return file
示例
模式 1
在此模式中,直接使用 CSV 数据。为了让 Gemini 理解 CSV 数据,我使用了 CSV 模式。此外,为了将结果导出为 JSON 数据,我使用了 JSON 模式。你可以在下面的脚本中看到csvSchema和jsonSchema。提示符可以用prompt.CsvSchema 和 jsonSchema 来表示。
函数createData返回原始 CSV 数据。
在本脚本中,提示中使用了用于输出的 JSON 模式。
import GenerateContent
import json
api_key = "###" # Please set your API key.
filename = "sample.csv" # Please set your CSV file with the path.
def createPrompt():
csvSchema = {
"description": 'Order of "fields" is the order of columns of CSV data.\n"name" is the column name.\n"type" is the type of value in the column.',
"fields": [
{"name": "Year", "type": "number"},
{"name": "Region", "type": "string"},
{"name": "Population", "type": "number"},
],
}
jsonSchema = {
"description": "JSON schema for outputting the result.",
"type": "array",
"items": {
"type": "object",
"properties": {
"region": {"description": "Region name.", "type": "string"},
"reason": {
"description": "Reasons for the population increase.",
"type": "string",
},
"measures": {
"description": "Details of measures to stop the population increase.",
"type": "string",
},
"currentPopulation": {
"description": "Current population.",
"type": "number",
},
"futurePopulationWithMeasures": {
"description": "Future population after 50 years with measures to keep the population increasing.",
"type": "number",
},
"futurePopulationWithoutMeasures": {
"description": "Future population after 50 years without measures to keep the population increasing.",
"type": "number",
},
},
"required": [
"region",
"reason",
"measures",
"currentPopulation",
"futurePopulationWithMeasures",
"futurePopulationWithoutMeasures",
],
},
}
prompt = "\n".join(
[
"Run the following steps.",
'1. Read the CSV data in the following text file. The CSV schema of this data is "CSVSchema".',
f"<CSVSchema>{json.dumps(csvSchema)}</CSVSchema>",
"2. Using the data collected and your knowledge, predict 3 regions that will have the largest increase in population in the future in the order of increase. Return the region name, detailed reasons for the increase, and measures to keep the population increasing by considering the features of the region. Also, return the current population and the population 50 years later predicted by you with and without measures to keep the population increasing.",
'3. Return the result by following "JSONSchema".',
f"<JSONSchema>{json.dumps(jsonSchema)}</JSONSchema>",
]
)
return prompt
def createData(filename):
return open(filename, "r").read()
data = createData(filename)
prompt = createPrompt()
object = {
"api_key": api_key,
"name": "sample-name-1a",
"data": data,
"prompt": prompt,
}
res = GenerateContent.Main().run(object)
print(res)
在本脚本中,使用response_schema 输出 JSON 模式。
import GenerateContent
import json
api_key = "###" # Please set your API key.
filename = "sample.csv" # Please set your CSV file with the path.
def createPrompt():
csvSchema = {
"description": 'Order of "fields" is the order of columns of CSV data.\n"name" is the column name.\n"type" is the type of value in the column.',
"fields": [
{"name": "Year", "type": "number"},
{"name": "Region", "type": "string"},
{"name": "Population", "type": "number"},
],
}
prompt = "\n".join(
[
"Run the following steps.",
'1. Read the CSV data in the following text file. The CSV schema of this data is "CSVSchema".',
f"<CSVSchema>{json.dumps(csvSchema)}</CSVSchema>",
"2. Using the data collected and your knowledge, predict 3 regions that will have the largest increase in population in the future in the order of increase. Return the region name, detailed reasons for the increase, and measures to keep the population increasing by considering the features of the region. Also, return the current population and the population 50 years later predicted by you with and without measures to keep the population increasing.",
]
)
return prompt
def createData(filename):
return open(filename, "r").read()
jsonSchema = {
"description": "JSON schema for outputting the result.",
"type": "array",
"items": {
"type": "object",
"properties": {
"region": {"description": "Region name.", "type": "string"},
"reason": {
"description": "Reasons for the population increase.",
"type": "string",
},
"measures": {
"description": "Details of measures to stop the population increase.",
"type": "string",
},
"currentPopulation": {
"description": "Current population.",
"type": "number",
},
"futurePopulationWithMeasures": {
"description": "Future population after 50 years with measures to keep the population increasing.",
"type": "number",
},
"futurePopulationWithoutMeasures": {
"description": "Future population after 50 years without measures to keep the population increasing.",
"type": "number",
},
},
"required": [
"region",
"reason",
"measures",
"currentPopulation",
"futurePopulationWithMeasures",
"futurePopulationWithoutMeasures",
],
},
}
data = createData(filename)
prompt = createPrompt()
object = {
"api_key": api_key,
"name": "sample-name-1b",
"data": data,
"prompt": prompt,
"response_schema": jsonSchema,
}
res = GenerateContent.Main().run(object)
print(res)
模式 2
在此模式中,CSV 数据通过转换为 JSON 数据来使用。为了让 Gemini 理解 JSON 数据,我使用了 JSON 模式。此外,为了将结果导出为 JSON 数据,我还使用了 JSON 模式。你可以在下面的脚本中看到jsonSchema1和jsonSchema2。提示符可以显示为prompt 。
函数createData返回从 CSV 数据转换而来的 JSON 数据,如下所示。
[
{"region": "Hokkaido", "populations": [{"year": "1975", "population": "5338206"} ,,,]},
{"region": "Aomori", "populations": [{"year": "1975", "population": "1468646"} ,,,]},
{"region": "Iwate", "populations": [{"year": "1975", "population": "1385563"} ,,,]},
,
,
,
}
在这个脚本中,输出使用了JSON模式的提示。
import GenerateContent
import csv
import json
api_key = "###" # Please set your API key.
filename = "sample.csv" # Please set your CSV file with the path.
def createPrompt():
jsonSchema1 = {
"description": 'JSON schema of the inputted value. The filename is "blobName@sample.txt".',
"type": "array",
"items": {
"type": "object",
"properties": {
"region": {"description": "Region name.", "type": "string"},
"populations": {
"description": "Populations for each year.",
"type": "array",
"items": {
"type": "object",
"properties": {
"year": {"type": "number", "description": "Year."},
"population": {
"type": "number",
"description": "Population.",
},
},
"required": ["year", "populations"],
},
},
},
"required": ["region", "populations"],
},
}
jsonSchema2 = {
"description": "JSON schema for outputting the result.",
"type": "array",
"items": {
"type": "object",
"properties": {
"region": {"description": "Region name.", "type": "string"},
"reason": {
"description": "Reasons for the population increase.",
"type": "string",
},
"measures": {
"description": "Details of measures to stop the population increase.",
"type": "string",
},
"currentPopulation": {
"description": "Current population.",
"type": "number",
},
"futurePopulationWithMeasures": {
"description": "Future population after 50 years with measures to keep the population increasing.",
"type": "number",
},
"futurePopulationWithoutMeasures": {
"description": "Future population after 50 years without measures to keep the population increasing.",
"type": "number",
},
},
"required": [
"region",
"reason",
"measures",
"currentPopulation",
"futurePopulationWithMeasures",
"futurePopulationWithoutMeasures",
],
},
}
prompt = "\n".join(
[
"Run the following steps.",
'1. Read the JSON data in the following text file. The JSON schema of this data is "JSONSchema1".',
f"<JSONSchema1>{json.dumps(jsonSchema1)}</JSONSchema1>",
"2. Using the data collected and your knowledge, predict 3 regions that will have the largest increase in population in the future in the order of increase. Return the region name, detailed reasons for the increase, and measures to keep the population increasing by considering the features of the region. Also, return the current population and the population 50 years later predicted by you with and without measures to keep the population increasing.",
'3. Return the result by following "JSONSchema2".',
f"<JSONSchema2>{json.dumps(jsonSchema2)}</JSONSchema2>",
]
)
return prompt
def createData(filename):
ar = list(csv.reader(open(filename, "r"), delimiter=","))[1:]
obj = {}
for r in ar:
year, region, population = r
v = {"year": year, "population": population}
obj[region] = (obj[region] + [v]) if region in obj else [v]
arr = [{"region": k, "populations": v} for (k, v) in obj.items()]
return json.dumps(arr)
data = createData(filename)
prompt = createPrompt()
object = {
"api_key": api_key,
"name": "sample-name-2a",
"data": data,
"prompt": prompt,
}
res = GenerateContent.Main().run(object)
print(res)
在本脚本中,使用response_schema 输出 JSON 模式。
import GenerateContent
import csv
import json
api_key = "###" # Please set your API key.
filename = "sample.csv" # Please set your CSV file with the path.
def createPrompt():
jsonSchema1 = {
"description": 'JSON schema of the inputted value. The filename is "blobName@sample.txt".',
"type": "array",
"items": {
"type": "object",
"properties": {
"region": {"description": "Region name.", "type": "string"},
"populations": {
"description": "Populations for each year.",
"type": "array",
"items": {
"type": "object",
"properties": {
"year": {"type": "number", "description": "Year."},
"population": {
"type": "number",
"description": "Population.",
},
},
"required": ["year", "populations"],
},
},
},
"required": ["region", "populations"],
},
}
prompt = "\n".join(
[
"Run the following steps.",
'1. Read the JSON data in the following text file. The JSON schema of this data is "JSONSchema1".',
f"<JSONSchema1>{json.dumps(jsonSchema1)}</JSONSchema1>",
"2. Using the data collected and your knowledge, predict 3 regions that will have the largest increase in population in the future in the order of increase. Return the region name, detailed reasons for the increase, and measures to keep the population increasing by considering the features of the region. Also, return the current population and the population 50 years later predicted by you with and without measures to keep the population increasing.",
]
)
return prompt
def createData(filename):
ar = list(csv.reader(open(filename, "r"), delimiter=","))[1:]
obj = {}
for r in ar:
year, region, population = r
v = {"year": year, "population": population}
obj[region] = (obj[region] + [v]) if region in obj else [v]
arr = [{"region": k, "populations": v} for (k, v) in obj.items()]
return json.dumps(arr)
jsonSchema2 = {
"description": "JSON schema for outputting the result.",
"type": "array",
"items": {
"type": "object",
"properties": {
"region": {"description": "Region name.", "type": "string"},
"reason": {
"description": "Reasons for the population increase.",
"type": "string",
},
"measures": {
"description": "Details of measures to stop the population increase.",
"type": "string",
},
"currentPopulation": {
"description": "Current population.",
"type": "number",
},
"futurePopulationWithMeasures": {
"description": "Future population after 50 years with measures to keep the population increasing.",
"type": "number",
},
"futurePopulationWithoutMeasures": {
"description": "Future population after 50 years without measures to keep the population increasing.",
"type": "number",
},
},
"required": [
"region",
"reason",
"measures",
"currentPopulation",
"futurePopulationWithMeasures",
"futurePopulationWithoutMeasures",
],
},
}
data = createData(filename)
prompt = createPrompt()
object = {
"api_key": api_key,
"name": "sample-name-2b",
"data": data,
"prompt": prompt,
"response_schema": jsonSchema2,
}
res = GenerateContent.Main().run(object)
print(res)
结果
[
{
"region": "Tokyo",
"reason": "Tokyo's robust economy, diverse job market, and well-established infrastructure continue to attract both domestic and international migrants. Its status as a global hub for business and culture ensures ongoing population growth.",
"measures": "Invest in affordable housing, improve public transportation, enhance green spaces and recreational facilities to improve quality of life, and continue promoting Tokyo as a global center for innovation and opportunity.",
"currentPopulation": 14086000,
"futurePopulationWithMeasures": 16000000,
"futurePopulationWithoutMeasures": 15000000
},
{
"region": "Osaka",
"reason": "Osaka is a major economic center with a strong industrial base and a thriving service sector. Its vibrant culture and relatively lower cost of living compared to Tokyo attract individuals seeking opportunities.",
"measures": "Focus on attracting skilled workers and entrepreneurs by offering tax incentives and streamlining business regulations. Improve affordable housing options and educational facilities. Promote Osaka's cultural attractions to attract tourists and residents.",
"currentPopulation": 8763000,
"futurePopulationWithMeasures": 10500000,
"futurePopulationWithoutMeasures": 9500000
},
{
"region": "Aichi",
"reason": "Aichi Prefecture benefits from its position as a major manufacturing and automotive hub. This strong industrial base and associated employment opportunities fuel consistent population growth.",
"measures": "Promote further diversification of the economy beyond automotive manufacturing to ensure long-term resilience. Invest in education and technology to attract highly skilled professionals. Develop sustainable infrastructure to enhance quality of life.",
"currentPopulation": 7477000,
"futurePopulationWithMeasures": 9000000,
"futurePopulationWithoutMeasures": 8000000
}
]
在执行上述部分的脚本时,我们观察到以下结果: