如何使用索引加速SQL查询（Python版）？

2023年08月11日由 camellia 发表 366 0

学习如何使用Python的内置sqlite3模块处理SQLite数据库。还将学习如何创建索引以加快查询速度。

假设你正在查阅一本书的内容。你希望能够更快地找到你正在寻找的信息。你会怎么做呢？你可能会查找术语索引，然后跳转到引用特定术语的页面。SQL中的索引的工作方式与书籍中的索引类似。

在大多数现实世界的系统中，你将对一个包含大量记录的数据库表运行查询（想象一下数百万条记录）。需要通过全表扫描来检索结果的查询将非常缓慢。如果你知道你经常需要根据某些列查询信息，可以在这些列上创建数据库的索引。这将大大提高查询的速度。

那么今天我们将学习如何使用Python的sqlite3模块连接到和查询SQLite数据库。我们还将学习如何添加索引，并查看它如何提高性能。

为了跟随本教程进行编码，你应该在你的工作环境中安装了Python 3.7+和SQLite。

注：本教程中的示例和样本输出适用于Ubuntu LTS 22.04上的Python 3.10和SQLite3（版本3.37.2）。

在Python中连接到数据库

我们将使用内置的sqlite3模块。在我们开始运行查询之前，我们需要：

连接到数据库
创建数据库游标以运行查询

为了连接到数据库，我们将使用以下代码：

来自 sqlite3 模块的 connect（）函数。建立连接后，我们可以调用连接对象来创建数据库游标，如下所示：

cursor()

import sqlite3



# connect to the db

db_conn = sqlite3.connect('people_db.db')

db_cursor = db_conn.cursor()

在这里，我们尝试连接到数据库people_db。

如果数据库不存在，则运行上述代码片段将为我们创建sqlite数据库。

创建表并插入记录

现在，我们将在数据库中创建一个表，并填充它的记录。

让我们在数据库中创建一个名为people的表，具有以下字段：people_db

名字
电子邮件
工作

# main.py

...

# create table

db_cursor.execute('''CREATE TABLE people (

                  id INTEGER PRIMARY KEY,

                  name TEXT,

                  email TEXT,

                  job TEXT)''')





...



# commit the transaction and close the cursor and db connection

db_conn.commit()

db_cursor.close()

db_conn.close()

使用Faker生成合成数据

我们现在必须在表中插入记录。为此，我们将使用 Faker——一个用于合成数据生成的 Python 包——可通过 pip 安装：

$ pip install faker

安装Faker后，可以将类导入到Python脚本中：Faker

# main.py

...

from faker import Faker

...

下一步是生成记录并将其插入人员表。为了让我们知道索引如何加快查询速度，让我们插入大量记录。在这里，我们将插入10万条记录；将变量设置为10000.num_records。

然后，我们执行以下操作：

通过Fakerfake实例化一个对象并设置种子，以便我们获得可重现性。

通过调用对象的first_name()和last_name()方法获取一个名字字符串。

通过调用对象的domain_name()方法生成一个虚假的域名。

使用名字和姓氏以及域生成电子邮件字段。

使用job()获取每个单独记录的作业。

我们生成记录并将其插入到表中：people

# create and insert records

fake = Faker() # be sure to import: from faker import Faker 

Faker.seed(42)



num_records = 100000



for _ in range(num_records):

    first = fake.first_name()

    last = fake.last_name()

    name = f"{first} {last}"

    domain = fake.domain_name()

    email = f"{first}.{last}@{domain}"

    job = fake.job()

    db_cursor.execute('INSERT INTO people (name, email, job) VALUES (?,?,?)', (name,email,job))



# commit the transaction and close the cursor and db connection

db_conn.commit()

db_cursor.close()

db_conn.close()

现在，main.py 文件具有以下代码：

# main.py

# imports

import sqlite3

from faker import Faker



# connect to the db

db_conn = sqlite3.connect('people_db.db')

db_cursor = db_conn.cursor()



# create table

db_cursor.execute('''CREATE TABLE people (

                  id INTEGER PRIMARY KEY,

                  name TEXT,

                  email TEXT,

                  job TEXT)''')





# create and insert records

fake = Faker()

Faker.seed(42)



num_records = 100000



for _ in range(num_records):

    first = fake.first_name()

    last = fake.last_name()

    name = f"{first} {last}"

    domain = fake.domain_name()

    email = f"{first}.{last}@{domain}"

    job = fake.job()

    db_cursor.execute('INSERT INTO people (name, email, job) VALUES (?,?,?)', (name,email,job))



# commit the transaction and close the cursor and db connection

db_conn.commit()

db_cursor.close()

db_conn.close()

运行此脚本一次，以使用记录数填充表。num_records

查询数据库

现在我们有了包含10万条记录的表，让我们对表运行一个示例查询。people

让我们运行一个查询来：

获取职位名称为“产品经理”的记录的名称和电子邮件，以及
将查询结果限制为10条记录。

我们将使用 time 模块中的默认计时器来获取查询的大致执行时间。

# sample_query.py



import sqlite3

import time



db_conn = sqlite3.connect("people_db.db")

db_cursor = db_conn.cursor()



t1 = time.perf_counter_ns()



db_cursor.execute("SELECT name, email FROM people WHERE job='Product manager' LIMIT 10;")



res = db_cursor.fetchall()

t2 = time.perf_counter_ns()



print(res)

print(f"Query time without index: {(t2-t1)/1000} us")

下面是输出：

Output >>

[

    ("Tina Woods", "Tina.Woods@smith.com"),

    ("Toni Jackson", "Toni.Jackson@underwood.com"),

    ("Lisa Miller", "Lisa.Miller@solis-west.info"),

    ("Katherine Guerrero", "Katherine.Guerrero@schmidt-price.org"),

    ("Michelle Lane", "Michelle.Lane@carr-hardy.com"),

    ("Jane Johnson", "Jane.Johnson@graham.com"),

    ("Matthew Odom", "Matthew.Odom@willis.biz"),

    ("Isaac Daniel", "Isaac.Daniel@peck.com"),

    ("Jay Byrd", "Jay.Byrd@bailey.info"),

    ("Thomas Kirby", "Thomas.Kirby@west.com"),

]



Query time without index: 448.275 us

你还可以通过在命令行运行来调用 SQLite 命令行客户端：sqlite3 db_name

$ sqlite3 people_db.db

SQLite version 3.37.2 2022-01-06 13:25:41

Enter ".help" for usage hints.

要获取索引列表，你可以运行：.index

sqlite> .index

由于当前没有索引，因此不会列出任何索引。

你还可以像这样检查查询计划：

sqlite> EXPLAIN QUERY PLAN SELECT name, email FROM people WHERE job='Product Manager' LIMIT 10;

QUERY PLAN

`--SCAN people

这里的查询计划是扫描所有效率低下的行。

在特定列上创建索引

若要在特定列上创建数据库索引，可以使用以下语法：

CREATE INDEX index-name on table (column(s))

假设我们需要经常查找具有特定职位的个人的记录。在作业列上创建索引会有所帮助：people_job_index

# create_index.py



import time

import sqlite3



db_conn = sqlite3.connect('people_db.db')



db_cursor =db_conn.cursor()



t1 = time.perf_counter_ns()



db_cursor.execute("CREATE INDEX people_job_index ON people (job)")



t2 = time.perf_counter_ns()



db_conn.commit()



print(f"Time to create index: {(t2 - t1)/1000} us")





Output >>

Time to create index: 338298.6 us

尽管创建索引需要这么长时间，但这是一次性操作。运行多个查询时，你仍将获得显著的加速。

现在，如果你在 SQLite 命令行客户端上运行，你将获得：.index

sqlite> .index

people_job_index

使用索引查询数据库

如果你现在查看查询计划，你应该能够看到我们现在使用作业列上的索引搜索表：peoplepeople_job_index

sqlite> EXPLAIN QUERY PLAN SELECT name, email FROM people WHERE job='Product manager' LIMIT 10;

QUERY PLAN

`--SEARCH people USING INDEX people_job_index (job=?)

你可以重新运行sample_query.py。仅修改语句并查看查询现在运行需要多长时间：print()

# sample_query.py



import sqlite3

import time



db_conn = sqlite3.connect("people_db.db")

db_cursor = db_conn.cursor()



t1 = time.perf_counter_ns()



db_cursor.execute("SELECT name, email FROM people WHERE job='Product manager' LIMIT 10;")



res = db_cursor.fetchall()

t2 = time.perf_counter_ns()



print(res)

print(f"Query time with index: {(t2-t1)/1000} us")

下面是输出：

Output >>

[

    ("Tina Woods", "Tina.Woods@smith.com"),

    ("Toni Jackson", "Toni.Jackson@underwood.com"),

    ("Lisa Miller", "Lisa.Miller@solis-west.info"),

    ("Katherine Guerrero", "Katherine.Guerrero@schmidt-price.org"),

    ("Michelle Lane", "Michelle.Lane@carr-hardy.com"),

    ("Jane Johnson", "Jane.Johnson@graham.com"),

    ("Matthew Odom", "Matthew.Odom@willis.biz"),

    ("Isaac Daniel", "Isaac.Daniel@peck.com"),

    ("Jay Byrd", "Jay.Byrd@bailey.info"),

    ("Thomas Kirby", "Thomas.Kirby@west.com"),

]



Query time with index: 167.179 us

我们看到查询现在大约需要167.179微秒来执行。

性能改进

对于我们的示例查询，使用索引查询的速度约为2.68倍。在执行时间上，我们得到了62.71%的百分比加速。

你也可以尝试运行更多查询：查询，并查看性能改进。

还要注意：因为我们只在作业列上创建了索引，所以如果你正在运行涉及其他列的查询，则查询的运行速度不会比没有索引的情况快。

文章来源：https://www.kdnuggets.com/2023/08/speed-sql-queries-indexes-python-edition.html

标签：

SQL查询 Python

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇时间序列分析：Python中的ARIMA模型

下一篇从实验到部署：MLflow 的入门指南

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来