自动文本数据提取和表格填写系统

2023年10月20日由 alex 发表 330 0

在本文中，我们将演示一个系统，该系统可以提取信息，以更快的住房/公寓保险报价过程-而不是手动填写网站上所有必需的字段，我们可以开发一种算法，允许我们的潜在客户上传他们的租赁/租赁协议或法律文件，我们的系统可以捕获并自动填写表格。

一、项目设置

项目主要分为三个阶段: (1)数据提取; (2)相关信息获取; (3)表格填写。在这三个步骤中，第二步是困难的，而其余步骤在今天的技术下是相当容易的。所以在这篇文章中，将主要关注这一阶段。

二、执行

1. 文本数据提取

有许多用于文本提取的OCR包，但我们的目的是选择一个可以覆盖所有类型输入格式的OCR包。

我可以在我的输入的非结构化或结构化级别中想到4种格式，潜在客户可以上传他们的法律文件。它们是: 基于文本的PDF，基于图像的PDF，图像或只有PDF格式(这是罕见的)。

为了有效地从这些格式中提取文本，我建议使用Python中的PDF2包。

# Import library
from PyPDF2 import PdfReader
# Open the PDF file
pdf_file = PdfReader(open("data/sample.pdf", "rb"))
# Read all the pages in the PDF
pages = [pdf_file.pages[i] for i in range(len(pdf_file.pages))]
# Join all the pages into a single string
text = '\n'.join([page.extract_text() for page in pages])

在本例中，我们将所有文本保存在一个名为“text”的字符串中。

2. 信息获取

假设我们只需要这个演示数据中的5个元素: first_name、last_name、address、phone、date_of_birth。

a. 正则表达式方式

这种传统的方法有几个缺点，因为它只适用于结构化的输入数据，并且搜索列表可能会耗尽。

# Import Regular Expression
import re
# Create empty lists to store our data
first_names = []
last_names = []
addresses = []
phones = []
dates_of_birth = []
# Define a function to capture the information from text file using Regular Expression
def extract_info_1(text, first_names, last_names, addresses, phones, dates_of_birth):
    # Use regular expressions to search for the relevant information in the text
    Address_keys = ["Location", "Located at", "Address", "Residence", "Premises", "Residential address"]
    BOD_keys = ["Born on", "DOB", "Birth date", "Date of birth"]
    Phone_keys = ["Phone", "Telephone", "Contact number", "Call at", "Phone number", "Mobile number"]
    First_Name_keys = ["First name", "Given name", "First", "Given", "Tenant", "First and last name"]
    Last_Name_keys = ["Last name", "Family name", "Surname", "Last"]
    for keyword in Address_keys:
        matches = re.findall(keyword + "\s*:\s*(.*)", text, re.IGNORECASE)
        if matches:
            addresses.extend(matches)
    for keyword in BOD_keys:
        matches = re.findall(keyword + "\s*:\s*(.*)", text, re.IGNORECASE)
        if matches:
            dates_of_birth.extend(matches)
    for keyword in Phone_keys:
        matches = re.findall(keyword + "\s*:\s*(.*)", text, re.IGNORECASE)
        if matches:
            phones.extend(matches)
    for keyword in First_Name_keys:
        matches = re.findall(keyword + "\s*:\s*(.*)", text, re.IGNORECASE)
        if matches:
            first_names.extend(matches)
    for keyword in Last_Name_keys:
        matches = re.findall(keyword + "\s*:\s*(.*)", text, re.IGNORECASE)
        if matches:
            last_names.extend(matches)
# Apply function
extract_info_1(text, first_names, last_names, addresses, phones, dates_of_birth)

为了正确捕获所有信息，我们只需要更新每个列表的键，然后这个函数将捕获这些键之后的所有信息。

例如，客户的地址可以遵循以下键: Address_keys = [" Location "， " Located at "， " address "， " Residence "， " Premises "， " Residential address "，…]

b. 基于规则的方法

这也是一种传统的方法，但它似乎很有用，因为我们可能知道每条信息的模式。

例如：由于我们的客户位于加拿大，所以电话号码可能是这些模式:

(123) 456-7890、123-456-7890、123 456 7890、1234567890,或这些手机前面有+1。

因此，我们可以对每个元素使用这些规则来提取它。

# Import Regular Expression
import re
# Define a function to capture the information from text file using Regular Expression
def extract_info_2(text, first_names, last_names, addresses, phones, dates_of_birth):
    # RULES
    # Address pattern
    # example: 123 Main St, Toronto, ON, M4B 1B3
    address_pattern = re.compile(r"""
        # Match the street address, which may include house/apartment number, street name, and street type
        (\d+\s+\b[A-Z][a-z]+\b\s+\b[A-Z][a-z]+\b)
        \s*,\s*
        # Match the city, which should start with an uppercase letter
        ([A-Z][a-z]+)\s*,\s*
        # Match the province abbreviation, which should be two uppercase letters
        (ON)\s*
        # Match the postal code, which should be formatted as A1A 1A1
        (\b[A-Z]\d[A-Z]\s?\d[A-Z]\d\b)
        """, re.VERBOSE)
    # Phone number pattern
    # example: (123) 456-7890 or 123-456-7890 or 1234567890 or these phones with +1
    phone_pattern = re.compile(r"""
        (?:
            # match a phone number starting with the country code (1)
            1\s*[\.-]?\s*\(?(\d{3})\)?[\.-]?\s*(\d{3})[\.-]?\s*(\d{4})
            |
            # match a phone number without the country code
            \(?(\d{3})\)?[\.-]?\s*(\d{3})[\.-]?\s*(\d{4})
        )
        """, re.VERBOSE)
    dob_pattern = re.compile(r"""
        # Match dates in the format YYYY-MM-DD
        (\d{4})-(\d{2})-(\d{2})
        |
        # Match dates in the format MTH DD, YYYY
        (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+(\d{1,2}),\s+(\d{4})
        |
        # Match dates in the format Month DD YYYY
        (January|February|March|April|May|June|July|August|September|October|November|December)\s+(\d{1,2})\s+(\d{4})
        """, re.VERBOSE)
    # Extract information using the new regex patterns
    address_matches = address_pattern.findall(text)
    if address_matches:
        addresses.extend(address_matches)
    phone_matches = phone_pattern.findall(text)
    if phone_matches:
        phones.extend(["-".join(filter(None, match)) for match in phone_matches])
    dob_matches = dob_pattern.findall(text)
    if dob_matches:
        dates_of_birth.extend(["-".join(filter(None, match)) for match in dob_matches])

# Apply function
extract_info_2(text, first_names, last_names, addresses, phones, dates_of_birth)

代码的解释：

所提供的代码定义了三种正则表达式模式(address_pattern、phone_pattern和dob_pattern)，并附有注释，解释了它们的用途和预期输入格式的示例。让我们来分析一下每种模式:

地址模式(address_pattern):

  address_pattern = re.compile(r"""compile(r"""
        # Match the street address, which may include house/apartment number, street name, and street type
        (\d+\s+\b[A-Z][a-z]+\b\s+\b[A-Z][a-z]+\b)
        \s*,\s*
        # Match the city, which should start with an uppercase letter
        ([A-Z][a-z]+)\s*,\s*
        # Match the province abbreviation, which should be two uppercase letters
        (ON)\s*
        # Match the postal code, which should be formatted as A1A 1A1
        (\b[A-Z]\d[A-Z]\s?\d[A-Z]\d\b)
        """, re.VERBOSE)

此正则表达式用于匹配特定格式的街道地址，例如“123 Main St, Toronto, ON, M4B 1B3”。

它将地址分成几个部分:

房屋或公寓号码:\d+
街道名称(以大写字母开头，后跟小写字母):\b[A-Z][A-Z] +\b
城市(以大写字母开头):[A-Z][A-Z] +
省缩写(例如，安大略省的“ON”):(ON)
邮政编码(格式为A1A 1A1): \b[A-Z]\d[A-Z]\s?\ d [a - z] \ d \ b
\s*用于匹配可选的空白，\s*，\s*用于匹配分隔这些组件的逗号。

电话号码模式(phone_pattern):

 # Phone number pattern# Phone number pattern
    # example: (123) 456-7890 or 123-456-7890 or 1234567890 or these phones with +1
    phone_pattern = re.compile(r"""
        (?:
            # match a phone number starting with the country code (1)
            1\s*[\.-]?\s*\(?(\d{3})\)?[\.-]?\s*(\d{3})[\.-]?\s*(\d{4})
            |
            # match a phone number without the country code
            \(?(\d{3})\)?[\.-]?\s*(\d{3})[\.-]?\s*(\d{4})
        )
        """, re.VERBOSE)

它可以匹配以下格式的电话号码:

国家代码(例如+1或1):1 \s*[\.-] ?\s*$?(\d{3})$ ?[\.-]?\s* (\d{3})[\.-]?\s* (\d{4})
没有国家代码:$?(\d{3})$ ?[\.-]?\s* (\d{3})[\.-]?\s* (\d{4})
(\d{3})和(\d{4})分别获取区号和线路号。

出生日期(dob＿pattern):

 dob_pattern = re.compile(r"""dob_pattern = re.compile(r"""
        # Match dates in the format YYYY-MM-DD
        (\d{4})-(\d{2})-(\d{2})
        |
        # Match dates in the format MTH DD, YYYY
        (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+(\d{1,2}),\s+(\d{4})
        |
        # Match dates in the format Month DD YYYY
        (January|February|March|April|May|June|July|August|September|October|November|December)\s+(\d{1,2})\s+(\d{4})
        """, re.VERBOSE)

该正则表达式用于匹配不同日期格式的出生日期，包括YYYY- mm -DD、MTH DD、YYYY和Month DD YYYY。

它由三个主要组组成，以|(交替)分隔:

日期的格式YYYY-MM-DD: (\ d {4}) - (\ d {2}) (\ d {2})
日期格式为年月日(例如，2023年1月1日):(Jan|Feb|Mar|Apr|May|Jun|七月|八月|Sep|十月|十一月|十二月)\s+(\d{1,2})，\s+(\d{4})
日期格式为Month DD YYYY(例如，2023年1月1日):(January|二月|三月|四月|五月|六月|七月|八月|九月|十月|十一月|十二月)\s+(\d{1,2})\s+(\d{4})
\d{4}、\d{2}和\d{1,2}分别用于匹配年、月和日组件。

地址提取(address_pattern):

address_pattern正则表达式旨在以特定格式查找包含多个组件(例如，街道号码、街道名称、城市、省份、邮政编码)的完整街道地址。

findall(text)返回在输入文本中找到的所有匹配项的列表。

如果找到任何匹配的地址，代码将它们全部附加到地址列表中。

电话号码提取(phone_pattern):

phone_pattern正则表达式用于查找各种格式的电话号码，包括带有或不带有国家代码的电话号码。

findall(text)返回在输入文本中找到的所有电话号码匹配列表。

为了保证格式的一致性，代码对每个电话号码匹配进行处理，用连字符(-)连接组件(区号、前缀、行号)。

然后将这些格式化的电话号码附加到电话列表中。

出生日期提取(dob_pattern):

dob_pattern正则表达式用于查找不同日期格式的出生日期(例如，YYYY- mm -DD, MTH DD, YYYY, Month DD YYYY)。

findall(text)返回在输入文本中找到的所有出生日期匹配列表。

代码处理每个出生日期匹配，以确保格式一致，必要时使用连字符(-)连接组件。

然后将这些格式化的出生日期附加到dates_of_birth列表中。

c. NER(命名实体识别)模型

在spaCy的NER (Named Entity Recognition)模型中，实体被划分为不同的类型，以识别和标记文本中不同类型的命名实体。spaCy的NER模型识别的实体类型包括但不限于:

人:人或个人的名字。
ORG:组织、公司或机构的名称。
GPE:地缘政治实体，如国家、城市和州。
LOC:非地缘政治地点，如自然地标或水体。
DATE:以各种格式表示的日期(例如:January 1, 2023或2023-01-01)。
TIME:以不同格式表示的时间(例如3:00 PM或15:00:00)。
货币:货币价值，包括货币符号(如“$100”或“€50”)。
百分比:百分比值(例如，“10%”或“50%”)。
数量:测量或数量(例如，“5公斤”或“10米”)。
序数:序数(例如，“第一”、“第二”、“第三”)。
基数:基数(如“一”、“二”、“三”)。
PRODUCT:产品名称，如品牌名称或特定项目。
EVENT:事件、会议或事件的名称。
语言:语言的名称或与语言相关的术语。
法律:法律参考资料，如法律或条例。
WORK_OF_ART:艺术作品、书籍或音乐作品的标题。
NORP:国籍或宗教或政治团体。
FAC:设施、建筑物或结构的名称。
电话:电话号码

下面是我们使用NER提取文本数据的一个结果示例:

在这种方法中，我们可以利用上面突出的实体(1、3、5、19)为我们的目的提取必要的信息。

import spacy
# Load the large English NER model
nlp = spacy.load("en_core_web_sm")
# Define a function to capture the information from text using Named Entity Recognition
def extract_info_3(text, first_names, last_names, addresses, phones, dates_of_birth):
    # Initialize variables to store extracted information
    first_name = ""
    last_name = ""
    address = ""
    phone = ""
    date_of_birth = ""
    # Process the text with spaCy NER model
    doc = nlp(text)
    # Extract the information using Named Entity Recognition
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            # Check if the entity text is a first name
            if not first_name:
                first_name = ent.text.strip()
            else:
                # If we already have a first name, assume the current entity is the last name
                last_name = ent.text.strip()
        elif ent.label_ == "GPE":
            # GPE represents geographical entities, which could include addresses
            address = ent.text.strip()
        elif ent.label_ == "PHONE":
            # PHONE entity type (custom) for phone numbers
            phone = ent.text.strip()
        elif ent.label_ == "DATE":
            # DATE entity type for dates, which could include date of birth
            date_of_birth = ent.text.strip()
    # Append the extracted information to the pre-defined lists
    first_names.append(first_name)
    last_names.append(last_name)
    addresses.append(address)
    phones.append(phone)
    dates_of_birth.append(date_of_birth)

d.混合方法

这种方法结合了上述所有方法，其中我们只需要通过所有3个函数传递文本以进一步获取数据追加。在每个列表中，我们首先对重复多次的元素进行排序，其次是频率较低的单词。

e.语言模型方法

由于法律文档和语言是复杂的，并且由于一个文档中可能出现许多实体，因此所有传统的方法都可能不是一个好的选择，相反，可以理解每个文档上下文的大型语言模型可以执行得更好。这就是为什么我建议使用像ChatGPT API这样的LLM来完成这项工作。

# 1. Import the necessary libraries
import openai
import os
import pandas as pd
import time
# 2. Set your API from ChatGPT account
openai.api_key = '<YOUR API KEY>'
# 3. Set up a function to get the result
def get_completion(prompt, model="gpt-3.5-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0,
        )
    return response.choices[0].message["content"]
# 4. Create a promt from text file and our words:
question = "Read and understand the document, then help me extract 5 pieces of information including (1) First name, (2) last name, (3) date of birth, (4) address, (5) phone number  of tenants only. Here is the content of the document: ".join(text)
# 5. Query the API
response = get_completion(question)
print(response)

在此步骤之后，你只需要从响应中提取数据以获得所需的信息。

三、表格自动填写

对于这一步，因为我们可以控制所有填充的字段，因此对于每个数据科学家来说都很容易执行这个阶段。

它高度依赖于我们拥有的内部系统，但是我可以演示如何使用Selenium从外部填写表单进行web浏览和填充。

from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time
# Load the website
web = webdriver.Chrome()
web.get('https://secure.sonnet.ca/#/quoting/property/about_you?lang=en')
# Wating for the web to load before filling out
time.sleep(5)
# Inputs field
#ADDRESS
Address_input = "50 Laughton Ave M6N 2W9" # KEY
Address_fill = web.find_element("xpath",'//*[@id="addressInput"]')
Address_fill.send_keys(Address_input)
# FIRST NAME
FirstName_input = "Kiel" # KEY
FirstName_fill = web.find_element("xpath",'//*[@id="firstName"]')
FirstName_fill.send_keys(FirstName_input)
# LAST NAME
LastName_input = "Dang" # KEY
LastName_fill = web.find_element("xpath",'//*[@id="lastName"]')
LastName_fill.send_keys(LastName_input)
# MONTH OF BIRTH
dropdown_month = web.find_element("id","month-0Button")
dropdown_month.click()
option = web.find_element("xpath", "//span[contains(text(), 'January')]") #KEY
option.click()
# DATE OF BIRTH
Date_input = "23" # KEY
Date_fill = web.find_element("xpath",'//*[@id="date-0"]')
Date_fill.send_keys(Date_input)
# YEAR OF BIRTH
Year_input = "1994" # KEY
Year_fill = web.find_element("xpath",'//*[@id="year-0"]')
Year_fill.send_keys(Year_input)
# Prevent auto closing the web after application finishes the script.
input("Press enter to close the browser")
web.quit()

结论

在开发自动文本数据提取和表格填写系统的过程中，我们已经深入研究了一个技术领域，以满足对高效数据处理日益增长的需求。我们项目的核心动机是利用自动化的力量来取代繁琐的任务中的人类，最小化错误，并促进业务增长的更快决策。

文章来源：https://medium.com/@kirudang/automated-text-data-extraction-and-form-filling-system-8c97250da6aa

标签：

语言模型文本提取

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇使用JAX对RL环境进行矢量化和并行化：以光速进行Q-learning

下一篇如何训练BERT来完成掩码语言建模任务

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

2024年RAG的年度回顾