Microsoft Phi3 Vision：文档OCR数据提取

2024年06月18日由 alex 发表 368 0

Phi3 模型是 Microsoft 小语言模型的最新版本。它有四种变体：

Phi-3-mini。3.8B 参数语言模型，提供两种上下文长度（128K和4K）
Phi-3-small。7B 参数语言模型，有两种上下文长度（128K和8K）
Phi-3-medium。14B 参数语言模型，有两种上下文长度（128K和4K）
Phi-3-vision是一个具有语言和视觉功能的 4.2B 参数多模态模型

在这篇文章中，我对多模态视觉语言模型的应用很感兴趣。正如官方文档中解释的那样，Phi-3-Vision-128K-Instruct 是一个轻量级的、最先进的开放式多模态模型，它为通用 AI 系统和应用程序提供视觉和文本输入功能，这些功能需要：

内存/计算受限的环境；
延迟受限场景；
一般图像理解；
光学字符识别；
图表和表格理解。

我感兴趣的是检查当该模型用作身份证、驾驶执照和健康保险卡等个人文件上的 OCR 时的数据提取能力。本次测试中使用的文件是传真件，它们不是原始文件，也不属于真人。

模型实例

为了在推理模式下使用模型，我构建了一个如下环境

1) conda create -n llm_images python=3.10
2) conda activate llm_images
3) pip install torch==2.3.0 torchvision==0.18.0
4) pip install packaging
5) pip install pillow==10.3.0 chardet==5.2.0 flash_attn==2.5.8 accelerate==0.30.1 bitsandbytes==0.43.1 Requests==2.31.0 transformers==4.40.2 
6) pip uninstall jupyter
7) conda install -c anaconda jupyter
8) conda update jupyter
9) pip install --upgrade 'nbconvert>=7' 'mistune>=2'
10) pip install cchardet

环境可用后，我从 Huggingface 软件库中下载了模型

# Import necessary libraries
from PIL import Image
import requests
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor
from transformers import BitsAndBytesConfig
import torch
from IPython.display import display
import time

# Define model ID
model_id = "microsoft/Phi-3-vision-128k-instruct"
# Load processor
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Define BitsAndBytes configuration for 4-bit quantization
nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)
# Load model with 4-bit quantization and map to CUDA
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    trust_remote_code=True,
    torch_dtype="auto",
    quantization_config=nf4_config,
）

接下来，我准备了一个 Python 函数，输入要发送给模型的信息和图像路径，然后输出模型的输出结果。

def model_inference(messages, path_image):
    
    start_time = time.time()
    
    image = Image.open(path_image)
    # Prepare prompt with image token
    prompt = processor.tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    # Process prompt and image for model input
    inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0")
    # Generate text response using model
    generate_ids = model.generate(
        **inputs,
        eos_token_id=processor.tokenizer.eos_token_id,
        max_new_tokens=500,
        do_sample=False,
    )
    # Remove input tokens from generated response
    generate_ids = generate_ids[:, inputs["input_ids"].shape[1] :]
    # Decode generated IDs to text
    response = processor.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]

    display(image)
    end_time = time.time()
    print("Inference time: {}".format(end_time - start_time))
    # Print the generated response
    print(response)

在下文中，我将展示如何从每个不同的文档中提取数据。根据文件的正反面，我准备了一个特定的提示符，用于识别要提取数据的字段。

身份证 OCR

正面

对于意大利身份证的正面，我使用以下提示来提取主要的个人数据，并将这些数据以 JSON 格式输出。

prompt_cie_front = [{"role": "user", "content": "<|image_1|>\nOCR the text of the image. Extract the text of the following fields and put it in a JSON format: \
'Comune Di/ Municipality', 'COGNOME /Surname', 'NOME/NAME', 'LUOGO E DATA DI NASCITA/\
PLACE AND DATE OF BIRTH', 'SESSO/SEX', 'STATURA/HEIGHT', 'CITADINANZA/NATIONALITY',\
'EMISSIONE/ ISSUING', 'SCADENZA /EXPIRY'. Read the code at the top right and put it in the JSON field 'CODE'"}]
# Download image from URL
path_image = "/home/randellini/llm_images/resources/cie_fronte.jpg"
# inference
model_inference(prompt_cie_front, path_image)

对于上面的图片，我得到了以下输出结果。可以看到，唯一的卡片代码位于卡片的右上方，没有任何相关字段。为了提取其值，我在提示中指定模型必须读取右上方的代码并将其放入名为 "CODE "的 JSON 字段。唯一的错误是唯一代码中的第一个 0 被换成了大写字母 O。

Inference time: 9.793543815612793
{
"Comune Di/ Municipality": "SERENELLA MARITTIMA",
"COGNOME /Surname": "ROSSI",
"NOME/NAME": "BIANCA",
"LUOGO E DATA DI NASCITA": "PINO SULLA SPONDA DEL LAGO MAGGIORE (VA) 30.12.1964",
"SESSO/SEX": "F",
"STATURA/HEIGHT": "180",
"CITADINANZA/NATIONALITY": "ITA",
"EMISSIONE/ ISSUING": "30.05.2022",
"SCADENZA /EXPIRY": "30.12.2031",
"CODE": "CAO000AA"
}

背面

为了提取背面的数据，我使用了以下提示符

prompt_cie_back = [{"role": "user", "content": "<|image_1|>\nOCR the text of the image. Extract the text of the following fields and put it in a JSON format: \
'CODICE FISCALE/FISCAL CODE', 'ESTREMI ATTO DI NASCITA', 'INDIRIZZO DI RESIDENZA/RESIDENCE'"}]
# Download image from URL
path_image = "/home/randellini/llm_images/resources/cie_retro.jpg"
# inference
model_inference(prompt_cie_back, path_image)

我得到了以下结果。只有一个错误，即缺少财政代码的第三个字符，一个大写的 S。

Inference time: 4.082342147827148
{
  "codice_fiscale": "RSBNC64T70G677R",
  "estremi_atto_di_nascita": "00000.0A00",
  "indirizzo_di_residenza": "Via Salaria, 712"
}

驾驶执照 OCR

对于意大利驾驶执照的正面，我使用了以下提示

prompt_ld_front = [{"role": "user", "content": "<|image_1|>\nOCR the text of the image. Extract the text of the following fields and put it in a JSON format: \
'1.', '2.', '3.', '4a.', '4b.', '4c.', '5.','9.'"}]
# Download image from URL
path_image = "/home/randellini/llm_images/resources/patente_fronte.png"
# inference
model_inference(prompt_ld_front, path_image)

获得结果

Inference time: 5.2030909061431885
{
"1": "ROSSI",
"2": "MARIA",
"3": "01/01/65",
"4a": "01/03/2014",
"4b": "01/01/2025",
"4c": "MIT-UCO",
"5": "A0A000000A",
"9": "B"
}

对于意大利驾驶执照的背面，目前我还没有找到正确的提示来读取表格中 "9."、"10."、"11. "和 "12. "列的值。此外，"12. 首先是作为表格中一列的名称，然后是作为卡片左下方的一个字段。

最后一个字段很重要，因为它提醒驾驶员要履行的特殊义务。例如，代码 01 表示佩戴镜片或眼镜驾驶的义务

健康保险卡 OCR

正面

要读取意大利医疗保险卡正面的值，我使用了提示符

prompt_hic_front = [{"role": "user", "content": "<|image_1|>\nOCR the text of the image. Extract the text of the following fields and put it in a JSON format: \
'Codice Fiscale', 'Sesso', 'Cognome', 'Nome', 'Luogo di nascita', 'Provincia', 'Data di nascita', 'Data di scadenza'"}]
# Download image from URL
path_image = "/home/randellini/llm_images/resources/tessera_sanitaria_fronte.jpg"
# inference
model_inference(prompt_hic_front, path_image)

结果如下

Inference time: 7.003508806228638
```json
{
  "Codice Fiscale": "RSSMRO62B25E205Y",
  "Sesso": "M",
  "Cognome": "ROSSI",
  "Nome": "MARIO",
  "Luogo di nascita": "CASSINA DE' PECCHI",
  "Provincia": "MI",
  "Data di nascita": "25/02/1962",
  "Data di scadenza": "10/10/2019"
}
```

背面

阅读卡片背面时，我使用了以下提示语

prompt_hic_back = [{"role": "user", "content": "<|image_1|>\nOCR the text of the image. Extract the text of the following fields and put it in a JSON format: \
'3 Cognome', '4 Nome', '5 Data di nascita', '6 Numero identificativo personale', '7 Numero identificazione dell'istituzione', 'Numero di identificazione della tessera', '9 Scadenza'"}]
# Download image from URL
path_image = "/home/randellini/llm_images/resources/tessera_sanitaria_retro.jpg"
# inference
model_inference(prompt_hic_back, path_image)

获得

Inference time: 7.403932809829712
{
"3 Cognome": "ROSSI",
"4 Nome": "MARIO",
"5 Data di nascita": "25/02/1962",
"6 Numero identificativo personale": "RSSMRO62B25E205Y",
"7 Numero identificazione dell'istituzione": "0030 - LOMBARDIA",
"Numero di identificazione della tessera": "80380800301234567890",
"9 Scadenza": "01/01/2006"
}

文章来源：https://medium.com/@enrico.randellini/exploring-microsoft-phi3-vision-language-model-as-ocr-for-document-data-extraction-c269f7694d62

标签：

计算机视觉

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇大型语言模型审查解除：资格取消的背后

下一篇数学视角下的KAN：柯尔莫哥洛夫-阿诺德网络

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来