使用UBIAI数据微调Llama 3模型进行关系提取

2024年08月19日 由 alex 发表 243 0

从非结构化文本中提取有意义的关系已变得至关重要。本文探讨了针对关系提取任务微调大型语言模型(LLM),尤其是 LLAMA 3 的过程。我们将讨论如何利用最先进的注释平台 UBIAI 来注释数据,从而提高模型识别和分类文本中语义关系的效率。


什么是关系提取?

关系提取是 NLP 的一项主要任务,是连接非结构化文本和结构化知识的重要纽带。这一过程需要识别文本中的实体,并确定连接它们的语义关系。例如,在 “由埃隆-马斯克创立的特斯拉正在彻底改变电动汽车行业 ”这句话中,关系提取系统可以识别出 “特斯拉”、“埃隆-马斯克 ”和 “电动汽车行业 ”等实体,并提取出 “由......创立 ”和 “彻底改变 ”等关系。


大型语言模型 (LLM)

大型语言模型的出现开创了 NLP 能力的新时代。这些在大量文本库中训练出来的复杂模型拥有对语言结构和语义的内在理解。LLAMA 3 是最先进的 LLM,它能够理解和生成不同领域的类人文本,充分体现了这种能力。通过对 LLAMA 3 的关系提取进行微调,我们可以利用其对语言的深刻理解,以前所未有的准确性从文本中提取细微的关系。


微调的数据准备


准备数据: UBIAI 的优势

任何成功的机器学习项目的核心都是高质量的数据。这正是 UBIAI 作为改变游戏规则的注释平台的优势所在。UBIAI 通过提供以下功能超越了传统的标注工具:

  1. 高级文档处理: UBIAI 擅长从各种文档格式中提取文本,保持结构的完整性和对关系提取任务至关重要的上下文信息。
  2. 直观的注释界面: 该平台为注释者提供了一个用户友好型环境,使他们能够毫不费力地识别实体和定义关系,确保标签的一致性和准确性。
  3. 质量控制机制: UBIAI 集成了内置验证工具和注释者间协议功能,大大提高了注释数据集的可靠性。
  4. 可定制的注释模式: 用户可定义自定义实体类型和关系类别,根据特定领域或用例定制注释流程。
  5. 协作工作流: UBIAI 支持基于团队的注释项目,允许注释者之间高效分配任务和无缝协作。


利用 UBIAI 的强大功能,研究人员和数据科学家可以创建专门用于训练关系提取模型的高保真数据集。这些精心注释的数据是微调 LLAMA 3 的基础,使其能够出色地从不同领域的文本中提取复杂的关系。


数据预处理

从 UBIAI 导出 JSON 格式的注释数据后,需要对其进行预处理,使其符合微调 LLAMA 3 模型所需的格式。下面的 Python 脚本演示了这一过程:


import json
import pandas as pd
def preprocess_json(data, possible_relationships):
    # Extract the relevant information
    document = data['document']
    tokens = data['tokens']
    relations = data['relations']
    # Create a mapping of token index to its text and entity label
    token_info = {i: {'text': t['text'], 'label':     t['entityLabel']} for i, t in enumerate(tokens)}
# Format the entities and relationships
    entities = [(t['text'], t['entityLabel']) for t in tokens]
    formatted_entities = ", ".join([f"{text} ({label})" for text, label in entities])
    formatted_relations = []
    for r in relations:
        child_index = r['child']
        head_index = r['head']
        if child_index < len(tokens) and head_index < len(tokens):
            child = token_info[child_index]['text']
            head = token_info[head_index]['text']
 relation_label = r['relationLabel']
            formatted_relations.append(f"{child} -> {head} ({relation_label})")
    formatted_relations = "; ".join(formatted_relations)
    # Create the formatted prompt and response
    prompt = f"systemExtract relationships between entities from the following text.user Text: \"{document}\" Entities: {formatted_entities}. Possible relationships: {', '.join(possible_relationships)}."
    response = f"assistantThe relations between the entities: {formatted_relations}"
    full_prompt = prompt + response
    return full_prompt
# List of possible relationships (customize as needed)
possible_relationships = ["MUST_HAVE", "REQUIRES", "NICE_TO_HAVE"]
input_path = "/content/UBIAI_REL_data.json"
# Read the input JSON file
with open(input_path, 'r') as file:
     data = json.load(file)
# Preprocess all JSON strings
data = [preprocess_json(j, possible_relationships) for j in data]
# Convert to a DataFrame
df = pd.DataFrame(data, columns=["text"])
# Save to CSV
df.to_csv('fine_tuning_data.csv', index=False)


我提供的脚本将数据格式化为适合微调 LLAMA 3 等模型的提示-响应对!


微调 LLAMA 3

要微调 LLAMA 3 模型以提取关系,通常需要使用 Python 中的 Hugging Face's Transformers 库。下面是一个基本设置,供你开始使用:


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training, LoraConfig
from trl import setup_chat_format
model_id = "meta-llama/Meta-Llama-3-8B"
# Tokenizer setup
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = 'left'
tokenizer.model_max_length = 2048
# Quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)
# Model setup
device_map = {"": torch.cuda.current_device()} if torch.cuda.is_available() else None
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map=device_map,
    quantization_config=bnb_config
)
model, tokenizer = setup_chat_format(model, tokenizer)
model = prepare_model_for_kbit_training(model)
# LoRA configuration
peft_config = LoraConfig(
    lora_alpha=128,
    lora_dropout=0.05,
    r=256,
    bias="none",
    target_modules=["q_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "k_proj", "v_proj"],
    task_type="CAUSAL_LM",)


该代码通过 4 位量化和 LoRA(低容适配)设置 LLAMA 3 模型,以实现高效微调。


训练配置

接下来,我们设置训练参数:


from transformers import TrainingArguments
args = TrainingArguments(
    output_dir="sft_model_path",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    optim="adamw_8bit",
    logging_steps=10,
    save_strategy="epoch",
    Learning_rate=2e-4,
    bf16=True,
    tf32=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant",
    report_to="tensorboard",)


这些参数定义了训练过程的各个方面,如历元数、批量大小、学习率和优化策略。


训练模型

现在我们可以设置训练器并开始微调过程:


from trl import SFTTrainer
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=512,
    tokenizer=tokenizer,
    dataset_text_field="text",
    packing=False,
    dataset_kwargs={
        "add_special_tokens": False,
        "append_concat_token": False,
    }
)
trainer.train()
trainer.save_model()


该代码初始化 SFTTrainer(监督微调训练器)并开始训练过程。微调后,脚本将新学习的权重与基础模型合并。


from peft import PeftModel
base_model = "meta-llama/Meta-Llama-3-8B""meta-llama/Meta-Llama-3-8B"
new_model = "/content/REL_finetuned_llm"
base_model_reload = AutoModelForCausalLM.from_pretrained(
    base_model,
    return_dict=True,
    torch_dtype=torch.float16,
    trust_remote_code=True,
)
base_model_reload, tokenizer = setup_chat_format(base_model_reload, tokenizer)
model = PeftModel.from_pretrained(base_model_reload, new_model)
model = model.merge_and_unload()
model.save_pretrained("llama-3-8b-REL")
tokenizer.save_pretrained("llama-3-8b-REL")


该代码加载基础模型,应用微调权重,并保存合并后的模型。


推理

最后,我们可以使用微调后的模型进行推理:


messages = [{"role": "user", "content": """Extract relationships between entities from the following text. Text: "1+ years development experience on Java stack AppConnect / API's experience is added advantage. Compute, Network and Storage Monitoring Tools (Ex: Netcool) Application Performance Tools (IBM APM) Cloud operations and Automation Tools (VmWare, ICAM, ...) Proven Record of developing enterprise class products and applications. Preferred Tech and Prof Experience None EO Statement IBM is committed to creating a diverse environment and is proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, or veteran status. IBM is also committed to compliance with all fair employment practices regarding citizenship and immigration status. ." Entities: 1+ years (EXPERIENCE), development (SKILLS). Possible relationships: EXPERIENCE_IN, LOCATED_IN, WORKS_FOR, PART_OF, CREATED_BY."""}]"role": "user", "content": """Extract relationships between entities from the following text. Text: "1+ years development experience on Java stack AppConnect / API's experience is added advantage. Compute, Network and Storage Monitoring Tools (Ex: Netcool) Application Performance Tools (IBM APM) Cloud operations and Automation Tools (VmWare, ICAM, ...) Proven Record of developing enterprise class products and applications. Preferred Tech and Prof Experience None EO Statement IBM is committed to creating a diverse environment and is proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, or veteran status. IBM is also committed to compliance with all fair employment practices regarding citizenship and immigration status. ." Entities: 1+ years (EXPERIENCE), development (SKILLS). Possible relationships: EXPERIENCE_IN, LOCATED_IN, WORKS_FOR, PART_OF, CREATED_BY."""}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
)
outputs = pipe(prompt, max_new_tokens=120, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])


结论

微调 LLAMA 3 以提取关系涉及几个关键步骤,包括数据标注、预处理、模型设置、微调和推理。按照本文,你可以利用 LLAMA 3 的功能从文本数据中提取有意义的关系。LLAMA 3 的灵活性和鲁棒性使其非常适合各种 NLP 任务,包括关系提取。


本文使用 UBIAI 中注释的数据展示了端到端的流程。它展示了高级语言模型如何将原始文本转化为结构化信息。UBIAI 的注释功能与 LLAMA 3 强大的语言理解能力相结合,为从非结构化文本数据中提取有价值的见解提供了强有力的工具。

文章来源:https://medium.com/ubiai-nlp/fine-tuning-llama-3-model-for-relation-extraction-using-ubiai-data-1fd1946a62e0
欢迎关注ATYUN官方公众号
商务合作及内容投稿请联系邮箱:bd@atyun.com
评论 登录
写评论取消
回复取消