模型:

dfurman/mpt-7b-instruct-orca

许可:

cc-by-sa-3.0

预印本库:

arxiv:2108.12409 arxiv:2205.14135

其他:

MosaicML custom_code mpt

数据集:

3Aehartford/dolphin

类库:

Transformers Safetensors PyTorch

任务:

文本生成

模型介绍文件清单

英文

? Mpt-7b-instruct-orca

这个简化指令跟随模型是通过对 ehartford/dolphin 数据集（一个 Microsoft's Orca 的开源实现）的前100k行进行微调而建立的模型。

模型许可证：CC-By-SA-3.0
基本用法：notebook
微调代码： script
损失曲线： plot
运行统计信息： table

示例提示和回答

示例1：

输入：

您是一个有用的助手。给我写一个在纽约市要做的事情的编号清单。

输出：

参观自由女神像：自由女神像是自由和民主的象征，它是纽约市的必看景点。

在中央公园散步：中央公园是曼哈顿市中心的一片绿地，是一个放松、享受大自然和观察人群的好地方。

参观大都会艺术博物馆：大都会博物馆是世界上最大、最有名的艺术博物馆之一，收藏了来自世界各地的大量艺术品。

漫步时代广场：时代广场是纽约市的中心，以其标志性的广告牌、明亮的灯光和熙熙攘攘的能量闻名。

参观帝国大厦：帝国大厦是曼哈顿中城的一座著名摩天大楼，从其观景台可以欣赏到城市的全景。

探索现代艺术博物馆（MoMA）：MoMA是曼哈顿中城一家著名的艺术博物馆，展示来自世界各地的现代和当代艺术。

参观911纪念馆和博物馆：911纪念馆和博物馆是曼哈顿下城一个庄重但重要的景点，纪念911恐怖袭击的受害者。

乘渡轮游览自由女神像：乘渡轮游览自由女神像是从水上看城市的好办法。

示例2：

输入：

您是一个有用助手。丹尼尔需要理发，他的理发师周一、周三和周五工作。所以，丹尼尔周日去理发了。这样合乎逻辑吗？让我们一步一步地解决这个问题，确保我们得出正确的答案。

输出：

第1步：确定理发师工作的那几天。

理发师在周一、周三和周五上班。

第2步：确定丹尼尔去理发的那一天。

丹尼尔周日去理发。

第3步：比较丹尼尔去理发的那一天与理发师上班的那几天。

周日不是理发师上班的日子。

第4步：确定陈述是否符合逻辑。

陈述不符合逻辑，因为丹尼尔在理发师不工作的日子去理发。

答案：不，陈述不合乎逻辑。丹尼尔应该在理发师工作的日子去理发，如周一、周三或周五。

示例3：

输入：

您是一个有用的助手。给我的朋友写一封邀请周五参加晚宴的简短电子邮件。简洁回答。

输出：

主题：星期五晚宴邀请

亲爱的朋友，

希望您一切安好。我写信是要邀请您参加我周五举办的晚宴。晚宴将是一个随意的聚会，我很希望您能加入我们。

晚宴将在我家举行，时间是从晚上7点到晚上10点。地址是123 Main Street，San Francisco，CA 94122。

晚宴将是一个聚餐，所以请带一道与大家分享的菜。我会提供主菜，所以您可以专注于带一道配菜、甜点或开胃菜。

请通过回复此电子邮件或致电我（555-555-5555）回复是否参加。期待在星期五见到您！

您的朋友，

您友好的助手

模型描述

这个模型是一个修改过的标准解码器-只transformer模型。

该模型从标准transformer进行了以下修改：

它使用 FlashAttention 。
它使用 ALiBi (Attention with Linear Biases) 并且不使用位置编码。
它不使用偏置。

Hyperparameter	Value
n_parameters	6.65B
n_layers	32
n_heads	32
d_model	4096
vocab size	50432
sequence length	2048

微调描述

该模型在一台H100（80 GB PCIe）上训练了约12小时，使用了 Lambda Labs Platform 。

运行时间：2023年7月5日（ link ）

参数摘要：{'lr': 2e-5, 'num_epochs': 1, 'seed': 43}
日志摘要：{'train_runtime': 61098.1062, 'train_samples_per_second': 1.637, 'train_steps_per_second': 0.409, 'train_loss': 1.4058428125, 'epoch': 1.0}

从 link 日志的 tfevents 中得出的图。

预训练数据

有关预训练过程的详细信息，请参阅 MPT-7B 。

该数据使用 EleutherAI/gpt-neox-20b 分词器进行标记化。

限制和偏见

下述语言来自 EleutherAI's GPT-NeoX-20B 的修改内容

该模型可能会产生事实上不正确的输出，不能依赖它产生事实准确的信息。该模型是根据各种公共数据集进行训练的。尽管我们已经努力清理预训练数据，但是该模型可能会生成粗俗、偏见或具有冒犯性的输出。

如何使用

基本用法：笔记本。

注意：此模型要求在 from_pretrained 方法中传递 trust_remote_code=True。这是因为我们使用的是一个尚未成为 transformers 包的一部分的自定义模型架构。

它包括许多训练效率特性的选项，例如 FlashAttention (Dao et al. 2022) ， ALiBi ，QK LayerNorm 等。

首先，安装软件包依赖项：

!pip install -q -U transformers einops accelerate torch
!pip install -q -U triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python

基本模型加载：

import transformers

model = transformers.AutoModelForCausalLM.from_pretrained(
  'dfurman/mpt-7b-instruct-orca',
  trust_remote_code=True,
  device_map="auto",
)

要使用优化的 triton implementation FlashAttention 模型，您可以在 GPU 上以 attn_impl='triton' 和 bfloat16 精度加载模型：

import torch
import transformers

name = 'dfurman/mpt-7b-instruct-orca'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.attn_config['attn_impl'] = 'triton'
config.init_device = 'meta'

model = transformers.AutoModelForCausalLM.from_pretrained(
  name,
  config=config,
  torch_dtype=torch.bfloat16,
  trust_remote_code=True,
  device_map="auto",
)

虽然该模型在一个序列长度为2048的情况下进行了训练，但 ALiBi 可以在微调和/或推理过程中增加最大序列长度。请注意，更大的上下文窗口需要更多可用的显存。例如：

import transformers

name = 'dfurman/mpt-7b-instruct-orca'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096

model = transformers.AutoModelForCausalLM.from_pretrained(
  name,
  config=config,
  trust_remote_code=True,
  device_map="auto",
)

此模型使用 EleutherAI/gpt-neox-20b 分词器进行训练。它可以直接从该模型的仓库中调用：

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('dfurman/mpt-7b-instruct-orca')

加载模型和分词器后，可以使用以下代码：

import transformers
import torch

# text generation function
def mpt_generate(
    model: transformers.AutoModelForCausalLM,
    tokenizer: transformers.AutoTokenizer,
    prompt: str,
    max_new_tokens: int = 128,
    temperature: int = 1.0,
) -> str:
    """
    Initialize the pipeline
    Uses Hugging Face GenerationConfig defaults
        https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationConfig
    Args:
        model (transformers.AutoModelForCausalLM): Model for text generation
        tokenizer (transformers.AutoTokenizer): Tokenizer for model
        prompt (str): Prompt for text generation
        max_new_tokens (int, optional): Max new tokens after the prompt to generate.
            Defaults to 128.
        temperature (float, optional): The value used to modulate the next token probabilities.
            Defaults to 1.0
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        return_token_type_ids=False,
    ).to(device)

    # when running Torch modules in lower precision, it is best practice to use the torch.autocast context manager.
    with torch.autocast("cuda", dtype=torch.bfloat16):
        response = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            return_dict_in_generate=True,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
        )

    decoded_output = tokenizer.decode(
        response["sequences"][0],
        skip_special_tokens=True,
    )  # grab output in natural language

    return decoded_output[len(prompt) :]  # remove prompt from output

现在我们可以生成文本了！例如：

prompt = "You are a helpful assistant. Here is a recipe for vegan banana bread:\n"

response = mpt_generate(
    model,
    tokenizer,
    prompt,
    max_new_tokens=150,
    temperature=0.92,
)

print(response)

运行时测试

runtime / 50 tokens (sec)	GPU	attn	torch dtype	VRAM (GB)
0.61	1x H100 (80 GB PCIe)	triton	bfloat16	12
0.67	1x H100 (80 GB PCIe)	torch	bfloat16	12
1.17	1x A100 (40 GB SXM)	triton	bfloat16	13
1.36	1x A100 (40 GB SXM)	torch	bfloat16	13
2.25	1x V100 (16 GB SXM)	torch	float16	13
3.75	1x V100 (16 GB SXM)	torch	fp4	4
4.84	1x Tesla T4 (15 GB)	torch	float16	13
8.77	1x Tesla T4 (15 GB)	torch	fp4	4

上述运行时统计信息（最左列）是使用每个测试的以下代码生成的，根据相应的 notebook 。

prompt = "You are a helpful assistant. Write me a long list of things to do in San Francisco:\n"

runtimes = []
for i in tqdm.tqdm(range(100)):
    start = time.time()
    response = mpt_generate(
        model,
        tokenizer,
        prompt,
        max_new_tokens=50,
        temperature=0.92,
    )
    end = time.time()
    runtimes.append(end - start)
    assert len(tokenizer.encode(response)) == 50

avg_runtime = torch.mean(torch.tensor(runtimes)).item()
print(f"Runtime avg in seconds: {avg_runtime}")  # time in seconds

致谢

This model was finetuned by Daniel Furman on July 5, 2023 and is intended primarily for research purposes.

声明

该模型的许可证不构成法律建议。我们对使用该模型的第三方的行为不负责任。在商业用途上使用该模型之前，请咨询律师。

MosaicML 对 MPT-7B 的引用

@online{MosaicML2023Introducing,
    author    = {MosaicML NLP Team},
    title     = {Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs},
    year      = {2023},
    url       = {www.mosaicml.com/blog/mpt-7b},
    note      = {Accessed: 2023-07-02},
    urldate   = {2023-07-02}
}

作者:

Daniel Furman

数据集大小:

24.77 GB