模型:

tiiuae/falcon-7b

任务:

文本生成

类库:

PyTorch Transformers

数据集:

tiiuae/falcon-refinedweb 3Atiiuae/falcon-refinedweb

语言:

其他:

RefinedWebModel custom_code text-generation-inference

预印本库:

arxiv:2205.14135 arxiv:1911.02150 arxiv:2101.00027 arxiv:2005.14165 arxiv:2104.09864 arxiv:2306.01116

许可:

apache-2.0

模型介绍文件清单

英文

🚀 Falcon-7B

Falcon-7B是由 TII 构建的具有7B参数的因果推断器模型，经过 RefinedWeb 增强的1500B令牌进行训练，并辅以策划语料库。它在Apache 2.0许可下提供。

即将发布的论文😊。

🤗 要开始使用Falcon（推断、微调、量化等），我们建议阅读 this great blogpost fron HF ！

为什么使用Falcon-7B？

它胜过可比较的开源模型（例如 MPT-7B ， StableLM ， RedPajama 等），因为它是在 RefinedWeb 增强的1500B令牌数据上进行训练的。请参阅 OpenLLM Leaderboard 。
它采用了为推断进行优化的架构，具有FlashAttention（ Dao et al., 2022 ）和multiquery（ Shazeer et al., 2019 ）。
它根据Apache 2.0许可提供，允许商业使用，不收取任何版税或限制。

⚠️ 这是一个原始的预训练模型，应该进一步微调适用于大多数用例。如果您正在寻找适合以聊天格式接收通用指令的版本，我们建议查看 Falcon-7B-Instruct 。

🔥 寻找一个更强大的模型吗？ Falcon-40B 是Falcon-7B的大哥！

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-7b"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

💥 Falcon LLMs需要PyTorch 2.0与transformers一起使用！

要进行快速推断使用Falcon，请查看 Text Generation Inference ！阅读更多相关信息这个 blogpost 。

您将需要至少16GB的内存才能快速运行Falcon-7B的推断。

Falcon-7B模型卡

模型详细信息

模型描述

开发者： https://www.tii.ae ;
模型类型：因果解码器;
语言（自然语言处理）：英语和法语;
许可证：Apache 2.0。

模型来源

论文：即将发布。

用途

直接使用

大规模语言模型研究；作为进一步专门化和微调特定用例（例如摘要，文本生成，聊天机器人等）的基础。

超出范围的用途

在充分评估风险和减轻风险的情况下进行生产使用；任何被认为是不负责任或有害的用例。

偏见、风险和局限性

Falcon-7B仅在英语和法语数据上进行训练，无法适当推广到其他语言。此外，由于它是在代表网络的大规模语料库上进行训练的，它将携带常见的在线刻板印象和偏见。

建议

我们建议Falcon-7B的用户考虑为感兴趣的特定任务对其进行微调，并针对任何生产使用采取必要的保护措施和适当的预防措施。

如何开始使用该模型

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-7b"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

训练详细信息

训练数据

Falcon-7B是在 RefinedWeb 的1500B令牌数据上进行训练的，这是一个经过高质量过滤和去重的网络数据集，我们增加了策划语料库。我们的策划语料库的重要组成部分受到了The Pile（ Gao et al., 2020 ）的启发。

Data source	Fraction	Tokens	Sources
12321321	79%	1,185B	massive web crawl
Books	7%	110B
Conversations	6%	85B	Reddit, StackOverflow, HackerNews
Code	3%	45B
RefinedWeb-French	3%	45B	massive web crawl
Technical	2%	30B	arXiv, PubMed, USPTO, etc.

数据使用Falcon- 7B / 40B 标记器进行了标记。

训练过程

Falcon-7B在384个A100 40GB GPU上进行训练，使用2D并行策略（PP=2，DP=192）结合使用ZeRO。

训练超参数

Hyperparameter	Value	Comment
Precision	bfloat16
Optimizer	AdamW
Learning rate	6e-4	4B tokens warm-up, cosine decay to 1.2e-5
Weight decay	1e-1
Z-loss	1e-4
Batch size	2304	30B tokens ramp-up

速度、大小、时间

训练于2023年3月初，历时约两周。

评估

即将发布的论文。

查看 OpenLLM Leaderboard 获取初步结果。

技术规格

模型体系结构和目标

Falcon-7B是一个因果推断器模型，通过因果语言建模任务（即预测下一个令牌）进行训练。

该架构广泛借鉴了GPT-3论文（ Brown et al., 2020 ），具有以下差异：

位置嵌入：rotary（ Su et al., 2021 ）；
注意力：multiquery（ Shazeer et al., 2019 ）和FlashAttention（ Dao et al., 2022 ）；
解码块：并行注意力/MLP与单层归一化。

Hyperparameter	Value	Comment
Layers	32
d_model	4544	Increased to compensate for multiquery
head_dim	64	Reduced to optimise for FlashAttention
Vocabulary	65024
Sequence length	2048

计算设施

硬件

Falcon-7B在AWS SageMaker上进行训练，使用384个A100 40GB GPU在P4d实例上。

软件

Falcon-7B使用自定义的分布式训练代码库Gigatron进行训练。它使用三维并行策略结合ZeRO和高性能的Triton内核（FlashAttention等）。

引用

即将发布的论文😊。在此期间，您可以使用以下信息进行引用：

@article{falcon40b,
  title={{Falcon-40B}: an open large language model with state-of-the-art performance},
  author={Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme},
  year={2023}
}

要了解有关预训练数据集的更多信息，请参阅📓 RefinedWeb paper 。

@article{refinedweb,
  title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
  author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
  journal={arXiv preprint arXiv:2306.01116},
  eprint={2306.01116},
  eprinttype = {arXiv},
  url={https://arxiv.org/abs/2306.01116},
  year={2023}
}

许可

Falcon-7B根据Apache 2.0许可提供。

联系方式

falconllm@tii.ae

作者:

Technology Innovation Institute

数据集大小:

13.45 GB