模型:

tiiuae/falcon-rw-7b

任务:

文本生成

类库:

PyTorch Transformers

数据集:

tiiuae/falcon-refinedweb 3Atiiuae/falcon-refinedweb

语言:

其他:

RefinedWebModel custom_code falcon text-generation-inference

预印本库:

arxiv:2306.01116 arxiv:2005.14165 arxiv:2108.12409 arxiv:2205.14135

许可:

apache-2.0

模型介绍文件清单

英文

Falcon-RW-7B

Falcon-RW-7B是由 TII 构建的7B参数因果解码器模型，使用了 RefinedWeb 的350B个标记进行训练。它在Apache 2.0许可下提供。

更多细节请参见? paper on arXiv 。

RefinedWeb是一个高质量的网络数据集，通过严格的过滤和大规模去重构建而成。仅在RefinedWeb上训练的Falcon-RW-7B可以与在精选数据上训练的可比模型相匹配或优于其性能。

⚠️Falcon现已作为transformers库中的核心模型提供！要使用库内版本，请使用pip install git+https://github.com/huggingface/transformers.git安装最新版本的transformers，然后从from_pretrained()中删除trust_remote_code=True参数。

⚠️此模型旨在作为研究工件使用，用于研究仅使用网络数据进行训练的影响。如果您对最先进的模型感兴趣，我们建议使用Falcon- 7B / 40B ，这两个模型均训练了超过1,000亿个标记。

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-rw-7b"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

?Falcon LLMs需要PyTorch 2.0与transformers一起使用！

Falcon-RW-7B模型卡片

模型详细信息

模型描述

开发者： https://www.tii.ae ;
模型类型：因果解码器模型;
语言（NLP）：英语;
许可：Apache 2.0。

模型来源

论文： https://arxiv.org/abs/2306.01116 。

用途

直接使用

研究大型语言模型，特别是充分过滤和去重网络数据对大型语言模型（公平性、安全性、限制性、能力等）性质的影响。

超出范围的使用

没有对风险和缓解措施进行充分评估的生产应用；任何可能被视为不负责任或有害的用途。

总体而言，我们建议Falcon- 7B / 40B 适用于与网络数据管道研究无直接关系的任何用途。

偏见、风险和限制

Falcon-RW-7B仅在英文数据上进行训练，对其他语言的泛化效果不合适。此外，由于它在代表网络的大规模语料库上进行训练，它将携带常见的在线刻板印象和偏见。

建议

我们建议Falcon-RW-7B的用户考虑对其进行特定任务的微调，并采取适当的防护措施和预防措施来进行任何生产用途。

如何开始使用该模型

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-rw-7b"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

培训详细信息

训练数据

Falcon-RW-7B在经过高质量过滤和去重的网络数据集RefinedWeb上训练了350B个标记。数据使用了Falcon- 7B / 40B 分词器进行分词。

训练过程

Falcon-RW-7B在256个A100 40GB GPU上进行训练，使用了3D并行策略（TP=2，PP=2，DP=64）结合ZeRO。

训练超参数

超参数是根据GPT-3论文（ Brown et al., 2020 ）进行调整的。

Hyperparameter	Value	Comment
Precision	bfloat16
Optimizer	AdamW
Learning rate	1.2e-4	500M tokens warm-up, cosine decay to 1.2e-5
Weight decay	1e-1
Batch size	1024	4B tokens ramp-up

速度、大小、时间

训练在2023年初进行，耗时约五天。

评估

有关详细的评估结果，请参见? paper on arXiv 。

技术规格

模型架构和目标

Falcon-RW-7B是一个仅解码器模型，采用因果语言建模任务（即预测下一个标记）进行训练。

架构改编自GPT-3论文（ Brown et al., 2020 ），但使用了ALiBi（ Ofir et al., 2021 ）和FlashAttention（ Dao et al., 2022 ）。

Hyperparameter	Value	Comment
Layers	36	Increased due to a config error when switching from a multi-query architecture
d_model	4096
head_dim	64	Reduced to optimise for FlashAttention
Vocabulary	65024
Sequence length	2048

计算基础设施

硬件

Falcon-RW-7B是在AWS SageMaker上训练的，使用了256个A100 40GB GPU在P4d实例上进行训练。

软件

Falcon-RW-7B使用了自定义的分布式训练代码库Gigatron进行训练。它结合了3D并行策略、ZeRO和高性能Triton内核（FlashAttention等）。

引用

@article{refinedweb,
  title={The {R}efined{W}eb dataset for {F}alcon {LLM}: outperforming curated corpora with web data, and web data only},
  author={Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay},
  journal={arXiv preprint arXiv:2306.01116},
  eprint={2306.01116},
  eprinttype = {arXiv},
  url={https://arxiv.org/abs/2306.01116},
  year={2023}
}

联系方式

falconllm@tii.ae

作者:

Technology Innovation Institute

数据集大小:

14.5 GB