模型:

bigscience/bloom-3b

任务:

文本生成

类库:

PyTorch Safetensors Transformers

语言:

其他:

bloom Eval Results text-generation-inference

预印本库:

arxiv:1909.08053 arxiv:2110.02861 arxiv:2108.12409

许可:

bigscience-bloom-rail-1.0

模型介绍文件清单

英文

BLOOM LM

大科学开放式开放式多语言语言模型

模型卡

版本 1.0 / 2022年5月26日

模型细节

基础知识

该部分提供了想要了解模型的任何人的信息。

点击扩展

开发者：BigScience（ website ）

所有协作者要么是志愿者，要么与雇主达成了协议。（进一步的参与者细分即将到来。）

模型类型：基于Transformer的语言模型

版本：1.0.0

语言：多种；请参见训练数据

许可证：RAIL许可证v1.0（ link ）

预计发布日期：2022年7月11日星期一

发送问题至：bigscience-contact@googlegroups.com

引用：BigScience，BigScience语言开放科学开放访问多语言（BLOOM）语言模型。五月2021年至2022年国际

由以下机构资助：

法国政府。
Hugging Face（ website ）。
贡献者的组织。（进一步组织的详细分布即将到来。）

技术规格

该部分提供给从事模型开发的人员信息。

点击扩展

有关复制培训的详细信息，请参见 the BLOOM training README 。

模型架构：改编自Megatron-LM GPT2（参见 paper , BLOOM Megatron code ）：

仅解码器架构
层归一化应用于词嵌入层（StableEmbedding；请参见 code , paper ）
ALiBI位置编码（请参见 paper ），具有GeLU激活函数
3,002,557,440个参数：
- 642,252,800个嵌入参数
- 30层，32个注意头
- 隐藏层为2560维
- 使用2048个令牌的序列长度（请参见 BLOOM tokenizer ，分词器描述）

目标函数：交叉熵与平均减少（请参见 API documentation ）。

计算基础设施：法国提供的Jean Zay公共超级计算机（请参见 announcement ）。

硬件：384个A100 80GB GPU（48个节点）：
- 额外的32个A100 80GB GPU（4个节点）备用
- 每个节点8个GPU，使用NVLink 4互连和4个OmniPath连接
- CPU：AMD
- CPU内存：每个节点512GB
- GPU内存：每个节点640GB
- 节点间连接：Omni-Path架构（OPA）
- NCCL通信网络：完全专用子网
- 磁盘IO网络：与其他类型的节点共享网络
软件：
- Megatron-DeepSpeed（ Github link ）
- DeepSpeed（ Github link ）
- PyTorch（pytorch-1.11带CUDA-11.5；请参见 Github link ）
- apex（ Github link ）

Training

训练日志： Tensorboard link

时期数量：1（当前目标）
日期：
- 开始于2022年3月11日上午11:42 PST
- 结束于2022年7月5日
训练成本估算：云计算中的等值2-5M美元（包括初步实验）
服务器训练位置：法国Île-de-France

Tokenization

BLOOM分词器（ link ）是一个通过以下方法进行训练的学习子词分词器：

字节级字对编码（BPE）算法
简单的预分词规则，无归一化
250,680词汇量

它是在预处理文本的1.5TB子集上使用每种语言的alpha加权进行训练的。

环境影响

点击扩展

训练超级计算机Jean Zay（ website ）主要使用核能。其产生的热量被重新利用供暖校园住房。

预计碳排放量：（待完成培训后。）

预计用电量：（待完成培训后。）

用途

本节回答关于模型的使用方式，讨论模型的预期用户（包括受到模型影响的用户）并描述模型不适用范围或滥用的用途。它为考虑使用模型或受模型影响的任何人提供信息。

点击扩展

预期用途

创建此模型旨在支持大型语言模型（LLM）的公共研究。LLM可用于语言生成或作为预训练的基础模型，可进一步用于特定任务的微调。以下是非详尽清单的用例。

直接使用

文本生成
探索语言模型生成的特点
- 示例：填空测试，虚拟情景生成，重组生成

下游使用

利用语言模型的任务包括：信息提取，问答，摘要

滥用和超出范围的使用

本节介绍用户不应该对模型进行的操作。

有关详细的使用限制，请参见 BLOOM License ，附件A。以下列表不详尽，但列出了一些容易预见的问题使用案例。

超出范围的用途

此模型不适用于在高风险环境中使用。该模型不适用于重要决策，也不适用于对个人生计或福祉有任何实质影响的用途。模型输出的内容似乎是事实，但实际上是不正确的。

超出范围的用途包括：

在生物医学领域，政治和法律领域或金融领域的使用
用于评估或评分个人，例如就业，教育或信用
将该模型应用于自动重要决策，生成事实内容，创建可靠摘要或生成必须正确的预测

滥用

故意使用该模型进行有害行为，违反人权或进行其他恶意活动属于对该模型的滥用。这包括：

生成垃圾邮件
发布不实信息和影响操作
诽谤和诬蔑
骚扰和虐待
欺骗行为
未经同意的身份窃取和模仿
未经同意的监视
生成内容而不将其归因于模型，如 RAIL License, Use Restrictions 中所指定

预计用户

直接用户

普通公众
研究人员
学生
教育工作者
工程师/开发人员
非商业实体
社群倡导者，包括人权和公民权利组织

间接用户

间接用户使用直接用户创建的派生产品的用户，例如使用其预期用途的软件的用户
使用 Derivatives of the Model, as described in the License 的用户

其他相关方（Parties Prenantes）

被LLM提及的个人和群体
接触到LLM输出或基于LLM的决策的个人和群体
原始作品包含在LLM中的个人和群体

训练数据

本节提供训练数据的高级概述。这对于想要了解模型正在学习的基本知识的人很重要。

点击扩展

有关每个数据集的详细信息，请参见各个 Data Cards 。

训练数据包括：

45种自然语言
12种编程语言
以1.5TB的预处理文本形式，转换为350B个唯一标记（请参见分词器部分获取更多信息。）

语言

饼图显示了训练数据中各种语言的分布。

下表显示了Niger-Congo和Indic语言在训练数据中的进一步分布。

点击扩展

Niger Congo	Percentage	Indic	Percentage
Chi Tumbuka	0.00002	Assamese	0.01
Kikuyu	0.00004	Odia	0.04
Bambara	0.00004	Gujarati	0.04
Akan	0.00007	Marathi	0.05
Xitsonga	0.00007	Punjabi	0.05
Sesotho	0.00007	Kannada	0.06
Chi Chewa	0.0001	Nepali	0.07
Setswana	0.0002	Telugu	0.09
Northern Sotho	0.0002	Malayalam	0.10
Fon	0.0002	Urdu	0.10
Kirundi	0.0003	Tamil	0.20
Wolof	0.0004	Bengali	0.50
Kuganda	0.0004	Hindi	0.70
Chi Shona	0.001
Isi Zulu	0.001
Igbo	0.001
Xhosa	0.001
Kinyarwanda	0.003
Yoruba	0.006
Swahili	0.02

下表显示了编程语言的分布。

点击扩展

Extension	Language	Number of files
java	Java	5,407,724
php	PHP	4,942,186
cpp	C++	2,503,930
py	Python	2,435,072
js	JavaScript	1,905,518
cs	C#	1,577,347
rb	Ruby	6,78,413
cc	C++	443,054
hpp	C++	391,048
lua	Lua	352,317
go	GO	227,763
ts	TypeScript	195,254
C	C	134,537
scala	Scala	92,052
hh	C++	67,161
H	C++	55,899
tsx	TypeScript	33,107
rs	Rust	29,693
phpt	PHP	9,702
c++	C++	1,342
h++	C++	791
php3	PHP	540
phps	PHP	270
php5	PHP	166
php4	PHP	29

风险和限制

本节识别了可预见的危害和误解。

点击扩展

模型可能：

过度呈现某些观点并低估其他观点
包含刻板印象
包含个人信息
生成：
- 具有仇恨，辱骂或暴力性的言论
- 具有歧视性或偏见性的语言
- 可能不适合所有环境的内容，包括性内容
出现错误，包括将不正确的信息产生为事实
生成无关或重复的输出

评估

本节描述评估协议并提供结果。

点击扩展

指标

本节描述计算性能的不同方式以及原因。

包括：

Metric	Why chosen
Perplexity	Standard metric for quantifying model improvements during training
Cross Entropy Loss	Standard objective for language models.

以及特定任务的多种不同度量标准（评估协议完成后即将提供更多评估指标）。

因素

本节列出了BLOOM模型的一些不同方面。重点在于可能导致模型行为高度变化的方面。

语言，例如英语或约鲁巴语
领域，例如新闻电报或故事
人口特征，例如性别或国籍

结果

结果基于因素和指标。

零射击评估：

请参见此存储库的JSON文件： https://github.com/bigscience-workshop/evaluation-results

Task	Language	Metric	BLOOM-2B5
arc_challenge	eng	acc ↑	0.28
arc_easy	eng	acc ↑	0.595
axb (Median of 10 prompts)	eng	acc ↑	0.443
axg (Median of 10 prompts)	eng	acc ↑	0.5
boolq (Median of 11 prompts)	eng	acc ↑	0.617
cb (Median of 15 prompts)	eng	acc ↑	0.304
cola (Median of 5 prompts)	eng	acc ↑	0.611
copa (Median of 9 prompts)	eng	acc ↑	0.63
crows_pairs_english (Median of 6 prompts)	eng	acc ↑	0.497
crows_pairs_french (Median of 7 prompts)	fra	acc ↑	0.503
diabla (Median of 2 prompts)	eng	acc ↑	0.289
gsarti/flores_101_afr	afr	byte_perplexity ↓	6.501
gsarti/flores_101_amh	amh	byte_perplexity ↓	3.973
gsarti/flores_101_ara	ara	byte_perplexity ↓	1.808
gsarti/flores_101_asm	asm	byte_perplexity ↓	5.699
gsarti/flores_101_ast	ast	byte_perplexity ↓	3.925
gsarti/flores_101_azj	azj	byte_perplexity ↓	6.943
gsarti/flores_101_bel	bel	byte_perplexity ↓	3.614
gsarti/flores_101_ben	ben	byte_perplexity ↓	5.121
gsarti/flores_101_bos	bos	byte_perplexity ↓	5.653
gsarti/flores_101_bul	bul	byte_perplexity ↓	2.701
gsarti/flores_101_cat	cat	byte_perplexity ↓	2.305
gsarti/flores_101_ceb	ceb	byte_perplexity ↓	6.291
gsarti/flores_101_ces	ces	byte_perplexity ↓	5.447
gsarti/flores_101_ckb	ckb	byte_perplexity ↓	3.726
gsarti/flores_101_cym	cym	byte_perplexity ↓	12.539
gsarti/flores_101_dan	dan	byte_perplexity ↓	5.183
gsarti/flores_101_deu	deu	byte_perplexity ↓	3.118
gsarti/flores_101_ell	ell	byte_perplexity ↓	2.468
gsarti/flores_101_eng	eng	byte_perplexity ↓	2.019
gsarti/flores_101_est	est	byte_perplexity ↓	9.117
gsarti/flores_101_fas	fas	byte_perplexity ↓	3.058
gsarti/flores_101_fin	fin	byte_perplexity ↓	6.847
gsarti/flores_101_fra	fra	byte_perplexity ↓	1.998
gsarti/flores_101_ful	ful	byte_perplexity ↓	11.466
gsarti/flores_101_gle	gle	byte_perplexity ↓	8.681
gsarti/flores_101_glg	glg	byte_perplexity ↓	3.03
gsarti/flores_101_guj	guj	byte_perplexity ↓	4.955
gsarti/flores_101_hau	hau	byte_perplexity ↓	10.758
gsarti/flores_101_heb	heb	byte_perplexity ↓	3.6
gsarti/flores_101_hin	hin	byte_perplexity ↓	4.713
gsarti/flores_101_hrv	hrv	byte_perplexity ↓	5.822
gsarti/flores_101_hun	hun	byte_perplexity ↓	6.44
gsarti/flores_101_hye	hye	byte_perplexity ↓	3.658
gsarti/flores_101_ibo	ibo	byte_perplexity ↓	5.565
gsarti/flores_101_ind	ind	byte_perplexity ↓	2.16
gsarti/flores_101_isl	isl	byte_perplexity ↓	8.082
gsarti/flores_101_ita	ita	byte_perplexity ↓	2.969
gsarti/flores_101_jav	jav	byte_perplexity ↓	7.057
gsarti/flores_101_jpn	jpn	byte_perplexity ↓	2.776
gsarti/flores_101_kam	kam	byte_perplexity ↓	11.073
gsarti/flores_101_kan	kan	byte_perplexity ↓	5.552
gsarti/flores_101_kat	kat	byte_perplexity ↓	2.523
gsarti/flores_101_kaz	kaz	byte_perplexity ↓	3.39
gsarti/flores_101_kea	kea	byte_perplexity ↓	8.919
gsarti/flores_101_kir	kir	byte_perplexity ↓	3.729
gsarti/flores_101_kor	kor	byte_perplexity ↓	3.933
gsarti/flores_101_lao	lao	byte_perplexity ↓	2.908
gsarti/flores_101_lav	lav	byte_perplexity ↓	7.777
gsarti/flores_101_lin	lin	byte_perplexity ↓	7.525
gsarti/flores_101_lit	lit	byte_perplexity ↓	7.369
gsarti/flores_101_ltz	ltz	byte_perplexity ↓	8.801
gsarti/flores_101_lug	lug	byte_perplexity ↓	8.483
gsarti/flores_101_luo	luo	byte_perplexity ↓	11.976
gsarti/flores_101_mal	mal	byte_perplexity ↓	4.616
gsarti/flores_101_mar	mar	byte_perplexity ↓	5.483
gsarti/flores_101_mkd	mkd	byte_perplexity ↓	2.966
gsarti/flores_101_mlt	mlt	byte_perplexity ↓	15.005
gsarti/flores_101_mon	mon	byte_perplexity ↓	3.411
gsarti/flores_101_mri	mri	byte_perplexity ↓	7.474
gsarti/flores_101_msa	msa	byte_perplexity ↓	2.571
gsarti/flores_101_mya	mya	byte_perplexity ↓	2.414
gsarti/flores_101_nld	nld	byte_perplexity ↓	4.128
gsarti/flores_101_nob	nob	byte_perplexity ↓	5.403
gsarti/flores_101_npi	npi	byte_perplexity ↓	5.199
gsarti/flores_101_nso	nso	byte_perplexity ↓	8.155
gsarti/flores_101_nya	nya	byte_perplexity ↓	8.18
gsarti/flores_101_oci	oci	byte_perplexity ↓	4.862
gsarti/flores_101_orm	orm	byte_perplexity ↓	12.912
gsarti/flores_101_ory	ory	byte_perplexity ↓	5.189
gsarti/flores_101_pan	pan	byte_perplexity ↓	4.698
gsarti/flores_101_pol	pol	byte_perplexity ↓	4.626
gsarti/flores_101_por	por	byte_perplexity ↓	1.975
gsarti/flores_101_pus	pus	byte_perplexity ↓	4.496
gsarti/flores_101_ron	ron	byte_perplexity ↓	4.965
gsarti/flores_101_rus	rus	byte_perplexity ↓	2.05
gsarti/flores_101_slk	slk	byte_perplexity ↓	6.451
gsarti/flores_101_slv	slv	byte_perplexity ↓	6.62
gsarti/flores_101_sna	sna	byte_perplexity ↓	8.462
gsarti/flores_101_snd	snd	byte_perplexity ↓	5.466
gsarti/flores_101_som	som	byte_perplexity ↓	11.959
gsarti/flores_101_spa	spa	byte_perplexity ↓	1.897
gsarti/flores_101_srp	srp	byte_perplexity ↓	2.871
gsarti/flores_101_swe	swe	byte_perplexity ↓	5.055
gsarti/flores_101_swh	swh	byte_perplexity ↓	3.697
gsarti/flores_101_tam	tam	byte_perplexity ↓	4.539
gsarti/flores_101_tel	tel	byte_perplexity ↓	5.807
gsarti/flores_101_tgk	tgk	byte_perplexity ↓	3.599
gsarti/flores_101_tgl	tgl	byte_perplexity ↓	5.667
gsarti/flores_101_tha	tha	byte_perplexity ↓	2.366
gsarti/flores_101_tur	tur	byte_perplexity ↓	4.885
gsarti/flores_101_ukr	ukr	byte_perplexity ↓	2.724
gsarti/flores_101_umb	umb	byte_perplexity ↓	12.767
gsarti/flores_101_urd	urd	byte_perplexity ↓	1.98
gsarti/flores_101_uzb	uzb	byte_perplexity ↓	12.002
gsarti/flores_101_vie	vie	byte_perplexity ↓	1.766
gsarti/flores_101_wol	wol	byte_perplexity ↓	9.144
gsarti/flores_101_xho	xho	byte_perplexity ↓	7.403
gsarti/flores_101_yor	yor	byte_perplexity ↓	5.913
gsarti/flores_101_zho_simpl	zho_simpl	byte_perplexity ↓	2.277
gsarti/flores_101_zho_trad	zho_trad	byte_perplexity ↓	2.518
gsarti/flores_101_zul	zul	byte_perplexity ↓	8.534
headqa	esp	acc ↑	0.264
hellaswag	eng	acc ↑	0.412
logiqa	eng	acc ↑	0.207
mathqa	eng	acc ↑	0.25
mc_taco	eng	em ↑	0.119
mnli (Median of 15 prompts)	eng	acc ↑	0.355
mnli_mismatched (Median of 15 prompts)	eng	acc ↑	0.352
mrpc	eng	acc ↑	0.586
multirc (Median of 11 prompts)	eng	acc ↑	0.538
openbookqa	eng	acc ↑	0.216
piqa	eng	acc ↑	0.708
prost	eng	acc ↑	0.227
pubmedqa	eng	acc ↑	0.616
qnli	eng	acc ↑	0.507
qqp (Median of 7 prompts)	eng	acc ↑	0.384
race	eng	acc ↑	0.352
rte (Median of 6 prompts)	eng	acc ↑	0.477
sciq	eng	acc ↑	0.892
sst (Median of 6 prompts)	eng	acc ↑	0.518
triviaqa	eng	acc ↑	0.042
tydiqa_primary (Median of 24 prompts)	eng	acc ↑	0.301
webqs	eng	acc ↑	0.017
wic (Median of 11 prompts)	eng	acc ↑	0.502
winogrande	eng	acc ↑	0.586
wnli (Median of 6 prompts)	eng	acc ↑	0.472
wsc (Median of 11 prompts)	eng	acc ↑	0.442
humaneval	python	pass@1 ↑	0.155
humaneval	python	pass@10 ↑	0.322
humaneval	python	pass@100 ↑	0.555

训练时评估：

截至2022年5月25日15:00 PST：

训练损失：2.0
验证损失：2.2
困惑度：8.9

术语表和计算

本节定义了常见术语以及如何计算指标的方式。

点击扩展

Loss: 指的是模型学到的内容与数据显示的内容（“真实情况”）之间的差异的计算。损失越低，越好。培训过程旨在最小化损失。
Perplexity: 是基于模型对新数据概率的估计。困惑度越低，性能越好。如果模型完全准确地预测将要看到的下一个令牌，则困惑度为1。在数学上，这是使用熵进行计算的。
High-stakes settings: ，例如在欧洲联盟提出的“ Artificial Intelligence (AI) Act ”中被认定为“高风险AI系统”和“不可接受风险AI系统”的系统。
Critical decisions: ，例如在 the United States' proposed Algorithmic Accountability Act 中定义的系统。
Human rights: 包括在 Universal Declaration of Human Rights 中定义的人权以及个人信息监管中的受保护类别。
Personal Data and Personal Information: 和个人信息在多个数据保护法规中定义，例如欧洲通用数据保护条例中的“ personal data ”，南非共和国的“ Protection of Personal Information Act ”以及中华人民共和国的“ Personal information protection law ”。
Sensitive characteristics: 包括人权中的特定受保护类别（请参见 UHDR, Article 2 ）和个人信息规定（请参见GDPR， Article 9; Protection of Personal Information Act, Chapter 1 ）
Deception: 指的是故意误导个人相信错误的内容，例如通过在社交媒体上创建如真人般的机器人或聊天机器人，或生成文本文件而不让消费者意识到该文本是由机器生成的。

模型卡作者

按时间顺序和花费的时间排序。

Margaret Mitchell，Giada Pistilli，Yacine Jernite，Ezinwanne Ozoani，Marissa Gerchick，Nazneen Rajani，Sasha Luccioni，Irene Solaiman，Maraim Masoud，Somaieh Nikpoor，Carlos Muñoz Ferrandis，Stas Bekman，Christopher Akiki，Danish Contractor，David Lansky，Angelina McMillan-Major，Tristan Thrush，Suzana Ilić，Gérard Dupont，Shayne Longpre，Manan Dey，Stella Biderman，Douwe Kiela，Emi Baylor，Teven Le Scao，Aaron Gokaslan，Julien Launay，Niklas Muennighoff

作者:

BigScience Workshop

数据集大小:

11.2 GB

BLOOM LM

大科学开放式开放式多语言语言模型

模型卡

目录

模型细节

基础知识

技术规格

环境影响

用途

预期用途

滥用和超出范围的使用

预计用户

训练数据

风险和限制

评估

指标

因素

结果

推荐

术语表和计算

更多信息

数据集创建

技术规格

初步结果

模型卡作者