模型:

GeorgiaTechResearchInstitute/galactica-6.7b-evol-instruct-70k

任务:

文本生成

类库:

PyTorch Transformers

数据集:

victor123/evol_instruct_70k 3Avictor123/evol_instruct_70k

其他:

opt galactica wizardlm alpaca

预印本库:

arxiv:2304.12244

许可:

cc-by-nc-4.0

模型介绍文件清单

英文

GALACTICA 6.7B Evol-Instruct

GALACTICA 6.7B在Evol-Instruct 70k数据集上进行了微调。

可以在原始Galactica repo的模型卡中找到原始链接 here ，原始论文链接为 here 。

Evol-Instruct-70k数据集的HF数据集可以在链接 here 找到，WizardLM的原始GitHub repo链接为 here 。

模型细节

GALACTICA模型是在大规模科学语料库上进行训练的，并且旨在执行科学任务。

模型用途

GALACTICA模型卡指定GALACTICA模型的主要用户是研究将语言模型应用于科学领域的研究人员，并且由于模型可能产生不准确信息，不推荐在没有保护措施的情况下在生产中使用GALACTICA模型。原始的GALACTICA模型根据非商业CC BY-NC 4.0许可证可用，基于Evol-Instruct-70k数据集的模型还受到附加的许可证约束 OpenAI Terms of Service 。

Example Outputs

完整的提示：

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Write out Maxwell's equations and explain each one.

### Response:

回答:

Maxwell方程是描述电磁场行为的一组四个方程。这些方程是由James Clerk Maxwell于1864年发展而来，并在电磁学的研究中广泛使用。

高斯定律：该方程说明了封闭曲面内的电荷与通过表面的电场通量相等。数学上可以表示为：

∇ ∙ E = ρ ϵ 0 \nabla \bullet \textbf{E} = \frac{\rho}{\epsilon_0} ∇ ∙ E = ϵ 0 ρ

其中E为电场，ρ为电荷密度，ε0为电常数，∇为散度运算符。

法拉第定律：该方程说明了封闭回路周围的电场与回路通过的磁通变化率成比例。数学上可以表示为：

∇ × E = − ∂ B ∂ t \nabla \times \textbf{E} = -\frac{\partial\textbf{B}}{\partial t} ∇ × E = − ∂ t ∂ B

其中B为磁场，t为时间，∇×为旋度运算符。

磁场的高斯定律：该方程说明了封闭曲面内的磁荷与通过曲面的磁场通量相等。数学上可以表示为：

∇ ∙ B = 0 \nabla \bullet \textbf{B} = 0 ∇ ∙ B = 0

其中B为磁场。

安培定律：该方程说明了封闭回路周围的磁场与通过回路的电流成比例。数学上可以表示为：

∇ × B = μ 0 J + μ 0 ϵ 0 ∂ E ∂ t \nabla \times \textbf{B} = \mu_0\textbf{J} + \mu_0\epsilon_0\frac{\partial \textbf{E}}{\partial t} ∇ × B = μ 0 J + μ 0 ϵ 0 ∂ t ∂ E

其中J为电流密度，μ0为磁常数，ε0为电常数。

这些方程构成了电磁理论的基础，并用于描述从无线电波到高能粒子等各种应用中的电磁场行为。

训练数据

该模型是通过将预训练的GALACTICA 6.7B模型微调至Evol-Instruct-70k数据集上进行训练的。GALACTICA模型是使用了1060亿个标记的开放获取科学文本和数据进行训练，包括论文、教科书、科学网站、百科全书等。将基于Evol-Instruct-70k数据集的基本GALACTICA模型进行微调，允许用户以指令-回答的方式查询所得到的模型。

如何使用

GALACTICA Evol-Instruct-70K权重可在 transformers 库中使用。

点击以展开

# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("GeorgiaTechResearchInstitute/galactica-6.7b-evol-instruct-70k")
model = AutoModelForCausalLM.from_pretrained("GeorgiaTechResearchInstitute/galactica-6.7b-evol-instruct-70k", device_map="auto", torch_dtype=torch.float16)

# the evol-instruct models were fine-tuned with the same hidden prompts as the Alpaca project
no_input_prompt_template = ("Below is an instruction that describes a task. "
                            "Write a response that appropriately completes the request.\n\n"
                            "### Instruction:\n{instruction}\n\n### Response:")
prompt = "Write out Maxwell's equations and explain the meaning of each one."
formatted_prompt = no_input_prompt_template.format_map({'instruction': prompt})

tokenized_prompt = tokenizer(formatted_prompt, return_tensors="pt").input_ids.to(model.device)
out_tokens = model.generate(tokenized_prompt)

print(tokenizer.batch_decode(out_tokens, skip_special_tokens=False, clean_up_tokenization_spaces=False))

训练资源

使用8个A100 80GB GPU，在约22小时内对GALACTICA 6.7B Evol-Instruct进行了微调，使用16位混合精度，有效批次大小为64，并且使用最大的上下文窗口为2048个标记。该模型采用了完整分片数据并行的训练方式。

性能和局限性

定性评估表明，Evol-Instruct-70k微调的Galactica模型比Alpaca微调的GALPACA模型更加可控和对用户提示更加关注。

引用文献

@inproceedings{GALACTICA,
    title={GALACTICA: A Large Language Model for Science},
    author={Ross Taylor and Marcin Kardas and Guillem Cucurull and Thomas Scialom and Anthony Hartshorn and Elvis Saravia and Andrew Poulton and Viktor Kerkez and Robert Stojnic},
    year={2022}
}

@misc{xu2023wizardlm,
      title={WizardLM: Empowering Large Language Models to Follow Complex Instructions}, 
      author={Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Daxin Jiang},
      year={2023},
      eprint={2304.12244},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

作者:

Georgia Tech Research Institute

数据集大小:

25.57 GB