模型:

GeorgiaTechResearchInstitute/starcoder-gpteacher-code-instruct

任务:

文本生成

类库:

PyTorch Transformers

数据集:

bigcode/the-stack-dedup teknium1/GPTeacher-codegen 3Ateknium1/GPTeacher-codegen 3Abigcode/the-stack-dedup

其他:

gpt_bigcode Code-Gen text-generation-inference

预印本库:

arxiv:1911.02150 arxiv:2205.14135 arxiv:2207.14255 arxiv:2305.06161

许可:

bigcode-openrail-m

模型介绍文件清单

英文

StarCoder GPTeacher-Codegen Fine-Tuned

此模型是在代码生成（GPT-4代码指令微调）数据集上进行了 bigcode/starcoder 次微调。

模型详情

基础的StarCoder模型是使用来自 teknium1/GPTeacher 的80多种编程语言训练的15.5B参数模型，已排除用户选择退出的部分。该模型使用了 Multi Query Attention 、 a context window of 8192 tokens ，并使用 Fill-in-the-Middle objective 上的1万亿标记进行了训练。

存储库: bigcode/Megatron-LM
项目网站: bigcode-project.org
论文: 💫StarCoder: May the source be with you!
联系人: contact@bigcode-project.org
语言: 80+ 编程语言

用途

预期用途

基础模型是在GitHub代码上进行训练，然后对其进行微调以遵循指令。例如，"编写一个计算平方根的函数"的提示应该能够得到合理的结果。原始仓库推荐使用 Tech Assistant prompt 对其进行few-shot提示，使其表现得像一个技术助手。这个微调模型使用了 Alpaca prompts 。

生成

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "GeorgiaTechResearchInstitute/starcoder-gpteacher-code-instruct"
device = "cuda"

input_prompt = ("Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n"
                "### Instruction:\n"
                "{instruction}\n\n"
                "### Input:\n"
                "{input}\n\n"
                "### Response:")

prompt = "Please explain the following program."
extra_input = "send(to, from, count)
              register short *to, *from;
              register count;
              {
                register n = (count + 7) / 8;
                switch (count % 8) {
                case 0: do { *to = *from++;
                case 7:      *to = *from++;
                case 6:      *to = *from++;
                case 5:      *to = *from++;
                case 4:      *to = *from++;
                case 3:      *to = *from++;
                case 2:      *to = *from++;
                case 1:      *to = *from++;
                  } while (--n > 0);
                }
              }"
prompt = input_prompt.format_map({"instruction": prompt, "input": extra_input})
                  
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True, torch_dtype=torch.float16).to(device)

inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

例子输出

完整提示:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
What does this program do and how does it work?
### Input:
send(to, from, count)
register short *to, *from;
register count;
{
    register n = (count + 7) / 8;
    switch (count % 8) {
        case 0: *to = *from++;
        case 7: *to = *from++;
        case 6: *to = *from++;
        case 5: *to = *from++;
        case 4: *to = *from++;
        case 3: *to = *from++;
        case 2: *to = *from++;
        case 1: *to = *from++;
    }
    while (--n > 0) {
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
        *to = *from++;
    }
}
### Response:

回复:

This program is designed to efficiently copy bytes from one memory location to another. It copies 'count' bytes from the location pointed to by 'from' to the location pointed to by 'to', and works correctly regardless of address alignment. The function 'foo' is a wrapper function that initializes the 'to' and 'from' pointers, checks for proper memory bounds, and performs byte alignment if necessary.

To use this program, you must declare two pointer variables, 'to' and 'from', and then call the 'foo' function with the following parameters:

foo(to, from, count)

归属和其他要求

模型的预训练数据集仅已过滤许可证，但模型有能力原样生成数据集中的源代码。代码的许可证可能要求归属或其他特定要求，必须遵守。BigCode项目提供了一个 search index ，可用于搜索预训练数据，以确定生成的代码来自何处，并将正确的归属应用于您的代码。

限制

该模型是在80多种编程语言的源代码上进行训练的。源代码以英语为主要语言，但也包含其他语言。因此，该模型能够根据一定的上下文生成代码片段，但生成的代码不能保证按预期工作。它可能效率低下，包含错误或漏洞。有关模型限制的详细讨论，请参见 the original paper 。微调过程使模型对直接用户输入更具响应性，然而，这只是一种对StarCoder模型进行指令微调的早期尝试，结果可能不代表模型的全部潜力。

训练

模型

架构: 带多查询注意力和填充中间目标的GPT-2模型
预训练步骤: 250k
预训练标记: 1万亿
精度: bfloat16
微调指令-回应对: 4.5k
微调上下文长度: 1024
微调时代: 3
微调学习率: 2e-5
微调优化: FSDP

硬件

GPU: 8个Tesla A100
训练时间: 5小时

许可证

该模型根据BigCode OpenRAIL-M v1许可协议进行许可。您可以在 here 找到完整的协议。此模型还使用了OpenAI GPT-4的输出进行了微调，因此还受到 OpenAI's terms of service. 的约束。

引用

基础模型HF存储库可以在 here. 找到

@article{li2023starcoder,
      title={StarCoder: may the source be with you!}, 
      author={Raymond Li and Loubna Ben Allal and Yangtian Zi and Niklas Muennighoff and Denis Kocetkov and Chenghao Mou and Marc Marone and Christopher Akiki and Jia Li and Jenny Chim and Qian Liu and Evgenii Zheltonozhskii and Terry Yue Zhuo and Thomas Wang and Olivier Dehaene and Mishig Davaadorj and Joel Lamy-Poirier and João Monteiro and Oleh Shliazhko and Nicolas Gontier and Nicholas Meade and Armel Zebaze and Ming-Ho Yee and Logesh Kumar Umapathi and Jian Zhu and Benjamin Lipkin and Muhtasham Oblokulov and Zhiruo Wang and Rudra Murthy and Jason Stillerman and Siva Sankalp Patel and Dmitry Abulkhanov and Marco Zocca and Manan Dey and Zhihan Zhang and Nour Fahmy and Urvashi Bhattacharyya and Wenhao Yu and Swayam Singh and Sasha Luccioni and Paulo Villegas and Maxim Kunakov and Fedor Zhdanov and Manuel Romero and Tony Lee and Nadav Timor and Jennifer Ding and Claire Schlesinger and Hailey Schoelkopf and Jan Ebert and Tri Dao and Mayank Mishra and Alex Gu and Jennifer Robinson and Carolyn Jane Anderson and Brendan Dolan-Gavitt and Danish Contractor and Siva Reddy and Daniel Fried and Dzmitry Bahdanau and Yacine Jernite and Carlos Muñoz Ferrandis and Sean Hughes and Thomas Wolf and Arjun Guha and Leandro von Werra and Harm de Vries},
      year={2023},
      eprint={2305.06161},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

作者:

Georgia Tech Research Institute

数据集大小:

58.93 GB