中文

? Mpt-7b-instruct-orca

This short-form instruction following model was built by finetuning MPT-7B on the first 100k rows of the ehartford/dolphin dataset (an open-source implementation of Microsoft's Orca ).

  • Model license: CC-By-SA-3.0
  • Basic usage: notebook
  • Finetuning code: script
  • Loss curves: plot
  • Runtime stats: table

Example prompts and responses

Example 1:

User :

You are a helpful assistant. Write me a numbered list of things to do in New York City.\n

mpt-7b-instruct-orca :

  • Visit the Statue of Liberty: The Statue of Liberty is a symbol of freedom and democracy, and it is a must-see attraction in New York City.
  • Take a walk in Central Park: Central Park is a massive green space in the heart of Manhattan, and it is a great place to relax, enjoy nature, and people-watch.
  • Visit the Metropolitan Museum of Art: The Met is one of the world's largest and most renowned art museums, housing an extensive collection of art from around the world.
  • Take a stroll through Times Square: Times Square is the heart of the city, and it is a must-see for its iconic billboards, bright lights, and bustling energy.
  • Visit the Empire State Building: The Empire State Building is a famous skyscraper in Midtown Manhattan, offering panoramic views of the city from its observation deck.
  • Explore the Museum of Modern Art (MoMA): MoMA is a renowned art museum in Midtown Manhattan, showcasing modern and contemporary art from around the world.
  • Visit the 9/11 Memorial and Museum: The 9/11 Memorial and Museum is a somber but important attraction in Lower Manhattan, commemorating the victims of the September 11th attacks.
  • Take a ferry ride to the Statue of Liberty: A ferry ride to the Statue of Liberty is a great way to see the city from the water
  • Example 2:

    User :

    You are a helpful assistant. Daniel is in need of a haircut. His barber works Mondays, Wednesdays, and Fridays. So, Daniel went in for a haircut on Sunday. Does this make logical sense? Let's work this out in a step by step fashion to make sure we have the right answer.\n

    mpt-7b-instruct-orca :

    Step 1: Identify the days of the week when the barber works.

    The barber works on Mondays, Wednesdays, and Fridays.

    Step 2: Determine the day of the week on which Daniel went for a haircut.

    Daniel went for a haircut on Sunday.

    Step 3: Compare the day of the week on which Daniel went for a haircut to the days when the barber works.

    Sunday is not one of the days when the barber works.

    Step 4: Determine if the statement makes logical sense.

    The statement does not make logical sense because Daniel went for a haircut on a day when the barber does not work.

    Answer: No, the statement does not make logical sense. Daniel should have gone for a haircut on a day when the barber works, such as Monday, Wednesday, or Friday.

    Example 3:

    User :

    You are a helpful assistant. Write a short email inviting my friends to a dinner party on Friday. Respond succinctly.\n

    mpt-7b-instruct-orca :

    Subject: Invitation to Dinner Party on Friday

    Dear Friend,

    I hope this email finds you well. I am writing to invite you to a dinner party that I am hosting on Friday. The party will be a casual gathering, and I would love to have you join us.

    The dinner will be held at my home, and the time is from 7:00 pm to 10:00 pm. The address is 123 Main Street, San Francisco, CA 94122.

    The dinner will be a potluck, so please bring a dish to share. I will provide the main course, so you can focus on bringing a side dish, dessert, or appetizer.

    Please RSVP by replying to this email or by calling me at 555-555-5555. I look forward to seeing you on Friday!

    Yours,

    Your Friendly Assistant

    Model Description

    The architecture is a modification of a standard decoder-only transformer.

    The model has been modified from a standard transformer in the following ways:

    Hyperparameter Value
    n_parameters 6.65B
    n_layers 32
    n_heads 32
    d_model 4096
    vocab size 50432
    sequence length 2048

    Finetuning Description

    This model was trained on a single H100 (80 GB PCIe) for about 12 hours using the Lambda Labs Platform .

    Run: July 5, 2023 ( link )

    • Args summary: {'lr': 2e-5, 'num_epochs': 1, 'seed': 43}
    • Log summary: {'train_runtime': 61098.1062, 'train_samples_per_second': 1.637, 'train_steps_per_second': 0.409, 'train_loss': 1.4058428125, 'epoch': 1.0}

    Plot derived from the tfevents log at runs/jul_5_23_3_15_00_sft-instruction-mpt-7b-orca ( link ).

    PreTraining Data

    For more details on the pretraining process, see MPT-7B .

    The data was tokenized using the EleutherAI/gpt-neox-20b tokenizer.

    Limitations and Biases

    The following language is modified from EleutherAI's GPT-NeoX-20B

    This model can produce factually incorrect output, and should not be relied on to produce factually accurate information. This model was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

    How to Use

    Basic Usage: Notebook

    Note: This model requires that trust_remote_code=True be passed to the from_pretrained method. This is because we use a custom model architecture that is not yet part of the transformers package.

    It includes options for many training efficiency features such as FlashAttention (Dao et al. 2022) , ALiBi , QK LayerNorm, and more.

    First, install the package dependencies:

    !pip install -q -U transformers einops accelerate torch
    !pip install -q -U triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python
    

    Basic model loading:

    import transformers
    
    model = transformers.AutoModelForCausalLM.from_pretrained(
      'dfurman/mpt-7b-instruct-orca',
      trust_remote_code=True,
      device_map="auto",
    )
    

    To use the optimized triton implementation of FlashAttention, you can load the model on GPU with attn_impl='triton' and with bfloat16 precision:

    import torch
    import transformers
    
    name = 'dfurman/mpt-7b-instruct-orca'
    
    config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
    config.attn_config['attn_impl'] = 'triton'
    config.init_device = 'meta'
    
    model = transformers.AutoModelForCausalLM.from_pretrained(
      name,
      config=config,
      torch_dtype=torch.bfloat16,
      trust_remote_code=True,
      device_map="auto",
    )
    

    Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. Note that larger context windows require more available VRAM. For example:

    import transformers
    
    name = 'dfurman/mpt-7b-instruct-orca'
    
    config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
    config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096
    
    model = transformers.AutoModelForCausalLM.from_pretrained(
      name,
      config=config,
      trust_remote_code=True,
      device_map="auto",
    )
    

    This model was trained with the EleutherAI/gpt-neox-20b tokenizer. It can be invoked directly from this model's repo:

    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained('dfurman/mpt-7b-instruct-orca')
    

    Once loaded, the model and tokenizer can be used with the following code:

    import transformers
    import torch
    
    # text generation function
    def mpt_generate(
        model: transformers.AutoModelForCausalLM,
        tokenizer: transformers.AutoTokenizer,
        prompt: str,
        max_new_tokens: int = 128,
        temperature: int = 1.0,
    ) -> str:
        """
        Initialize the pipeline
        Uses Hugging Face GenerationConfig defaults
            https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationConfig
        Args:
            model (transformers.AutoModelForCausalLM): Model for text generation
            tokenizer (transformers.AutoTokenizer): Tokenizer for model
            prompt (str): Prompt for text generation
            max_new_tokens (int, optional): Max new tokens after the prompt to generate.
                Defaults to 128.
            temperature (float, optional): The value used to modulate the next token probabilities.
                Defaults to 1.0
        """
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
        inputs = tokenizer(
            prompt,
            return_tensors="pt",
            return_token_type_ids=False,
        ).to(device)
    
        # when running Torch modules in lower precision, it is best practice to use the torch.autocast context manager.
        with torch.autocast("cuda", dtype=torch.bfloat16):
            response = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=temperature,
                return_dict_in_generate=True,
                eos_token_id=tokenizer.eos_token_id,
                pad_token_id=tokenizer.pad_token_id,
            )
    
        decoded_output = tokenizer.decode(
            response["sequences"][0],
            skip_special_tokens=True,
        )  # grab output in natural language
    
        return decoded_output[len(prompt) :]  # remove prompt from output
    

    We can now generate text! For example:

    prompt = "You are a helpful assistant. Here is a recipe for vegan banana bread:\n"
    
    response = mpt_generate(
        model,
        tokenizer,
        prompt,
        max_new_tokens=150,
        temperature=0.92,
    )
    
    print(response)
    

    Runtime tests

    runtime / 50 tokens (sec) GPU attn torch dtype VRAM (GB)
    0.61 1x H100 (80 GB PCIe) triton bfloat16 12
    0.67 1x H100 (80 GB PCIe) torch bfloat16 12
    1.17 1x A100 (40 GB SXM) triton bfloat16 13
    1.36 1x A100 (40 GB SXM) torch bfloat16 13
    2.25 1x V100 (16 GB SXM) torch float16 13
    3.75 1x V100 (16 GB SXM) torch fp4 4
    4.84 1x Tesla T4 (15 GB) torch float16 13
    8.77 1x Tesla T4 (15 GB) torch fp4 4

    The runtime statistics above (leftmost column) were generated with following code for each test, as per the corresponding notebook .

    prompt = "You are a helpful assistant. Write me a long list of things to do in San Francisco:\n"
    
    runtimes = []
    for i in tqdm.tqdm(range(100)):
        start = time.time()
        response = mpt_generate(
            model,
            tokenizer,
            prompt,
            max_new_tokens=50,
            temperature=0.92,
        )
        end = time.time()
        runtimes.append(end - start)
        assert len(tokenizer.encode(response)) == 50
    
    avg_runtime = torch.mean(torch.tensor(runtimes)).item()
    print(f"Runtime avg in seconds: {avg_runtime}")  # time in seconds
    

    Acknowledgements

    This model was finetuned by Daniel Furman on July 5, 2023 and is intended primarily for research purposes.

    Disclaimer

    The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please cosult an attorney before using this model for commercial purposes.

    MosaicML Citation for MPT-7B

    @online{MosaicML2023Introducing,
        author    = {MosaicML NLP Team},
        title     = {Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs},
        year      = {2023},
        url       = {www.mosaicml.com/blog/mpt-7b},
        note      = {Accessed: 2023-07-02},
        urldate   = {2023-07-02}
    }