【指南】如何在本地运行Nvidia的llama-3.1-nemotron-70b-instruct

2024年10月21日由 alex 发表 436 0

在本地运行大型语言模型（LLM）越来越受到开发人员、研究人员和人工智能爱好者的青睐。llama-3.1-nemotron-70b-instruct 就是这样一个备受关注的模型，它是由NVIDIA定制的强大 LLM，可增强生成的响应的有用性。在本文中，我们将从用户友好的 Ollama 平台开始，探索在本地计算机上运行该模型的多种方法。

方法 1：使用 Ollama 在本地运行 llama-3.1-nemotron-70b-instruct

Ollama 是本地运行 LLM 的绝佳工具，它提供简单的设置过程，并支持各种模型，包括 llama-3.1-nemotron-70b-instruct。

安装

访问 Ollama 官方网站(https://ollama.ai)，下载适合你操作系统的版本。
在终端运行以下命令安装 Ollama：

curl https://ollama.ai/install.sh | sh

运行 llama-3.1-nemotron

安装好 Ollama 后，只需执行一个简单的命令，就能轻松运行 llama-3.1-nemotron-70b-instruct 模型：

ollama run nemotron:70b-instruct-q5_K_Mnemotron:70b-instruct-q5_K_M

如果系统中还没有模型，该命令将下载模型并启动交互会话。

使用模型

加载模型后，你可以通过键入提示开始与模型交互。例如：

>>> What are the key features of llama-3.1-nemotron-70b-instruct?key features of llama-3.1-nemotron-70b-instruct?
Llama-3.1-Nemotron-70B-Instruct is a large language model with several key features:
1. Customized by NVIDIA: The model has been fine-tuned by NVIDIA to improve the helpfulness and quality of its responses.
2. Based on Llama 3.1: It builds upon the Llama 3.1 architecture, which is known for its strong performance across various tasks.
3. 70 billion parameters: This large parameter count allows for complex reasoning and a wide range of capabilities.
4. Instruct-tuned: The model is specifically designed to follow instructions and generate helpful responses to user queries.
5. RLHF training: It has been trained using Reinforcement Learning from Human Feedback, specifically the REINFORCE algorithm.
6. Specialized reward model: The training process utilized Llama-3.1-Nemotron-70B-Reward for optimization.
7. HelpSteer2-Preference prompts: These were used during the training process to further improve the model's helpfulness.
8. Extended context length: Like other Llama 3.1 models, it likely supports a longer context window of 128K tokens.
9. Multilingual capabilities: It can understand and generate text in multiple languages.
10. Strong reasoning abilities: The model excels in tasks requiring complex reasoning and problem-solving.
These features make llama-3.1-nemotron-70b-instruct a powerful and versatile language model suitable for a wide range of applications, from general conversation to specialized tasks in various domains.

对于更高级的用例，你可以使用 Langchain 等库将 Ollama 与 Python 集成。下面是一个简单的例子：

python
from langchain.llms import Ollama
ollama = Ollama(base_url="http://localhost:11434", model="nemotron:70b-instruct-q5_K_M")
response = ollama.generate("Explain the concept of quantum entanglement.")
print(response)

这样，你就可以将模型无缝集成到你的 Python 项目和应用程序中。

方法 2：使用 llama.cpp

llama.cpp 是一种流行的 Llama 模型推理 C++ 实现，针对 CPU 使用进行了优化。虽然它可能需要比 Ollama 更多的设置，但它提供了更大的灵活性和对模型参数的控制。

安装

克隆 llama.cpp 存储库：

git clone https://github.com/ggerganov/llama.cpp.gitclone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

建设项目：

make

下载模型

要运行 llama-3.1-nemotron-70b-instruct，你需要下载模型权重。这些权重通常以 GGML 或 GGUF 格式提供。

mkdir models
cd models
wget https://huggingface.co/TheBloke/Llama-3.1-Nemotron-70B-Instruct-GGUF/resolve/main/llama-3.1-nemotron-70b-instruct.Q4_K_M.gguf

运行模型

获得模型文件后，可以使用以下命令运行它：

./main -m models/llama-3.1-nemotron-70b-instruct.Q4_K_M.gguf -n 1024 -p "Hello, how are you today?""Hello, how are you today?"

该命令加载模型并根据给定提示生成响应。你可以调整各种参数，如生成代币的数量 (-n) 或控制随机性的温度。

方法 3：使用 Hugging Face 变换器

Hugging Face 的 Transformers 库提供了一个高级 API，用于处理各种语言模型，包括 llama-3.1-nemotron-70b-instruct。

安装

首先，安装必要的库：

pip install transformers torch accelerate

运行模型

下面是加载和使用模型的 Python 脚本：

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "meta-llama/Llama-3.1-Nemotron-70b-instruct""meta-llama/Llama-3.1-Nemotron-70b-instruct"
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
# Prepare the input
prompt = "Explain the concept of quantum computing in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Generate the response
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=100)
# Decode and print the response
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

这种方法可以对模型的行为进行更精细的控制，并与其他 Hugging Face 工具和管道集成。

结论

在本地运行 llama-3.1-nemotron-70b-instruct 为开发人员和研究人员带来了无限可能。无论你是选择 Ollama 的简单性、llama.cpp 的灵活性，还是 Hugging Face Transformers 的集成能力，你现在都拥有了在自己的硬件上利用这种高级语言模型的强大功能的工具。

文章来源：https://medium.com/@sebastian-petrus/how-to-run-nvidia-llama-3-1-nemotron-70b-instruct-locally-a58ad283aaff

标签：

大型语言模型

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇 AI图像与故事生成：FastAPI、Groq与Replicate的应用指南

下一篇利用LLaVA-Critic评估多模态模型

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

AGENTIC AI如何塑造未来