随着生成式人工智能和语言模型在理解和提取文档中信息方面的巨大增长,我们正目睹一个新时代的到来,即GPT等机器正在帮助人类更好地提取、解释和提取知识。
在本文中,我们将演示一个类似的用例,利用OpenAI API和LangChain构建一个能够从文档中提取信息并回答问题的系统。让我们开始吧。
步骤1:设置您的环境
在本文,我们将使用Jupyter Notebook和OpenAI API。
以下是我们在本指南中需要使用的模块。
pip install openai tiktoken chromadb langchain BeautifulSoup4
我们还需要设置OpenAI API凭证。
import os
os.environ["OPENAI_API_KEY"] = "sk-xxxx"
步骤2:获取文件内容
我们将使用来自伯克利人工智能研究博客https://bair.berkeley.edu/blog/2023/07/14/ddpo/ 的一篇文章。
LangChain提供了一种方便的方式,从网址中读取数据并将其转换为文档格式。您可以在这里了解更多关于文档加载程序的信息。
我们将使用WebBaseLoader,如下所示。
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://bair.berkeley.edu/blog/2023/07/14/ddpo/")
data = loader.load()
print(data[0].page_content)
> '\n\n\n\n\nTraining Diffusion Models with Reinforcement Learning – The Berkeley Artificial Intelligence Research Blog\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSubscribe\nAbout\nArchive\nBAIR\n\n\n\n\n\n\n\n\nTraining Diffusion Models with Reinforcement Learning\n\nKevin Black \xa0\xa0\n \n \n Jul 14, 2023\n \n \n\n\n\n\n\n\n\n\n\n\nTraining Diffusion Models with Reinforcement Learning\n\n\n\n\n\n\nreplay\n\n\nDiffusion models have recently emerged as the de facto standard for generating complex, high-dimensional outputs. You may know them for their ability to produce stunning AI art and hyper-realistic synthetic images, but they have also found success in other applications such as drug design\xa0and continuous control. The key idea behind diffusion models is to iteratively transform random noise into a sample, such as an image or protein structure. This is typically motivated as a maximum likelihood estimation\xa0problem, where the model is trained to generate samples that match the training data as closely as possible.\nHowever, most use cases of diffusion models are not directly concerned with matching the training data, but instead with a downstream objective. We don’t just want an image that looks like existing images, but one that has a specific type of appearance; we don’t just want a drug molecule that is physically plausible, but one that is as effective as possible. In this post, we show how diffusion models can be trained on these downstream objectives directly using reinforcement learning (RL). To do this, we finetune Stable Diffusion\xa0on a variety of objectives, including image compressibility, human-perceived aesthetic quality, and prompt-image alignment. The last of these objectives uses feedback from a large vision-language model\xa0to improve the model’s performance on unusual prompts, demonstrating how powerful AI models can be used to improve each other\xa0without any humans in the loop.\n\n\n\n\n\n\n A diagram illustrating the prompt-image alignment objective. It uses LLaVA, a large vision-language model, to evaluate generated images.\n \n\n\nDenoising Diffusion Policy Optimization\nWhen turning diffusion into an RL problem, we make only the most basic assumption: given a sample (e.g. an image), we have access to a reward function\xa0that we can\xa0evaluate to tell us how “good” that sample is. Our goal is for the diffusion model to generate samples that maximize this reward function.\nDiffusion models are typically trained using a loss function derived from maximum likelihood estimation (MLE), meaning they are encouraged to generate samples that make the training data look more likely. In the RL setting, we no longer have training data, only samples from the diffusion model and their associated rewards. One way we can still use the same MLE-motivated loss function is by treating the samples as training data and incorporating the rewards by weighting the loss for each sample by its reward. This gives us an algorithm that we call reward-weighted regression (RWR), after existing algorithms\xa0from RL literature.\nHowever, there are a few problems with this approach. One is that RWR is not a particularly exact algorithm — it maximizes the reward only approximately\xa0(see Nair et. al., Appendix A).\xa0The MLE-inspired loss for diffusion is also not exact and is instead derived using a variational bound\xa0on the true likelihood of each sample. This means that RWR maximizes the reward through two levels of approximation, which we find significantly hurts its performance.\n\n\n\n\n\n We evaluate two variants of DDPO and two variants of RWR on three reward functions and find that DDPO consistently achieves the best performance.\n \n\n\nThe key insight of our algorithm, which we call denoising diffusion policy optimization (DDPO), is that we can better maximize the reward of the final sample if we pay attention to the entire sequence of denoising steps that got us there. To do this, we reframe the diffusion process as a multi-step Markov decision process (MDP). In MDP terminology: each denoising step is an action, and the agent\xa0only gets a reward on the final step of each denoising trajectory\xa0when the final sample is produced. This framework allows us to apply many powerful algorithms from RL literature that are designed specifically for multi-step MDPs. Instead of using the approximate likelihood of the final sample, these algorithms use the exact likelihood of each denoising step, which is extremely easy to compute.\nWe chose to apply policy gradient algorithms due to their ease of implementation and past success in language model finetuning. This led to two variants of DDPO: DDPOSF, which uses the simple score function estimator of the policy gradient also known as REINFORCE; and DDPOIS, which uses a more powerful importance sampled estimator. DDPOIS\xa0is our best-performing algorithm and its implementation closely follows that of proximal policy optimization (PPO).\nFinetuning Stable Diffusion Using DDPO\nFor our main results, we finetune Stable Diffusion v1-4\xa0using DDPOIS. We have four tasks, each defined by a different reward function:\n\nCompressibility: How easy is the image to compress using the JPEG algorithm? The reward is the negative file size of the image (in kB) when saved as a JPEG.\nIncompressibility:\xa0How hard\xa0is the image to compress using the JPEG algorithm? The reward is the positive\xa0file size of the image (in kB) when saved as a JPEG.\nAesthetic Quality: How aesthetically appealing is the image to the human eye? The reward is the output of the LAION aesthetic predictor, which is a neural network trained on human preferences.\nPrompt-Image Alignment: How well does the image represent what was asked for in the prompt? This one is a bit more complicated: we feed the image into LLaVA, ask it to describe the image, and then compute the similarity between that description and the original prompt using BERTScore.\n\nSince Stable Diffusion is a text-to-image model, we also need to pick a set of prompts to give it during finetuning. For the first three tasks, we use simple prompts of the form “a(n) [animal]”. For prompt-image alignment, we use prompts of the form “a(n) [animal] [activity]”, where the activities are “washing dishes”, “playing chess”, and “riding a bike”. We found that Stable Diffusion often struggled to produce images that matched the prompt for these unusual scenarios, leaving plenty of room for improvement with RL finetuning.\nFirst, we illustrate the performance of DDPO on the simple rewards (compressibility, incompressibility, and aesthetic quality). All of the images are generated with the same random seed. In the top left quadrant, we illustrate what “vanilla” Stable Diffusion generates for nine different animals; all of the RL-finetuned models show a clear qualitative difference. Interestingly, the aesthetic quality model (top right) tends towards minimalist black-and-white line drawings, revealing the kinds of images that the LAION aesthetic predictor considers “more aesthetic”.1\n\n\n\nNext, we demonstrate DDPO on the more complex prompt-image alignment task. Here, we show several snapshots from the training process: each series of three images shows samples for the same prompt and random seed over time, with the first sample coming from vanilla Stable Diffusion. Interestingly, the model shifts towards a more cartoon-like style, which was not intentional. We hypothesize that this is because animals doing human-like activities are more likely to appear in a cartoon-like style in the pretraining data, so the model shifts towards this style to more easily align with the prompt by leveraging what it already knows.\n\n\n\nUnexpected Generalization\nSurprising generalization has been found to arise when finetuning large language models with RL: for example, models finetuned on instruction-following only in English often improve in other languages. We find that the same phenomenon occurs with text-to-image diffusion models. For example, our aesthetic quality model was finetuned using prompts that were selected from a list of 45 common animals. We find that it generalizes not only to unseen animals but also to everyday objects.\n\n\n\nOur prompt-image alignment model used the same list of 45 common animals during training, and only three activities. We find that it generalizes not only to unseen animals but also to unseen activities, and even novel combinations of the two.\n\n\n\nOveroptimization\nIt is well-known that finetuning on a reward function, especially a learned one, can lead to reward overoptimization\xa0where\xa0the model exploits the reward function to achieve a high reward in a non-useful way. Our setting is no exception: in all the tasks, the model eventually destroys any meaningful image content to maximize reward.\n\n\n\nWe also discovered that LLaVA is susceptible to typographic attacks: when optimizing for alignment with respect to prompts of the form “[n]\xa0animals”, DDPO was able to successfully fool LLaVA by instead generating text loosely resembling the correct number.\n\n\n\nThere is currently no general-purpose method for preventing overoptimization, and we highlight this problem as an important area for future work.\nConclusion\nDiffusion models are hard to beat when it comes to producing complex, high-dimensional outputs. However, so far they’ve mostly been successful in applications where the goal is to learn patterns from lots and lots of data (for example, image-caption pairs). What we’ve found is a way to effectively train diffusion models in a way that goes beyond pattern-matching — and without necessarily requiring any training data. The possibilities are limited only by the quality and creativity of your reward function.\nThe way we used DDPO in this work is inspired by the recent successes of language model finetuning. OpenAI’s GPT models, like Stable Diffusion, are first trained on huge amounts of Internet data; they are then finetuned with RL to produce useful tools like ChatGPT. Typically, their reward function is learned from human preferences, but others\xa0have more recently figured out how to produce powerful chatbots using reward functions based on AI feedback instead. Compared to the chatbot regime,\xa0our experiments are small-scale and limited in scope. But considering the enormous success of this “pretrain + finetune” paradigm in language modeling, it certainly seems like it’s worth pursuing further in the world of diffusion models. We hope that others can build on our work to improve large diffusion models, not just for text-to-image generation, but for many exciting applications such as video generation, music generation, \xa0image editing, protein synthesis, robotics, and more.\nFurthermore, the “pretrain + finetune” paradigm is not the only way to use DDPO. As long as you have a good reward function, there’s nothing stopping you from training with RL from the start. While this setting is as-yet unexplored, this is a place where the strengths of DDPO could really shine. Pure RL has long been applied to a wide variety of domains ranging from playing games\xa0to robotic manipulation\xa0to nuclear fusion\xa0to chip design. Adding the powerful expressivity of diffusion models to the mix has the potential to take existing applications of RL to the next level — or even to discover new ones.\n\nThis post is based on the following paper:\n\n\nTraining Diffusion Models with Reinforcement Learning\n\nKevin\xa0Black*,\n Michael\xa0Janner*,\n Yilun\xa0Du,\n Ilya\xa0Kostrikov,\n and Sergey\xa0Levine\n\narXiv Preprint.\n\n\n\nIf you want to learn more about DDPO, you can check out the paper, website, original code, or get the model weights on Hugging Face. If you want to use DDPO in your own project, check out my PyTorch + LoRA implementation where you can finetune Stable Diffusion with less than 10GB of GPU memory!\nIf DDPO inspires your work, please cite it with:\n@misc{black2023ddpo,\n title={Training Diffusion Models with Reinforcement Learning}, \n author={Kevin Black and Michael Janner and Yilun Du and Ilya Kostrikov and Sergey Levine},\n year={2023},\n eprint={2305.13301},\n archivePrefix={arXiv},\n primaryClass={cs.LG}\n}\n\n\n\n\n\n\n So, it turns out that the aesthetic score model we used was not exactly... correct. Check out this GitHub issue for the riveting details involving Google Cloud TPUs, floating point formats, and the CLIP image encoder.\n ↩\n\n\n\n\n\n\n\nSubscribe to our RSS feed.\n\n \n\n Spread the word: \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
步骤3:将文档转化为固定长度的块
一旦我们获取到文档内容,下一步就是将其转化为固定大小的块,以便文本适应我们选择的模型上下文窗口。我们将使用RecursiveCharacterTextSplitter,并将块大小设置为500。
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0)
all_splits = text_splitter.split_documents(data)
print(all_splits[0])
print(all_splits[1])
> page_content='Training Diffusion Models with Reinforcement Learning – The Berkeley Artificial Intelligence Research Blog\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSubscribe\nAbout\nArchive\nBAIR\n\n\n\n\n\n\n\n\nTraining Diffusion Models with Reinforcement Learning\n\nKevin Black \xa0\xa0\n \n \n Jul 14, 2023\n \n \n\n\n\n\n\n\n\n\n\n\nTraining Diffusion Models with Reinforcement Learning\n\n\n\n\n\n\nreplay' metadata={'source': 'https://bair.berkeley.edu/blog/2023/07/14/ddpo/', 'title': 'Training Diffusion Models with Reinforcement Learning – The Berkeley Artificial Intelligence Research Blog', 'description': 'The BAIR Blog', 'language': 'No language found.'}
page_content='Diffusion models have recently emerged as the de facto standard for generating complex, high-dimensional outputs. You may know them for their ability to produce stunning AI art and hyper-realistic synthetic images, but they have also found success in other applications such as drug design\xa0and continuous control. The key idea behind diffusion models is to iteratively transform random noise into a sample, such as an image or protein structure. This is typically motivated as a maximum likelihood' metadata={'source': 'https://bair.berkeley.edu/blog/2023/07/14/ddpo/', 'title': 'Training Diffusion Models with Reinforcement Learning – The Berkeley Artificial Intelligence Research Blog', 'description': 'The BAIR Blog', 'language': 'No language found.'}
步骤4:使用向量存储存储文档块
一旦我们将文档分成块,下一步就是为文本创建嵌入并将其存储在向量存储中。我们可以按照以下方式进行操作。
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()
步骤5:基于查询检索相似文件
在生成了文本的嵌入向量后,我们可以通过向量相似度搜索来获取与查询相关的文档片段。
question = "What steps are used for the DDPO algorithm?""What steps are used for the DDPO algorithm?"
docs = vectorstore.similarity_search(question)
print(f"Retrieved {len(docs)} documents")
print(docs[0].page_content)
> Retrieved 4 documents
The key insight of our algorithm, which we call denoising diffusion policy optimization (DDPO), is that we can better maximize the reward of the final sample if we pay attention to the entire sequence of denoising steps that got us there. To do this, we reframe the diffusion process as a multi-step Markov decision process (MDP). In MDP terminology: each denoising step is an action, and the agent only gets a reward on the final step of each denoising trajectory when the final sample is produced.
步骤6:使用精简文件生成回答
既然我们已经有了查询的精简文件,我们可以创建一个LLM链来根据向量存储检索器提供的上下文,为我们的查询生成一个回答。我们将生成一个提示,指示LLM回答查询,并在不确定的情况下指定一个固定的消息。
from langchain.prompts import PromptTemplate
template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Explain the answer in 3 sentences at max. Be concise.
Always say "Done!" at the end of the answer.
{context}
Question: {question}
Answer:"""
prompt = PromptTemplate.from_template(template)
根据这个提示,我们将创建一个LLM链,操作如下。
from langchain.schema.runnable import RunnablePassthrough
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
qa_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
)
在上面的代码中,我们有一个新的术语——RunnablePassThrough。这是一个协议,使得通过允许我们像上面那样将组件连接在一起,更容易创建自定义的LLM链。
现在我们可以使用我们的LLM链来回答我们的查询。
qa_chain.invoke("What steps are used for the DDPO algorithm?").content"What steps are used for the DDPO algorithm?").content
> 'The DDPO algorithm uses a sequence of denoising steps to maximize the reward of the final sample. Each denoising step is considered an action in a multi-step Markov decision process (MDP). The agent only receives a reward on the final step of each denoising trajectory when the final sample is produced. Done!'
qa_chain.invoke("What's the main idea disucssed in the article?").content
> 'The main idea discussed in the article is the problem of overoptimization and the need for a general-purpose method to prevent it. Done!'
有了这个,我们就拥有了一个可以使用的功能系统,可以轻松地利用它来帮助我们解释和理解文档。