鹦鹉

1. 什么是鹦鹉？

鹦鹉是一个基于释义的话语扩充框架，专门用于加速NLU模型的训练。一个释义框架不仅仅是一个释义模型。关于该库和使用详情，请参考 github page

安装

pip install git+https://github.com/PrithivirajDamodaran/Parrot_Paraphraser.git

快速入门


from parrot import Parrot
import torch
import warnings
warnings.filterwarnings("ignore")

''' 
uncomment to get reproducable paraphrase generations
def random_state(seed):
  torch.manual_seed(seed)
  if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

random_state(1234)
'''

#Init models (make sure you init ONLY once if you integrate this to your code)
parrot = Parrot(model_tag="prithivida/parrot_paraphraser_on_T5", use_gpu=False)

phrases = ["Can you recommed some upscale restaurants in Newyork?",
           "What are the famous places we should not miss in Russia?"
]

for phrase in phrases:
  print("-"*100)
  print("Input_phrase: ", phrase)
  print("-"*100)
  para_phrases = parrot.augment(input_phrase=phrase)
  for para_phrase in para_phrases:
   print(para_phrase)

----------------------------------------------------------------------
Input_phrase: Can you recommed some upscale restaurants in Newyork?
----------------------------------------------------------------------
list some excellent restaurants to visit in new york city?
what upscale restaurants do you recommend in new york?
i want to try some upscale restaurants in new york?
recommend some upscale restaurants in newyork?
can you recommend some high end restaurants in newyork?
can you recommend some upscale restaurants in new york?
can you recommend some upscale restaurants in newyork?
----------------------------------------------------------------------
Input_phrase: What are the famous places we should not miss in Russia
----------------------------------------------------------------------
what should we not miss when visiting russia?
recommend some of the best places to visit in russia?
list some of the best places to visit in russia?
can you list the top places to visit in russia?
show the places that we should not miss in russia?
list some famous places which we should not miss in russia?

旋钮

 para_phrases = parrot.augment(input_phrase=phrase, 
                               diversity_ranker="levenshtein",
                               do_diverse=False, 
                               max_return_phrases = 10, 
                               max_length=32, 
                               adequacy_threshold = 0.99, 
                               fluency_threshold = 0.90)

2. 为什么选择鹦鹉？

Huggingface 列出了 12 paraphrase models, RapidAPI 列出了7个免费和商业化的释义工具，比如 QuillBot ，Rasa讨论了一个用于文本数据扩充的实验性释义工具 here ，Sentence-transfomers提供了一个 paraphrase mining utility ，而 NLPAug 提供了一个使用 PPDB （数百万个释义数据库）进行词级扩充的选项。虽然这些释义尝试很棒，但仍然存在一些差距，释义在建立NLU模型的文本扩充中仍然不是一个主流选项...鹦鹉是填补其中一些差距的一个谦虚的尝试。

一个好的释义是什么？几乎所有条件化文本生成模型都需要验证两个因素：（1）生成的文本是否传达了与原始上下文相同的含义（合适性）（2）文本是否流畅/语法正确（流畅性）。例如，神经机器翻译输出在合适性和流畅性方面进行测试。但是，释义应该在表面词汇形式上与原文不同，同时在合适性和流畅性方面都应该合适和流利。根据这个定义，衡量释义质量的三个关键指标是：

合适性（是否充分保留了含义？）
流畅性（释义是否流利的英语？）
多样性（词汇/短语/句法）（释义在多大程度上改变了原句？）

鹦鹉提供了旋钮，以控制合适性、流畅性和多样性，根据您的需求。

什么使一个释义工具成为一个好的扩充器？对于训练NLU模型，我们不仅仅需要大量的话语，还需要有带有意图和槽/实体注释的话语。典型的流程如下：

给定一个输入话语+输入注释，一个好的扩充器应该输出N个输出释义，同时保留意图和槽。
然后，使用我们在步骤1中获得的输入注释将输出释义转换为带有注释的数据。
由输出释义创建的带有注释的数据集，然后成为NLU模型的训练数据集。

但是，一般来说，作为一个生成模型，释义工具不能保证保留槽/实体的能力。因此，能够以受限的方式生成高质量的释义，而不以词汇差异为代价，才能使释义工具成为一个好的扩充器。在下面的第3节中会有更多关于这方面的内容。

3. 范围

在会话引擎领域，知识机器人是我们向其提问的机器人，比如“柏林墙是什么时候被拆除的？” ，交易机器人是我们给予命令的机器人，比如“请打开音乐”，语音助手是既能回答问题又能执行命令的机器人。鹦鹉主要集中在扩充键入或口述给对话界面的文本，以便构建强大的NLU模型。（通常人们不会向会话界面键入或大声说出长段文字。因此，预训练模型是在最长32个字符的文本样本上进行训练的。）

虽然鹦鹉主要旨在成为构建良好的NLU模型的文本扩充器，但它也可以用作一个纯粹的释义工具。

作者:

Prithivi Da

数据集大小:

851.2 MB