数据集:
flax-sentence-embeddings/paws-jsonl
This dataset is a jsonl format for PAWS dataset from: https://github.com/google-research-datasets/paws . It only contains the PAWS-Wiki Labeled (Final) and PAWS-Wiki Labeled (Swap-only) training sections of the original PAWS dataset. Duplicates data are removed.
Each line contains a dict in the following format:
{"guid": <id>, "texts": [anchor, positive]} or
{"guid": <id>, "texts": [anchor, positive, negative]}
positives_negatives.jsonl.gz: 24,723
positives_only.jsonl.gz: 13,487
Total : 38,210
PAWS: Paraphrase Adversaries from Word Scrambling
This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification. The dataset has two subsets, one based on Wikipedia and the other one based on the Quora Question Pairs (QQP) dataset.
这个数据集是来自于 https://github.com/google-research-datasets/paws 的PAWS数据集的jsonl格式。它仅包含原始PAWS数据集中的PAWS-Wiki Labeled (Final)和PAWS-Wiki Labeled (Swap-only)的训练部分。重复的数据已被删除。
每行包含一个字典,格式如下:
{"guid": <id>, "texts": [anchor, positive]}
或
{"guid": <id>, "texts": [anchor, positive, negative]}
positives_negatives.jsonl.gz: 24,723
positives_only.jsonl.gz: 13,487
总共: 38,210
PAWS: Paraphrase Adversaries from Word Scrambling
本数据集包含108,463个人工标注和656,000个噪声标注的配对,强调了在解决释义识别问题时对建模结构、上下文和词序信息的重要性。数据集有两个子集,一个基于维基百科,另一个基于Quora问题配对(QQP)数据集。