数据集:

medical_questions_pairs

语言:

en

计算机处理:

monolingual

大小:

1K<n<10K

语言创建人:

other

批注创建人:

expert-generated

源数据集:

original

预印本库:

arxiv:2008.13546
中文

Dataset Card for [medical_questions_pairs]

Dataset Summary

This dataset consists of 3048 similar and dissimilar medical question pairs hand-generated and labeled by Curai's doctors. Doctors with a list of 1524 patient-asked questions randomly sampled from the publicly available crawl of HealthTap . Each question results in one similar and one different pair through the following instructions provided to the labelers:

  • Rewrite the original question in a different way while maintaining the same intent. Restructure the syntax as much as possible and change medical details that would not impact your response. e.g. "I'm a 22-y-o female" could become "My 26 year old daughter"
  • Come up with a related but dissimilar question for which the answer to the original question would be WRONG OR IRRELEVANT. Use similar key words.

The first instruction generates a positive question pair (similar) and the second generates a negative question pair (different). With the above instructions, the task was intentionally framed such that positive question pairs can look very different by superficial metrics, and negative question pairs can conversely look very similar. This ensures that the task is not trivial.

Supported Tasks and Leaderboards

  • text-classification : The dataset can be used to train a model to identify similar and non similar medical question pairs.

Languages

The text in the dataset is in English.

Dataset Structure

Data Instances

The dataset contains dr_id, question_1, question_2, label. 11 different doctors were used for this task so dr_id ranges from 1 to 11. The label is 1 if the question pair is similar and 0 otherwise.

Data Fields

  • dr_id : 11 different doctors were used for this task so dr_id ranges from 1 to 11
  • question_1 : Original Question
  • question_2 : Rewritten Question maintaining the same intent like Original Question
  • label : The label is 1 if the question pair is similar and 0 otherwise.

Data Splits

The dataset as of now consists of only one split(train) but can be split seperately based on the requirement

train
Non similar Question Pairs 1524
Similar Question Pairs 1524

Dataset Creation

Doctors with a list of 1524 patient-asked questions randomly sampled from the publicly available crawl of HealthTap . Each question results in one similar and one different pair through the following instructions provided to the labelers:

  • Rewrite the original question in a different way while maintaining the same intent. Restructure the syntax as much as possible and change medical details that would not impact your response. e.g. "I'm a 22-y-o female" could become "My 26 year old daughter"
  • Come up with a related but dissimilar question for which the answer to the original question would be WRONG OR IRRELEVANT. Use similar key words.

The first instruction generates a positive question pair (similar) and the second generates a negative question pair (different). With the above instructions, the task was intentionally framed such that positive question pairs can look very different by superficial metrics, and negative question pairs can conversely look very similar. This ensures that the task is not trivial.

Curation Rationale

[More Information Needed]

Source Data

1524 patient-asked questions randomly sampled from the publicly available crawl of HealthTap

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

[More Information Needed]

Annotation process

Doctors with a list of 1524 patient-asked questions randomly sampled from the publicly available crawl of HealthTap . Each question results in one similar and one different pair through the following instructions provided to the labelers:

  • Rewrite the original question in a different way while maintaining the same intent. Restructure the syntax as much as possible and change medical details that would not impact your response. e.g. "I'm a 22-y-o female" could become "My 26 year old daughter"
  • Come up with a related but dissimilar question for which the answer to the original question would be WRONG OR IRRELEVANT. Use similar key words.

The first instruction generates a positive question pair (similar) and the second generates a negative question pair (different). With the above instructions, the task was intentionally framed such that positive question pairs can look very different by superficial metrics, and negative question pairs can conversely look very similar. This ensures that the task is not trivial.

Who are the annotators?

Curai's doctors

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

[More Information Needed]

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

[More Information Needed]

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

@misc{mccreery2020effective,
      title={Effective Transfer Learning for Identifying Similar Questions: Matching User Questions to COVID-19 FAQs}, 
      author={Clara H. McCreery and Namit Katariya and Anitha Kannan and Manish Chablani and Xavier Amatriain},
      year={2020},
      eprint={2008.13546},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

Contributions

Thanks to @tuner007 for adding this dataset.