数据集:

semaj83/ctmatch_ir

许可:

mit
中文

CTMatch Information Retrieval Dataset

This is a dataset of processed clinical trials documents, somehwat of a duplication of that found in datasets/ir_datasets except that these have been preprocessed with ctproc to clean and extract useful fields from the clinical trial documents.

Note: They are currently saved as text files because of the downstream task in ctmatch, though in the future they may be converted to .csv.

Each .txt file has exactly 374648 lines of corresponding data:

doc_texts.txt

  • texts extracted from documents processed with ctproc using and eligbility criteria fields only, structured as this example from NCT00000102: "Inclusion Criteria: diagnosed with Congenital Adrenal Hyperplasia (CAH) normal ECG during baseline evaluation, Exclusion Criteria: history of liver disease, or elevated liver function tests history of cardiovascular disease"

doc_categories.txt :

  • 1 x 14 vectors of somewhat arbitrarily chosen topic probabilities (softmax output) generated by zero-shot classification model facebook/bart-large-mnli , CTMatch.category_model(doc['condition']) lexically ordered as such: cancer,cardiac,endocrine,gastrointestinal,genetic,healthy,infection,neurological,other,pediatric,psychological,pulmonary,renal,reproductive

doc_embeddings.txt :

  • 1 x 384 vectors of embeddings taken from last hidden state of model encoded doc_text using SentenceTransformers( sentence-transformers/all-MiniLM-L6-v2 )

index2docid.txt :

  • simple mapping of index to NCTID's for filtering/reference throughout IR program, corresponding to vector, texts representation order