Abstract
Computational text phenotyping is the practice of identifying patients with
certain disorders and traits from clinical notes. Rare diseases are challenging
to be identified due to few cases available for machine learning and the need
for data annotation from domain experts. We propose a method using ontologies
and weak supervision, with recent pre-trained contextual representations from
Bi-directional Transformers (e.g. BERT). The ontology-based framework includes
two steps: (i) Text-to-UMLS, extracting phenotypes by contextually linking
mentions to concepts in Unified Medical Language System (UMLS), with a Named
Entity Recognition and Linking (NER+L) tool, SemEHR, and weak supervision with
customised rules and contextual mention representation; (ii) UMLS-to-ORDO,
matching UMLS concepts to rare diseases in Orphanet Rare Disease Ontology
(ORDO). The weakly supervised approach is proposed to learn a phenotype
confirmation model to improve Text-to-UMLS linking, without annotated data from
domain experts. We evaluated the approach on three clinical datasets of
discharge summaries and radiology reports from two institutions in the US and
the UK. Our best weakly supervised method achieved 81.4% precision and 91.4%
recall on extracting rare disease UMLS phenotypes from MIMIC-III discharge
summaries. The overall pipeline processing clinical notes can surface rare
disease cases, mostly uncaptured in structured data (manually assigned ICD
codes). Results on radiology reports from MIMIC-III and NHS Tayside were
consistent with the discharge summaries. We discuss the usefulness of the weak
supervision approach and propose directions for future studies.
Users
Please
log in to take part in the discussion (add own reviews or comments).