Fall Research Expo 2020

A Phenotype-driven COVID-19 Knowledge Graph from Biomedical Literature Drives Hypothesis Generation

Since December 2019, the scientific community has experienced a literature explosion regarding the novel coronavirus originating in Wuhan, China. As such, it has become increasingly difficult for researchers in the field to stay informed about novel developments in the published corpus. To address this problem and aid researchers in collecting, analyzing, and organizing the vast amount of information, we have created a knowledge graph (KG) cataloguing the relationships found between entities as evidenced by papers in the COVID-19 Open Research Dataset (CORD-19). We trained an embedding model to apply the KG to subsequent tasks such as predicting new treatments, symptoms, and risk factors for COVID-19. The embedding model obtained a classification accuracy over 70% classification accuracy with hits@10 at 0.61 and 0.18 depending on the expansiveness of the KG. Furthermore, an interactive web application was created and allows researchers to explore the KG and form novel questions. In conclusion, our KG compiles and extracts COVID-19 information useful to developing diagnostics and treatments. The web application is available at http://covid19nlp.wglab.org:3001/.

PRESENTED BY
PURM - Penn Undergraduate Research Mentoring Program
College of Arts & Sciences 2023
Join Ryan for a virtual discussion
PRESENTED BY
PURM - Penn Undergraduate Research Mentoring Program
College of Arts & Sciences 2023

Comments

Are triples the convention for this kind of NLP and analysis? Would adding a fourth term as a metathesaurus concept or semantic network relation (eg A or B <verb> C, A <verb 1 or 2> affects B) be possible or useful? This work is extremely timely and presented very well!

Hi Ryan, congratulations on this very interesting project. I really liked how your research is interactive and very visual. I was curious though, how was the learning curve for developing the natural language processing? Were there any libraries or computational techniques you had to learn specifically for this research?