A Phenotype-driven COVID-19 Knowledge Graph from Biomedical Literature Drives Hypothesis Generation
Since December 2019, the scientific community has experienced a literature explosion regarding the novel coronavirus originating in Wuhan, China. As such, it has become increasingly difficult for researchers in the field to stay informed about novel developments in the published corpus. To address this problem and aid researchers in collecting, analyzing, and organizing the vast amount of information, we have created a knowledge graph (KG) cataloguing the relationships found between entities as evidenced by papers in the COVID-19 Open Research Dataset (CORD-19). We trained an embedding model to apply the KG to subsequent tasks such as predicting new treatments, symptoms, and risk factors for COVID-19. The embedding model obtained a classification accuracy over 70% classification accuracy with hits@10 at 0.61 and 0.18 depending on the expansiveness of the KG. Furthermore, an interactive web application was created and allows researchers to explore the KG and form novel questions. In conclusion, our KG compiles and extracts COVID-19 information useful to developing diagnostics and treatments. The web application is available at http://covid19nlp.wglab.org:3001/.
Comments
Triples
Are triples the convention for this kind of NLP and analysis? Would adding a fourth term as a metathesaurus concept or semantic network relation (eg A or B <verb> C, A <verb 1 or 2> affects B) be possible or useful? This work is extremely timely and presented very well!
Great Job!
Hi Ryan, congratulations on this very interesting project. I really liked how your research is interactive and very visual. I was curious though, how was the learning curve for developing the natural language processing? Were there any libraries or computational techniques you had to learn specifically for this research?