Exploring Statistical and Computational Methods in Circadian Metabolomics
Metabolomics is a growing sub-field of systems biology centered on the large-scale study of small molecules (metabolites) in cells, biofluids, tissues, and organisms. Research has shown that the analysis of these metabolite concentrations applies to biomarker discovery (i.e. distinguishing control and disease groups) and the advancement of personalized medicine. When considering metabolomics in light of the circadian rhythm, the 24-hour clock that regulates many internal processes in living organisms, we discover that there is an unexplained yet intricate link between metabolism and the circadian rhythm. This summer, I was blessed to work under Dr. Aalim Weljie at the Department of Systems Pharmacology and Translational Therapeutics at Penn Medicine and explore this gap in knowledge with 2 mini-projects.
One of the most important aspects of metabolomics is the very specific workflow that has been and must be established when performing any experiment: setting up and gathering the biological samples, acquiring the data through (usually) mass spectrometry, performing statistical analysis, identifying the compounds in the data and the reaction chains (pathways) associated with them, and finally developing biological models through analyzing these pathways. However, in order to perform the identification efficiently with thousands of datapoints, we need to create a lab database with this pathway information, along with the IDs of these metabolites that are present in multiple online databases. This was my first project: create this lab database (includes about 1500 metabolites currently) and develop scripts in R to take the output of the different mass spectrometry methods and annotate them with the information in the database. This project allowed me to really understand the implementation of many data structures in R, especially in text analysis, by attempting different ways to associate the metabolite names outputted from the mass spectrometry data and the names in the database, accounting for spelling errors, spare non-alpha-numeric characters, and so on.
When determining whether certain metabolite concentrations display a circadian behavior, a typical experiment setup would measure the values of a wide range of metabolites at multiple timepoints over the spread of 1-2 days. Thereafter, one would run the many statistical algorithms that would determine whether the metabolites did display a circadian behavior (as expected, the algorithms output p-values and a p-value of <0.05 is considered significant). This was my second project, to explore these statistical algorithms and compare their outputs when run on the same dataset. With this project, I was able to learn a ton about different types of regression (linear, nonlinear, single, multiple, etc.) and how to properly wrangle data to perform these analyses in R.
All in all, I was able to really learn about the structure of research in a field that I had virtually no experience in previously, gaining important statistical technical skills along the way. I also established a great long-term research relationship with Dr. Weljie, his lab, and many of the other faculty in his department that I hope to contribute significantly in my remaining years at Penn.
Comments
Great poster!
This is truly a wonderful poster presentation. You did a great job defining terms so that a lay person can understand the complexities of metabolomics. Nice use of graphics, too. Congratulations!