Fall Research Expo 2022

Exploring Quantification Methods for Simple and Complex Tandem Repeats on Nanopore Sequencing Data

View Poster

Tandem repeats are sequences of DNA bases that are repeated many times within a chromosome. These repeats can be grouped into variable number tandem repeats (VNTRs) and short tandem repeats (STRs), where the key difference is that VNTRs consist of comparatively longer repeat units. These tandem repeats are significant because they are often used as genetic markers for identification and are also associated with many diseases like Huntington’s disease, fragile X syndrome, and bipolar disorder.

My main research focuses on exploring and evaluating different methods of quantifying tandem repeats (VNTRs and STRs) based on long-read sequencing data, specifically from Oxford Nanopore Sequencing. Data was generated by Dr. Egli’s lab at Columbia University, and then split into two samples based on generation date (09/24 and 03/11). These samples contained DNA reads (a sequence of base pairs) with VNTR in two regions of interest, one on chromosome 9 and the other on chromosome 11. The objective of my research was to conduct analysis on these samples. This would be done by first calculating statistics like total reads, mean read length, and sample enrichment, and then estimating repeat count with various repeat quantification methods. These methods included EMBOSS Needle pairwise alignment of primary reads followed by manual repeat estimation, RepeatHMM (a computational tool), and NanoRepeat (another computational tool). Afterwards, I looked to evaluate repeat estimate results to gauge the efficacy and precision of each methodology.

Analysis quickly concentrated on the chromosome 11 region of interest, because both VNTR samples had very few supporting reads for the other region. For the 2 samples, the region of interest at chr11 has approximately 60- and 850-times enrichment for 09/24 and 03/11 respectively. Using manual estimation as a baseline, it became evident that all three methods had concurring average repeat count estimates, with NanoRepeat slightly outperforming RepeatHMM.

For future extension, I am currently working with data generated by Dr. Li Fang at CHOP. This dataset of 12 Huntington’s Disease (HTT) cell lines where repeat count is inferred/validated through Sanger sequencing. HTT is known to be caused by a STR expansion of trinucleotide CAG repeats. Comparing results from different tools like NanoRepeat and RepeatHMM would allow for a more standardized comparison between tools when evaluating their repeat detection ability.