Computational methods for efficient processing and analysis of short-read Next-Generation DNA sequencing data

Nadukkalam Ravindran, Praveen

Computational methods for efficient processing and analysis of short-read Next-Generation DNA sequencing data

dc.contributor.author	Nadukkalam Ravindran, Praveen
dc.contributor.copyright-release	Yes	en_US
dc.contributor.degree	Doctor of Philosophy	en_US
dc.contributor.department	Faculty of Computer Science	en_US
dc.contributor.ethics-approval	Not Applicable	en_US
dc.contributor.external-examiner	Dr. Timothy Frasier	en_US
dc.contributor.graduate-coordinator	Dr. Michael McAllister	en_US
dc.contributor.manuscripts	Yes	en_US
dc.contributor.thesis-reader	Dr. Norbert Zeh	en_US
dc.contributor.thesis-reader	Dr. Nauzer Kalyaniwalla	en_US
dc.contributor.thesis-supervisor	Dr. Robert Beiko	en_US
dc.contributor.thesis-supervisor	Dr. Ian R. Bradbury	en_US
dc.date.accessioned	2020-04-13T12:42:01Z
dc.date.available	2020-04-13T12:42:01Z
dc.date.defence	2020-03-19
dc.date.issued	2020-04-13T12:42:01Z
dc.description.abstract	DNA sequencing has transformed the discipline of population genetics, which seeks to assess the level of genetic diversity within species or populations, and infer the geographic and temporal distributions between members of a population. Restriction-site associated DNA sequencing (RADSeq) is a NGS technique, which produce data that consists of relatively short (typically 50 to 300 nucleotide) fragments or “reads” of sequenced DNA and enables large-scale analysis of individuals and populations. In this thesis, we describe computational methods, which use graph-based structures to represent these short reads obtained and to capture the relationships among them. A key challenge in RADSeq analysis is to identify optimal parameter settings for assignment of reads to loci (singular: Locus), which correspond to specific regions in the genome. The parameter sweep is computationally intensive, as the entire analysis needs to be run for each parameter set. We propose a graph-based structure (RADProc), which provides persistence and eliminates redundancy to enable parameter sweeps. For 20 green crab samples and 32 different parameter sets, RADProc took only 2.5 hours while the widely used Stacks software took 78 hours. Another challenge is to identify paralogs, sequences that are highly similar due to recent duplication events, but occur in different regions of the genome and should not to be merged into the same locus. We introduce PMERGE, which identifies paralogs by clustering the catalog locus consensus sequences based on similarity. PMERGE is built on the fact that paralogs may be wrongly merged into a single locus in some but not all samples. PMERGE identified 62%-87% of paralogs in the Atlantic salmon and green crab datasets. Gene flow is the movement of alleles, specific sequence variants at a given locus, between populations and is an important indicator of population mixing that changes genetic diversity within the populations. We use the RADProc graph to infer gene flow among populations using allele frequency differences in exclusively shared alleles in each pair of populations. The method successfully inferred gene flow patterns in simulated datasets and provided insights into reasons for observed hybridization at two locations in a green crab dataset.	en_US
dc.identifier.uri	http://hdl.handle.net/10222/78429
dc.language.iso	en	en_US
dc.subject	Computational methods	en_US
dc.subject	Software optimization	en_US
dc.subject	Population genetics	en_US
dc.subject	Short-read DNA sequence processing	en_US
dc.title	Computational methods for efficient processing and analysis of short-read Next-Generation DNA sequencing data	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: NadukkalamRavindran-Praveen-PhD-FCS-March-2020.pdf
Size:: 7.58 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Faculty of Graduate Studies Online Theses