Repository logo
 

Computational methods for efficient processing and analysis of short-read Next-Generation DNA sequencing data

dc.contributor.authorNadukkalam Ravindran, Praveen
dc.contributor.copyright-releaseYesen_US
dc.contributor.degreeDoctor of Philosophyen_US
dc.contributor.departmentFaculty of Computer Scienceen_US
dc.contributor.ethics-approvalNot Applicableen_US
dc.contributor.external-examinerDr. Timothy Frasieren_US
dc.contributor.graduate-coordinatorDr. Michael McAllisteren_US
dc.contributor.manuscriptsYesen_US
dc.contributor.thesis-readerDr. Norbert Zehen_US
dc.contributor.thesis-readerDr. Nauzer Kalyaniwallaen_US
dc.contributor.thesis-supervisorDr. Robert Beikoen_US
dc.contributor.thesis-supervisorDr. Ian R. Bradburyen_US
dc.date.accessioned2020-04-13T12:42:01Z
dc.date.available2020-04-13T12:42:01Z
dc.date.defence2020-03-19
dc.date.issued2020-04-13T12:42:01Z
dc.description.abstractDNA sequencing has transformed the discipline of population genetics, which seeks to assess the level of genetic diversity within species or populations, and infer the geographic and temporal distributions between members of a population. Restriction-site associated DNA sequencing (RADSeq) is a NGS technique, which produce data that consists of relatively short (typically 50 to 300 nucleotide) fragments or “reads” of sequenced DNA and enables large-scale analysis of individuals and populations. In this thesis, we describe computational methods, which use graph-based structures to represent these short reads obtained and to capture the relationships among them. A key challenge in RADSeq analysis is to identify optimal parameter settings for assignment of reads to loci (singular: Locus), which correspond to specific regions in the genome. The parameter sweep is computationally intensive, as the entire analysis needs to be run for each parameter set. We propose a graph-based structure (RADProc), which provides persistence and eliminates redundancy to enable parameter sweeps. For 20 green crab samples and 32 different parameter sets, RADProc took only 2.5 hours while the widely used Stacks software took 78 hours. Another challenge is to identify paralogs, sequences that are highly similar due to recent duplication events, but occur in different regions of the genome and should not to be merged into the same locus. We introduce PMERGE, which identifies paralogs by clustering the catalog locus consensus sequences based on similarity. PMERGE is built on the fact that paralogs may be wrongly merged into a single locus in some but not all samples. PMERGE identified 62%-87% of paralogs in the Atlantic salmon and green crab datasets. Gene flow is the movement of alleles, specific sequence variants at a given locus, between populations and is an important indicator of population mixing that changes genetic diversity within the populations. We use the RADProc graph to infer gene flow among populations using allele frequency differences in exclusively shared alleles in each pair of populations. The method successfully inferred gene flow patterns in simulated datasets and provided insights into reasons for observed hybridization at two locations in a green crab dataset.en_US
dc.identifier.urihttp://hdl.handle.net/10222/78429
dc.language.isoenen_US
dc.subjectComputational methodsen_US
dc.subjectSoftware optimizationen_US
dc.subjectPopulation geneticsen_US
dc.subjectShort-read DNA sequence processingen_US
dc.titleComputational methods for efficient processing and analysis of short-read Next-Generation DNA sequencing dataen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
NadukkalamRavindran-Praveen-PhD-FCS-March-2020.pdf
Size:
7.58 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: