INFERRING ECOLOGICAL POPULATION STRUCTURE AND ENVIRONMENTAL ASSOCIATIONS THROUGH AUTOMATED ANALYSIS OF REPEAT-CONTAINING AND POLYMORPHIC DNA SEQUENCES
Abstract
Biodiversity conservation plays an important role in the maintenance of a healthy ecosystem. Genetic diversity provides a foundation for understanding the diversity at the organism and population levels of organization. Genomic DNA markers offer the opportunity to identify genetic variations that distinguish populations, and can be used to investigate the underlying forces that drive adaptation to different environments. Short simple-repeat DNA sequences or microsatellites are one of the most popular genetic markers for many biological applications. However, microsatellite data require extensive manual checking for errors and characteristic signals, a laborious process that can take days or weeks for a single dataset. We have developed MEGASAT, a bioinformatics approach that automates microsatellite genotyping from DNA sequence data. MEGASAT uses fuzzy matches and counting of frequently observed sequences to distinguish true genotype signal from errors. We validated MEGASAT using microsatellite data from a population sample of 71 guppies from Trinidad, demonstrating a high level of reproducibility and accuracy of MEGASAT-called genotypes by a combination of genotyping error estimation methods.
We also developed a random-forest (RF) based method to identify adaptive gene variants and environmental factors associated with those adaptive variants in sea scallop data. Our approach uses the inverse Cholesky transformation to account for spatial autocorrelations in genetic and environmental data and ordination techniques to further explore the relationships between these two data sets. The variable importance ranked by RF models and ordination techniques were both used on corrected and uncorrected data to find which environmental variables play important role in shaping the genetic structure of sea scallop populations.