Show simple item record

dc.contributor.authorManuele, Alexander
dc.date.accessioned2021-08-19T14:18:50Z
dc.date.available2021-08-19T14:18:50Z
dc.date.issued2021-08-19T14:18:50Z
dc.identifier.urihttp://hdl.handle.net/10222/80695
dc.description.abstractNext-generation DNA sequencing technologies have made marker-gene DNA sequence data widely available. Analysis of microbiome data has many challenges, including sparsity, high cardinality, and intra-study dependencies during feature engineering. Language-modelling techniques may provide the means to overcome these challenges. The first step in sequence modelling is dividing the sequence into sensible tokens. We show that trained tokenization strategies, byte-pair encoding and unigram language modelling can replace traditional sliding-window based segmentation techniques for DNA marker genes in classification, clustering, and language-modelling tasks. We then propose a novel approach for feature representation of DNA marker genes, proposing a training scheme to learn dense vector representations of DNA sequences using transformer language models optimized using DNA sequence pair-wise alignment scores. We demonstrate that our representations match or exceed previously published approaches for treatment of individual marker genes and of microbiome samples while providing fixed-length, low-cardinality representations of each.en_US
dc.language.isoenen_US
dc.subjectRepresentation Learningen_US
dc.subjectMachine Learningen_US
dc.subjectBioinformaticsen_US
dc.subjectLanguage Modellingen_US
dc.titleNovel Approaches to Marker Gene Representation Learning Using Trained Tokenizers and Jointly Trained Transformer Modelsen_US
dc.date.defence2021-08-16
dc.contributor.departmentFaculty of Computer Scienceen_US
dc.contributor.degreeMaster of Computer Scienceen_US
dc.contributor.external-examinern/aen_US
dc.contributor.graduate-coordinatorMike McAlisteren_US
dc.contributor.thesis-readerVlado Keseljen_US
dc.contributor.thesis-readerSageev Ooreen_US
dc.contributor.thesis-supervisorRobert Beikoen_US
dc.contributor.ethics-approvalNot Applicableen_US
dc.contributor.manuscriptsNot Applicableen_US
dc.contributor.copyright-releaseNot Applicableen_US
 Find Full text

Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record