Show simple item record

dc.contributor.authorS Pradeep, Sushumna
dc.date.accessioned2024-10-01T14:23:12Z
dc.date.available2024-10-01T14:23:12Z
dc.date.issued2024-09-25
dc.identifier.urihttp://hdl.handle.net/10222/84633
dc.descriptionThis thesis investigates various word embedding models, including PubMedBERT, BioBERT, SkipGram, CBOW, and GloVe, in the context of Literature-Based Discovery (LBD) within biomedical research, specifically focusing on cancer-related entities. The study evaluates the effectiveness of these models in identifying known functional relationships among genes, diseases, and chemicals, using curated data from the Comparative Toxicogenomics Database (CTD) as a reference. Initially, the research assesses how well these models capture existing interactions within the medical literature. Subsequently, it explores the models' capabilities to discover previously unknown functional relationships, specifically targeting relationships that emerged in CTD version 2024 but were absent in version 2022. Word embeddings were generated from PubMed abstracts up to 2022, and their functional relatedness was measured using cosine similarity for curated pairs from the CTD dataset. Performance was evaluated through precision and recall calculations at cosine similarity thresholds of 0.6, 0.7, and 0.8. Heatmaps were used to compare model performance. The findings indicate that PubMedBERT and BioBERT significantly outperformed traditional models like CBOW, SkipGram, and GloVe, particularly at a threshold of 0.7, which balances accuracy and data retrieval. Notably, the embeddings successfully captured functional relationships in newly curated pairs from the CTD dataset, including 42 disease-chemical pairs, 58 disease-gene pairs, and 83 chemical-gene pairs, demonstrating the models' potential for conducting LBD in biomedical literature.en_US
dc.description.abstractThis thesis investigates word embedding models, including PubMedBERT, BioBERT, SkipGram, CBOW, and GloVe, in the context of Literature-Based Discovery (LBD) within biomedical research, with a specific focus on cancer-related entities. Firstly, I study the effectiveness of word embedding models in identifying current functional relationships (e.g., interaction) between genes, diseases, and chemicals, as recorded in the medical literature. As a reference, I use curated functional relationships from the Comparative Toxicogenomics Database (CTD). The goal is to evaluate each word embedding model, highlighting their strengths and weaknesses in identifying functional relationships in particular, and in biomedical text mining in general. Next, I study the ability of word embedding models in discovering previously unknown functional relationships from the medical literature. I create word embeddings from the medical literature up until 2022, and check whether they can identify functional relationships that were not in CTD at that time (i.e., functional relationships found in CTD version 2024 but not part of CTD version 2022; time-slicing). If this is successful, it means that word embedding models can conduct LBD; they can identify previously unknown functional relationships from the medical literature. We created word embeddings using models such as CBOW, SkipGram, GloVe, BioBERT, and PubMedBERT based on PubMed abstracts up to 2022. After generating the embeddings, we measured functional relatedness using cosine similarity for curated pairs from the CTD dataset. To evaluate the performance of these models, we calculated precision and recall by comparing the curated CTD pairs with the instance vector pairs of instances from CTD, using cosine similarity thresholds of 0.6, 0.7, and 0.8. Once these values were obtained, heatmaps were plotted to compare model performance and identify which model produced the best results. The findings reveal that PubMedBERT and BioBERT significantly outperform traditional models like CBOW, SkipGram, and GloVe both on precision and recall; especially at a cosine similarity threshold of 0.7, which has been identified as an optimal balance between accuracy and comprehensive data retrieval. The results also show that the word embeddings created from PubMed abstracts up to 2022 are able to capture functional relationships in newly curated pairs from the CTD dataset. Specifically, the dataset included 157 disease-chemical pairs, 138 disease-gene pairs, and 191 chemical-gene pairs. Using the generated word embeddings, the model successfully captured relatedness in 42 disease-chemical pairs, 58 disease-gene pairs, and 83 chemical-gene pairs.en_US
dc.language.isoenen_US
dc.subjectNLPen_US
dc.subjectWord Embeddingsen_US
dc.subjectLBDen_US
dc.titleInvestigating Word Embedding Techniques for Extracting Disease, Gene, and Chemical Relationships from Biomedical Textsen_US
dc.date.defence2024-08-26
dc.contributor.departmentFaculty of Computer Scienceen_US
dc.contributor.degreeMaster of Computer Scienceen_US
dc.contributor.external-examinerN/Aen_US
dc.contributor.thesis-readerDr. HASSAN SAJJADen_US
dc.contributor.thesis-readerDr. SAMINA ABIDIen_US
dc.contributor.thesis-supervisorDr. SYED SIBTE RAZA ABIDIen_US
dc.contributor.thesis-supervisorDr. WILLIAM VAN WOENSELen_US
dc.contributor.ethics-approvalNot Applicableen_US
dc.contributor.manuscriptsNot Applicableen_US
dc.contributor.copyright-releaseNot Applicableen_US
 Find Full text

Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record