Investigating Word Embedding Techniques for Extracting Disease, Gene, and Chemical Relationships from Biomedical Texts

S Pradeep, Sushumna

dc.contributor.author	S Pradeep, Sushumna
dc.date.accessioned	2024-10-01T14:23:12Z
dc.date.available	2024-10-01T14:23:12Z
dc.date.issued	2024-09-25
dc.identifier.uri	http://hdl.handle.net/10222/84633
dc.description	This thesis investigates various word embedding models, including PubMedBERT, BioBERT, SkipGram, CBOW, and GloVe, in the context of Literature-Based Discovery (LBD) within biomedical research, specifically focusing on cancer-related entities. The study evaluates the effectiveness of these models in identifying known functional relationships among genes, diseases, and chemicals, using curated data from the Comparative Toxicogenomics Database (CTD) as a reference. Initially, the research assesses how well these models capture existing interactions within the medical literature. Subsequently, it explores the models' capabilities to discover previously unknown functional relationships, specifically targeting relationships that emerged in CTD version 2024 but were absent in version 2022. Word embeddings were generated from PubMed abstracts up to 2022, and their functional relatedness was measured using cosine similarity for curated pairs from the CTD dataset. Performance was evaluated through precision and recall calculations at cosine similarity thresholds of 0.6, 0.7, and 0.8. Heatmaps were used to compare model performance. The findings indicate that PubMedBERT and BioBERT significantly outperformed traditional models like CBOW, SkipGram, and GloVe, particularly at a threshold of 0.7, which balances accuracy and data retrieval. Notably, the embeddings successfully captured functional relationships in newly curated pairs from the CTD dataset, including 42 disease-chemical pairs, 58 disease-gene pairs, and 83 chemical-gene pairs, demonstrating the models' potential for conducting LBD in biomedical literature.	en_US
dc.description.abstract	This thesis investigates word embedding models, including PubMedBERT, BioBERT, SkipGram, CBOW, and GloVe, in the context of Literature-Based Discovery (LBD) within biomedical research, with a specific focus on cancer-related entities. Firstly, I study the effectiveness of word embedding models in identifying current functional relationships (e.g., interaction) between genes, diseases, and chemicals, as recorded in the medical literature. As a reference, I use curated functional relationships from the Comparative Toxicogenomics Database (CTD). The goal is to evaluate each word embedding model, highlighting their strengths and weaknesses in identifying functional relationships in particular, and in biomedical text mining in general. Next, I study the ability of word embedding models in discovering previously unknown functional relationships from the medical literature. I create word embeddings from the medical literature up until 2022, and check whether they can identify functional relationships that were not in CTD at that time (i.e., functional relationships found in CTD version 2024 but not part of CTD version 2022; time-slicing). If this is successful, it means that word embedding models can conduct LBD; they can identify previously unknown functional relationships from the medical literature. We created word embeddings using models such as CBOW, SkipGram, GloVe, BioBERT, and PubMedBERT based on PubMed abstracts up to 2022. After generating the embeddings, we measured functional relatedness using cosine similarity for curated pairs from the CTD dataset. To evaluate the performance of these models, we calculated precision and recall by comparing the curated CTD pairs with the instance vector pairs of instances from CTD, using cosine similarity thresholds of 0.6, 0.7, and 0.8. Once these values were obtained, heatmaps were plotted to compare model performance and identify which model produced the best results. The findings reveal that PubMedBERT and BioBERT significantly outperform traditional models like CBOW, SkipGram, and GloVe both on precision and recall; especially at a cosine similarity threshold of 0.7, which has been identified as an optimal balance between accuracy and comprehensive data retrieval. The results also show that the word embeddings created from PubMed abstracts up to 2022 are able to capture functional relationships in newly curated pairs from the CTD dataset. Specifically, the dataset included 157 disease-chemical pairs, 138 disease-gene pairs, and 191 chemical-gene pairs. Using the generated word embeddings, the model successfully captured relatedness in 42 disease-chemical pairs, 58 disease-gene pairs, and 83 chemical-gene pairs.	en_US
dc.language.iso	en	en_US
dc.subject	NLP	en_US
dc.subject	Word Embeddings	en_US
dc.subject	LBD	en_US
dc.title	Investigating Word Embedding Techniques for Extracting Disease, Gene, and Chemical Relationships from Biomedical Texts	en_US
dc.date.defence	2024-08-26
dc.contributor.department	Faculty of Computer Science	en_US
dc.contributor.degree	Master of Computer Science	en_US
dc.contributor.external-examiner	N/A	en_US
dc.contributor.thesis-reader	Dr. HASSAN SAJJAD	en_US
dc.contributor.thesis-reader	Dr. SAMINA ABIDI	en_US
dc.contributor.thesis-supervisor	Dr. SYED SIBTE RAZA ABIDI	en_US
dc.contributor.thesis-supervisor	Dr. WILLIAM VAN WOENSEL	en_US
dc.contributor.ethics-approval	Not Applicable	en_US
dc.contributor.manuscripts	Not Applicable	en_US
dc.contributor.copyright-release	Not Applicable	en_US

Find Full text

Files in this item

Name:: SushumnaSPradeep2024.pdf
Size:: 2.561Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Faculty of Graduate Studies Online Theses

Show simple item record