NORMA eResearch @NCI Library

RoBERTa-Based NLP System for Enhanced Disease Prediction from Symptom Descriptions

Anusury, Karthikeya (2024) RoBERTa-Based NLP System for Enhanced Disease Prediction from Symptom Descriptions. Masters thesis, Dublin, National College of Ireland.

[thumbnail of Master of Science]
Preview
PDF (Master of Science)
Download (879kB) | Preview
[thumbnail of Configuration Manual]
Preview
PDF (Configuration Manual)
Download (798kB) | Preview

Abstract

Due to the fast-changing nature of medical treatment, there is an urgent need for rapid translation with precision into employable medical interpretations to conduct timely diagnoses of the disease. In this context, the work has designed and compared the performance of state-of-the-art deep learning natural language processing models BERT and RoBERTa and machine learning models XGBoost and Random Forest for disease prediction from symptom descriptions. In this study, the SymptomsDisease246k dataset was applied, and those diseases that possess at least 1000 samples were selected for reliability and statistical significance. All the models were tested for their handling of medical terminology, computational efficiency, and predictive accuracy. The methodology involves rigorous preprocessing of data, model implementation, and then an extensive analysis based on model performance metrics like accuracy, F1 score, precision, and recall. Among these three transformer-based models, significant improvement in accuracy is visible as compared to the performance of traditional machine learning. BERT slightly outperformed RoBERTa during short-term training, having the best accuracy of 0.8712 and F1 score of 0.8720. During longer training, this model was more stable than both, though BERT still had a slight edge. XGBoost turned out to be a strong baseline, providing the possibility to balance performance and computational efficiency, yielding 0.8578 accuracy. Random Forest, while being less accurate, was the fastest in training. The current study has demonstrated the role that transformer-based models can play in the field of medical text analysis, but at the same time, it has also shown that traditional methods of machine learning can be helpful in certain contexts. The results suggested that when choosing a model to use in healthcare, there is a trade-off between performance and efficiency. Moreover, in a clinical environment, this needs human expert validation. This study further adds to the development of AI-assisted clinical decision-making processes with widened insight into both the strengths and limitations of various modeling approaches in medical text analysis.

Item Type: Thesis (Masters)
Supervisors:
Name
Email
Yaqoob, Abid
UNSPECIFIED
Subjects: Q Science > QA Mathematics > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science
P Language and Literature > P Philology. Linguistics > Computational linguistics. Natural language processing
R Medicine > Healthcare Industry
Q Science > Q Science (General) > Self-organizing systems. Conscious automata > Machine learning
Divisions: School of Computing > Master of Science in Data Analytics
Depositing User: Ciara O'Brien
Date Deposited: 07 Aug 2025 08:52
Last Modified: 07 Aug 2025 08:52
URI: https://norma.ncirl.ie/id/eprint/8456

Actions (login required)

View Item View Item