Anusury, Karthikeya (2024) RoBERTa-Based NLP System for Enhanced Disease Prediction from Symptom Descriptions. Masters thesis, Dublin, National College of Ireland.
Preview |
PDF (Master of Science)
Download (879kB) | Preview |
Preview |
PDF (Configuration Manual)
Download (798kB) | Preview |
Abstract
Due to the fast-changing nature of medical treatment, there is an urgent need for rapid translation with precision into employable medical interpretations to conduct timely diagnoses of the disease. In this context, the work has designed and compared the performance of state-of-the-art deep learning natural language processing models BERT and RoBERTa and machine learning models XGBoost and Random Forest for disease prediction from symptom descriptions. In this study, the SymptomsDisease246k dataset was applied, and those diseases that possess at least 1000 samples were selected for reliability and statistical significance. All the models were tested for their handling of medical terminology, computational efficiency, and predictive accuracy. The methodology involves rigorous preprocessing of data, model implementation, and then an extensive analysis based on model performance metrics like accuracy, F1 score, precision, and recall. Among these three transformer-based models, significant improvement in accuracy is visible as compared to the performance of traditional machine learning. BERT slightly outperformed RoBERTa during short-term training, having the best accuracy of 0.8712 and F1 score of 0.8720. During longer training, this model was more stable than both, though BERT still had a slight edge. XGBoost turned out to be a strong baseline, providing the possibility to balance performance and computational efficiency, yielding 0.8578 accuracy. Random Forest, while being less accurate, was the fastest in training. The current study has demonstrated the role that transformer-based models can play in the field of medical text analysis, but at the same time, it has also shown that traditional methods of machine learning can be helpful in certain contexts. The results suggested that when choosing a model to use in healthcare, there is a trade-off between performance and efficiency. Moreover, in a clinical environment, this needs human expert validation. This study further adds to the development of AI-assisted clinical decision-making processes with widened insight into both the strengths and limitations of various modeling approaches in medical text analysis.
Item Type: | Thesis (Masters) |
---|---|
Supervisors: | Name Email Yaqoob, Abid UNSPECIFIED |
Subjects: | Q Science > QA Mathematics > Electronic computers. Computer science T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science P Language and Literature > P Philology. Linguistics > Computational linguistics. Natural language processing R Medicine > Healthcare Industry Q Science > Q Science (General) > Self-organizing systems. Conscious automata > Machine learning |
Divisions: | School of Computing > Master of Science in Data Analytics |
Depositing User: | Ciara O'Brien |
Date Deposited: | 07 Aug 2025 08:52 |
Last Modified: | 07 Aug 2025 08:52 |
URI: | https://norma.ncirl.ie/id/eprint/8456 |
Actions (login required)
![]() |
View Item |