Comparative Study of Transformer Models for Text Classification in Healthcare

Gosavi, Parth Nandkishor

Comparative Study of Transformer Models for Text Classification in Healthcare

Tools

Gosavi, Parth Nandkishor (2024) Comparative Study of Transformer Models for Text Classification in Healthcare. Masters thesis, Dublin, National College of Ireland.

Preview	PDF (Master of Science) Download (1MB) \| Preview
Preview	PDF (Configuration Manual) Download (2MB) \| Preview

Abstract

The huge quantity of textual data grows exponentially, posing significant issues in the field of research in healthcare, due to a large amount of storage and high processing cost. It offers powerful solutions for classifying and organizing the text data through text classification, an important step in text mining. The problem is becoming more and more common in health care and text-based data such as medical findings and scientific literature abstracts and thus demand for better approaches to text classification is a challenge in the field of healthcare. Although many existing techniques are based on classical ML models and rule-based methods, they usually suffer from scalability issues due to sparsity of data and complexity of medical language.

On the other hand transformer-based models such as BERT (Bidirectional Encoder Representations from Transformers), RoBERTa (Robustly Optimized BERT Pretraining Approach), DistilBERT (Distilled Bidirectional Encoder Representations from Transformers) and XLNet (eXtreme Language Net) have proven to sweep these matters away. But large-scale applications to healthcare are still facing significant hurdles, primarily due to the heavy demand for compute and mixed performance generalization on domain-specific datasets.

This project discusses transformer based models for a large scale multi-label text classification on a biomedical dataset (PubMed). Because binary labeled documents according to such hierarchical taxonomy inherently benefit from hierarchical relations of both the MeSH (Medical Subject Headings) ontology collection and avoiding the label sparsity and complicated linguistic structures arising from the correspondence to medical documents. We conduct a thorough performance analysis of our models comparing accuracy, F1-scores and training time. These findings highlight the trade-offs between computation cost and performance, and offer practical guidance as to the usefulness of these models for health care applications. As such, the work serves to further other natural language processing work in the healthcare space and has actionable implications for decision support systems, patient data analysis and healthcare informatics.

This increased the overall F1 score for BERT (0.8403), which is more accurate but took longer to train than the other models used. RoBERTa was the balance of precision with computational efficiency. DistilBERT won out in the end as the fastest model, but at the expense of performance (accuracy). Ability to model long-text dependencies, but more expensive computationally.

Item Type:	Thesis (Masters)
Supervisors:	Name Email Qayum, Abdul UNSPECIFIED
Uncontrolled Keywords:	Transformer models; BERT; RoBERTa; DistilBERT; XLNet; ClinicalBERT; BioBERT; SciBERT; natural language processing (NLP); text classification; named entity recognition (NER); document summarization; electronic health records (EHRs); Medical Subject Headings (MeSH); multi-label classification; hierarchical labels; computational complexity; long sequence processing; domain-specific adaptation; tradeoffs; evaluation metrics; accuracy; precision, recall; F1-score; AUROC; interpretability; medical knowledge graphs (KGs)
Subjects:	Q Science > QA Mathematics > Electronic computers. Computer science T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science R Medicine > R Medicine (General) P Language and Literature > P Philology. Linguistics > Computational linguistics. Natural language processing R Medicine > Healthcare Industry
Divisions:	School of Computing > Master of Science in Data Analytics
Depositing User:	Ciara O'Brien
Date Deposited:	02 Sep 2025 11:58
Last Modified:	02 Sep 2025 11:58
URI:	https://norma.ncirl.ie/id/eprint/8702

Actions (login required)

View Item