Evaluation of Large Language Models on MedQUAD Dataset

Shakeel, Muhammad Hassan

Evaluation of Large Language Models on MedQUAD Dataset

Tools

Shakeel, Muhammad Hassan (2024) Evaluation of Large Language Models on MedQUAD Dataset. Masters thesis, Dublin, National College of Ireland.

Preview	PDF (Master of Science) Download (1MB) \| Preview
Preview	PDF (Configuration Manual) Download (980kB) \| Preview

Abstract

In recent times, small-sized LLMs have outperformed bigger LLMs such as GPT2 for domain-specific tasks after fine-tuning. This paper fine-tunes small-sized LLMs such as Gemma-2 (2 billion), Phi-2 (2.7 billion) and Llama-2 (7 billion) parameters for question-answering task on MedQUAD dataset. Among Gemma-2, Phi-2 and Llama-2, Llama-2 has outperformed others with ROUGE-1=0.455, ROUGE2=0.289, ROUGE-L=0.373 and BLEU=0.275. On the dimensions of informativeness, relevance, grammaticality, naturalness and sentiment for human evaluation, the three models produced similar performance, however Llama-2 outperformed with the average score of 7.492. This paper observed a pattern observed a correlation between model parameter size and model performance, big model gives better performance compare to small models.

Item Type:	Thesis (Masters)
Supervisors:	Name Email Trinh, Anh Duong UNSPECIFIED
Subjects:	Q Science > QA Mathematics > Electronic computers. Computer science T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science P Language and Literature > P Philology. Linguistics > Computational linguistics. Natural language processing
Divisions:	School of Computing > Master of Science in Artificial Intelligence
Depositing User:	Ciara O'Brien
Date Deposited:	20 Jun 2025 10:37
Last Modified:	20 Jun 2025 10:37
URI:	https://norma.ncirl.ie/id/eprint/7968

Actions (login required)

View Item