NORMA eResearch @NCI Library

Assessing the Efficacy of Synthetic Data for Enhancing Machine Translation Models in Low Resource Domains

Yadav, Shweta (2023) Assessing the Efficacy of Synthetic Data for Enhancing Machine Translation Models in Low Resource Domains. Masters thesis, Dublin, National College of Ireland.

[thumbnail of Master of Science]
Preview
PDF (Master of Science)
Download (1MB) | Preview
[thumbnail of Configuration manual]
Preview
PDF (Configuration manual)
Download (1MB) | Preview

Abstract

An artificially generated dataset mimics real-world data in terms of its statistical properties, but it contains no real information. Data around rare occurrences like Covid-19 pandemic is difficult to capture in real-world data due to their infrequent nature. Additionally, cost involved and time-consumption to gather real world data is a big challenge. In such cases, synthetic data can help create more balanced datasets for model training. This project investigates the effectiveness of using synthetic data for tuning machine translation models when training data is limited. The Covid-19 domain is chosen considering the urgency and importance of the global accessibility of information related to the pandemic. TICO-19, a publically available dataset was effectively formulated to cater to this need. The medical terminologies were extracted and passed to OpenAI API to generate training language pair data. The fine-tuned Davinci model is then verified with blind test data provided under TICO-19 for translation from English to French. SacreBLEU score is used to compute the translation quality, the fine-tuned model has a significantly higher BLEU score of 19.54 in comparison to the base model with a BLEU score of 0.44. The adapted model also has a comparable score to the next-generation version of davinci with a BLEU score of 22.29.

Item Type: Thesis (Masters)
Supervisors:
Name
Email
Nayak, Prashanth
UNSPECIFIED
Uncontrolled Keywords: OpenAI; davinci; TICO-19; low resource domain; machine translation; Covid19
Subjects: Q Science > QA Mathematics > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science
Q Science > QH Natural history > QH301 Biology > Methods of research. Technique. Experimental biology > Data processing. Bioinformatics > Artificial intelligence
Q Science > Q Science (General) > Self-organizing systems. Conscious automata > Artificial intelligence
R Medicine > Diseases > Outbreaks of disease > Epidemics > COVID-19 Pandemic, 2020-
Divisions: School of Computing > Master of Science in Data Analytics
Depositing User: Tamara Malone
Date Deposited: 09 Jan 2025 15:41
Last Modified: 09 Jan 2025 15:41
URI: https://norma.ncirl.ie/id/eprint/7292

Actions (login required)

View Item View Item