NORMA eResearch @NCI Library

Assessing the Efficacy of Synthetic Data for Enhancing Machine Translation Models in Low Resource Domains

Yadav, Shweta (2023) Assessing the Efficacy of Synthetic Data for Enhancing Machine Translation Models in Low Resource Domains. In: Big Data and Artificial Intelligence. Lecture Notes in Computer Science, 14418 . Springer, Cham. ISBN 978-3-031-49601-1

Full text not available from this repository.
Official URL:


An artificially generated dataset mimics real-world data in terms of its statistical properties, but it contains no real information. Data around rare occurrences like Covid-19 pandemic is difficult to capture in real-world data due to their infrequent nature. Additionally, cost involved and time-consumption to gather real world data is a big challenge. In such cases, synthetic data can help create more balanced datasets for model training. This project investigates the effectiveness of using synthetic data for tuning machine translation models when training data is limited. The Covid-19 domain is chosen considering the urgency and importance of the global accessibility of information related to the pandemic. TICO-19, a publically available dataset was effectively formulated to cater to this need. The medical terminologies were extracted and passed to OpenAI API to generate training language pair data. The fine-tuned davinci model is then verified with blind test data provided under TICO-19 for translation from English to French. SacreBLEU score is used to compute the translation quality, the fine-tuned model has a significantly higher BLEU score of 19.54 in comparison to the base model with a BLEU score of 0.44. The adapted model also has a comparable score to the next-generation version of davinci with a BLEU score of 22.29.

Item Type: Book Section
Uncontrolled Keywords: OpenAI; davinci; TICO-19; low resource domain; machine translation; Covid-19
Subjects: Q Science > QA Mathematics > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science
Q Science > QH Natural history > QH301 Biology > Methods of research. Technique. Experimental biology > Data processing. Bioinformatics > Artificial intelligence
Q Science > Q Science (General) > Self-organizing systems. Conscious automata > Artificial intelligence
R Medicine > RA Public aspects of medicine > Public Health System
Divisions: School of Computing > Staff Research and Publications
Depositing User: Tamara Malone
Date Deposited: 16 Jan 2024 17:15
Last Modified: 16 Jan 2024 17:18

Actions (login required)

View Item View Item