NORMA eResearch @NCI Library

Augmenting Training Data for Low-Resource Neural Machine Translation via Bilingual Word Embeddings and BERT Language Modelling

Ramesh, Akshai, Uhana, Haque Usuf, Parthasarathy, Venkatesh Balavadhani, Haque, Rejwanul and Way, Andy (2021) Augmenting Training Data for Low-Resource Neural Machine Translation via Bilingual Word Embeddings and BERT Language Modelling. In: 2021 International Joint Conference on Neural Networks (IJCNN), 18-22 July 2021, Shenzhen, China.

Full text not available from this repository.
Official URL: https://doi.org/10.1109/IJCNN52387.2021.9534211

Abstract

Neural machine translation (NMT) is often described as ‘data hungry’ as it typically requires large amounts of parallel data in order to build a good-quality machine translation (MT) system. However, most of the world's language-pairs are low-resource or extremely low-resource. This situation becomes even worse if a specialised domain is taken into consideration for translation. In this paper, we present a novel data augmentation method which makes use of bilingual word embeddings (BWEs) learned from monolingual corpora and bidirectional encoder representations from transformer (BERT) language models (LMs). We augment a parallel training corpus by introducing new words (i.e. out-of-vocabulary (OOV) items) and increasing the presence of rare words on both sides of the original parallel training corpus. Our experiments on the simulated low-resource German–English and French–English translation tasks show that the proposed data augmentation strategy can significantly improve state-of-the-art NMT systems and outperform the state-of-the-art data augmentation approach for low-resource NMT.

Item Type: Conference or Workshop Item (Paper)
Uncontrolled Keywords: Machine translation; Neural machine translation; Transformer; Language modelling
Subjects: Q Science > QA Mathematics > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science

P Language and Literature > P Philology. Linguistics > Language Services
Divisions: School of Computing > Staff Research and Publications
Depositing User: Clara Chan
Date Deposited: 01 Oct 2021 15:16
Last Modified: 01 Oct 2021 15:16
URI: http://norma.ncirl.ie/id/eprint/5080

Actions (login required)

View Item View Item