Pre-training and fine-tuning multilingual sequence-to-sequence models for domain-specific low-resource neural machine translation

Thillainathan S

Pre-training and fine-tuning multilingual sequence-to-sequence models for domain-specific low-resource neural machine translation

Files

TH5032-1.pdf (134.24 KB)

TH5032-2.pdf (107.47 KB)

TH5032.pdf (1.4 MB)

Date

2022

Authors

Thillainathan S

Abstract

Limited parallel data is a major bottleneck for morphologically rich Low-Resource Languages (LRLs), resulting in Neural Machine Translation (NMT) systems of poor quality. Language representation learning in a self-supervised sequence-to-sequence fashion has become a new paradigm that utilizes the largely available monolingual data and alleviates the parallel data scarcity issue in NMT. The language pairs supported by the Self-supervised Multilingual Sequence-to-sequence Pre-trained (SMSP) model can be fine-tuned using this pre-trained model with a small amount of parallel data. This study shows the viability of fine-tuning such SMSP models for an extremely low-resource domain-specific NMT setting. We choose one such pre-trained model: mBART. We are the first to implement and demonstrate the viability of non-English centric complete fine-tuning on SMSP models. To demonstrate, we select Sinhala, Tamil and English languages in extremely lowresource settings in the domain of official government documents. This research explores the ways to extend SMSP models to adapt to new domains and improve the fine-tuning process of SMSP models to obtain a high-quality translation in an extremely lowresource setting. We propose two novel approaches: (1) Continual Pre-training of the SMSP model in a self-supervised manner with domain-specific monolingual data to incorporate new domains and (2) multistage fine-tuning of the SMSP model with in- and out-domain parallel data. Our experiments with Sinhala (Si), Tamil (Ta) and English (En) show that directly fine-tuning (single-step) the SMSP model mBART for LRLs significantly outperforms state-of-the-art Transformer based NMT models in all language pairs in all six bilingual directions. We gain a +7.17 BLEU score on Si→En translation and a +6.74 BLEU score for the Ta→En direction. Most importantly, for non-English centric Si-Ta fine-tuning, we surpassed the state-of-the-art Transformer based NMT model by gaining a +4.11 BLEU score on Ta→Si and a +2.78 BLEU score on Si→Ta. Moreover, our proposed approaches improved performance strongly by around a +1 BLEU score compared to the strong single-step direct mBART fine-tuning for all six directions. At last, we propose a multi-model ensemble that improved the performance in all the cases where we obtained the overall best model with a +2 BLEU score improvement.

Keywords

PRE-TRAINING, FINE-TUNING, LOW-RESOURCE LANGUAGES, MBART, PRE-TRAINED LANGUAGE MODELS, NEURAL MACHINE TRANSLATION, INFORMATION TECHNOLOGY -Dissertation, COMPUTER SCIENCE -Dissertation, COMPUTER SCIENCE & ENGINEERING -Dissertation

Citation

Thillainathan, S. (2022). Pre-training and fine-tuning multilingual sequence-to-sequence models for domain-specific low-resource neural machine translation [Master's theses, University of Moratuwa]. Institutional Repository University of Moratuwa.http://dl.lib.uom.lk/handle/123/21664

URI

http://dl.lib.uom.lk/handle/123/21664

Collections

Master of Science By Research

Full item page

Pre-training and fine-tuning multilingual sequence-to-sequence models for domain-specific low-resource neural machine translation

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

DOI

Collections