Sinhala-Tamil statistical machine translation (SMT) for official documents

dc.contributor.advisorRanathunga, S
dc.contributor.advisorJayasena, S
dc.contributor.authorFarhath, FF
dc.date.accept2018
dc.date.accessioned2018
dc.date.available2018
dc.date.issued2018
dc.description.abstractSinhala and Tamil are declared to be the offi cial lang uages of Sri Lan ka. This requires each government related dissemination/communication to be done in both the languages. Even though the requirement for translation is higher, the number of available human translators is limited. One feasible option to boost the productivity would be assisting the human translators with machine translation output. Here the machine translation output is given to translators to work on by post editing, rather than translating from the scratch. However, Sinhala - Tamil pair does not have any well-performing machine translation system. Therefore, the focus of this research is to develop a machine translation system for short official government documents. This thesis presents two main contributions towards building ‘Si-T a’, the first domainadapted machine trans lation system for Sin hala - Tam il. The first contribution is building the baseline translation system. The second is implementing data pre-processing techniques to improve the translation quality of the base line sys tem. The base line system was built using Moses, a phrase -based stat istical trans lation system. This was the feasible option with the available resources. To improve the quality of the translation, three main approaches were explored. They are: (a) domain adaptation, (b) integration of terminology, dictionary, and name lists, and (c) addressing out-of-vocabulary (OOV) problem using word-embedding-based paraphrasing. In or der to adapt the sys tem for the dom ain of official government documents, different language model design techniques and a data filtration technique were experimented. Under terminology integration, experiments were carried out to evaluate the effect of incorporating bilingual terminology lists to the system. Moreover, a novel data augmentation technique was experimented to generate parallel data using bilingual lists and available parallel data. Further, open domain dictionary entries, as well as a list of person names and addresses were integrated and evaluated. In addition, word-embeddingbased paraphrasing was used along with a novel heuristic-based filtering to address the out-of-vocabulary issue. All the above-mentioned approaches gave an improvement over the baseline, apart from data filtering technique. Yet, all these scores were above the scores of already available machine translation systems for this language pair. Though our techniques/approaches were evaluated only on Sinhala - Tamil pair, they are feasible to be applied to other low-resourced, highly inflectional language pairs.en_US
dc.identifier.accnoTH3871en_US
dc.identifier.degreeMaster of Philosophyen_US
dc.identifier.departmentDepartment of Computer Science & Engineeringen_US
dc.identifier.facultyEngineeringen_US
dc.identifier.urihttp://dl.lib.mrt.ac.lk/handle/123/15814
dc.language.isoenen_US
dc.subjectCOMPUTER SCIENCE AND ENGINEERING-Dissertationsen_US
dc.subjectMACHINE TRANSLATION SYSTEMSen_US
dc.subjectSINHALA LANGUAGE-Translationen_US
dc.subjectTAMIL LANGUAGE-Translationen_US
dc.subjectSTATISTICAL MACHINE TRANSLATIONen_US
dc.titleSinhala-Tamil statistical machine translation (SMT) for official documentsen_US
dc.typeThesis-Full-texten_US

Files

Original bundle

Now showing 1 - 3 of 3
Loading...
Thumbnail Image
Name:
TH3871-1.pdf
Size:
92.38 KB
Format:
Adobe Portable Document Format
Description:
Pre-text
Loading...
Thumbnail Image
Name:
TH3871-2.pdf
Size:
70.48 KB
Format:
Adobe Portable Document Format
Description:
Post-text
Loading...
Thumbnail Image
Name:
TH3871.pdf
Size:
1.58 MB
Format:
Adobe Portable Document Format
Description:
Full-thesis

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: