A rule-based lemmatizing approach for sinhala language

Nandathilaka, M; Ahangama, S; Weerasuriya, GT

A rule-based lemmatizing approach for sinhala language

Date

2018

Authors

Nandathilaka, M

Ahangama, S

Weerasuriya, GT

Publisher

Information Technology Research Unit, Faculty of Information Technology, University of Moratuwa, Sri Lanka

Abstract

Speech recognition, natural language processing, language translation and deep learning researches are bridging the communication gap between humans as well as between humans and machines. Sinhala is a native language in Sri Lanka which is being used by 19 million people approximately. The growth of Sinhala natural language processing tools is less when compared to European and other Asian Languages. A lemmatizer for Sinhala can be used for the morphological analysis and is an essential module in Sinhala language processing mechanisms. Lemmatizing is a complex process in morphological analyzing where base/root of words are derived. There is not much work published focusing on lemmatizer approaches for Sinhala. This paper presents a rule based lemmatizing approach which can be used to determine the base form of Sinhala words with an accuracy of 77.3%. It differs from similar works because the data used in the research are extracted from social media.

Keywords

Sinhala Morphology, Lemmatization, Inflection, Rule-based, Social media data

Citation

M. Nandathilaka, S. Ahangama and G. T. Weerasuriya, "A Rule-based Lemmatizing Approach for Sinhala Language," 2018 3rd International Conference on Information Technology Research (ICITR), 2018, pp. 1-5, doi: 10.1109/ICITR.2018.8736134.