SNLP - 2020
Permanent URI for this collectionhttp://192.248.9.226/handle/123/23136
Browse
Recent Submissions
- item: Conference-AbstractPre-Text(National Language Processing Centre University of Moratuwa Sri Lanka, 2020) Thayasivam, U; Rathnayaka, C.
- item: Conference-AbstractIssues in employing metaphorically equivalent phrases according to cognitive translation hypothesis in translating Sinhala text into English to train an AI translation software(National Language Processing Centre University of Moratuwa Sri Lanka, 2020) Herath, L; Ranaweera, B; Thayasivam, U; Rathnayaka, CThis paper explores the issues in translating Sinhala metaphorical/cultural phrases into English because of the indirectness of their meanings especially when translating material to train an AI software. Current research shows that using the Cognitive Translation Hypothesis (here on CTH) is one of the most successful solutions for this issue which suggests that using metaphorically equivalent phrases in translation is effective to convey the meaning. However, one of the problems in using metaphorically equivalent phrases is that the meanings of such phrases in the source language can be entirely different from the target language in its literal sense, even though the metaphorical meaning can be equivalent. Thus, in translating material from Sinhala to English using CTH, especially to train an AI software to translate text feed, can be unsuccessful since it can confuse between the literal meaning and the metaphorical meaning that could occur in the text feed in both ways.
- item: Conference-AbstractAutomatic Sinhala news text summarizer(National Language Processing Centre University of Moratuwa Sri Lanka, 2020) Tennakoon, S; Thayasivam, U; Rathnayaka, CWith the present explosion of news circulating the digital space, which consists mostly of unstructured textual data, there is a need to absorb the content of news easily and effectively. While there are many Sinhala news sites out there, no site facilitates recommendation despite the popularity of recommender systems in the current age and day. Therefore, it is effective if the news were presented in a summarized version which tally with the user preferences as well. Our research aims to fill this gap by providing a centralized news platform which recommends news to its users clearly and concisely. The news articles were collected using web scraping and after performing categorization it will be presented in a summarized context. Also, we expect to detect the grey sheep users and to provide separate recommendations to them in order to minimize errors in recommendation. Here the grey sheep users refer to the user group who have special tastes and they may neither agree nor disagree with majority of users. By implementing the proposed system, we hope to provide appropriate solutions to the mentioned requirements and build a user-friendly Sinhala news platform. Considering about the application, manually creating a summary can be time consuming and tedious. The main idea behind building an automatic text summarization is to distinguish the highest significant information from the given content, decrease of the offered text to fewer sentences without leaving the fundamental thoughts of the first content and present it to the end-readers. Implementation of a specific summarizer for Sinhala Language is a major requirement to develop such an application because there is no Sinhala news platform available which presents summarized and categorized news text to the users. The objective is to produce a brief and exact outline of voluminous news messages while focusing on the key thoughts that convey beneficial information without losing the general significance. The research aims to build the summarizer with the use of PyTeaser algorithm. Even though PyTeaser can not be directly used for Sinhala Language, by using language specific modifications, PyTeaser is made available for Sinhala. The logic behind the PyTeaser includes assigning a total score to each sentence based on four features: Title Score, Keyword Frequency, Sentence Length and Sentence Position. Total score is computed by weighting the mentioned features and those weights are constants. Then the sentences with the highest scores are selected to produce the summary. The quality of the summary is evaluated using F Measure, with the use of human-generated summaries which are produced by Sinhala experts. The research focuses to compute the F Measure for all the possible weight combinations made out with original weights of PyTeaser and choose the optimized weight vector which provides the best quality news summary. The summaries for the proposed application are generated using the derived weight set and then those news are expected to recommend to endusers via the recommender system, according to user preferences.
- item: Conference-AbstractNamed entity boundary detection for religious unhealthy statements in social media(National Language Processing Centre University of Moratuwa Sri Lanka, 2020) Yapa, P; Thayasivam, U; Rathnayaka, CNamed entity recognition (NER) can be introduced as one of the fundamental tasks in natural language processing. The type and boundary are the two components of a named entity (NE) in text. There are various research have been applied to detect the NE types and boundary as two separate tasks and there is no existing mechanism to consider both of these tasks together. NER in social media analysis considering expressions, disputes etc. can also be considered as something which carries a huge demand. Furthermore, most of these mechanisms have been followed considering English as the testing language corpus and there is an overwhelming demand for such a system which supports for multiple languages including Sinhala. So, the intention is to implement a system for NE boundary detection for Sinhala language considering religious unhealthy statements in social media. Detecting both NE boundary and NE type as an aggregate mechanism will tune up the accuracy and performance of NE linking to knowledge bases. The approach will be determined by some of the aspects such as identifying the existing mechanisms of NE type detection and NE boundary detection, identifying the complexity indexes, matrices and relationships of religious unhealthy statements in Sinhala, identifying the novelty in implementing NE boundary detection considering NE types of religious unhealthy statements and finally enhancing NE linking considering NE boundary detection. The ultimate target is to implement a prototype which out forms the stateof-the-art existing baselines.
- item: Conference-AbstractOld Sinhala newspaper reader for people with visual impairment(National Language Processing Centre University of Moratuwa Sri Lanka, 2020) Perera, V.; Gayashan, P.P.A; Nirmani, S; Thayasivam, U; Rathnayaka, CIn Sri Lanka the blind or visually impaired people refer to non-current documents like old sinhala newspapers for their studies, research and other academic purposes. The Department of National archives in Srilanka provides repositories of non-current records to the people who need to deep dive into the past newspapers, However It is an extremely difficult job for visually impaired people to acquire information from those archived newspapers without guidance.In recent years The Sinhala Screen reader was implemented for 'Ranawiru Sevana' rehabilitation facility at Ragama to help the retired blind soldiers to access documents & web sources and to ease their day-to-day computer related work and support applications including web browsers, email clients, internet chat programs and office suites. However, in Sri Lanka still there is no proper offline application for blind or visually impaired people to assist them with reading physical printed newspapers which were published before the launch of e-newspapers. Hence, this proposed solution is a real-time offline Sinhala Old newspaper reader with Smart assistant support to deliver the preferred news articles to the user with predefined question sets. The system consists of, Track and segment the newspaper article from scanned newspaper pages, segmentation of elements in the article and character by character segmentation, character recognition, feature extraction and use of classification model. In addition to that a high accurate word correction module will be used to provide accurate recognition of Sinhala characters and creates the corresponding corrected words. There can be word and nonword errors due to incomplete and incorrectly classified words due to the variation of characters and poor visualquality of the deteriorated old newspaper pages. At the final stage the preferred news content will be provided via audible clips to the user by using speech to text and text to speech tools.
- item: Conference-AbstractAn improved kNN algorithm using k-means and Fast text to predict sentiments expressed in Tamil texts(National Language Processing Centre University of Moratuwa Sri Lanka, 2020) Thavareesan, S; Mahesan, S; Thayasivam, U; Rathnayaka, CWith the intention to develop a suitable approach to performing Sentiment Analysis on Tamil Texts using K-means clustering with k-Nearest Neighbour (k-NN) classifier, a corpus UJ_Corpus_Opinions consisting of 1518 Positive and 1173 Negative comments has been constructed. For training 820 positive and 820 negative comments are taken, and for testing 650 and 350 respectively. Bag of Words (BoW) and fastText vectors are used to create feature vectors. These feature vectors are clustered using K-means clustering. The cluster centroids are used as classification keys for k-NN classifier. Two types of clustering techniques are utilised to develop two models: (i) using class-wise information, (ii) with no class-wise information. These two models are tested using K-Fold. All these four models are tested with the two types of feature vectors. These models are tested using varying number of centroids (Kc:1..10), neighbours (Kn:1..Kc) and folds (Kf:1..10) to study their influence in the accuracy. The accuracy increases with the values of Kc, and the highest accuracy (74%) is obtained for Kn=1 and Kf=2. Accuracy, in general, is found to be more with fastText than with the BoW. The model with fastText and class-wise clustering with K-Fold that obtained 74% accuracy has F1-Score of 0.74.
- item: Conference-AbstractSmart mail sorting system for Sinhala handwritten addresses(National Language Processing Centre University of Moratuwa Sri Lanka, 2020) Fernando, W.D.R.; Bhashani, S.D.P.; Ranasingha, L.P.V.H.; Thayasivam, U; Rathnayaka, CIn Sri Lanka post offices play the most significant role when considering sharing the letters, bills, bank statements, sending parcels, and many more. According to the performance report of the postal department in 2018, 258,866 letters had been sorted and delivered through the Central Mail Exchange per day in Sri Lanka [1]. This mailing process could be made more effective and systematic by reducing the time of the manual process and the human error rate. There are many existing systems regarding automating the mail sorting process in other countries. As an example, Amazon deploys a robot allowing it to receive a package and deposit it in the correct location in the center according to the zip code by reducing miss-sorts [2]. However, in Sri Lanka, there is no proper system to automate the mail sorting process because, in the context of the Sinhala language, the alphabet consists of symbols that are complex and vary in shape and dimensions. Identifying each letter or modifier in a Sinhala text image is a challenge due to features such as overlapping or touching characters, cursive or non-cursive characters, and vary in shape or dimension of the characters from person to person, etc. This research proposes a solution to implement an automated system that can take an envelope image and classify that envelope into relevant postal division using postal code. The proposed methodology has two main phases known as identifying the relevant postal division and digitization of the mail process. The first phase consists of three sub-processes: preprocessing and identifying elements, segmenting and character recognition, error correction, and identifying relevant postal code. In the second phase, the proposed system keeps a record of mail details that are passed by the system.
- item: Conference-AbstractAssessing the influence of structural ambiguity in technical translation with special reference to administrative documents in Sri Lanka(2020) Sewwand, N.; Amarasinghe, H.; Thayasivam, U.; Rathnayaka, CAmbiguity is a crucial problem that translators often encounter when translate documents from Source language (SL) to Target language (TL). Each language is comprised of system of linguistic elements (phonology, morphology, syntax and semantics) which arranged to a proper sequence in a way that a clear meaning is derived by the users. However, due to some improper arrangements of these linguistic elements, perhaps the message intended to be delivered can be misinterpreted in several ways. In the context of structural ambiguity, Tuggy (1993) defines it as a “result of two or more syntactic structures that can be attributed to one string of word or sentence.” This research aimed to assess the influence of structural ambiguity in technical translation with special reference to administrative documents in Sri Lanka. The study was conducted in line with qualitative approaches observing English version of administrative documents including the gazettes, circulars, and amendments. The results indicated that the ambiguity was associated by 2% out of all the selected text, and 80% of which has appeared as a result of modification while the rest has come in the form of co-ordination.
- item: Conference-AbstractEffectiveness of grammatically correct translation from English into Sinhalese: with special reference to the annual report 2019/20 of Sri Lankan airlines(National Language Processing Centre University of Moratuwa Sri Lanka, 2020) Gunathilaka, D. D. I. M. B.; Thayasivam, U; Rathnayaka, CTranslation, with no doubt is a globally indispensable matter of need in collaborating all human, who speak diverse vernaculars. In searching of the effectiveness of grammatically correct translation from English into Sinhalese, the researcher has conducted a case-study based on the Annual Report (AR) 2019/20 of Sri Lankan Airlines. Collected data from the said AR has been analyzed together with the secondary data in order to investigate the derivation of the correct meaning into Sinhalese by giving special attention on Spellings, Subject-Verb Concord, Word Formation, Word Division, and Punctuations, and found that though the fundamental meaning of the AR has been communicated into Sinhalese, there are number of grammatically ambiguous words and phrases due to the misuse of Spellings, Word Formation, Word Division, and Punctuations, which delivered misinterpretations. Further, no serious error has been detected related to Subject-Verb Concord. Finally, the analyzed data has clearly proved that grammar matters for an accurate translation.
- item: Conference-AbstractLexical uncertainties confronted by government translators when translating legal documents from Sinhala to English(National Language Processing Centre University of Moratuwa Sri Lanka, 2020) Hansani, J.A.M; Thayasivam, U; Rathnayaka, CThe objective of the present study is to find the lexical uncertainties confronted by government translators who have less than one- year experience in their profession. The mixed methodology was employed to gather the primary data. An error analysis was done after having asked 25 selected translators to translate different legal documents. It is found out that ambiguity, vagueness, lack of knowledge of context, vagueness and legal terminology have mostly caused producing non or less acceptable translation. Out of those five lexical uncertainties, the researcher has found legal terminology as the lexical uncertainty most frequently confronted by translators in translating legal documents. To collect more data, a questionnaire was also distributed to the sample population. As recommendations to solve these issues, the research suggests the use and the maintenance of legal glossaries, meeting law experts, gathering legal knowledge, extensive reading and getting help from online sources and books.
- item: Conference-AbstractPolysemy in technical translations(National Language Processing Centre University of Moratuwa Sri Lanka, 2020) Senevirathne, S. A. T. A.; Thayasivam, U; Rathnayaka, CTranslators may find Polysemy: coexistence of several meanings for a single word, as a challenge since it complicates the task of the translator. This situation gets worse if the translation is technical where the message should be depicted as it is in the Source Text unlike a literary translation which gives the translator more freedom. This paper intends to examine how Polysemy can affect a technical translation when the translator is unable to apply the most appropriate term for the original text. Similarly, it aims to explore how a translator can choose the relevant words for the Target Text when doing a technical translation. Accordingly, few technical documents and their translations will be analyzed in order to discover the situations where Polysemy has appeared and whether the translator has successfully tackled the polysemic situation. Thus, this paper will shed some light on how to work on Polysemy in Technical translations
- item: Conference-AbstractA Study of the language practices of the government service: with reference to the examination for official languages proficiency in Sri Lanka(National Language Processing Centre University of Moratuwa Sri Lanka, 2020) Manthrirathna, P. S.; Thayasivam, U; Rathnayaka, CAs per the Public Administration Circular 1/ 2014 declared by Ministry of Public Administration and Management, all the government sector employees in Sri Lanka should have the official languages proficiency. Accordingly, the Tamil speaking employees and the Sinhala speaking employees should pass the Sinhala Proficiency Test/ the Tamil Proficiency Test respectively. The aims and the objectives of the Official Languages Proficiency Examination mainly address effective communication and translation in order to do official work successfully, catering the needs of the country. The final examination respectively offers 70% and 30% marks for the written paper and for the oral test. The government sector employees who possess this qualification are paid a special allowance too. However, despite the employees get through this examination it is problematic whether its aims are actually accomplished and cater the actual language needs of the country. Thus, this ongoing research investigates whether the practical component of the course is actually practiced within the workplaces. This study also intends to research whether the government sector employees of all the categories have the opportunity to sit for the Official Languages Proficiency Examination, and the total number of the candidates who passed the examination for the past five years.
- item: Conference-AbstractPost-Text(National Language Processing Centre University of Moratuwa Sri Lanka, 2020) Thayasivam, U; Rathnayaka, C