Browsing by Author "Thayasivam, U"

Now showing 1 - 20 of 26

item: Conference-Abstract
ACTSEA : annotated corpus for Tamil & Sinhala emotion analysis
Jenarthanan, R; Senarath, Y; Thayasivam, U
The purpose of text emotion analysis is to detect and recognize the classification of feeling expressed in text. In recent years, there has been an increase in text emotion analysis studies for English language since data were abundant. Due to the growth of social media large amount data are now available for regional languages such as Tamil and Sinhala as well. However, these languages lack necessary annotated corpus for many NLP tasks including emotion analysis. In this paper, we present our scalable semi-automatic approach to create an annotated corpus named ACTSEA for Tamil and Sinhala to support emotion analysis. Alongside, our analysis on a sample of the produced data and the useful findings are presented for the low resourced NLP community to benefit. For ACTSEA, data were gathered from twitter platform and annotated manually after cleaning. We collected 600280 (Tamil) and 318308 (Sinhala) tweets in total which makes our corpus largest data collection which is currently available for these languages.
item: Conference-Full-text
Application of noise filter mechanism for t5-based text-to-sql generation
(IEEE, 2023-12-09) Aadhil Rushdy, MR; Thayasivam, U; Abeysooriya, R; Adikariwattage, V; Hemachandra, K
The objective of the text-to-SQL task is to convert natural language queries into SQL queries. However, the presence of extensive text-to-SQL datasets across multiple domains, such as Spider, introduces the challenge of effectively generalizing to unseen data. Existing semantic parsing models have struggled to achieve notable performance improvements on these crossdomain datasets. As a result, recent advancements have focused on leveraging pre-trained language models to address this issue and enhance performance in text-to-SQL tasks. These approaches represent the latest and most promising attempts to tackle the challenges associated with generalization and performance improvement in this field. This paper proposes an approach to evaluate and use the Seq2Seq model providing the encoder with the most pertinent schema items as the input and to generate accurate and valid cross-domain SQL queries using the decoder by understanding the skeleton of the target SQL query. The proposed approach is evaluated using Spider dataset which is a well-known dataset for text-to-sql task and able to get promising results where the Exact Match accuracy and Execution accuracy has been boosted to 72.7% and 80.2% respectively compared to other best related approaches.
item: Conference-Abstract
Automatic Sinhala news text summarizer
(National Language Processing Centre University of Moratuwa Sri Lanka, 2020) Tennakoon, S; Thayasivam, U; Rathnayaka, C
With the present explosion of news circulating the digital space, which consists mostly of unstructured textual data, there is a need to absorb the content of news easily and effectively. While there are many Sinhala news sites out there, no site facilitates recommendation despite the popularity of recommender systems in the current age and day. Therefore, it is effective if the news were presented in a summarized version which tally with the user preferences as well. Our research aims to fill this gap by providing a centralized news platform which recommends news to its users clearly and concisely. The news articles were collected using web scraping and after performing categorization it will be presented in a summarized context. Also, we expect to detect the grey sheep users and to provide separate recommendations to them in order to minimize errors in recommendation. Here the grey sheep users refer to the user group who have special tastes and they may neither agree nor disagree with majority of users. By implementing the proposed system, we hope to provide appropriate solutions to the mentioned requirements and build a user-friendly Sinhala news platform. Considering about the application, manually creating a summary can be time consuming and tedious. The main idea behind building an automatic text summarization is to distinguish the highest significant information from the given content, decrease of the offered text to fewer sentences without leaving the fundamental thoughts of the first content and present it to the end-readers. Implementation of a specific summarizer for Sinhala Language is a major requirement to develop such an application because there is no Sinhala news platform available which presents summarized and categorized news text to the users. The objective is to produce a brief and exact outline of voluminous news messages while focusing on the key thoughts that convey beneficial information without losing the general significance. The research aims to build the summarizer with the use of PyTeaser algorithm. Even though PyTeaser can not be directly used for Sinhala Language, by using language specific modifications, PyTeaser is made available for Sinhala. The logic behind the PyTeaser includes assigning a total score to each sentence based on four features: Title Score, Keyword Frequency, Sentence Length and Sentence Position. Total score is computed by weighting the mentioned features and those weights are constants. Then the sentences with the highest scores are selected to produce the summary. The quality of the summary is evaluated using F Measure, with the use of human-generated summaries which are produced by Sinhala experts. The research focuses to compute the F Measure for all the possible weight combinations made out with original weights of PyTeaser and choose the optimized weight vector which provides the best quality news summary. The summaries for the proposed application are generated using the derived weight set and then those news are expected to recommend to endusers via the recommender system, according to user preferences.
item: Conference-Full-text
A deep learning ensemble hate speech detection approach for sinhala tweets
(IEEE, 2022-07) Munasinghe, S; Thayasivam, U; Rathnayake, M; Adhikariwatte, V; Hemachandra, K
We live in an era where social media platforms play a key role in society. These platforms support most of the native languages and this has enabled people to express their opinions conveniently. Also, it is very common to observe that people express very hateful opinions on social media platforms as well. Several studies have been carried out in this area for the Sinhala language with traditional machine learning models and none of them have shown promising results. Further, current approaches are far behind the latest techniques carried out in high-resource languages. Hence this study presents a deep learning-based approach for hate speech detection which has shown outstanding results for other languages. Moreover, a deep learning ensemble was constructed from these models to evaluate performance improvements. These models were trained and tested on a newly created dataset using the Twitter API. Moreover, the model generalizability was further tested by applying it to a completely new dataset. As per the results, it can be observed that the proposing approach has outperformed the traditional machine learning models and is well generalized. Finally, the experimentation with extra features also reveals that there is a positive impact on the performance using extra features.
item: Thesis-Full-text
Developing a retrieval-based Tamil language chatbot for closed domain
(2023) Kugathasan, K; Thayasivam, U
Chatbots are conversational systems that interact with humans via natural language. Frequently, it is used to respond to user queries and provide them with the information they need. To build a highly functional chatbot, a good corpus and a variety of language-related resources are required. Since Tamil is a low-resource language those resources are not available for Tamil. Additionally, since Tamil is also a morphologically rich language, high inflexion and free word order pose key challenges to Tamil chatbots. Due to all the above reasons, it is evident that developing an effective End-to-End chat system is challenging even for a closed domain. This study introduces a novel method for building a chatbot in Tamil by leveraging a dataset extracted from Tamil banking website’s FAQ sections and extending it to encompass the language's morphological complexity and rich inflectional structure. Intent is assigned to each query, and a multiclass intent classifier is developed to classify user intent. The CNN-based classifier demonstrated the highest performance, achieving an accuracy of 98.72%. While previous works on short-text classification in Tamil focused only on a few classes and used a very large dataset, our method produced a superior accuracy of over 98% using a small number of per-class examples even when there are 56 classes and additional challenges like class imbalance problem in the data. This shows our approach is better than any other approach for short text classification in Tamil. The major contribution of this research is the generation of the first-ever chat dataset for Tamil. Our research is the first of its kind in Tamil to show how an efficient context-less chatbot can be built using short text classification. Although this project is done for the Tamil language and for the Banking domain, this approach can be applied to other low-resourced languages and domains as well.
item: Conference-Full-text
Domain specific named entity recognition in tamil
(IEEE, 2022-07) Murugathas, R; Thayasivam, U; Rathnayake, M; Adhikariwatte, V; Hemachandra, K
This paper presents a domain specific Tamil Named Entity Recognizer for history domain. The system uses a manually annotated corpus of 23k tokens and the dataset is tagged with 36 tags related to history domain. NER model is trained for Tamil based on Conditional Random Fields (CRF) with the use of features extracted based on the domain of interest and language. Hyper parameter tuning is applied with random search algorithm to find the best hyper parameters for the model. Tamil is a low resourced and morphologically rich language which makes the task challenging. Despite that, the system achieved a fair results with micro-averaged Precision, Recall and Fl-score of 87.9%, 67.1% and 76.1% respectively.
item: Conference-Abstract
Effectiveness of grammatically correct translation from English into Sinhalese: with special reference to the annual report 2019/20 of Sri Lankan airlines
(National Language Processing Centre University of Moratuwa Sri Lanka, 2020) Gunathilaka, D. D. I. M. B.; Thayasivam, U; Rathnayaka, C
Translation, with no doubt is a globally indispensable matter of need in collaborating all human, who speak diverse vernaculars. In searching of the effectiveness of grammatically correct translation from English into Sinhalese, the researcher has conducted a case-study based on the Annual Report (AR) 2019/20 of Sri Lankan Airlines. Collected data from the said AR has been analyzed together with the secondary data in order to investigate the derivation of the correct meaning into Sinhalese by giving special attention on Spellings, Subject-Verb Concord, Word Formation, Word Division, and Punctuations, and found that though the fundamental meaning of the AR has been communicated into Sinhalese, there are number of grammatically ambiguous words and phrases due to the misuse of Spellings, Word Formation, Word Division, and Punctuations, which delivered misinterpretations. Further, no serious error has been detected related to Subject-Verb Concord. Finally, the analyzed data has clearly proved that grammar matters for an accurate translation.
item: Conference-Full-text
End To End Model For Speaker Identification With Minimal Training Data
(IEEE, 2021-07) Balakrishnan, S; Jathusan, K; Thayasivam, U; Adhikariwatte, W; Rathnayake, M; Hemachandra, K
Deep learning has achieved immense universality by outperforming GMM and i-vectors on speaker identification. Neural Network approaches have obtained promising results when fed by raw speech samples directly. Modified Convolutional Neural Network (CNN) architecture called SincNet, based on parameterized sinc functions which offer a very compact way to derive a customized filter bank in the short utterance. This paper proposes attention based Long Short Term Memory (LSTM) architecture that encourages discovering more meaningful speaker-related features with minimal training data. Attention layer built using Neural Networks offers a unique and efficient representation of the speaker characteristics which explore the connection between an aspect and the content of short utterances. The proposed approach converges faster and performs better than the SincNet on the experiments carried out in the speaker identification tasks.
item: Conference-Full-text
Hybrid approach for accurate and interpretable representation learning of knowledge graph
(IEEE, 2020-07) Yogendran, N; Kanagarajah, A; Chandiran, K; Thayasivam, U; Weeraddana, C; Edussooriya, CUS; Abeysooriya, RP
Representation learning of knowledge graph aims to embed both entities and relations into a low-dimensional space. However, there are still some gaps in the knowledge graph embedding methods in providing interpretation of knowledge graph while encoding the semantic meaning of the concepts and structured information of knowledge graphs. To address this issue, we propose a hybrid approach for Accurate and Interpretable Representation Learning (AIRL) method for embedding entities and relations of knowledge graphs by utilizing the rich information located in entity descriptions and hierarchical types of entities. Here we use hybrid approach to learn interpretable knowledge representations by capturing the semantics and structure of entities using this rich information. We adopt FB15K dataset generated from a large knowledge graph freebase, to evaluate the performance of the proposed model. The results of experiments demonstrate AIRL significantly outperforms translation embeddings and other state-of-the-art methods.
item: Conference-Abstract
An improved kNN algorithm using k-means and Fast text to predict sentiments expressed in Tamil texts
(National Language Processing Centre University of Moratuwa Sri Lanka, 2020) Thavareesan, S; Mahesan, S; Thayasivam, U; Rathnayaka, C
With the intention to develop a suitable approach to performing Sentiment Analysis on Tamil Texts using K-means clustering with k-Nearest Neighbour (k-NN) classifier, a corpus UJ_Corpus_Opinions consisting of 1518 Positive and 1173 Negative comments has been constructed. For training 820 positive and 820 negative comments are taken, and for testing 650 and 350 respectively. Bag of Words (BoW) and fastText vectors are used to create feature vectors. These feature vectors are clustered using K-means clustering. The cluster centroids are used as classification keys for k-NN classifier. Two types of clustering techniques are utilised to develop two models: (i) using class-wise information, (ii) with no class-wise information. These two models are tested using K-Fold. All these four models are tested with the two types of feature vectors. These models are tested using varying number of centroids (Kc:1..10), neighbours (Kn:1..Kc) and folds (Kf:1..10) to study their influence in the accuracy. The accuracy increases with the values of Kc, and the highest accuracy (74%) is obtained for Kn=1 and Kf=2. Accuracy, in general, is found to be more with fastText than with the BoW. The model with fastText and class-wise clustering with K-Fold that obtained 74% accuracy has F1-Score of 0.74.
item: Conference-Abstract
Issues in employing metaphorically equivalent phrases according to cognitive translation hypothesis in translating Sinhala text into English to train an AI translation software
(National Language Processing Centre University of Moratuwa Sri Lanka, 2020) Herath, L; Ranaweera, B; Thayasivam, U; Rathnayaka, C
This paper explores the issues in translating Sinhala metaphorical/cultural phrases into English because of the indirectness of their meanings especially when translating material to train an AI software. Current research shows that using the Cognitive Translation Hypothesis (here on CTH) is one of the most successful solutions for this issue which suggests that using metaphorically equivalent phrases in translation is effective to convey the meaning. However, one of the problems in using metaphorically equivalent phrases is that the meanings of such phrases in the source language can be entirely different from the target language in its literal sense, even though the metaphorical meaning can be equivalent. Thus, in translating material from Sinhala to English using CTH, especially to train an AI software to translate text feed, can be unsuccessful since it can confuse between the literal meaning and the metaphorical meaning that could occur in the text feed in both ways.
item: Conference-Full-text
Language model-based spell-checker for Sri Lankan names and addresses
(IEEE, 2022-07) Udagedara, Y; Elikewela, B; Thayasivam, U; Rathnayake, M; Adhikariwatte, V; Hemachandra, K
Names are used abundantly in various applications, but traditional spell-checkers are not adapted to correcting errors in names. In this research, we suggest a spell-checker for Sri Lankan names and addresses. The main challenge in building a spell checker for names is the inability to create a comprehensive dictionary. Our spell-checker overcomes this challenge by utilizing a language model for evaluating the validity of names and a non-Dictionary suggestion generator. The resulting spell-checker boasts performance of up to 96% suggestion adequacy. This spellchecker can be used in applications directly, and the components built can be repurposed for other named entity-related research.
item: Conference-Abstract
Lexical uncertainties confronted by government translators when translating legal documents from Sinhala to English
(National Language Processing Centre University of Moratuwa Sri Lanka, 2020) Hansani, J.A.M; Thayasivam, U; Rathnayaka, C
The objective of the present study is to find the lexical uncertainties confronted by government translators who have less than one- year experience in their profession. The mixed methodology was employed to gather the primary data. An error analysis was done after having asked 25 selected translators to translate different legal documents. It is found out that ambiguity, vagueness, lack of knowledge of context, vagueness and legal terminology have mostly caused producing non or less acceptable translation. Out of those five lexical uncertainties, the researcher has found legal terminology as the lexical uncertainty most frequently confronted by translators in translating legal documents. To collect more data, a questionnaire was also distributed to the sample population. As recommendations to solve these issues, the research suggests the use and the maintenance of legal glossaries, meeting law experts, gathering legal knowledge, extensive reading and getting help from online sources and books.
item: Conference-Full-text
Low resource multi-asr speech command recognition
(IEEE, 2022-07) Mohamed, I; Thayasivam, U; Rathnayake, M; Adhikariwatte, V; Hemachandra, K
There are several applications when comes to spoken language understanding (SLU) such as topic identification and intent detection. One of the primary underlying components used in SLU studies are ASR (Automatic Speech Recognition). In recent years we have seen a major improvement in the ASR system to recognize spoken utterances. But it is still a challenging task for low resource languages as it requires 100’s hours of audio input to train an ASR model. To overcome this issue recent studies have used transfer learning techniques. However, the errors produced by the ASR models significantly affect the downstream natural language understanding (NLU) models used for intent or topic identification. In this work, we have proposed a multi-ASR setup to overcome this issue. We have shown that combining outputs from multiple ASR models can significantly increase the accuracy of low-resource speech-command transfer-learning tasks than using the output from a single ASR model. We have come up with CNN based setups that can utilize outputs from pre-trained ASR models such as DeepSpeech2 and Wav2Vec 2.0. The experiment result shows an 8% increase in accuracy over the current state-of-the-art low resource speech-command phoneme-based speech intent classification methodology.
item: Conference-Abstract
Named entity boundary detection for religious unhealthy statements in social media
(National Language Processing Centre University of Moratuwa Sri Lanka, 2020) Yapa, P; Thayasivam, U; Rathnayaka, C
Named entity recognition (NER) can be introduced as one of the fundamental tasks in natural language processing. The type and boundary are the two components of a named entity (NE) in text. There are various research have been applied to detect the NE types and boundary as two separate tasks and there is no existing mechanism to consider both of these tasks together. NER in social media analysis considering expressions, disputes etc. can also be considered as something which carries a huge demand. Furthermore, most of these mechanisms have been followed considering English as the testing language corpus and there is an overwhelming demand for such a system which supports for multiple languages including Sinhala. So, the intention is to implement a system for NE boundary detection for Sinhala language considering religious unhealthy statements in social media. Detecting both NE boundary and NE type as an aggregate mechanism will tune up the accuracy and performance of NE linking to knowledge bases. The approach will be determined by some of the aspects such as identifying the existing mechanisms of NE type detection and NE boundary detection, identifying the complexity indexes, matrices and relationships of religious unhealthy statements in Sinhala, identifying the novelty in implementing NE boundary detection considering NE types of religious unhealthy statements and finally enhancing NE linking considering NE boundary detection. The ultimate target is to implement a prototype which out forms the stateof-the-art existing baselines.
item: Conference-Abstract
Old Sinhala newspaper reader for people with visual impairment
(National Language Processing Centre University of Moratuwa Sri Lanka, 2020) Perera, V.; Gayashan, P.P.A; Nirmani, S; Thayasivam, U; Rathnayaka, C
In Sri Lanka the blind or visually impaired people refer to non-current documents like old sinhala newspapers for their studies, research and other academic purposes. The Department of National archives in Srilanka provides repositories of non-current records to the people who need to deep dive into the past newspapers, However It is an extremely difficult job for visually impaired people to acquire information from those archived newspapers without guidance.In recent years The Sinhala Screen reader was implemented for 'Ranawiru Sevana' rehabilitation facility at Ragama to help the retired blind soldiers to access documents & web sources and to ease their day-to-day computer related work and support applications including web browsers, email clients, internet chat programs and office suites. However, in Sri Lanka still there is no proper offline application for blind or visually impaired people to assist them with reading physical printed newspapers which were published before the launch of e-newspapers. Hence, this proposed solution is a real-time offline Sinhala Old newspaper reader with Smart assistant support to deliver the preferred news articles to the user with predefined question sets. The system consists of, Track and segment the newspaper article from scanned newspaper pages, segmentation of elements in the article and character by character segmentation, character recognition, feature extraction and use of classification model. In addition to that a high accurate word correction module will be used to provide accurate recognition of Sinhala characters and creates the corresponding corrected words. There can be word and nonword errors due to incomplete and incorrectly classified words due to the variation of characters and poor visualquality of the deteriorated old newspaper pages. At the final stage the preferred news content will be provided via audible clips to the user by using speech to text and text to speech tools.
item: Conference-Abstract
Polysemy in technical translations
(National Language Processing Centre University of Moratuwa Sri Lanka, 2020) Senevirathne, S. A. T. A.; Thayasivam, U; Rathnayaka, C
Translators may find Polysemy: coexistence of several meanings for a single word, as a challenge since it complicates the task of the translator. This situation gets worse if the translation is technical where the message should be depicted as it is in the Source Text unlike a literary translation which gives the translator more freedom. This paper intends to examine how Polysemy can affect a technical translation when the translator is unable to apply the most appropriate term for the original text. Similarly, it aims to explore how a translator can choose the relevant words for the Target Text when doing a technical translation. Accordingly, few technical documents and their translations will be analyzed in order to discover the situations where Polysemy has appeared and whether the translator has successfully tackled the polysemic situation. Thus, this paper will shed some light on how to work on Polysemy in Technical translations
item: Conference-Abstract
Post-Text
(National Language Processing Centre University of Moratuwa Sri Lanka, 2020) Thayasivam, U; Rathnayaka, C
item: Conference-Abstract
Pre-Text
(National Language Processing Centre University of Moratuwa Sri Lanka, 2020) Thayasivam, U; Rathnayaka, C
.
item: Conference-Full-text
Shadow removal for documents with reflective textured surface
(IEEE, 2022-07) Ravindran, A; Arudselvam, U; Thayasivam, U; Rathnayake, M; Adhikariwatte, V; Hemachandra, K
An increasing number of institutions are converting from traditional verification to online digital verification of user documents. In Sri Lanka, this requires clean digital images of documents such as the National Identity Card (NIC), driver’s license etc. which often have background textures and reflective surfaces. Due to human error, uneven natural light and reflectance properties, such document images contain cast shadows which pose a difficulty to further processing. The NIC dataset itself is unique in nature. It has some properties unique to a document image, i.e., dark letters on a light background, and some properties unique to a natural image, i.e., background object textures. Therefore, the target domain or nature of dataset itself is a novelty. For such domain, we propose a shadow removal mechanism based on Dual Hierarchical Aggregation Network (DHAN) and VGG-19 (Convolutional network by Visual Geometry Group) object detection deep learning model. Since previous research is not already done on this specific target area, we do not have a direct benchmark to compare our proposed methodology. Hence, we have experimented our dataset with already existing state of the art models in both shadow removal for document images and shadow removal for natural images. Our proposed model reflects the typical backbone architecture for shadow removal models for natural images when removing shadows while preserving background textures. Our architecture can be directly utilised or added to an already-existing image processing pipeline. Although our target domain is relatively new, for comparison purpose we went comparing our model with a close relative which is shadow removal on natural images. Our architecture results in an overall quality improvement of 12% and 63% improvement in output resolution when compared with the state-of-the-art architecture in shadow removal for natural images.