Tamil news clustering

Fayaza MSF

Tamil news clustering

dc.contributor.advisor	Ranathunga S
dc.contributor.author	Fayaza MSF
dc.date.accept	2020
dc.date.accessioned	2020
dc.date.available	2020
dc.date.issued	2020
dc.description.abstract	The web has an abundance of online news articles that are updated frequently. Readers face difficulty in discovering content of interest from the overwhelming news sources and feel tired browsing various websites. This situation is valid in the case of Tamil online news as well, and the number of online news articles published in the Tamil language is on the rise. To address this issue, news aggregators and clustering techniques come into play. Even though there are many news aggregators available for languages like English, the only news aggregator that supports Tamil is Google news, which is a noticeable shortage. Google news mainly covers the Indian news and gives high weightage to the words that appear on the headline rather than those appearing in the body of the news when searching for the news [1]. This research focuses on clustering Tamil online news articles into related topics. There are several clustering techniques and similarity measures used to cluster the documents in the literature for other languages. Tamil is an agglutinative language, meaning that the techniques used for English documents might not readily work for Tamil. The purpose of this research is to study the techniques available for other languages and develop a mechanism to cluster the Tamil online news articles according to their content similarity. As the first step of this study, ten different datasets were created by collecting news from nine different news providers. Data was collected on nonadjacent days to get diversified data. TF-IDF and word embedding techniques were used to create vector representations of data. One pass algorithm and affinity propagation algorithm were used to cluster the news articles, since the number of clusters cannot be predefined and there is a high number of single news clusters. We achieved the best solution when applying word embedding with one pass algorithm. As another contribution of this research, we were able to create a Tamil word embedding model with 21,077,843 words.	en_US
dc.identifier.accno	TH4360	en_US
dc.identifier.degree	MSc in Computer Science and Engineering	en_US
dc.identifier.department	Department of Computer Science and Engineering	en_US
dc.identifier.faculty	Engineering	en_US
dc.identifier.uri	http://dl.lib.uom.lk/handle/123/16509
dc.language.iso	en	en_US
dc.subject	COMPUTER SCIENCE- Dissertation	en_US
dc.subject	COMPUTER SCIENCE & ENGINEERING - Dissertation	en_US
dc.subject	CLUSTERING	en_US
dc.subject	IF-IDF	en_US
dc.subject	WORD EMBEDDING	en_US
dc.subject	ONE PASS ALGORITHM	en_US
dc.subject	AFFINITY PROPAGATION	en_US
dc.subject	COSINE SIMILARITY	en_US
dc.subject	CRAWLER	en_US
dc.title	Tamil news clustering	en_US
dc.type	Thesis-Full-text	en_US

Collections

Master of Science in Computer science and Engineering

Tamil news clustering

Files

Collections