Tamil news clustering

dc.contributor.advisorRanathunga S
dc.contributor.authorFayaza MSF
dc.date.accept2020
dc.date.accessioned2020
dc.date.available2020
dc.date.issued2020
dc.description.abstractThe web has an abundance of online news articles that are updated frequently. Readers face difficulty in discovering content of interest from the overwhelming news sources and feel tired browsing various websites. This situation is valid in the case of Tamil online news as well, and the number of online news articles published in the Tamil language is on the rise. To address this issue, news aggregators and clustering techniques come into play. Even though there are many news aggregators available for languages like English, the only news aggregator that supports Tamil is Google news, which is a noticeable shortage. Google news mainly covers the Indian news and gives high weightage to the words that appear on the headline rather than those appearing in the body of the news when searching for the news [1]. This research focuses on clustering Tamil online news articles into related topics. There are several clustering techniques and similarity measures used to cluster the documents in the literature for other languages. Tamil is an agglutinative language, meaning that the techniques used for English documents might not readily work for Tamil. The purpose of this research is to study the techniques available for other languages and develop a mechanism to cluster the Tamil online news articles according to their content similarity. As the first step of this study, ten different datasets were created by collecting news from nine different news providers. Data was collected on nonadjacent days to get diversified data. TF-IDF and word embedding techniques were used to create vector representations of data. One pass algorithm and affinity propagation algorithm were used to cluster the news articles, since the number of clusters cannot be predefined and there is a high number of single news clusters. We achieved the best solution when applying word embedding with one pass algorithm. As another contribution of this research, we were able to create a Tamil word embedding model with 21,077,843 words.en_US
dc.identifier.accnoTH4360en_US
dc.identifier.degreeMSc in Computer Science and Engineeringen_US
dc.identifier.departmentDepartment of Computer Science and Engineeringen_US
dc.identifier.facultyEngineeringen_US
dc.identifier.urihttp://dl.lib.uom.lk/handle/123/16509
dc.language.isoenen_US
dc.subjectCOMPUTER SCIENCE- Dissertationen_US
dc.subjectCOMPUTER SCIENCE & ENGINEERING - Dissertationen_US
dc.subjectCLUSTERINGen_US
dc.subjectIF-IDFen_US
dc.subjectWORD EMBEDDINGen_US
dc.subjectONE PASS ALGORITHMen_US
dc.subjectAFFINITY PROPAGATIONen_US
dc.subjectCOSINE SIMILARITYen_US
dc.subjectCRAWLERen_US
dc.titleTamil news clusteringen_US
dc.typeThesis-Full-texten_US

Files