Detecting access patterns through analysis of web logs

dc.contributor.authorAlgiriyage, N
dc.date.accept2015
dc.date.accessioned2015-09-16T04:57:59Z
dc.date.available2015-09-16T04:57:59Z
dc.date.issued2015-09-16
dc.description.abstractWith the evolution of the Internet and continuous growth of the global information infrastructure, the amount of data collected online from transactions and events has been drastically increased. Web server access log files collect substantial data about web visitor access patterns. Data mining techniques can be applied on such data (which is known as Web Mining) to reveal lot of useful information about navigational patterns. In this research we analyze the patterns of web crawlers and human visitors through web server access log files. The objectives of this research are to detect web crawlers, identify suspicious crawlers, detect Googlebot impersonation and profile human visitors. During human visitor profiling we group similar web visitors into clusters based on their browsing patterns and profile them. We show that web crawlers can be identified and successfully classified using heuristics. We evaluated our proposed methodology using seven test crawler scenarios. We found that approximately 53.25% of web crawler sessions were from â ˘ AIJknownâ˘A ˙I crawlers and 34.16% exhibit suspicious behavior. We present an effective methodology to detect fake Googlebot crawlers by analyzing web access logs. We propose using Markov chain models to learn profiles of real and fake Googlebots based on their patterns of web resource access sequences. We have calculated log-odds ratios for a given set of crawler sessions and our results show that the higher the log-odds score, the higher the probability that a given sequence comes from the real Googlebot. Experimental results show, at a threshold log-odds score we can distinguish the real Googlebot from the fake. For the purpose of human visitor profiling, an improved similarity measure is proposed and it is used as the distance measure in an agglomerative hierarchical clustering for a data set from an e-commerce web site. To generate profiles, frequent item set mining is applied over the clusters. Our results show that proper visitor clustering can be achieved with the improved similarity measure.en_US
dc.identifier.accno109008en_US
dc.identifier.degreeMSc.en_US
dc.identifier.departmentDepartment of Computer Science & Engineeringen_US
dc.identifier.facultyEngineeringen_US
dc.identifier.urihttp://dl.lib.mrt.ac.lk/handle/123/11341
dc.language.isoenen_US
dc.subjectCOMPUTER SCIENCE AND ENGINEERING-Dissertationsen_US
dc.subjectWeb usage mining
dc.titleDetecting access patterns through analysis of web logsen_US
dc.typeThesis-Abstracten_US

Files

Original bundle

Now showing 1 - 3 of 3
Loading...
Thumbnail Image
Name:
109008-1.pdf
Size:
73.68 KB
Format:
Adobe Portable Document Format
Description:
Pre Text
Loading...
Thumbnail Image
Name:
109008-2.pdf
Size:
83.96 KB
Format:
Adobe Portable Document Format
Description:
Post Text
Loading...
Thumbnail Image
Name:
109008.pdf
Size:
1.98 MB
Format:
Adobe Portable Document Format
Description:
Full Thesis