Analysis of semantic and non-semantic crawlers

Shridevi S; S. Sanket; J. Thakor; M. Dhivya

A focused crawler goes through the world wide web and selects out those pages that are apropos to a predefined topic and neglects those pages that are not matter of interest. It collects the domain specific documents and is considered as one of the most important ways to gather information. However, centralized crawlers are not adequate to spider meaningful and relevant portions of the Web. A crawler which is scalable and which is good at load balancing can improve the overall performance. Therefore, with the size of web pages increasing over internet day by day, in order to download the pages efficiently in terms of time and increase the coverage of crawlers distributed web crawling is of prime importance. This paper describes about different semantic and non-semantic web crawler architectures: Broadly classifying them into Nonsemantic (Serial, Parallel and Distributed) and Semantic (Distributed and focused). An implementation of all the aforementioned types is done using the various libraries provided by Python 3, and a comparative analysis is done among them. The purpose of this paper is to outline how different processes can be run parallelly and on a distributed system and how all these interact with each other using shared variables and message passing algorithms. © 2021 CEUR-WS. All rights reserved.

Journal	CEUR Workshop Proceedings
Publisher	CEUR-WS
ISSN	16130073