Header menu link for other important links
X
Web page classification using n-gram based URL features
, C. Aravindan
Published in Institute of Electrical and Electronics Engineers Inc.
2014
Pages: 15 - 21
Abstract
Exponential increase in the number of web pages in the World Wide Web poses a great challenge in information filtering and also makes topic focused crawling a time consuming process in searching for relevant information. We propose an URL based web page classification method that does not need either the web page content or its link structure. In the proposed approach, character n-gram based features are extracted from URLs alone and classification is done by Support Vector Machines and Maximum Entropy Classifiers. The performance of the system was evaluated on two bench mark datasets viz., ODP with 2 million URLs and WebKB with 4K URLs. We used F1 as a performance metric and our experimental results showed an improvement of 20.5% increaseon WebKB dataset and 4.7% increase on ODP dataset. © 2013 IEEE.