Identifying health domain URLs using SVM

Rajalakshmi R

doi:10.1145/2791405.2791441

Profiles Research Units Publications

Conferences

Identifying health domain URLs using SVM

Published in Association for Computing Machinery

2015

DOI: 10.1145/2791405.2791441

Volume: 10-13-August-2015

Pages: 203 - 208

Abstract

World Wide Web contains large volume of information on various topics. Especially, in health domain, people surf the net before consulting experts. But it is not guaranteed that, only the relevant health related pages are retrieved. So there is a need for an automated system that could assist in identifying health related web pages. In this paper, an URL based approach is proposed to identify health domain URLs that will help to avoid fetching irrelevant web pages. One of the issues in URL based topic classification is difficulty in selection of suitable URL features. In this paper, only the 4-grams derived from URLs are used as features to determine the health related web page, without using any medical repository. Statistical dictionary based methods have been reported in the literature, but construction of such dictionary is not automatic. A machine learning technique to automatically learn statistical dictionary of terms from the training URLs is proposed. To classify a web page either as a health page or not, SVM binary classifier is designed with a dictionary of 4-grams derived from URLs. The bench mark dataset ODP has been used for evaluating the performance by conducting various experiments. With the proposed URL based approach, 87% of precision has been achieved, which is a significant improvement over the existing techniques. © 2015 ACM.