Header menu link for other important links
X
Improving the effectiveness of statistical feature selection algorithms using bag of synsets and its parallelization
, K.R. Chandran, A. Karthik, A. Vijay Samuel
Published in EuroJournals, Inc.
2011
Volume: 48
   
Issue: 4
Pages: 580 - 592
Abstract
Text categorization is a fundamental technique to mine massive amount of textual data. The problem is of high dimension and most of the machine learning algorithms donot perform well with all the terms in the corpus. Feature selection is a pre-processing step that removes irrelevant and redundant terms from the corpus and increases the efficiency and effectiveness of the learning techniques. Categorizing documents in a language like English is more challenging due to the presence of the phenomena like polysemy and synonymy. It has been observed that due to the difference in writing style of people, different words are used in documents to imply the same meaning. This dilutes the feature selection algorithms that are based on frequency of terms. In this research, the problem of synonymy and polysemy has been dealt by representing documents as bag of synsets rather than bag of words. As there are thousands of words present in documents, connecting to Wordnet for synonym of each term increases the execution time of the algorithm. Hence this paper parallelizes the possible sections of the feature selection algorithm. The proposed parallel algorithms were implemented in a distributed environment formed using the distributed framework, Hadoop. Experiments were conducted with documents of 20Newsgroup dataset to study the advantages of the system. Features selected by the statistical techniques were analyzed by using the simplest naive Bayes classifier. It was observed that parallel execution of the algorithm decreases the execution time. The performance of the classifier increases with bag of synset representation and positive features Improving the Effectiveness of Statistical Feature Selection Algorithms Using Bag of Synsets and its Parallelization. © EuroJournals Publishing, Inc. 2011.
About the journal
JournalEuropean Journal of Scientific Research
PublisherEuroJournals, Inc.
ISSN1450216X