Header menu link for other important links
X
Language identification from small text samples
Murthy K.N.,
Published in
2006
Volume: 13
   
Issue: 1
Pages: 57 - 80
Abstract
There is an increasing need to deal with multi-lingual documents today. If we could segment multi-lingual documents language-wise, it would be very useful both for exploration of linguistic phenomena, such as code-switching and code mixing, and for computational processing of each segment as appropriate. Identification of language from a given small piece of text is therefore an important problem. This paper is about language identification from small text samples. In this paper, language identification is formulated as a generic machine learning problem - a supervised classification task in which features extracted from a training corpus are used for classification. Regression is a well established technique for modelling and analysis. Regression can also be used for classification. This paper gives a clear formulation of multiple linear regression for solving a two-class classification problem. Theoretical bases for verifying the adequacy of the model for the task and for analysing the significance of individual features is included. The method has been applied to pair wise language identification among several major Indian languages including Hindi, Bengali, Marathi, Punjabi, Oriya, Telugu, Tamil, Malayalam and Kannada. Some of these languages belong to the Indo-Aryan family while the others come from the Dravidian family of languages. Language identification was so far a largely unexplored problem in the Indian context. Variations within and across language families have been explored. Variations with regard to sizes of test samples have also been explored. Performance is comparable to the best published results for other languages of the world. In most of the published work in language identification so far, bytes have been taken as the fundamental units of text. Indian scripts are primarily syllabic in nature, reflecting phonetic sound units in a more or less direct fashion. The fundamental units of writing are called aksharas. One of the unique characteristics of Indian scripts is the concept of a script grammar. The script grammar, included in this paper, defines the set of valid aksharas. We hypothesize that aksharas are the more appropriate units of text in Indian languages, not characters or bytes. Our experimental results on language identification support this claim. © Taylor & Francis.
About the journal
JournalJournal of Quantitative Linguistics
ISSN09296174
Open AccessNo