Header menu link for other important links
An approach for Protein Secondary Structure prediction using prediction-based language models
L.D. Xavier,
Published in Institute of Electrical and Electronics Engineers Inc.
Prediction based language models are considered as one of the major concepts in Natural Language processing which gains knowledge from unstructured text data. Extracting insights from sequential data such as biological sequences is an important problem in genomics, proteomics and classifying the secondary structures of protein, helps the researchers in aiding to understand protein functions. This is considered as one of the important preliminaries of Drug development. Traditional techniques such as sequential models, probabilistic techniques and statistical approaches were widely applied in structure prediction which extracts insights from sequence of amino acid. However, handheld feature extraction becomes a tedious task, which eventually leads to less accuracy. Our novel approach creates vectors using word embeddings which is assumed to consider contextual information of amino acids thereby improving the accuracy of secondary structure prediction approach. This is considered as an optimistic solution for secondary structure prediction problem. In this approach a variation of word embeddings - Continuous Bag of Words (CBOW) method is proposed which retains the sequential information of all amino acid in the protein chain. This vector is used as input features of Deep Neural Network classifier and class labels are classified into Helix, Sheet, Coil. We have tested this NLP based approach on GenBank dataset. The infrastructure required for this analysis was leveraged from Google Colab. © 2020 IEEE.