Abstract
With the rapid increase of protein sequences in the post-genomic age, the need for an automated and accurate tool to predict protein subcellular localization becomes increasingly important. Many efforts have been tried. Most of them aim to find the optimal classification scheme and less of them take the simplifying the complexity of biological system into consideration. This work shows how to decrease the complexity of biological system with linear DR (Dimensionality Reduction) method by transforming the original high-dimensional feature vectors into the low-dimensional feature vectors. A powerful sequence encoding scheme by fusing PSSM (Position-Specific Score Matrix) and Chou ’ s PseAA (Pseudo Amino Acid) composition is proposed to represent the protein samples. Then, the K-NN (K-Nearest Neighbor) classifier is employed to identify the subcellular localization based on their reduced low-dimensional feature vectors. Experimental results thus obtained are quite encouraging, indicating that the aforementioned linear DR method is quite promising in dealing with complicated biological problems, such as predicting the subcellular localization of Gramnegative bacterial proteins.
Keywords: Subcellular localization, PSSM, PseAAC, Linear dimensionality reduction, PCA, LDA