Abstract
Background: DNase I hypersensitive sites (DHSs) are important signs of DNA regulatory regions. Their identification in DNA sequences is significant for both the biomedical research and the discovery of new drugs. The existing experimental methods to achieve this, however, are timeconsuming and laborious, so new computational means are called for.
Method: To meet this end, a novel predictive model, called iDHSs-PseTNC, was constructed by integrating the sequence-order information and the physicochemical properties of trinucleotides into the pseudo trinucleotide composition (PseTNC). In the model, the deep sparse auto-encoder was used for reconstructing the input to get a good representative of the input characteristic, and a softmax classifier was added to the top of the auto-encoder coding layer. The deep sparse auto-encoder model obtained the best classification result with each member of the training set correctly classified. Five-fold crossvalidation test results indicated that the new predictor remarkably outperformed the existing prediction methods for the same purpose.
Results: In this paper, the ACC rate of iDHSs-PseTNC is slightly (0.3%) lower than that of iDHS-EL constructed by Liu et al., its MCC rate is 3.45% higher than that of iDHS-EL. And the predictor iDHSs-PseTNC achieves the highest successful rates in both Pt and Py among the existing predictors. In order to facilitate the direct derivation of the needed results by experimental scholars, an easy-to-use web-server for identifying DHSs has been established for free access at: http://www.jcibioinfo. cn/iDHSs-PseTNC, which allows for fast and accurate computation.
Conclusion: The timely identification of the DHSs in DNA sequence is significant for the intensive study on DNA function and the development of new drugs. In this article, we proposed a novel method for predicting the DHSs of DNA by incorporating physicochemical properties of trinucleotides into pseudo trinucleotide composition via deep sparse auto-encoder. The results were promising enough for our predictor to be used as an analytic solution to more genomic problems.
Keywords: Dnase I hypersensitive sites, pseudo trinucleotide composition, deep sparse auto-encoder, webserver, five-fold cross-validation, DNA sequence.