CNNSplice: Robust Models for Splice Site Prediction Using Convolutional Neural Networks

We present CNNSplice, a set of deep convolutional neural network models for splice site prediction. Using the five-fold cross-validation model selection technique, we explore several models based on typical machine learning applications and propose five high-performing models to efficiently predict the true and false SS in balanced and imbalanced datasets. Our evaluation results indicate that CNNSplice's models achieve a better performance compared with existing methods across five organisms' datasets. In addition, our generality test shows CNNSplice's model ability to predict and annotate splice sites in new or poorly trained genome datasets indicating a broad application spectrum. CNNSplice demonstrates improved model prediction, interpretability, and generalizability on genomic datasets compared to existing splice site prediction tools. CNNSplice can be accessed and utilized as an open-sourced software at github and is also available as a containerized application that can be run on any platform.

One-Hot Encoding and Parameter Tuning

In this study, we utilized a binary integer variable collection to represent our input genomic nucleotide bases. Specifically, we assigned [1 0 0 0], [0 1 0 0], [0 0 1 0], and [0 0 0 1] to represent Adenine (A), Cytosine (C), Guanine (G), and Thymine (T), respectively. Additionally, we represented invalid nucleotide letters (N) with [0,0,0,0], where each nucleotide is located at the corresponding vector index containing a value of 1. Therefore, we provided the CNN architecture with a Z X 4 input matrix, where Z denotes the genomic sequence length and 4 denotes the nucleotide types (A, C, G, T).

Genomic Datasets Sequence Length Selection

To maintain a good sequence length balance, and for a fair model prediction and comparison with other baseline methods, we used a sequence length of 400 of sequence region 0-399. Thus, all algorithm acceptor and donor input species datasets regions in this study have a sequence length of 400 nucleotides bases.