Home Introduction Results Data References
Introduction
GeneAlign is a coding exon prediction tool for predicting protein coding genes by measuring the homology between the predicted sequence on one genome and related genes annotated on another genome. Identifying protein coding genes is one of most important tasks in newly sequenced genomes. With increasing numbers of gene annotations verified by experiments, it is feasible to identify genes in the newly sequenced genomes by comparing to annotated genes of phylogenetically close organisms. GeneAlign applies CORAL, a heuristic linear time alignment tool, to determine if the regions flanked by the candidate signals (initiation codon-GT, AG-GT and AG-STOP codon) are similar to the annotated coding exons. Employing the conservation of gene structures and sequence homologies between protein coding regions increases the prediction accuracy.
Accuracy Results
GeneAlign was tested on Projector data set of 491 human-mouse homologous sequence pairs. At the gene level, both the average sensitivity and the average specificity of GeneAlign are 81%, and they are larger than 96% at the exon level. The rates of missing exons and wrong exons are smaller than 1%.
Table : Prediction accuracy on the Projector data set.
Program
Gene level (%)
Exon level (%)
Sn
Sp
ME
WE
Human gene prediction
GeneWise
61.91
92.56
93.60
1.50
0.32
Projector
51.32
93.78
86.99
0.88
8.59
GeneAlign
82.28
96.65
97.12
0.74
Mouse gene prediction
60.49
93.13
93.39
1.18
0.28
58.45
94.55
90.35
0.47
4.55
79.23
96.63
96.39
0.49
0.58
*The measures of sensitivity (Sn) and specificity (Sp) are respectively Sn=TP/(TP+FN) and Sp=TP/(TP+FP). ME (missing exons) is the proportion of annotated exons not overlapped by any predicted exons, whereas WE (wrong exons) is the proportion of predicted exons not overlapped by any annotated exons.
Data Sets
The testing set is Projector data set (Meyer and Durbin, 2004, http://www.sanger.ac.uk/Software/analysis/projector/) which collects 491 homologous gene pairs. The average number of exons per gene in the test set is 8.8 exons. 44% of these gene pairs (216 out of 491) have the identical number of coding exons and the identical coding sequence length. 51% of these gene pairs (249 out of 491) have identical exons number but differ in coding sequence length. 5% of these gene pairs (26 out of 491) have different number of exons.
Test Set: Projector test set (fasta) and annotations (GTF)
Predictions: GeneAlign (GFF), Projector (GTF), GeneWise (GTF)
References
S.J. Hsieh, C.Y. Lin, Y.S. Chung, and C.Y. Tang, “Comparative Exon Prediction based on Heuristic Coding Region Alignment”, Proceedings of the International Symposium on Parallel Architectures, Algorithms, and Networks, 2005, pp. 14-19.
Please contact Shu Ju Hsieh with any questions.