Home    Introduction   Results   Data    References 


 Introduction

GeneAlign is a coding exon prediction tool for predicting protein coding genes by measuring the homology between the predicted sequence on one genome and related genes annotated on another genome. Identifying protein coding genes is one of most important tasks in newly sequenced genomes. With increasing numbers of gene annotations verified by experiments, it is feasible to identify genes in the newly sequenced genomes by comparing to annotated genes of phylogenetically close organisms. GeneAlign applies CORAL, a heuristic linear time alignment tool, to determine if the regions flanked by the candidate signals (initiation codon-GT, AG-GT and AG-STOP codon) are similar to the annotated coding exons. Employing the conservation of gene structures and sequence homologies between protein coding regions increases the prediction accuracy.

Accuracy Results

GeneAlign was tested on Projector data set of 491 human-mouse homologous sequence pairs. At the gene level, both the average sensitivity and the average specificity of GeneAlign are 81%, and they are larger than 96% at the exon level. The rates of missing exons and wrong exons are smaller than 1%.

Table :  Prediction accuracy on the Projector data set.

Program

Gene level (%)

 

Exon level (%)

Sn

Sp

Sn

Sp

ME

WE

Human gene prediction

GeneWise

61.91

61.91

 

92.56

93.60

1.50

0.32

Projector

51.32

51.32

 

93.78

86.99

0.88

8.59

GeneAlign

82.28

82.28

 

96.65

97.12

0.74

0.32

Mouse gene prediction

GeneWise

60.49

60.49

 

93.13

93.39

1.18

0.28

Projector

58.45

58.45

 

94.55

90.35

0.47

4.55

GeneAlign

79.23

79.23

 

96.63

96.39

0.49

0.58

*The measures of sensitivity (Sn) and specificity (Sp) are respectively Sn=TP/(TP+FN) and Sp=TP/(TP+FP). ME (missing exons) is the proportion of annotated exons not overlapped by any predicted exons, whereas WE (wrong exons) is the proportion of predicted exons not overlapped by any annotated exons.

Data Sets

The testing set is Projector data set (Meyer and Durbin, 2004, http://www.sanger.ac.uk/Software/analysis/projector/) which collects 491 homologous gene pairs. The average number of exons per gene in the test set is 8.8 exons. 44% of these gene pairs (216 out of 491) have the identical number of coding exons and the identical coding sequence length. 51% of these gene pairs (249 out of 491) have identical exons number but differ in coding sequence length. 5% of these gene pairs (26 out of 491) have different number of exons.

References

 


Please contact Shu Ju Hsieh with any questions.