Home  


  Examples of input sequences in FASTA format

  Examples of input annotation file in GFF/GTF format

  Examples for download and testing

  GeneAlign Output

  Debugging the Output 


Examples of input sequences in FASTA format

>seqname position strand_for_prediction
ctcttttaggctgttcttgtgttggccccctgcttgtggtgctattttgcatgccacaga
aactatggttcctaaagcaaggctcagacatcctgccatgtatcccaactgtgatggact
aatattttttcaaaacagcaaagcaaaaaactctc

The fields of description line:

  1. seqname - The name of the sequence.
  2. position (optional) - Determines the part of the genome that the UCSC Genome Browser will initially open to, in chromosome:start-end format.
  3. strand_for_prediction (optional) - Valid entries include "forward", "reverse". The default is "forward".

Note: If the predicted results need to be viewed by the  UCSC Genome Browser, the sequences must be retrieved on the forward strand. And if the predicted sequence is retrieved form mouse genome, the "seqname" must start with "Mm".

Example 1:
>Hs.3.ENST00000302762.1  chr3:156826614-156854401  reverse
ctcttttaggctgttcttgtgttggccccctgcttgtggtgctattttgcatgccacaga
aactatggttcctaaagcaaggctcagacatcctgccatgtatcccaactgtgatggact
aatattttttcaaaacagcaaagcaaaaaactctc

Example 2:
>Hs.2.ENST00000302762.1  forward
ctcttttaggctgttcttgtgttggccccctgcttgtggtgctattttgcatgccacaga
aactatggttcctaaagcaaggctcagacatcctgccatgtatcccaactgtgatggact
aatattttttcaaaacagcaaagcaaaaaactctc

Example 3:
>Mm.1.ENST00000302762.1 
ctcttttaggctgttcttgtgttggccccctgcttgtggtgctattttgcatgccacaga
aactatggttcctaaagcaaggctcagacatcctgccatgtatcccaactgtgatggact
aatattttttcaaaacagcaaagcaaaaaactctc

 

Examples of input annotation file in GFF/GTF format

The first eight  fields of GFF (General Feature Format) and GTF (Gene Transfer Format) are the same. For more information on GFF/GTF format, refer to http://www.sanger.ac.uk/Software/formats/GFF and  http://genes.cs.wustl.edu/GTF2.html.

Here is a brief description of the GFF/GTF fields:

  1. seqname - The name of the sequence.
  2. source - The program that generated this feature.
  3. feature - The name of this type of feature. The feature type of coding exons must be  "CDS".
  4. start - The starting position of the feature in the sequence. The first base is numbered 1.
  5. end - The ending position of the feature.
  6. score - A score between 0 and 100, or '.' (for don't know/don't care).
  7. strand - Valid entries include '+', '-'.
  8. frame - If the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base.

Example 1:

HSCKIIBE GeneAlgin CDS 1634 1705 . + 0
HSCKIIBE GeneAlign CDS 2672 2774 . + 0
HSCKIIBE GeneAlign CDS 3344 3459 . + 2
HSCKIIBE GeneAlign CDS 3906 3981 . + 0
HSCKIIBE GeneAlign CDS 4128 4317 . + 2
HSCKIIBE GeneAlign CDS 4645 4735 . + 1

Example 2:

Mm.4.ENSMUST30521.1 EnsEMBL exon 137143182 137143929 . - .
Mm.4.ENSMUST30521.1 EnsEMBL CDS 137143182 137143901 . - 0
Mm.4.ENSMUST30521.1 EnsEMBL start_codon 137143899 137143901 . - 0
Mm.4.ENSMUST30521.1 EnsEMBL exon 137131214 137131372 . - .
Mm.4.ENSMUST30521.1 EnsEMBL CDS 137131214 137131372 . - 0
Mm.4.ENSMUST30521.1 EnsEMBL exon 137130549 137131003 . - .
Mm.4.ENSMUST30521.1 EnsEMBL CDS 137130563 137131003 . - 0
Mm.4.ENSMUST30521.1 EnsEMBL stop_codon 137130560 137130562 . - 0

Examples for download and testing

 

Example 1 (multi-exon gene): human.fasta mouse.fasta m_anno.txt (Output)

Example 2 (multi-exon gene): human.fasta mouse.fasta m_anno.txt (Output)

Example 3 (multi-exon gene): human.fasta mouse.fasta m_anno.txt (Output)

Example 4 (include initial micro_exon): human.fasta mouse.fasta m_anno.txt (Output)

Example 5 (include internal micro_exon): human.fasta mouse.fasta m_anno.txt (Output)

Example 6 (include terminal micro_exon): human.fasta mouse.fasta m_anno.txt (Output)

Example 7 (single-exon gene): human.fasta mouse.fasta m_anno.txt (Output)

Example 8 (predict on mouse sequence): mouse.fasta human.fasta h_anno.txt (Output)
 


GeneAlign Output

A number of files will be created by GeneAlign.

result.htm - include coding exon annotation for first DNA sequence and links of results

result.gff - a text file  in GFF format  for coding exon annotation of first DNA sequence

alignment.htm - coding exon alignments

browser.txt (optional) - a text file for being loaded by UCSC Genome Browser.


Debugging the Output

  • Incorrect output on the UCSC Genome Browser :
  • the first sequence is not retrieved on the forward strand : The output would be reverse. Try to retrieve sequence from the "Get DNA in Window" page of the UCSC Genome Browser.
  • the first human sequence is not retrieved from the Human May 2004 (hg17) assembly :  The output would be shift.
  • the first mouse is not retrieved from the Mouse Aug. 2005 (mm7) assembly  :  The output would be shift.
  • the sequence is inconsistent with CDS annotation: Try to retrieve sequence from the "Get DNA in Window" page of the UCSC Genome Browser.
  • Where you can find "Get DNA in Window" page : An example of "Get DNA in Window" page - retrieve.pdf.
  •  No significant alignment:
  • Genes on the reverse strand : Try to add the word "reverse" on the description line at the field "strand_for_prediction" for the first sequence and run GeneAlign again. The default strand for prediction is on the forward stand.
  • Gene annotation and the second sequence are inconsistent.
  • Highly dissimilar gene structures between input sequences : GeneAlign is currently requiring similar gene structures in the input sequences, some exons will be missed by the program if the structures are high dissimilar.
  •  

     

    Please contact Shu Ju Hsieh with any questions.