DIGAP is a database of improved gene annotation in 28 plant pathogens. Features of the database are as follows.
Firstly, the 'hypothetical genes' in each genome are checked. Some 'hypothetical genes' have been recognized as non-coding open reading frames (ORFs) by using Z curve method [1]. Evidences of principal component analysis (PCA), average length distribution and COG functional category occupation indicate that the identified non-coding ORFs are very unlikely to encode proteins. Method to identify these non-coding ORFs is described in the Documents section, and the identified non-coding ORFs are listed in the Statistics section.
Secondly, the translation initiation sites (TISs) of all the protein-coding genes in the 28 phytopathogens have been refined based on NCBI RefSeq [2], ProTISA [3] database and an ab initio gene start site prediction program, GS-finder [4]. Joint-jury method is used to relocate TISs. If two of the three systems possess the same TIS, then it is predicted to be true TIS, otherwise, TIS provided by ProTISA is adopted. The TISs relocation information is listed in Statistics section.
Thirdly, potential functions of a large number of 'hypothetical genes' have been predicted by using sequence alignment tools. In the Browse section, red lines indicate 'hypothetical genes' assigned with functions. When a user click the DIGAP_ID number, the predicted functional information and basic information are listed.
Fourthly, theoretical gene expression indices CAI [5] and E(g) [6] values, are calculated to show the gene expression levels. Some highly expressed genes are important for the survival of these plant pathogens. Predict highly expressed genes are marked with '*'.
Finally, the homologues of antibacterial drug targets contained in TTD [7] and DrugBank [8] are enumerated in DIGAP, and 3D structures of these potential targets have been modeled. Most of the PDB templates can provide the information of active sites and inhibitors, which is very helpful for new bactericide discovery.
All the above refined information in DIGAP can provide more accurate annotation for the research of lifestyle, metabolism and pathogenicity of these phytopathogens.
Citation
Gao, N., Chen, L.L., Ji, H.F., Wang, W., Chang, J.W., Gao, B., Zhang, L., Zhang, S.C. and Zhang, H.Y. (2009) DIGAP - a Database of Improved Gene Annotation for Phytopathogens. BMC Genomics 2010, 11:54.
References
[1] Zhang, C.T. and Zhang, R. (1991) Analysis of distribution of bases in the coding sequences by a diagrammatic technique. Nucleic Acids Res., 19, 6313-6317.
[2] Pruitt, K.D., Tatusova, T. and Maglott, D.R. (2008) NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res., 35, D61-D65.
[3] Hu, G.Q., Zheng, X., Yang, Y.F., Ortet, P., She, Z.S. and Zhu, H. (2008) ProTISA: a comprehensive resource for translation initiation site annotation in prokaryotic genomes. Nucleic Acids Res., 36, D114-119.
[4] Ou, H.Y., Guo, F.B. and Zhang, C.T. (2004) GS-Finder: a program to find bacterial gene start sites with a self-training method. Int. J. Biochem. Cell Biol., 36, 535-544.
[5] Sharp, P.M. and Li, W.H. (1987) The Codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res., 15, 1281-1295.
[6] Karlin, S., Mr¨˘zek, J. and Campbell, A.M. (1998) Codon usages in different gene classes of the Escherichia coli genome. Mol. Microbiol., 29, 1341-1355.
[7] Chen, X., Ji, Z.L. and Chen, Y.Z. (2002) TTD: Therapeutic Target Database. Nucleic Acids Res., 30, 412-415.
[8] Wishart, D.S., Knox, C., Guo, A.C., Cheng, D., Shrivastava, S., Tzur, D., Gautam, B. and Hassanali, M. (2008) DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res., 36, D901-906.