In several recent papers new gene-detection algorithms were proposed for detecting protein-coding regions without requiring a learning dataset of already known genes. The fact that unsupervised gene-detection is possible is closely connected to the existence of a cluster structure in oligomer frequency distributions. In this paper we study the cluster structure of several genomes in the space of their triplet frequencies, using a pure data exploration strategy. Several complete genomic sequences were analyzed, using the visualization of tables of triplet frequencies in a sliding window. The distribution of 64-dimensional vectors of triplet frequencies displays a well-detectable cluster structure. The structure was found to consist of seven clusters, corresponding to protein-coding information in three possible phases in one of the two complementary strands and in the non-coding regions with high accuracy (higher than 90% on nucleotide level). Visualizing and understanding the structure allows to analyze effectively the performance of different gene-prediction tools. Since the method does not require extraction of ORFs, it can be applied even for unassembled genomes. for LaTeX users @article{ANGorban2003-3, author = {A. N. Gorban and A. Y. Zinovyev and T. G. Popova}, title = {Seven clusters in genomic triplet distributions}, journal = {In Silico Biology}, volume = {3}, pages = {0039}, year = {2003} }
\bibitem{ANGorban2003-3} A.N. Gorban, A.Y. Zinovyev, T.G. Popova, Seven clusters in genomic triplet distributions, In Silico Biology {\bf 3} (2003) 0039.ANGorban2003-3 A.N. Gorban, A.Y. Zinovyev, T.G. Popova Seven clusters in genomic triplet distributions In Silico Biology,3,2003,0039 |