Periodicity of DNA in exons

Eskesen, Stephen T; Eskesen, Frank N; Kinghorn, Brian; Ruvinsky, Anatoly

doi:10.1186/1471-2199-5-12

Research article
Open access
Published: 18 August 2004

Periodicity of DNA in exons

Stephen T Eskesen¹,
Frank N Eskesen²,
Brian Kinghorn¹ &
…
Anatoly Ruvinsky¹

BMC Molecular Biology volume 5, Article number: 12 (2004) Cite this article

8916 Accesses
22 Citations
Metrics details

Abstract

Background

The periodic pattern of DNA in exons is a known phenomenon. It was suggested that one of the initial causes of periodicity could be the universal (RNY)_npattern (R = A or G, Y = C or U, N = any base) of ancient RNA. Two major questions were addressed in this paper. Firstly, the cause of DNA periodicity, which was investigated by comparisons between real and simulated coding sequences. Secondly, quantification of DNA periodicity was made using an evolutionary algorithm, which was not previously used for such purposes.

Results

We have shown that simulated coding sequences, which were composed using codon usage frequencies only, demonstrate DNA periodicity very similar to the observed in real exons. It was also found that DNA periodicity disappears in the simulated sequences, when the frequencies of codons become equal.

Frequencies of the nucleotides (and the dinucleotide AG) at each location along phase 0 exons were calculated for C. elegans, D. melanogaster and H. sapiens. Two models were used to fit these data, with the key objective of describing periodicity. Both of the models showed that the best-fit curves closely matched the actual data points. The first dynamic period determination model consistently generated a value, which was very close to the period equal to 3 nucleotides. The second fixed period model, as expected, kept the period exactly equal to 3 and did not detract from its goodness of fit.

Conclusions

Conclusion can be drawn that DNA periodicity in exons is determined by codon usage frequencies. It is essential to differentiate between DNA periodicity itself, and the length of the period equal to 3. Periodicity itself is a result of certain combinations of codons with different frequencies typical for a species. The length of period equal to 3, instead, is caused by the triplet nature of genetic code. The models and evolutionary algorithm used for characterising DNA periodicity are proven to be an effective tool for describing the periodicity pattern in a species, when a number of exons in the same phase are analysed.

Background

Periodicity of DNA in exons, with the period being equal to 3 nucleotides, has been well known for some time [1–6]. This periodicity reflects correlations between nucleotide positions along coding sequences [7], which is caused by the asymmetry in base composition at the three coding positions [8]. This periodicity has also been suggested as a reading-frame monitoring device during translation, due to interrupted periodic patterns matching with frame shifts downstream where the periodic pattern returns [9].

The triplet code has undergone evolution itself, from the earliest form of the triplet code to what exists today. The universal DNA periodicity observed in exons suggests a (RNY)_npattern (R = A or G, Y = C or U, N = any base), which probably was inherited from the earliest mRNA sequences [10, 11].

In this study comparisons between real and simulated coding sequences were used in attempt to better understand the cause of the DNA periodicity. The only data used by the simulation program were codon usage frequencies from real species. Thus the simulated coding sequences had frequencies of codons very similar to real species. The major difference, however, was a random position of codons in simulated sequences.

The periodicity of exons, as well as other coding statistics can be an additional tool for exon prediction programs [7]. The distance between two types of nucleotides is counted, and a period is determined by the distance between the similar frequencies. For example, if there is a nucleotide A at one point in a sequence, and other A's are more common when there are 2, 5, 8 and so on nucleotides between them, a period of three can be determined [7]. Additional methods of finding periodicity include Fourier analysis [12], the length shuffle Fourier transform algorithm [13], autocorrelation functions [14] and distance analysis [15]. We applied two models of an evolutionary algorithm [16, 17] to quantify DNA periodicity. Thus the second objective of this study was investigation of a new method for quantification of DNA periodicity.

Results

Periodicity of DNA in exons

As mentioned in the Background, periodic 3-nucleotide pattern has been known for eukaryotic exons for some time. We studied a question whether DNA periodicity similar to that observed in exons can be simulated in computer experiments utilising codon usage frequencies (CUF) of real species as the only source of information. The computer program GENERATE, which was used in these experiments, composed artificial coding sequences using CUF of several species as the only source of information. Thus despite random choice the frequencies of codons in simulated sequences were very similar to the real CUF.

As an example of these experiments, Figure 1A shows distribution of Adenine nucleotides in real Drosophila melanogaster exons (phase 0) and in simulated (Figure 1B) coding sequences (phase 0) created by GENERATE using D. melanogaster CUF. To avoid any significant influences of splicing signals, D. melanogaster exons aligned at the 5' end start from the 10^th nucleotide (4^th codon). In the simulated coding sequences periodicity is also highly pronounced and the periodicity patterns observed in D. melanogaster exons and simulated sequences are nearly identical (Figure 1A &1B). Other studied species C. elegans and H. sapiens despite significant differences in AT and GC content also show high similarity in the periodicity pattern between exons and simulated sequences (data are not shown). Periodicity of other nucleotides was also observed and it shown high similarity in both DNA of real exons and simulated sequences (data are not shown). The obvious conclusion following from this study is that CUF, which was the only source of information for the simulated coding sequences, is the crucial factor determining periodicity.

DNA periodicity in simulated coding sequences was dramatically reduced in the experiments where the frequencies of all non-stop codons were made equal (Figure 1C). This observation strongly supports the conclusion that codon usage frequencies determine DNA periodicity in exons. A very light periodicity of Adenine and Thymine (data are not shown) was caused by the fact that 3 stop codons and the corresponding combinations of nucleotides were present in the simulated coding sequences in different and much lesser frequencies than other codons. Cytosine, which is not a component of any stop-codon, does not show periodic pattern at all because frequencies of Cytosine containing codons were equal to frequencies of all other non stop codons. Finally, when frequencies of all codons, including stop codons, were made equal, no periodicity was observed in the simulated sequences (Figure 1D).

Thus the computer simulations lead to a firm conclusion that 3 nucleotide periodicity observed in DNA of exons is determined by codon usage frequencies. The triplet nature of genetic code is rather responsible for the length of the period but not periodicity itself, as some people might think.

Quantification of DNA periodicity using an evolutionary algorithm

Data sets on frequency of the nucleotides and all 16 dinucleotides at each location were constructed for Caenorhabditis elegans, Drosophila melanogaster and Homo sapiens phase 0 exons. The frequencies of Adenine and dinucleotide pair AG are shown in this paper as an example. Two models were used to fit to these data, with the key objective of describing periodicity in the data.

Inspection of Figures 2 to 5 shows that periodicity is quite apparent, and that the frequencies between peak and trough frequencies are generally consistent, but sometimes trending in value. Two models were used to accommodate this shifting pattern. The first model is:

where

is predicted frequency at nucleotide position i, and b₁ to b₅ are parameters to be estimated from data. Because of irregularities in frequencies close to the 5' end of exons, nucleotide position 20 was take as position i = 1.

The component b₁ + b₂i fits an overall linear trend in frequencies, independently from finer-scale periodicity. For some data sets it could be useful to include a quadratic term for i.

Parameter b₃ gives the amplitude of periodic waves. As 2π radians describes a full cycle, periodicity is given by parameter b₅. However, if b₅ differs from exactly 3, this generates a shift in phase that is linear with nucleotide position. This shift combines with static phase shift parameter b₄ to fit the relative frequencies in adjacent groups of three locations. This is not an ideal model, in that parameter b₅ does not cleanly describe periodicity, but it proves to work well in practice.

The second model is:

where the value of Offset depends on nucleotide position I as follows:

if i < b4 Then Offset = b6

else if i < b4 + b5 Then Offset = b7

else Offset = b8

Thus, in this case, three regions are defined by parameters b₄ and b₅, and Offset is defined within region by parameters b₆ and b₇.

Model 2 fixes periodicity at 3 nucleotides, but allows for different patterns of relative frequency in regions chosen by the data.

The analysis task for both models is to find values of the b parameters that give a close fit between the real frequencies Y and the predicted frequencies . The criterion used for this was the sum of squared errors across nucleotide position:

Best-fitting b parameters were found using a form of evolutionary algorithm (Differential evolution, [17]) with modifications to improve robustness following [18].

To test if the above method could identify a quantifiable period in exons, several tests were performed. All phase 0 exons were extracted from the EID database, described in Material and Methods. This procedure dramatically enhanced the visible periodicity compared to exons of all three phases (Figure 1A.). Once the phase 0 exons were separated, they were aligned at the 5' end and data for all four single nucleotide frequencies were analysed using both methods. In addition to all single nucleotide frequencies, the dinucleotide frequency of AG was run through the analysis for both dynamic period determination best-fit curve, and the static period of 3 best-fit curve. This was done for the three studied species, C. elegans, D. melanogaster and H. sapiens for phase 0 exons. Several other dinucleotide pairs also shown clear periodicity patterns in exons (data are not shown). AG was used in this paper as an example. Introns were only run through the analysis for dynamic period determination and did not show clear and stable periodicity.

Phase 0 exons – dynamic period determination model (Model 1)

Curves of best-fit were created for the four separate nucleotides and the dinucleotide AG for C. elegans, D. melanogaster and H. sapiens phase 0 exons. These curves were created using model 1, in an attempt to find a periodicity within the given data. The first few positions of exons are frequently under different selection pressures, which do not always conform to the same pressures as the remainder of the exon. It was for this reason that the algorithm was run starting from position 20 to position 100 in exons.

Table 1, shows DNA periodicity in exons of C. elegans, D. melanogaster and H. sapiens as they were determined by the analysis. The amplitude of the periodicity was measured as the variation from the center-point of the sine curve. As mentioned earlier, the criterion value of goodness of fit is the sum of squared deviations between observed frequencies and frequencies predicted by the model with the prevailing parameters. This means that as the analysis runs through its generation cycles, it finds better fitting curves as it goes along and replaces the previous curve of best-fit. The criterion value is a reflection of this process, in that as it gets closer to zero, the closer the curve of best-fit represents actual data. The data show that in all cases, the determined period is very close to 3 in exons for all three species for all nucleotides and the dinucleotide pair AG. The range of periods goes from a low of 2.990391 in C. elegans nucleotide C, a difference from 3 is ~0.0096, to a maximum of 3.019349 in C. elegans dinucleotide pair AG, a difference from three of ~0.0193. The criterion values of goodness of fit are also very low for exons, with the largest among them at 0.0093573 in H. sapiens nucleotide C.

Table 1 Periods of best-fit curves in phase 0 aligned exons. Periods and criterion values of phase 0 C. elegans D. melanogaster and H. sapiens exons aligned at the 5' end. The four single nucleotide as well as a single dinucleotide pair, AG were studied.

Full size table

A comparison between the best-fit curves for nucleotide A and actual frequencies of nucleotide A in exons of the three studied species can be seen in Figure 2. Figure 3 shows a similar comparison for the dinucleotide pair AG. Best-fit curves for nucleotides C, G and T are not shown. The blue points on the graph represent data points predicted by fitting model 1 to actual data, represented by pink points. As can be seen in the graphs, the blue best-fit curves in both Figure 2 and Figure 3 both closely follow the pink line for actual data, which confirms that the best-fit line quite accurately portrays actual data.

Phase 0 exons – static period 3 model (Model 2)

The data were then fitted to model 2, keeping the period fixed at 3. With this algorithm, the ideal best-fit curve would remain nearly in the same pattern. Again phase 0 exons were used as sample data. Figure 4 shows the best-fit curve for nucleotide Adenine in C. elegans D. melanogaster and H. sapiens phase 0 exons with the period fixed at three. Figure 5 shows the best-fit curve for the dinucleotide pair AG in 0 exons in the same species. In both sets of graphs, pink points represent actual frequencies of nucleotides and the dinucleotide pair, AG, while the blue points represent the optimized curve of best-fit for these frequencies. It is clear from the graph that keeping the period fixed at exactly three in exons does not detract from the accuracy of the curve of best-fit. The curve of best-fit is remarkably similar to the actual data points.

Discussion

The fact of DNA periodicity in exons, as well as the lack of periodicity in introns, is known for some time [1–6]. "Such a periodic pattern reflects correlations between nucleotide positions along coding sequences (that is, the probability of finding a nucleotide at a given position in a coding sequences is not independent of the nucleotide occurring at some other even distant position). The correlations arise, in turn, because of the asymmetry in base composition at the three codon positions in coding sequences" [7, 8]. The simulation experiments described in the paper support such conclusion and provide a clear proof that frequency of codon usage is the key cause for DNA periodicity in exons (Figure 1). We have shown that simulations, which utilized only codon usage frequencies data, produced an exceptionally good match to periodicity observed in real exons. As soon as frequencies of all codons are set as equal, DNA periodicity in exons entirely disappears. It is reasonable to think that the asymmetry in base composition, studied by Guigó [7], might be caused by the codon usage frequency.

The results presented in this paper also demonstrate effectiveness of the evolutionary algorithm and the both models used to identify a periodicity pattern in exons. Although periods, which are seen in Table 1, are not precisely equal to 3, they are very close. This minor discrepancy is a result of the analysis compensating for slight changes in the pattern of frequencies over nucleotide position. When the period is fixed at exactly 3, and the program allows for change-over points where the curve of best-fit is adjusted to better suit the data, the curves of best-fit still closely match the actual data points (see Figures 4 and 5), revealing that the period of 3 is not simply coincidental when it is allowed to be determined by the program. As it can be seen in Figures 2,3,4,5 the amplitude of variation is much more narrow for C. elegans than for the two other species under consideration.

Introns do not show any specific period that can be determined by the analysis (results are not shown). Although the analysis does produce a period for each data set given, these periods are not consistent with each other, and the predictions do not fit the data well. As introns are not composed from codons, this is an additional indication supporting the conclusion that CUF determine periodicity pattern in exons.

Since only exons show a strong periodicity of three, this type of analysis can be in principle used as an additional component of exon finding tools. Such possibility was already considered [7]. Unfortunately, the methods discussed here being very effective in quantifying DNA periodicity in a set of many sequences, are not sensitive enough for a single sequence. Further modifications of the approach are necessary before it can be used in exon prediction programs.

Conclusions

Conclusion can be drawn that DNA periodicity in exons is determined by codon usage frequencies. It is essential to differentiate between DNA periodicity itself, and the length of the period equal to 3. Periodicity itself is a result of certain combinations of codons with different frequencies typical for a species. The length of period equal to 3, instead, is caused by the triplet nature of genetic code. The models and evolutionary algorithm used for characterising DNA periodicity are proven to be an effective tool for describing the periodicity pattern in a species, when a number of exons in the same phase are analysed.

Methods

Exon-Intron Database

Information relevant to C. elegans, D. melanogaster and H. sapiens, was extracted from the exon-intron database (EID), which was compiled in the W. Gilbert laboratory, Department of Molecular and Cellular Biology, Harvard University [18]. The database contains protein-coding intron-containing genes. From the version of the database that we used, the following data were extracted: C. elegans 14,836 genes and 98581 exons; D. melanogaster 13,361 genes and 58,801 exons, H. sapiens 7150 genes and 47908 exons.

exScan

This program calculates frequencies of nucleotides or any combination of nucleotides in a database. exScan align all exons in the database at either the 5' or 3' end. The program then searches the exons for given sequences and give a summary of the sequences found. exScan was used in order to obtain the frequencies for nucleotides and dinucleotide pairs in each position of exons aligned at the 5' end. A summary of the program's operation follows:

Command line for exScan selects the database to be used for searching.

The string(s) to be searched are also entered into the command line.

ExScan then aligns the exons by the 5' end and searches each exon for all matches to the search strings.

The output of the program will provide the number of matches for each search string entered at every position along the aligned exons.

These numbers are then converted into frequencies.

ExScan is written in the C++ programming language; its full description and the program itself are available upon request

GENERATE

The program simulates a required number of coding sequences, using as an input file CUF for a particular species and a generator of random numbers. Inclusion of stop-codon in a coding sequence terminates the gene. There are options, which allow establishing a minimal and a maximal length of genes as well a shape of gene length distribution. GENERATE was used in this study to show how CUF alone could create periodicity, even in randomly created sequences, which do not code for any real protein. The procedure when running GENERATE follows:

GENERATE accepts as input a file containing usage frequencies of all 64 codons. These codon usage frequencies for different species were taken from the database located at http://www.kazusa.or.jp/codon/. Thus, despite random choice the frequencies of codons in simulated sequences were very similar to the real CUF.

These frequencies are then used to construct a requested number of artificial genes, with the codons chosen randomly based on their frequencies.

Artificial genes all start with ATG, and will terminate once a stop codon is randomly chosen.

The artificial genes can then be used as a separate database for analysis with exScan as above.

GENERATE was written in the C++ programming language. Description of the program and program itself are available upon request.

Differential Evolution

The specific method used for fitting the two periodicity models was Differential Evolution (DE). As DE is a widely applicable method of general utility for optimization, the reader is directed to Storn and Price [13] for detailed description and example computer code. The concept is outlined here:

A population of candidate solutions is established. Each population member is constituted by a randomly sampled set of b parameters and is characterized by its fitness (its value on the prevailing objective function, the sum of squared errors across nucleotide position).

For each population member, a challenger is constructed. If this challenger has superior fitness, it will replace the population member in the next generation. A challenger is constructed as follows:

Three other population members are chosen at random. We can label these as a, b and c. Each parameter is then addressed in turn. With probability CR (CR = 0.4 was adopted) the parameter is simply taken from the population member that the challenger is challenging. Otherwise, a new parameter value is constructed as the value for member a plus F times the difference of the values for b and c. For this application, F = 0.4, except F = 1 every fourth generation and F = 2 every seventh generation, to help avoid local optima. In addition, mutation independent from differences between other solutions was invoked periodically, also to help avoid local optima.

Successful challengers replace their respective population members, and, together with surviving members constitute a new generation with higher mean fitness. The process continues over sufficient generations to achieve convergence close to an optimal solution, with the most fit solution being chosen.

References

Shepherd JC: Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification. Proc Natl Acad Sci. 1981, 78 (3): 1596-1600.
Article PubMed Central CAS PubMed Google Scholar
Shepherd JC: Periodic correlations in DNA sequences and evidence suggesting their evolutionary origin in a comma-less genetic code. Journal of Molecular Evolution. 1981, 17 (2): 94-102.
Article PubMed Google Scholar
Zhurkin VB: Periodicity in DNA primary structure is defined by secondary structure of the coded protein. Nucleic Acids Res. 1981, 9 (8): 1963-1971.
Article PubMed Central CAS PubMed Google Scholar
Bibb MJ, Findlay PR, Johnson MW: The relationship between base composition and codon usage in bacterial genes and its use for the simple and reliable identification of protein-coding sequences. Gene. 1984, 30 (1–3): 157-166. 10.1016/0378-1119(84)90116-1
Article CAS PubMed Google Scholar
Silverman BD, Linsker R: A measure of DNA periodicity. J Theor Biol. 1986, 118 (3): 295-300.
Article CAS PubMed Google Scholar
Baldi P, Brunak S, Chauvin Y, Engelbrecht J, Krogh A: Periodic sequence patterns in human exons. Proc Int Conf Intell Syst Mol Biol. 1995, 3: 30-38.
CAS PubMed Google Scholar
Guigó R: DNA composition, codon usage and exon prediction. 2000, Open "main.html" document., http://www.pdg.cnb.uam.es/cursos/FVi2001/GenomAna/GeneIdentification/SearchContent/
Google Scholar
Gutiérrez G, Oliver JL, Marin A: On the origin of the periodicity of three in protein coding DNA sequences. J Theor Biol. 1994, 167 (4): 413-414. 10.1006/jtbi.1994.1080
Article PubMed Google Scholar
Trifonov EN: Translation framing code and frame-monitoring mechanism as suggested by the analysis of mRNA and 16 S rRNA nucleotide sequences. J Mol Biol. 1987, 194 (4): 643-652.
Article CAS PubMed Google Scholar
Trifonov EN: Elucidating sequence codes: three codes for evolution. Ann NY Acad Sci. 1999, 870: 330-338.
Article CAS PubMed Google Scholar
Eigen M, Winkler-Oswatitsch R: Transfer-RNA: the early adaptor. Naturwissenschaften. 1981, 68 (5): 217-228.
Article CAS PubMed Google Scholar
Storn R, Price K: Differential evolution – A simple and efficient heuristic for global optimization over continuous spaces. J Global Optimization. 1997, 11: 341-359. 10.1023/A:1008202821328. 10.1023/A:1008202821328
Article Google Scholar
Piyasatian N, Kinghorn BP: Balancing genetic diversity, genetic merit and population viability in conservation programs. Journal of Animal Breeding and Genetics. 2003, 120: 137-149. 10.1046/j.1439-0388.2003.00383.x. 10.1046/j.1439-0388.2003.00383.x
Article Google Scholar
Tiwari S, Ramachandran S, Bhattacharya S, Ramaswamy R: Prediction of probable genes by Fourier analysis of genomic sequences. Computer Applications in the Biosciences. 1997, 13: 263-270.
CAS PubMed Google Scholar
Yan M, Lin Z, Zhang C: A new Fourier transform approach for protein coding measure based on the format of the Z curve. Bioinformatics. 1998, 14 (8): 685-690. 10.1093/bioinformatics/14.8.685
Article CAS PubMed Google Scholar
Arques DG, Lapayre JC, Michel CJ: Identification and simulation of shifted periodicities common to protein coding genes of eukaryotes, prokaryotes and viruses. J Theor Biol. 1995, 172 (3): 279-291. 10.1006/jtbi.1995.0024
Article CAS PubMed Google Scholar
Konopka AK, Smythers GW, Owens J, Maizel JV: Distance analysis helps to establish characteristic motifs in intron sequences. Gene Anal Tech. 1987, 4 (4): 63-74. 10.1016/0735-0651(87)90020-3
Article CAS PubMed Google Scholar
Saxonov S, Daizadeh I, Fedorov A, Gilbert W: EID: the Exon-Intron Database – an exhaustive database of protein-coding intron-containing genes. Nucleic Acids Res. 2000, 28: 185-190. 10.1093/nar/28.1.185
Article PubMed Central CAS PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Genetics and Bioinformatics, University of New England, Armidale, NSW, Australia
Stephen T Eskesen, Brian Kinghorn & Anatoly Ruvinsky
Thomas J. Watson Research Center, Yorktown Heights, NY, USA
Frank N Eskesen

Authors

Stephen T Eskesen
View author publications
You can also search for this author in PubMed Google Scholar
Frank N Eskesen
View author publications
You can also search for this author in PubMed Google Scholar
Brian Kinghorn
View author publications
You can also search for this author in PubMed Google Scholar
Anatoly Ruvinsky
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anatoly Ruvinsky.

Additional information

Competing interests

None declared.

Authors' contributions

SE conducted the data analysis and was involved in drafting the manuscript. FE designed and wrote the code for the C++ computer programs. BK assisted with the manuscript and wrote the code for modeling and fitting periodicity. AR drafted the manuscript, performed statistical analysis, conceived the study and participated in its design and coordination. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Eskesen, S.T., Eskesen, F.N., Kinghorn, B. et al. Periodicity of DNA in exons. BMC Molecular Biol 5, 12 (2004). https://doi.org/10.1186/1471-2199-5-12

Download citation

Received: 05 November 2003
Accepted: 18 August 2004
Published: 18 August 2004
DOI: https://doi.org/10.1186/1471-2199-5-12

Periodicity of DNA in exons