SpringerOpen Newsletter

Receive periodic news and updates relating to SpringerOpen.

Open Access Research

Novel methodologies for spectral classification of exon and intron sequences

Hon Keung Kwan1*, Benjamin Y M Kwan2 and Jennifer Y Y Kwan3

Author Affiliations

1 Department of Electrical and Computer Engineering, University of Windsor, 401 Sunset Avenue, Windsor, ON N9B 3P4, Canada

2 Faculty of Medicine, University of Ottawa, 451 Smyth Road, Ottawa, ON K1H 8M5, Canada

3 School of Medicine, Queen's University, 80 Barrie Street, Kingston, ON K7L 3N6, Canada

For all author emails, please log on.

EURASIP Journal on Advances in Signal Processing 2012, 2012:50  doi:10.1186/1687-6180-2012-50


The electronic version of this article is the complete one and can be found online at: http://asp.eurasipjournals.com/content/2012/1/50


Received:6 October 2011
Revisions received:6 October 2011
Accepted:28 February 2012
Published:28 February 2012

© 2012 Kwan et al; licensee Springer.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Digital processing of a nucleotide sequence requires it to be mapped to a numerical sequence in which the choice of nucleotide to numeric mapping affects how well its biological properties can be preserved and reflected from nucleotide domain to numerical domain. Digital spectral analysis of nucleotide sequences unfolds a period-3 power spectral value which is more prominent in an exon sequence as compared to that of an intron sequence. The success of a period-3 based exon and intron classification depends on the choice of a threshold value. The main purposes of this article are to introduce novel codes for 1-sequence numerical representations for spectral analysis and compare them to existing codes to determine appropriate representation, and to introduce novel thresholding methods for more accurate period-3 based exon and intron classification of an unknown sequence. The main findings of this study are summarized as follows: Among sixteen 1-sequence numerical representations, the K-Quaternary Code I offers an attractive performance. A windowed 1-sequence numerical representation (with window length of 9, 15, and 24 bases) offers a possible speed gain over non-windowed 4-sequence Voss representation which increases as sequence length increases. A winner threshold value (chosen from the best among two defined threshold values and one other threshold value) offers a top precision for classifying an unknown sequence of specified fixed lengths. An interpolated winner threshold value applicable to an unknown and arbitrary length sequence can be estimated from the winner threshold values of fixed length sequences with a comparable performance. In general, precision increases as sequence length increases. The study contributes an effective spectral analysis of nucleotide sequences to better reveal embedded properties, and has potential applications in improved genome annotation.

Keywords:
DNA sequence; numerical representation; nucleotide to numeric mapping; exon and intron sequences; coding and non-coding sequences; threshold value; thresholding; exon and intron classification; period-3; spectral analysis; discrete Fourier transform; gene detection; genome annotation

1. Introduction

It is known that the coding regions of a nucleotide sequence exhibit a period-3 spectral property because of the presence of codon structure [1], and this property can be used to identify regions of interest. A nucleotide sequence is a discrete sequence consisting of nucleotides C, G, A, and T in which digital spectral analysis can be used to reveal its hidden periodicities, spectral features, and genome structure. Digital spectral analysis is usually carried out by the discrete Fourier transform (DFT) which is a digital signal processing (DSP) technique. Genomic signal processing (GSP) involves the processing and analysis of a digital signal in the form of a numerical sequence mapped from a nucleotide sequence [1,2]. A general description on the biological aspects of this article can be found in [1,2]. To perform digital spectral analysis, each nucleotide of a nucleotide sequence has to be converted to a numerical value through a mapping. Such a mapping is called numerical representation and its choice affects how well the biological properties of a nucleotide sequence can be revealed in numerical domain. A nucleotide sequence can be numerically represented in the form of R-sequence, a survey on mappings with R ≥ 1 is given in [3,4]. 1-sequence numerical representation (with R = 1) is the most compact form of mapping in which one nucleotide is mapped to one fixed numerical value to form a single sequence. 1-sequence numerical representations and their relative performances are studied in this article for unknown exon and intron sequence classification with an improved accuracy.

In this article, nine 1-sequence numerical representations (Codes 1-9) [5-16] are identified though a literature search, which can be grouped under positive-integer-value, real-value, and complex-value numerical representations. In addition, seven 1-sequence complex-value numerical representations (Codes 10-16) are derived. In this article, all these sixteen numerical representations and the Voss representation [17] are compared based on the genomes of twelve organisms (including the human).

Genome annotation involves a process of identifying the locations of the coding regions (and associated genes) in a genome. Coding regions exhibit the period-3 property in the spectral domain which is less apparent in sequences other than exon sequences, and can therefore be used to detect exon sequences and to distinguish exon regions from intron regions in genomic sequences. The performance depends on an appropriate choice of a period-3 threshold value to classify between an exon sequence and an intron sequence. In [17,18], the cumulative distributions of coding and non-coding sequences are used to determine a period-3 threshold value. In this article, two threshold value determination methods are defined. Given a set of exon and intron sequences of a fixed length, all the three thresholding values are used to determine the winner threshold value applicable to any nucleotide sequence of the same length. Also, given a number of exon and intron sequence sets of different lengths, the winner threshold values can be interpolated to yield an interpolated winner threshold value applicable to an unknown nucleotide sequence of an arbitrary length. Some commonly used symbols (or abbreviations) are explained in Table 1.

Table 1. List of symbols

2. Methods and results

2.1. Numerical representation

There are a number of numerical representations for nucleotide sequences. In this article, we shall focus on direct and simple numerical representations which satisfy the following requirements: (a) Single sequence mapping for a nucleotide sequence; (b) Fixed value mapping for each nucleotide; and (c) Accessible to digital signal processing analysis. Sixteen 1-sequence numerical representations (Codes 1-16) which satisfy these three requirements are defined in Table 2.

Table 2. List of sixteen numerical representation codes

It is known that exons are rich in nucleotides C and G and introns are rich in nucleotides A and T [8,19]. For ease of viewing, C-G and A-T are paired when defining each of the sixteen numerical representations. The first group of numerical representations consists of five positive-integer-value numerical representations (as Codes 1-5 in Table 2) which shall be described as follows: The Integer Number representation [5,6] is obtained by mapping numerals {1, 3, 2, 0} respectively to the four nucleotides as C = 1, G = 3, A = 2, and T = 0. The Single Galois Indicator representation [7,8] maps the CGAT nucleotides to a Galois field of four, GF(4), which is formed by assigning numerical values to the nucleotides as C = 1, G = 3, A = 0, and T = 2 in a nucleotide sequence. This representation suggests that C < G and A < T. In the Paired Nucleotide Atomic Number representation [9], the paired nucleotides are assigned with atomic numbers as A, G = 62 and C, T = 42 respectively. In the Atomic Number representation [9], a numerical sequence is formed by assigning atomic numbers to each nucleotide as C = 58, G = 78, A = 70, and T = 66 in a nucleotide sequence. The Molecular Mass representation [10] of a nucleotide sequence is formed by mapping the four nucleotides to their molecular masses as C = 110, G = 150, A = 134, and T = 125, respectively.

The second group of numerical representations consists of three real-value numerical representations (as Codes 6-8 in Table 2) which can be described as: The Electron-Ion Interaction Pseudo-potential (EIIP) represents the distribution of the free electrons energies along a nucleotide sequence. In the EIIP representation, a single EIIP indicator sequence [11,12] is formed by substituting the EIIP of the nucleotides as C = 0.1340, G = 0.0806, A = 0.1260, and T = 0.1335 in a nucleotide sequence. In the Paired Numeric representation [6,8], nucleotides (C-G, A-T) are to be paired in a complementary manner and values of -1 and +1 are to be used to denote, respectively, C-G and A-T nucleotide pairs. In the Real Number representation [6,8,13], the nucleotide mappings are C = 0.5, G = -0.5, A = -1.5, and T = 1.5, which bears complementary property. The third group consists of one complex-value numerical representation (as Code 9 in Table 2) called the Complex Number representation [2,5,6,8,14-16], it reflects the complementary nature of C-G and A-T pairs by mapping nucleotides as C = -1-j, G = -1+j, A = 1+j, and T = 1-j.

In addition to the above nine 1-sequence numerical representations, seven 1-sequence numerical representations (listed as Codes 10-16 in Table 2) are derived in which each nucleotide of a sequence is mapped to either a single real-value element (± 1) or a single imaginary-value element (± j). Each of the Codes 10-16, namely, the K-Twin-Pair Code, the K-Bipolar-Pair Codes I and II, and the K-Quaternary Codes I-IV has its equivalent numerical representations. We define a numerical representation R2 to be an equivalent numerical representation of R1 if R2 gives rise to the same power spectrum as that of R1. An equivalent numerical representation can be obtained by multiplying the numerical represented value of each of the four bases [C, G, A, T] of a nucleotide sequence by the same constant which includes any of -1, and ± j. In particular, a complementary numerical representation obtained by inverting the signs of all the four bases [C, G, A, T] in a numerical representation is an equivalent numerical representation. Other forms of equivalent numerical representation exist, for example, the K-Twin-Pair Code [C, G, A, T] = [-1, -1, j, j] has an equivalent form [C, G, A, T] = [-j, -j, 1, 1]. The K-Quaternary Code III (Code 15) [C, G, A, T] = [-j, -1, 1, j] is identical to the pentanary code in [20] (with [20] specifies that x[n] = 0 for an unknown nucleotide at position n) which was derived for the three-dimensional DNA walk for graphical representation of nucleotide sequences. The K-Bipolar-Pair Code II (Code 12) [C, G, A, T] = [-1, 1, -j, j] has an equivalent numerical representation of [C,G,A,T] = [-j, j, 1, -1] adopted in [21]. The K-Quaternary Code IV (Code 16) [C, G, A, T] = [-j, -1, j, 1] has a complementary numerical representation of [C, G, A, T] = [j, 1, -j, -1] mentioned in [22]. In [23], complex numerical representations with simultaneous non-zero real part and non-zero imaginary part were adopted which are different from those of the Codes 10-16 in which either a real value or an imaginary value is used for each nucleotide to numeric mapping. It should be noted that the numerical representations in [21-23] were formulated for sequence alignment but not for period-3 power spectral analysis.

The Voss representation (i.e., the Code 17) is a commonly used numerical representation for period-3 spectral classification of exon and intron sequences [1,2,17,24,25]. The Voss representation is a 4-sequence representation (R = 4) in which each of the four nucleotides of a nucleotide sequence is represented by a separate numerical sequence (as C-sequence, G-sequence, A-sequence, and T-sequence) such that the nth position of the C-sequence is coded by 1 if the nth nucleotide of the sequence is C, otherwise it is coded by 0. In a similar manner, the coding procedure just described applies to the remaining three numerical sequences. A threshold value can be determined from the cumulative distributions of the signal-to-noise ratio of the peak at f = 1/3 for each set of exon and intron sequences [17]. Using the Voss representation, two known threshold values (Tc, T4) are adopted for comparison in this article. The Tc thresholding is determined from the exon and intron cumulative distribution functions [17] and the T4 thresholding is set to a fixed value four [17]. The Codes 1-16 are to be compared to the Voss representation (Code 17) in Section 2.4.

2.2. Spectral analysis

The purpose of the spectral analysis of a numerically represented nucleotide sequence is to compute its period-3 spectral component located at a frequency equal to 2π/3 in the DFT spectrum. Given a numerical sequence, x[l] for l = 1 to L, its finite-length DFT sequence, X[k] for k = 1 to L, and its inverse DFT, x[l] for l = 1 to L, are defined by

X [ k ] = 1 L l = 1 L x [ l ] W L ( k - 1 ) ( l - 1 )  for  1 k L (1)

x [ l ] = 1 L k = 1 L X [ k ] W L - ( k - 1 ) ( l - 1 )  for  1 l L (2)

W L = e - j 2 π L (3)

In this article, a rectangular windowing of length N bases is used and each consecutive window is right-shifted by three bases with an overlap window length of N-6 bases between two adjacent windows. The rectangular windowing is adopted to avoid distortion on the cumulative DFT spectrum. We define the mean cumulative DFT spectrum, XT[k], of all the windowed sequences, Xm[k] for m = 1 to M, as

X T [ k ] = 1 M m = 1 M X m [ k ]  for  1 k N (4)

The power spectrum S[k] is obtained from Equation 4 as

S [ k ] = X T [ k ] 2  for  1 k N (5)

From Equation 5, we define the normalized power spectrum Sn[k] of a numerically represented nucleotide sequence of length L as

S n [ k ] = L N S [ k ]  for  1 k N (6)

The period-3 power spectral value (or period-3 value) P3 can be obtained from the normalized power spectrum of a numerically represented sequence at k = N/3+1 as

P 3 = S n [ N 3 + 1 ] (7)

In all computer simulations, for a window length of WL equals to a sequence length of SL, one complete DFT analysis is adopted. For WL less than SL, each consecutive window is right-shifted by three bases with an overlap window length of "WL-6" bases between two adjacent windows. The period-3 values P3 computed by Equation 7 for 4583 exon sequences and that for 4583 intron sequences (both of sequence length 1250 bases, window length 24 bases, and window right-shift 3 bases) obtained by the K-Quaternary Code I are plotted in Figures 1 and 2. From these figures, it can be observed that the period-3 values of exon sequences are higher than those of the intron sequences. This is an important property which will be used to classify exon sequences from intron sequences.

thumbnailFigure 1. Period-3 power spectral values of 4583 exon sequences using K-Quaternary Code I. SL = 1250 bases, WL = 24 bases, and WRS = 3 bases.

thumbnailFigure 2. Period-3 power spectral values of 4583 intron sequences using K-Quaternary Code I. SL = 1250 bases, WL = 24 bases, and WRS = 3 bases.

The log10 power spectrum Sn[k] for 1 ≤ k N of an exon sequence and that of an intron sequence (both of sequence length 1250 bases, window length 24 bases, and window right-shift 3 bases) obtained by the K-Quaternary Code I are plotted in Figures 3 and 4. As shown in Figure 3 (or Figure 4), the exon (or intron) period-3 value at 2π/3 defined by Equation 7 has in fact been able to reveal a prominent power spectral value; also, the exon period-3 value is higher than the intron period-3 value. The DC power spectral value is usually the highest. Besides period-3 and DC power spectral value, power spectral values of other periodicities are clearly shown in Figures 3 and 4.

thumbnailFigure 3. Log10(Exon power spectrum) S[k] for k = 1 to 24 of exon sequence # 4583 using K-Quaternary Code I. SL = 1250 bases, WL = 24 bases, and WRS = 3 bases.

thumbnailFigure 4. Log10(Intron power spectrum) S[k] for k = 1 to 24 of intron sequence # 4583 using K-Quaternary Code I. SL = 1250 bases, WL = 24 bases, and WRS = 3 bases.

2.3. Winner threshold value

In this section, the statistics of the period-3 values computed from a training set of exon sequences and intron sequences are used to define the threshold values of two novel thresholding methods to classify an untrained sequence to be either an exon sequence or an intron sequence. As explained in Table 1, meanP3e and sdP3e represent, respectively, the mean and standard deviation of the period-3 values obtained from the exon sequences of a training set; and meanP3i and sdP3i represent, respectively, the mean and standard deviation of the period-3 values obtained from the intron sequences of the same training set. We define two threshold values called the mid threshold value, Tm, and the proportional threshold value, Tp, as

T m = ( m e a n P 3 i + m e a n P 3 e ) + ( s t d P 3 i - s t d P 3 e ) 2 (8)

T p = s d P 3 e × m e a n P 3 i + s d P 3 i × m e a n P 3 e s d P 3 e + s d P 3 i (9)

In both definitions, the exon cluster is centred at meanP3e and the intron cluster is centred at meanP3i. In Equation 8, the mid threshold value is determined by the mid-point between the exon cluster and the intron cluster, whereas in Equation 9, the proportional threshold value is determined in proportion to the standard deviations of the two clusters.

Besides the above two methods for determining a threshold value, the cross-over point of the cumulative distribution of all the exon period-3 values, F(P3e), and the complementary cumulative distribution of all the intron period-3 values, Fc(P3i), of a set of exon and intron training sequences can be used to determine a threshold value. We define such a cumulative distribution threshold value, Tc, as

T c  = Period - 3 value at minimum  F ( P 3 e ) - F c ( P 3 i ) (10)

Figure 5 shows a plot of F(P3e) and Fc(P3i) versus corresponding exon/intron period-3 power spectral value. Each of the three threshold values Tm, Tp, and Tc can be used to classify an unknown nucleotide sequence. For illustration, let Tt be any of Tm, Tp, and Tc, if an unknown nucleotide sequence has a computed period-3 power spectral value greater than or equal Tt, the unknown nucleotide sequence is classified as an exon sequence; otherwise it is classified as an intron sequence.

thumbnailFigure 5. Cumulative distribution threshold value (Tc) = 3.881141 at cross-over point of exon CDP3 PSV (red) and intron CCDP3 PSV (blue) using K-Quaternary Code I. Exon/intron sequence number = 3437, SL = 1250 bases, WL = 24 bases, and WRS = 3 bases.

The performance is measured objectively in terms of the exon classification (%) which is the percentage of correct classification of exon sequences, the intron classification (%) which is the percentage of correct classification of intron sequences, and the precision (%). These three terms are defined in Equations 11-13 as

exon classification =  number of correct exon classification exon number × 100 % (11)

intron classification =  number of correct intron classification intron number × 100 % (12)

precision =  number of correct exon classification + number of correct intron classification exon number + intron number x 100 % (13)

We define a winner threshold value, Tw, as the threshold value chosen among Tm, Tp, and Tc that yields the top classification (or the highest precision).

2.4. Classification and speed performances of codes 1-17

In this article, all the exon and intron sequences of the human and eleven model organisms were downloaded from the UCSC Genomes [26-29] as inputs to be processed by Matlab programs.

2.4.1. Classification performance of short sequences

The short sequence group I listed in Table 3 downloaded from the UCSC Human genome [26-29] consists of the thirteen short sequence sets ranging from 50 bases to 650 bases (at intervals of 50 bases) as [50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650]. For each of the thirteen short sequence sets, the training sequence numbers are 1 to 10005 and the testing sequence numbers are 10006 to 13340 with a train/test ratio of three for both exon and intron sequences within a set. Each of the thirteen sequence sets is trained and tested using a combination of sixteen numerical representations, four window lengths, and four threshold values (Tm, Tp, Tc, and T4). Figures 6(a-b) and 7 (a-c) plots the top classifications obtained using each of the thirteen untrained sub-sets of 3335 exon sequences and 3335 intron sequences of the short sequence group I. Figure 6 (a, top) displays a total of 13 × 4 precision values obtained using the thirteen sequence sets in thirteen combined columns with each combined column consists of four sub-columns of precision values. Each of the four sub-column precision values shown in Figure 6 (a, top) corresponds to a top classification with its code index, window length index, threshold index, and threshold value displayed, respectively, in Figures 6 (b, bottom) and Figure 7 (a-c). In each of the five sub-plots in Figures 6 (a-b) and 7 (a-c), the first sub-column corresponds to the top classification obtained by the top performer among the Codes 1 to 16, the second sub-column corresponds to the top classification obtained by the Code 10, the third sub-column corresponds to the top classification obtained by the Code 13, and the fourth sub-column corresponds to the top classification obtained by the Code 17. The Codes 10 and 13 are found to be among the top performers and therefore they are displayed in two separate sub-columns. As the Codes 1-16 are to be compared to the Code 17 (described in Section 2.1), therefore the Code 17 is also displayed as one sub-column.

Table 3. UCSC Human genome consisting of 2 short sequence groups.

thumbnailFigure 6. (a-b) Precision (a, top) and code index (b, bottom) of top classifications of 13 short sequence sets obtained by best of Codes 1-16 (dark blue), Code 10 (light blue), Code 13 (yellow), and Code 17 (red).

thumbnailFigure 7. (a-c) WL index (a, top), threshold index (b, middle), and threshold values (c, bottom) of top classifications of 13 short sequence sets obtained by best of Codes 1-16 (dark blue), Code 10 (light blue), Code 13 (yellow), and Code 17 (red).

The classification performance shown in Figures 6(a-b) and 7 (a-c) and summarized in Table 4 indicate that for the thirteen short sequence sets: (a) The top performer for 8 of these thirteen sequence sets is the Code 13, followed by the Code 17 as the top performer for the remaining 5 sequence sets (with the Code 13 ranks 2nd in 4 cases and the Code 10 ranks 2nd in 1 case, each of these 5 cases is also shown in Table 4 on the row following that of the top performer). (b) The window length of 9 bases yields 7 top classifications; the window length of 15 bases yields 2 top classifications; and for the window length equals to the sequence length, 4 top classifications are obtained. (c) The Tc thresholding yields 8 top classifications; followed by the Tp thresholding which yields 5 top classifications. In general, top classifications can often be achieved by the Code 13, using WL = 9 bases and Tp thresholding for SL = 50 to 450 bases; and using WL = 9 bases and Tc thresholding for SL = 500 to 650 bases.

Table 4. Code (Figure 6b, bottom), WL (bases) (Figure 7a, top), threshold method (Figure 7b, middle), and precision (%) (Figure 6a, top) of top classifications of 13 short sequence sets

2.4.2. Classification performance of long sequences

The long sequence group I listed in Table 5 downloaded from the UCSC Human genome [26-29] consists of the thirteen long sequence sets ranging from 650 bases to 1250 bases (at intervals of 50 bases) as [650, 700, 750, 800, 850, 900, 950, 1000, 1050, 1100, 1150, 1200, 1250]. For each of the thirteen long sequence sets, the training sequence numbers are 1 to 3437 and the testing sequence numbers are 3438 to 4583 with a train/test ratio of three for both exon and intron sequences within a set. Each of the thirteen sequence sets is trained and tested using a combination of sixteen numerical representations, four window lengths, and four threshold values (Tm, Tp, Tc, and T4). The top classification results of Codes 1-16 are compared to those obtained by the Code 17 [17] (as described in Section 2.1). Figures 8(a-b) and 9 (a-c) plots the top classifications obtained using each of the thirteen untrained sub-sets consisting of 1146 exon sequences and 1146 intron sequences of the long sequence group I. The notations and explanations of Figures 6(a-b) and 7 (a-c) regarding the short sequence set I in Table 3 apply to Figures 8(a-b) and 9 (a-c) which display the corresponding results of the long sequence set I in Table 5.

Table 5. UCSC Human genome consisting of 2 long sequence groups

thumbnailFigure 8. (a-b) Precision (a, top) and code index (b, bottom) of top classifications of 13 long sequence sets obtained by best of Codes 1-16 (dark blue), Code 10 (light blue), Code 13 (yellow), and Code 17 (red).

thumbnailFigure 9. (a-c) WL index (a, top), threshold index (b, middle), and threshold value (c, bottom) of top classifications of 13 long sequence sets obtained by best of Codes 1-16 (dark blue), Code 10 (light blue), Code 13 (yellow), and Code 17 (red).

The classification performance shown in Figures 8(a-b) and 9 (a-c) and summarized in Table 6 indicates that for the thirteen long sequence sets: (a) The top performer for the thirteen sequence sets is the Code 13, except for SL = 1100 bases in which the Code 1 (with WL = 24 bases and T4 thresholding) is marginally higher than that of the Code 13 (with WL = 9 bases and Tp thresholding as shown in Table 6 on the row following that of the top performer). (b) The window length of 9 bases yields 5 top classifications; the window length of 15 bases yields 4 top classifications; and for window length equals to 24 bases, 3 top classifications are obtained. (c) The Tp thresholding yields 6 top classifications; followed by the T4 thresholding which yields 4 top classifications; and the Tc thresholding which yields 3 top classifications. In general, top classifications are achievable often using a combination of the Code 13, WL = 9 bases, and Tp thresholding.

Table 6. Code (Figure 8b, bottom), WL (bases) (Figure 9a, top), threshold method (Figure 9b, middle), and precision (%) (Figure 8a, top) of top classifications of 13 long sequence sets

2.4.3. Speed performance

In this article, the DFT-based spectral analysis of a numerical sequence is computed using the FFT. In general, the computational complexity of a 4-sequence numerical representation (such as the Voss representation [17]) is always four times that of a 1-sequence numerical representation if either windowing or non-windowing is used in the comparison. The FFT computational complexity for a numerical sequence of length L is L × log2(L) and for a windowed sequence of WL = N is N × log2(N). As each consecutive window is right-shifted by three bases, the number of windows NW = 1 + fix((L-N)/3) where fix(X) rounds the elements of X to the nearest integer. For a non-windowed 4-sequence numerical representation of SL = L, the computational complexity is 4L × log2(L) whereas for a windowed 1-sequence numerical representation of SL = L, the computational complexity is NW × N × log2(N). The ratio of 4L × log2(L) over NW × N × log2(N) gives a speed improvement ratio. A plot of the speed improvement ratio against SL for WL = [9,15,24] bases is shown in Figure 10. As seen from Figure 10, it can be observed that a windowed 1-sequence representation with WL = [9,15,24] bases offers a possible speed advantage over a non-windowed 4-sequence representation and the speed improvements increases as SL increases. Also shown in Figure 10, for WL = SL, N = L, the speed improvement ratio is four.

thumbnailFigure 10. Speed improvement ratio of 1-sequence representation over 4-sequence representation for window length = 9, 15, 24, SL bases in FFT spectral analysis.

2.5. Interpolated winner threshold value

With the use of the winner threshold value described in Section 2.3, a top precision for an unknown nucleotide sequence of length equals to any of the thirteen long or short sequence sets can be obtained. However, if the length of an unknown nucleotide sequence is neither any of the thirteen short sequence sets nor any of the thirteen long sequence set, an interpolated winner threshold value can be determined as described in this section. For illustration, let us consider the K-Quaternary Code I but the interpolated winner threshold methodology also applies to other numerical representations. For a set of exon and intron training sequences, the winner threshold value, Tw(P3), is first chosen among the three threshold values, Tm(P3), Tp(P3), and Tc(P3) that yields a top precision during testing. For each window length, the winner threshold value, Tw(P3), versus sequence length of the set of thirteen short sequences with SL = 50 to 650 bases is plotted in Figure 11 and similarly in Figure 12 for the set of thirteen long sequences with SL = 650 to 1250 bases. From Figures 11 and 12, given an unknown nucleotide sequence of arbitrary length (within the length of trained sequences), an interpolated winner threshold value, Ti(P3), at any of the four specified window lengths can be obtained. Using an interpolated winner threshold value, Ti(P3), if the unknown nucleotide sequence has a period-3 power spectral value greater than or equal Ti(P3), the unknown nucleotide sequence is classified as an exon sequence; otherwise it is classified as an intron sequence.

thumbnailFigure 11. Winner threshold value Tw(P3) versus SL of 13 short sequence sets with WL = 9, 15, 24, SL bases using K-Quaternary Code I.

thumbnailFigure 12. Winner threshold value Tw(P3) versus SL of 13 long sequence sets with WL = 9, 15, 24, SL bases using K-Quaternary Code I.

2.5.1. Short sequences

For each WL, an interpolated threshold value for an arbitrary sequence length can be estimated using cubic spline interpolation from the corresponding discrete winner threshold values shown in Figure 11. For each of the four WL, the interpolated threshold values and the corresponding precisions obtained on testing each of the twelve untrained sequence sets (under the short sequence group II of Table 3) of sequence lengths [90, 120, 180, 210, 270, 330, 390, 420, 480, 510, 570, 630] bases, are displayed in Figures 13 and 14 using thin red bars. For comparison, the winner threshold values and the corresponding precisions of the thirteen short sequence sets for each of the four WL are displayed in Figures 13 and 14 using thick blue bars. From Figures 13 and 14, it can be observed that the interpolated winner threshold values and their precisions perform well and follow the general trends.

thumbnailFigure 13. Interpolated winner threshold values (thin red bar) of 12 untrained short sequence sets obtained from winner threshold values (thick blue bar) of 13 short sequence sets using K-Quaternary Code I.

thumbnailFigure 14. Precisions (thin red bar) of 12 untrained short sequence sets (obtained using interpolated winner threshold values) versus precisions (thick blue bar) of 13 short sequence sets (obtained using winner threshold values), both using K-Quaternary Code I.

2.5.2. Long sequences

In a similar manner to the short sequence sets, for each WL, an interpolated winner threshold value for an arbitrary sequence length can be determined using cubic spline interpolation from the corresponding discrete winner threshold values of the thirteen long sequence sets shown in Figure 12. For each of the four WL, twelve untrained sequence sets (under the long sequence group II of Table 5) of sequence lengths [690, 720, 780, 810, 870, 930, 990, 1020, 1080, 1110, 1170, 1230] bases are tested, the interpolated winner threshold values and the corresponding precisions obtained are displayed in Figures 15 and 16 using thin red bars. For comparison, the winner threshold values and the corresponding precisions for each of the four WL of the thirteen long sequence sets are displayed in Figures 15 and 16 using thick blue bars. From Figures 15 and 16, it can be observed that the interpolated winner threshold values and their precisions work well and follow the general trends.

thumbnailFigure 15. Interpolated winner threshold values (thin red bar) of 12 untrained long sequence sets obtained from winner threshold values (thick blue bar) of 13 long sequence sets using K-Quaternary Code I.

thumbnailFigure 16. Precisions (thin red bar) of 12 untrained long sequence sets (obtained using interpolated winner threshold values) versus precisions (thick blue bar) of 13 long sequence sets (obtained using winner threshold values), both using K-Quaternary Code I.

2.6. Classification performances of 12 organisms

To demonstrate the efficiency of the numerical representations and thresholding methods when applied to other model organisms, the UCSC genomes [26-29] of twelve organisms (including the human) listed in Table 7 are tested. Each of the genomes of the twelve organisms is trained and tested with identical SL of 300 bases, four window lengths, and four threshold values, the results are compared based on the best of the Codes 1-16 (dark blue), the Code 10 (light blue), the Code 13 (yellow), and the Code 17 (red) [17] (as described in Section 2.1) as shown in Figures 17-18. For the twelve organisms, Figures 17(a-b) plot precision (a, top) and the corresponding code index (b, bottom); and Figure 18(a-c) plot window length index (a, top), threshold index (b, middle), and threshold value (c, bottom) of the top classifications obtained. The results summarized in Table 8 indicate that top classifications are obtained by the Code 13 in seven organisms; followed by the Code 17 with four top classifications (with the Code 13 ranks 2nd in 2 cases and the Code 10 ranks 2nd in 1 case, each of these 3 cases is shown in Table 8 on the row following that of the top performer); and the Code 6 with one top classification. The results also indicate that the Code 13 with (a) WL = 24 bases and Tp thresholding is the top choice for organisms 1, 3, 4; (b) WL = 24 bases and T4 thresholding is the top choice for organism 7; (c) WL = 15 bases and Tc thresholding is the top choice for organisms 11 and 12; and (d) WL = 300 bases and Tm thresholding is the top choice for organism 2.

Table 7. UCSC genome of 12 organisms

thumbnailFigure 17. (a-b) Precision (a, top) and code index (b, bottom) of top classifications of 12 organisms obtained by best of Codes 1-16 (dark blue), Code 10 (light blue), Code 13 (yellow), and Code 17 (red).

thumbnailFigure 18. (a-c) WL index (a, top), threshold index (b, middle), and threshold value (c, bottom) of top classifications of 12 organisms obtained by best of Codes 1-16 (dark blue), Code 10 (light blue), Code 13 (yellow), and Code 17 (red).

Table 8. Code index (Figure 17b, bottom), WL (bases) (Figure 18a, top), threshold method (Figure 18b, middle), and precision (%) (Figure 17a, top) of top classifications of 12 organisms (SL = 300 bases)

3. Discussion

In this article, the ability of the K-Quaternary Code I (the Code 13) through the use of the discrete Fourier transform to capture the periodicities of an exon sequence or an intron sequence across the entire spectrum at a resolution defined by the window length is shown in Figures 3 and 4. Such a spectral tracking ability is shared among all of the sixteen numerical representations. As seen from Figures 3 and 4, there are three prominent peaks located at frequency equal to 0, 2π/3, and 4π/3. The peak at 2π/3 corresponds to the period-3 power spectral value, and the peak at 0 corresponds to the power spectral value at DC which usually exhibits the highest value within a spectrum. The power spectral values at other frequencies are lower and different but all serve to reflect their actual power spectral properties across the spectrum. For the Codes 9-16, their numerical represented sequences x[n] are complex; therefore, DFT spectrum shows non-symmetrical peaks at frequency equal to 2π/3 and 4π/3. However, for the Codes 1-8 in which their numerical represented sequences x[n] are real and therefore symmetrical peaks at frequency equal to 2π/3 and 4π/3 are obtained from DFT analysis.

The interpolated winner threshold values, Ti(P3), and their corresponding precisions shown in Figures 13 and 14 for short sequences, and in Figures 15 and 16 for long sequences of the human genome indicate that Ti(P3) obtained using cubic spline interpolation from either Figure 11 or Figure 12 can yield a similar precision as compared to that of the winner threshold value, Tw(P3), of its nearby SL. It should be noted that each of Tm(P3), Tp(P3), and Tc(P3) required to determine Tw(P3) has to be computed directly from the training portion of each short/long sequence set whereas Ti(P3) requires minimal computation but can achieve a comparable precision.

Given an unknown human nucleotide sequence of an arbitrary length L, if L is equal to the length of any of the thirteen short sequence sets or any of the thirteen long sequence sets, the choice of a suitable set of code, WL, and threshold can be obtained as a table-look-up from either Table 4 or 6. If L is not equal to the length of any of these thirteen short or long sequence sets, the closest SL, its code, and WL can also be obtained from either Table 4 or 6. For the latter case, once the code and WL are determined, its threshold value can then be determined using the interpolated winner threshold described in Section 2.5. Besides the human genome, the methodologies described in this article can be applied to the genome of other model organisms as verified by the results shown in Figures 17(a-b) and 18 (a-c) and Tables 7 and 8.

4. Conclusions

In this article, two methods for determining threshold values have been defined, and together with the cumulative distribution threshold value, determine the winner threshold value for classifying an unknown nucleotide sequence of a fixed length. An interpolated winner threshold value has also been introduced to classify an unknown nucleotide sequence of an arbitrary length with a comparable performance to that obtained by the winner threshold value of its nearby SL (in classifying an unknown nucleotide sequence of a fixed length). In general, precision increases as sequence length increases, and classification performance depends on a suitable choice of numerical representation and window length. Sixteen 1-sequence numerical representations have been presented and compared to classify untrained exon and intron sequences in the spectral domain, in which the K-Quaternary Code I yields attractive performance. When comparing each of the sixteen windowed 1-sequence numerical representations using WL = [9,15,24] bases to a non-windowed 4-sequence numerical representation (such as the Voss representation), the speed improvement ratio increases as SL increases which favours long nucleotide sequence analysis. The results obtained indicate the methodologies introduced in this article for exon and intron sequence classification are applicable to the genomes of the human and other model organisms. Overall, the study has developed novel methodologies in numerical representation for improved nucleotide to numeric mapping, spectral analysis for effective period-3 spectral value computation, and thresholding for more accurate classification of unknown exon and intron sequences of fixed and arbitrary length.

Competing interests

The authors declare that they have no competing interests.

References

  1. PP Vaidyanathan, Genomics and proteomics: A signal processor's tour. IEEE Circ. Syst. Mag. 4th Q, 6–29 (2004)

  2. D Anastassiou, Genomic signal processing. IEEE Signal Process. Mag. 18, 8-20 (2001)

  3. HK Kwan, SB Arniker, Numerical representation of DNA sequences, 307–310 in Proceedings of IEEE International Conference on Electro/Information Technology (EIT), Windsor, Ontario, Canada, 7-9 June 2009

  4. SB Arniker, HK Kwan, Graphical representation of DNA sequences, 311–314 in Proceedings of IEEE International Conference on Electro/Information Technology (EIT), Windsor, Ontario, Canada, 7-9 June 2009

  5. PD Cristea, in BIOS’02: Genetic Signal Representation and Analysis, SPIE International Conference on Biomedical Optics Symposium, Molecular Analysis and Informatics, San Jose, CA, USA, 21-24 January 2002. Proceedings of Society of Photo-Optical Instrumentation Engineers (SPIE) Conference, vol. 4623 (SPIE, 2002), 77–84

  6. M Akhtar, J Epps, E Ambikairajah, On DNA numerical representations for period-3 based exon prediction. 4 pages in Proceedings of IEEE Workshop on Genomic Signal Processing and Statistics (GENSIPS), Tuusula, Finland, 10-12 June 2007

  7. GL Rosen, Signal Processing for Biologically-inspired Gradient Source Localization and DNA Sequence Analysis (PhD dissertation, Georgia Institute of Technology, Atlanta, 2006)

  8. M Akhtar, J Epps, E Ambikairajah, Signal processing in sequence analysis: Advances in eukaryotic gene prediction. IEEE J Sel Top Signal Process 2, 310–321 (2008)

  9. T Holden, R Subramaniam, R Sullivan, E Cheng, C Sneider, G Tremberger Jr., A Flamholz, DH Leiberman, TD Cheung, ATCG nucleotide fluctuation of Deinococcus radiodurans radiation genes. in Instruments, Methods, and Missions for Astrobiology X. Proceedings of Society of Photo-Optical 2 Instrumentation Engineers (SPIE) Conference, vol. 6694 (SPIE, 12 September 2007), pp. 669417-1–669417-10

  10. HE Stanley, SV Buldyrev, AL Goldberger, ZD Goldberger, S Havlin, SM Ossadnik, C-K Peng, M Simmons, Statistical mechanics in biology: how ubiquitous are long-range correlations? Physica A 205, 214–253 (1994). PubMed Abstract | Publisher Full Text OpenURL

  11. AS Nair, SS Pillai, A coding measure scheme employing electron-ion interaction pseudo potential (EIIP). Bioinformation 1, 197–202 (2006). PubMed Abstract | PubMed Central Full Text OpenURL

  12. I Cosic, Macromolecular Bioactivity: is it resonant interaction between macromolecules? Theory and applications. IEEE Trans Biomed Eng 41, 1101–1114 (1994). PubMed Abstract | Publisher Full Text OpenURL

  13. N Chakravarthy, A Spanias, LD Lasemidis, K Tsakalis, Autoregressive modeling and feature analysis of DNA sequences. EURASIP Journal of Genomic Signal Processing 1, 13–28 (2004)

  14. PD Cristea, Conversion of nucleotides sequences into genomic signals. J Cell Mol Med 6, 279–303 (2002). PubMed Abstract | Publisher Full Text OpenURL

  15. PD Cristea, Representation and analysis of DNA sequences. in Genomic Signal Processing and Statistics, EURASIP Book Series in Signal Processing and Communications, volume 2 (Hindawi Publishing Corporation, 2005), , ed. by ER Dougherty, I Shmulevich, J Chen, ZJ Wang, pp. 15–65

  16. AK Brodzik, O Peters, Symbol-balanced Quaternionic periodicity transform for latent pattern detection in DNA sequences. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 5, Philadelphia, USA, March 2005, 373–376

  17. S Tiwari, S Ramachandran, A Bhattacharya, S Bhattacharya, R Ramaswamy, Prediction of probable genes by Fourier analysis of genomic sequences. Bioinformatics (CABIOS) 13(3), 263–270 (1997). Publisher Full Text OpenURL

  18. S Datta, A Asif, A fast DFT based gene prediction algorithm for identification of protein coding regions. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 5, Philadelphia, USA, March 2005, 653–656

  19. BYM Kwan, JYY Kwan, HK Kwan, R Atwal, OT Shen, 4 pages in Proceedings of TENCON, Hong Kong, China, 2006

  20. JA Berger, SK Mitra, M Carli, A Neri, New approaches to genome sequence analysis based on digital signal processing. 4 pages in Proceedings of IEEE Workshop on Genomic Signal Processing and Statistics (GENSIPS), Raleigh, North Carolina, October 2002

  21. EA Cheever, DB Searls, W Karunaratne, GC Overton, Using signal processing techniques for DNA sequence comparison, 173–174 in Proceedings of the 1989 Fifteenth Annual Northeast Bioengineering Conference, Boston, MA, 27-28 March 1989

  22. S Rajasekaran, H Nick, PM Pardalos, S Sahni, G Shaw, Efficient algorithms for local alignment search. J Comb Optim 5, 117–124 (2001). Publisher Full Text OpenURL

  23. GD Avenio, M Grigioni, G Orefici, R Creti, SWIFT (sequence-wide investigation with Fourier transform): a software tool for identifying proteins of a given class from the unannotated genome sequence. Bioinformatics 21(13), 2943–2949 (2005). PubMed Abstract | Publisher Full Text OpenURL

  24. CC Yin, SS-T Yau, Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. J Theor Biol 247, 687–694 (2007). PubMed Abstract | Publisher Full Text OpenURL

  25. J Tuqan, A Rushdi, A DSP approach for finding the codon bias in DNA sequences. IEEE J Sel Topics Signal Process 2(3), 343–356 (2008)

  26. D Karolchik, AS Hinrichs, TS Furey, KM Roskin, CW Sugnet, D Haussler, WJ Kent, The UCSC Table Browser data retrieval tool. Nucleic Acids Res, D493–496 (2004). | OpenURL

  27. J Goecks, A Nekrutenko, J Taylor, The Galaxy Team, Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biology 11(8) Article R86 BioMed Central Full Text OpenURL

  28. D Blankenberg, G Von Kuster, N Coraor, G Ananda, R Lazarus, M Mangan, A Nekrutenko, J Taylor, Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol Chapter 19(Unit 19.10), 1–21 (2010)

  29. B Giardine, C Riemer, RC Hardison, R Burhans, L Elnitski, P Shah, Y Zhang, D Blankenberg, I Albert, J Taylor, W Miller, WJ Kent, A Nekrutenko, Galaxy: a platform for interactive large-scale genome analysis. Genome Res 15(10), 1451–1455 (2005). PubMed Abstract | Publisher Full Text | PubMed Central Full Text OpenURL