A New Quantum Cuckoo Search Algorithm for Multiple Sequence Alignment

Widad Kartous; Abdesslem Layeb; Salim Chikhi

doi:10.1515/jisys-2013-0052

Open Access Published by De Gruyter June 17, 2014

A New Quantum Cuckoo Search Algorithm for Multiple Sequence Alignment

Widad Kartous , Abdesslem Layeb and Salim Chikhi

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2013-0052

Abstract

Multiple sequence alignment (MSA) is one of the major problems that can be encountered in the bioinformatics field. MSA consists in aligning a set of biological sequences to extract the similarities between them. Unfortunately, this problem has been shown to be NP-hard. In this article, a new algorithm was proposed to deal with this problem; it is based on a quantum-inspired cuckoo search algorithm. The other feature of the proposed approach is the use of a randomized progressive alignment method based on a hybrid global/local pairwise algorithm to construct the initial population. The results obtained by this hybridization are very encouraging and show the feasibility and effectiveness of the proposed solution.

Keywords: Bioinformatics; multiple sequence alignment; cuckoo search algorithm; quantum computing; hybrid algorithms

1 Introduction

During the last decade, the importance of biological data increased in an exceptional way, and this has led researchers to focus on new and more efficient techniques for solving different problems encountered in bioinformatics. Major research efforts in the bioinformatics field include multiple sequence alignments (MSAs), phylogeny construction, gene finding, genome assembly, drug design, drug discovery, protein structure alignment, protein structure prediction, prediction of gene expression and protein–protein interactions, genome-wide association studies, and the modeling of evolution [4].

MSA is among the most important and challenging tasks in bioinformatics, making MSA tools very important for day-to-day sequence analyses [18]. Biologists use these similarities to identify the functional, structural, or evolutionary relationships between the biological sequences. Unfortunately, finding an accurate MSA has been shown to be NP-hard [1], and that is the reason why several methods have been proposed in the literature to tackle this problem [30]. The MSA methods can be divided into the following classes depending on the search strategy used to perform the MSA resolution: exact, progressive, and iterative methods. The first class includes exact methods, which use a generalization of the Needleman–Wunsch algorithm [25] to align all the sequences simultaneously [13, 20, 21, 23, 25–29, 31, 33, 35, 36, 39]. Although exact methods give optimal solutions, their main shortcoming is their complexity, which becomes even more critical with the increase of the number of sequences. The second class contains methods based on a progressive approach [4, 11, 13–21, 23, 25–29, 31, 33, 35, 36, 39, 40, 42]. For these methods, the MSA is built gradually according to a given order of sequences, starting with the alignment of two sequences, then gradually adding the remaining sequences one by one to the preceding alignment. The progressive methods are simple and fast and generally give alignments of good qualities. The most popular progressive program is Clustal [42]. However, the major disadvantage of progressive methods is the problem of the local minima, which consequently can lead to poor quality solutions. To overcome this problem, several strategies were added to the main progressive methods to minimize alignment errors. For example, Zhou and Zhou [49] have proposed one of the most competitive MSA tools (SPEM) that use structural information in the construction of the MSA. Meanwhile, the iterative methods of the third class were shown to be promising. The basic idea is to start with an initial alignment and iteratively refine it through a series of suitable refinements called iterations. The process is reiterated if a number of criteria are satisfied. Iterative methods can be deterministic or stochastic, depending on the strategy used to improve the alignment. The first stochastic iterative algorithm proposed in the literature uses an algorithm of simulated annealing [16]. However, this algorithm is very slow and is only appropriate to use as an improver. Later, several other iterative algorithms that use various strategies such as genetic algorithms (GAs) [2, 6, 19, 28, 31, 36, 48] and tabu search [35] were proposed. Concerning the deterministic iterative methods [45], they involve extracting the sequence one by one from multiple alignments and realigning them to the remaining sequences. The process is reiterated until it does not have any more possible improvements [14]. Although the iterative methods generally give more accurate alignments than the progressive methods, their major disadvantage is their high execution time. Recently, several methods between progressive strategy and iterative refinement have been used in combination. MUSCLE [10], ProbCons [9], and T-Coffee [23] are among the most powerful MSA tools in the literature.

Meanwhile, the above methods can be divided into two classes: global and local methods, depending on the pairwise alignment algorithm used to create the MSA. The global methods align sequences from the beginning to the end. These methods are very useful on sequences with global homology. However, most local methods attempt to identify the most conserved motifs common in all sequences. These methods are better than the global programs if we are interested in aligning a set of sequences with local homologies. Recently, several hybrid methods that combine global and local alignment features have been developed such as DIALIGN-T [40], T-Coffee [27, 29, 31], and GRASPALINE [14].

Far from the bioinformatics field, a growing theoretical and practical interest is devoted to researches on merging evolutionary computation and quantum computing (QC) [1–3, 6, 9–11, 14, 15, 24, 30]. The aim is to benefit from QC capabilities to enhance both efficiency and speed of classical evolutionary algorithms. This has led to the design of quantum-inspired evolutionary algorithms that have been proven to be better than conventional EAs in some optimization problems. QC is a new research field that encompasses investigations on quantum mechanical computers and quantum algorithms. QC relies on the principles of quantum mechanics such as qubit representation and superposition of states. QC is capable of processing huge numbers of quantum states simultaneously in parallel. QC brings new philosophy to optimization because of its underlying concepts [15]. Recently, the principles of QC have been integrated successfully by Layeb [17] in the cuckoo search algorithm (CSA); this has been later developed in 2009 by Yang and Deb [46, 47]. CSA is a new metaheuristic algorithm imitating animal behavior [46] based on the obligate brood parasitic behavior of some cuckoo species and the Lévy flight principle, and preliminary studies show that CSA is very promising and could outperform existing algorithms [4, 7, 8, 32, 34, 38, 41, 46–49].

The article is organized as follows: in Section 2, a formulation of the tackled problem is given; in Section 3, the principles of QC are briefly presented; in Section 4, an overview of the concepts of CSA is given; Section 5 is devoted to the proposed approach; Section 6 discusses the experimental results; the conclusion and a discussion of future work are given in Section 7.

2 Problem Formulation

The MSA problem refers to the alignment of three or more biological sequences having generally different lengths. Figure 1 shows a small example of an MSA; three situations are possible for a position given in the alignment. In “match,” the characters are the same; in “mismatch,” the characters are not the same; and in “insertion/deletion,” one of the positions is a gap.

Figure 1

A Multiple Alignment of AGCTA, ACAGT, and GCATA.

Formally, the problem of MSA can be formulated as follows [19]: let S={s₁, s₂, …, s_n} be a set of n sequences with n≥2. Each sequence s_i is a string defined over an alphabet Λ. The lengths of the sequences are not necessarily the same. The problem of MSA can be defined by implicitly specifying a pair (Ω, C), where Ω is the set of all feasible solutions that are potential alignments and C is a mapping Ω→R called score of the alignment. Each potential alignment is viewed as a set S′={s′1, s′2, . . . , s′n} satisfying the following criteria:

Each sequence s′i is an extension of s_i and is defined over the alphabet Λ′=Λ∪{–}. The symbol “–” is a dash denoting a gap. Gaps are added to s_i in a way the deletion of gaps from s′i leaves s_i.
For all i, j, length (s′i)=length (s′j).

A score of an alignment S′ denoted by C(S′) is defined as

(1)C(S′)=∑i∑jsim (s′i, s′j), (1)

where sim(s′i, s′j) denotes some similarity between each pair of sequences s′i and s′j.

Let us now define the optimum value of C as C_best={max C(S′)/S′∈Ω} and the set of optima Ω_best={S′∈Ω/C(S′)=C_best}. The optimal alignment can be then defined by S′*, such that

(2)S′*=argmaxs′∈Ω(C(S′)). (2)

The addressed task is clearly a combinatorial optimization problem. Although the computing power available has been increasing steadily at a rapid rate, it is still practically impossible to find globally optimal solutions to combinatorial optimization problems. The main reason is that the required computation grows exponentially with the size of the problem. Therefore, it is often desirable to find near-optimal solutions to these problems. Efficient heuristic algorithms offer a good alternative to reach this goal such as the progressive methods. In the next section, we will discuss the first concepts used in our approach.

3 QC Principles

A quantum computer is a computational device that makes direct use of quantum mechanical phenomena such as superposition and entanglement to perform computing operations on data. Quantum computers are different from digital computers based on transistors. Although digital computers require data to be encoded into binary digits (bits), quantum computation uses quantum properties to represent data and perform operations on these data [13]. Quantum computers share theoretical similarities with nondeterministic and probabilistic computers. An example is the ability to be in more than one state simultaneously.

Unlike the manner how classical computer manipulates simple bits, where each bit is equal to 0 or 1, QC exchanges a support of information, instead of a bit manipulating a qubit. It is composed of the superposition of two basis states denoted by agreement |0〉 and |1〉. The state of a qubit can be represented using the bracket notation:

(3)|ψ〉=a|0〉+b|1〉, (3)

where |Ψ〉 denotes more than a vector Ψ. In some vector space, |0〉 and |1〉 represent the classical bit values 0 and 1, respectively, and a and b are complex numbers, such that

(4)|a|2+|b|2=1, (4)

where a and b are complex numbers that specify the probability amplitudes of the corresponding states. When we measure the qubit’s state, we may have “0” with a probability |a|² and we may have “1” with a probability |b|². A system of m-qubits can represent 2^m states at the same time. Quantum computers can perform computations on all these values at the same time. It is this exponential growth of the state space with the number of particles that suggests exponential speed-up of computation on quantum computers over classical computers. Each quantum operation will deal with all the states present within the superposition in parallel. When observing a quantum state, it collapses to a single state among those states [15].

A quantum algorithm consists in applying a succession of quantum operations on quantum systems. Quantum operations are performed using quantum gates and quantum circuits, yet a powerful quantum machine is still under construction. By the time a powerful quantum machine is constructed, researches are conducted to benefit from the QC field. Since the late 1990s, merging quantum computation and evolutionary computation has been proven to be a productive issue when probing complex problems. The purpose of this combination is to increase the profit of each one of these two areas by reciprocally inspiring each other. Like any other evolutionary algorithm, a quantum genetic algorithm (QGA) relies on the representation of the individual, the evaluation function, and the population dynamics. The particularity of QGA stems from the quantum representation they adopt, which allows representing the superposition of all potential solutions for a given problem. It also stems from the quantum operators it uses to evolve the entire population through generations. QGAs were proposed to solve many difficult combinatorial optimization problems, and the experimental results demonstrated that these algorithms were far more superior to conventional GAs [4, 7, 8, 12, 32, 34, 38, 41–49].

4 Cuckoo search algorithm

The CSA is a new metaheuristic algorithm inspired by the obligate brood parasitism of some cuckoo species that lay their eggs in the nests of host birds. Some cuckoos have evolved in such way that female parasitic cuckoos can imitate the colors and patterns of the eggs of a few chosen host species. This reduces the probability of the eggs being abandoned, which therefore increases their reproductivity [33]. It is worth mentioning that several host birds engage in direct conflict with intruding cuckoos. In this case, if host birds discover that the eggs are not their own, they will either throw them away or simply abandon their nests and build new ones elsewhere. Parasitic cuckoos often choose a nest where the host bird just laid its own eggs. CSA models of such breeding behavior can thus be applied to various optimization problems. Yang and Deb [46, 47] discovered that the performance of the CSA can be improved using Lévy flights instead of simple random walk.

Each egg in a nest represents a solution, and a cuckoo egg represents a new solution. The aim is to use the new and potentially better solutions (cuckoos) to replace not-so-good solutions in the nests. In the simplest form, each nest has one egg. The algorithm can be extended to more complicated cases in which each nest has multiple eggs representing a set of solutions [46, 47].

The CSA is based on three idealized rules:

Each cuckoo lays one egg at a time and dumps it in a randomly chosen nest.
The best nests with a high quality of eggs (solutions) will carry over to the next generations.
The number of available host nests is fixed, and a host can discover an alien egg with probability pa ∈ [0, 1]. In this case, the host bird can either throw the egg away or abandon the nest to build a completely new nest in a new location [46]. The basic steps of the CSA can be summarized as the pseudocode, as follows:

The applications of CSA in engineering optimization problems have shown promising efficiency. For example, for both spring design and welded beam design problems, CSA obtained better solutions than existing solutions in literature. A promising discrete CSA is recently proposed to solve a nurse-scheduling problem [41]. An efficient computation approach based on CSA has been proposed for data fusion in wireless sensor networks [7, 8]. A new quantum-inspired CSA was developed to solve knapsack problems, which shows its effectiveness [17]. CSA can also be used to efficiently generate independent test paths for structural software testing [38] and test data generation [34].

5 Quantum-Inspired CSA for MSA

To solve the MSA problems, we propose a new algorithm based on the quantum CSA. The proposed approach is based on two phases (Figure 2). The first is a construction phase used for the construction of initial solutions, and it is based on a randomized progressive MSA; the second phase is a refinement method, based on the quantum cuckoo algorithm dynamics, used for the improvement of the solutions in the first phase.

Figure 2

Flowchart of the Proposed Algorithm.

5.1 Construction Phase

The construction phase is an essential step in any two-phase algorithm to obtain good solutions. Generally, we use a semi-greedy algorithm to construct progressively a feasible solution. In the MSA field, there are two main classes of progressive algorithms used to construct the MSA: the global alignment and the local alignment. The global progressive methods are generally more accurate than the local ones except in cases belonging to large N/C-terminal extension or internal insertions. The major disadvantage of global methods in this case is especially due to the difference between the lengths of sequences. Global alignment methods attempt to align the sequences over their entire length, whereas local programs search only for the most conserved motifs [43]. To improve the accuracy of the global methods in the case of sequences with large N/C-terminal extensions without decreasing their accuracy in the other cases, we have introduced in the construction phase the idea of GRASPALINE in order to use both the local pairwise and global pairwise algorithms to align two sequences in the general progressive algorithm. GRASPALINE [18] uses an algorithm similar to the one proposed by Feng and Doolittle [11] to build an MSA progressively. The main novelty of this construction phase is its ability to produce a diverse set of good solutions; this behavior is accomplished by the insertion of some randomness in the progressive algorithm. Consequently, we can select with some probability a sequence even if it is not the closest one to the already aligned sequences. In more details, the proposed algorithm for MSA can be described as follows:

The following example shows how to apply the construction phase on a BAliBASE test 1ckaA of reference 4. The test 1ckaA is composed of the sequences {1ckaA, 1ycsB, 1aboA, 1griA, 1fmk, 1awx, 1ark, 1hsq, 1pht, EPS8_MOUSE}. The multiple alignment of the above set of sequences is built as follows:

First, we create the guide tree using the neighbor-joining method (Figure 3). In the first step, we begin by the alignment of the sequences 1ckaA and 1ark. Initially, we compute the difference between their lengths. This difference is <120 (Table 1), so we use a global pairwise algorithm to build the alignment of 1ckaA and 1ark. According to the tree, the third sequence to be aligned is 1griA. However, because the first phase of our method is the greedy randomized procedure, we can select a random sequence from unaligned ones. At this stage, we select the closest sequence to 1griA in the previous alignment, which is 1ckaA. Like in the preceding stage, we compute the difference between the lengths of 1griA and 1ckaA, and this difference is <120, so we use also a global pairwise algorithm to align 1griA and 1ckaA. Afterward, the old gaps in the previous alignment and the new ones generated in the alignment of 1griA and 1ckaA are propagated in the three sequences. At the end of this stage, the alignment is built from the sequences 1ckaA, 1ark, and 1griA. In the same manner, we perform the alignment of 1hsq, 1awx, and 1pht. Meanwhile, the alignments of the sequences 1ycsB, 1fmk, 1aboA, and EPS8_MOUSE are done using a local pairwise algorithm because the difference between their lengths and those of the closest sequences is >120 (Table 1). For example, the closest sequence to the sequence EPS8_MOUSE is 1ckaA, and the difference between their lengths is 765. Therefore, the use of a local pairwise algorithm is more effective than the global one.

Figure 3

The Guide Tree of the Sequences.

Table 1

The Progression of the Sequence Alignments.

Sequence	Length	Aligned to the sequence	Length difference	Pairwise algorithm
1ckaA	56	–	–	–
1ark	60	1ckaA	4	Global
1griA	94	1ckaA	38	Global
1hsq	71	1ckaA	15	Global
1awx	67	1ckaA	11	Global
1ycsB	193	1awx	126	Local
1fmk	258	1ckaA	202	Local
1aboA	58	1fmk	200	Local
EPS8_MOUSE	821	1ckaA	765	Local
1pht	80	1ckaA	24	Global

It should be noted that in population-based metaheuristics, the construction of the initial population is very important. Indeed, there is a compromise to respect: the population should be diverse (i.e., containing both good and bad solutions) to obtain good results and avoiding traps or local minima. Consequently, using a pure population of non-aligned solutions is time-consuming. On the other side, starting from a population created by a good algorithm such as MUSCLE can lead the algorithm to be trapped in the local minima. Clustal algorithm constitutes a better choice to initialize population-based metaheuristics for MSA because it is based on greedy strategy, and it gives generally suboptimal solutions. Clustal was used as starting point of a quantum evolutionary algorithm in the work of Meshoul et al. [22], but the improvements were not significant enough. To avoid all the previous problems, our algorithm uses a progressive procedure similar to Clustal algorithm to construct the initial population. Nonetheless, unlike Clustal, the procedure used is randomized to obtain diverse solutions. Finally, the evaluation of the impact of using different MSA algorithms to generate the initial population will be studied in our forthcoming works.

5.2 Refinement Phase

The solutions produced in the construction phase are not necessarily optimal. To enhance the quality of these solutions, we need to apply a refinement method. In our algorithm, we have used an iterative refinement method based on the quantum CSA operators (Figure 4). Next we describe how the representation scheme has been embedded within a CSA and resulted in a hybrid stochastic algorithm performing the MSA. First, a swarm of p host nest is created at random positions to represent all possible solutions. The algorithm progresses through a number of generations. During each iteration, the following main tasks are performed. A new cuckoo is built using the Lévy flights operator followed by one type of the quantum mutations that are applied with some probability Pr. The next step is to evaluate the current cuckoo. For that, we apply the measure operation to obtain a binary solution that represents a potential solution. After this step, we apply the interference operation according to the best current element. We replace some worst nests by the current cuckoo if it is better or by new random nests generated by the Lévy flight. Finally, the global best solution is then updated if a better one is found and the whole process is repeated until a stopping criterion is reached. The new alignment is evaluated using an objective function: the alignment is accepted if there is an improvement in the alignment score.

Figure 4

Architecture of Refinement Phase.

The particularity of quantum-inspired CSA stems from the quantum representation it adopts, which allows representing the superposition of all potential solutions for a given problem. Moreover, the position of a nest depends on the probability amplitudes a and b of the qubit function. The probabilistic nature of the quantum measure offers a good diversity to the CSA, whereas the interference operation helps to intensify the search around the good solutions. The results obtained are discussed in the next section.

The components of the proposed algorithm will be explained in detail.

5.2.1 Quantum Representation of MSA

There are two representations used in this algorithm; the first is a binary matrix (BM) where each row represents a sequence in the alignment: value 1 is assigned to any element corresponding to a nucleic base and value 0 to any element corresponding to a gap. In terms of QC, a potential solution is represented as quantum matrix (Figure 5). Each sequence is represented as a quantum register, and each column represents only one qubit and corresponds to an element of the alphabet {A, C, G, U, –}. The amplitudes a and b are real values satisfying |aβ|²+|bβ|² =1. For each qubit, a binary value is calculated according to probabilities |a_i|², |b_i|² and the number of nucleic bases in each sequence [23]; as a consequence, all potential alignments can be represented by a quantum matrix (Figure 5) [23] containing a superposition of all possible configurations. This quantum matrix can be viewed like a probabilistic representation of all possible alignments.

Figure 5

Quantum Representation of MSA.

5.2.2 Quantum Operations

We define now the following operations we have applied on the quantum representation:

Measurement operation: This operation allows extracting from the quantum matrix one solution among all those present in the superposition without destroying all other configurations as it is done in pure quantum systems. This has the advantage of preserving the superposition for the next iterations knowing that we operate on a conventional computer. The result of this operation is a BM (Figure 6) [19]. The value of a qubit is calculated according to its probabilities |a_i|², |b_i|² and the number of nucleic bases in each sequence. The obtained BM is then translated into an alphabetical matrix (Figure 7) by respecting the order of appearance of nucleic bases in the sequences to be aligned [23].
Quantum interference: This operation aims to increase the probability for a good alignment to be extracted as a result of the measure operation. It mainly consists in moving the state of each qubit in the direction of the value of the bit corresponding to the current best solution. The operation of interference is useful to intensify research around the best solution [19]. The value of the rotation angle δθ (Table 2) is chosen so that to avoid premature convergence. It is set experimentally, and its direction is determined as a function of the values of a_i, b_i (Figure 8) and the corresponding element’s value in the BM [19].
Mutation operation: This operator allows for the exploration of new solutions and thus enhances the diversification capacities of the search process. Three kinds of mutation are considered [19].
1. Single qubit mutation: in this mutation, we alter some qubits taken randomly (Figure 9).
2. Qubit register mutation: a random set of consecutive qubits are moved (Figure 10).
3. Qubit bloc of mutation: a random bloc of qubits is moved (Figure 11).

Table 2

Lookup Table of the Rotation Angle.

a	b	Reference bit value	Angle
>0	>0	1	+δθ
>0	>0	0	–δθ
>0	<0	1	–δθ
>0	<0	0	+δθ
<0	>0	1	–δθ
<0	>0	0	+δθ
<0	<0	1	+δθ
<0	<0	0	–δθ

Figure 6

Measure of Quantum Matrix.

Figure 7

Equivalence Between Binary and Alphabetic Matrices.

Figure 8

Quantum Interference.

Figure 9

Mutation Operators: Single Qubit Mutation.

Figure 10

Mutation Operators: Register of Qubit Mutation.

Figure 11

Mutation Operators: Bloc of Qubit Mutation.

5.3 Scoring Systems and Objective Functions

In the case of MSA, biologically speaking, there is no exact scoring scheme that can be used to find the optimal solution. Indeed, the optimal solution found by a mathematical objective function does not always give the biological optimal because the mathematical score depends on a set of parameters such as substitution matrix and gap penalties. For these reasons, several scoring schemes have been proposed in the literature [26]. The most popular scoring scheme is the sum of all pairwise alignments score, i.e., sum-of-pairs score (SPS) [Equation (5)], and its variant weighted sum-of-pairs score (SWPS) [Equation (6)], and this function was chosen for its ability to better assess divergent sequences. The score was calculated and based on the weight values provided by phylogenetic trees. The use of weights cannot penalize the not too distant sequences. This objective function requires the construction of a phylogenetic tree for the determination of weight between sequences.

(5)SPS=∑i=1n−1∑j=insc(Si, Sj). (5)

(6)WSPS=∑i=1n−1∑j=inWijsc(Si, Sj). (6)

In the iterative refinement phase of our approach, we have used the WSPS score to evaluate each alignment. Unfortunately, WSPS depends on a set of parameters such as the substitution matrix and gap penalty scheme. Consequently, it is interesting to analyze different objective functions such as tree-based consistency-based objective function Coffee, entropy column scores, and consensus function on the performance of the quantum-inspired CSA.

6 Results, Analysis, and Discussion

The proposed approach is implemented in MATLAB 7.7 and tested on I3 PC with 4 GB. To demonstrate the effectiveness of our approach, called QICSAL, we have evaluated our approach on BAliBASE 2 benchmark [3]. BAliBASE is divided into eight classes of reference sets, but we have used only the first five classes in our experiment. BAliBASE tests can be viewed at the web site http://www-igbmc.u-strasbg.fr/BioInfo/BAliBASE2/ and can be downloaded from http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE2/index.html. Reference 1 contains the alignments of equidistant sequences; all the sequences are of similar length, with no large insertions or extensions. Reference 2 aligns up to three “orphan” sequences (<25% identical) from reference 1 with a family of at least 15 closely related sequences. Reference 3 consists of up to four subgroups, with <25% residue identity between sequences from different groups. The alignments are constructed by adding homologous family members to the more distantly related sequences in reference 1. Reference 4 is divided into two subcategories containing alignments of up to 20 sequences including N/C-terminal extensions (up to 400 residues) and insertions (up to 100 residues) [3, 4, 6, 9–11, 13–21, 23, 25–31, 33, 35, 36, 39, 40, 42–44]. Finally, sequences of reference 5 contain large internal insertions. All reference alignments were refined manually by BAliBASE authors.

To assess the alignment accuracy, we use the BAli-score program, which helps to compute two performance metrics: the SPS of an alignment, which is a ratio between the number of correctly aligned residue pairs found in the test alignment and the total number of aligned residue pairs in core blocks of the reference alignment [43]. The second metric is the column score (CS), which is defined as the number of columns correctly aligned found in the test alignment divided by the total number of columns in core blocks of the reference alignment. The closer to 1.0 these scores are, the better the alignment. Finally, a Friedman test was carried out to test the significance of the difference in the accuracy of each method. The results of this experiment are summarized in Table 3. Moreover, our method is compared with the most competitive and popular MSA tools in the literature (Table 3) such as ClustalW 1.83 [42], MUSCLE [10], ProbCons [9], T-Coffee [29], SPEM [49], PRALINE [37], IMSA [5], and GRASPALINE [18]. In this experiment, we have not reported the runtime of different programs because to make a fair comparison, all the programs should be implemented under the same conditions. However, most programs used in this experiment are in the C programming language, and it is very fast compared with MATLAB, the language’s implementation of our algorithm. Generally, the runtime depends on the length and the number of sequences to be aligned. Although MATLAB has fast manipulation of multidimensional arrays, the use of C language and parallel machines can significantly increase the speed of our algorithm.

Table 3

BaliScore CS and SP Score Results of MSA.

Program	Ref. [1]	(82)	Ref. [2]	(23)	Ref. [3]	(12)	Ref. [5]	(12)	Ref. [6]	(12)
	SPS	CS	SPS	CS	SPS	CS	SPS	CS	SPS	CS
SPEM	90.8	83.9	93.4	57.3	81.4	56.9	97.4	90.8	97.4	92.3
MUSCLE	90.3	84.7	64.4	60.9	82.2	61.9	91.8	74.8	98.1	92.1
ProbCons	90.0	83.9	94.0	62.6	82.3	63.1	90.9	73.6	98.1	91.7
T-Coffee	86.8	80.0	93.9	58.5	76.7	54.8	92.1	76.8	94.6	86.1
PRALINE	90.4	83.9	94.0	61.0	76.4	55.8	79.9	53.9	81.8	68.6
ClustalW	85.8	78.3	93.3	59.3	72.3	48.1	83.4	62.3	85.8	63.4
IMSA	83.4	65.3	92.1	41.3	78.6	36.2	73.0	31.9	83.6	56.9
GRASPALINE	88.7	82.0	90.1	52.2	73.6	44.4	94.3	79.3	93.5	85.4
QICSAL	88.3	81.8	90.1	56.8	74.0	45.9	92.5	78.3	93.6	79.0

The experimental results show that our approach can give solutions at least as good as the existing programs. Indeed, the effectiveness of our algorithm can be noticed in references 3, 4, and 5 compared with the global alignment programs such as ClustalW or IMSA, which is due to the use of a global/local method in the construction phase. Indeed, the hybridization between the algorithms of Needleman (global alignment) and Smith-Waterman (local alignment) helps in efficiently solving the problem of references containing the sequences with large N/C-terminal extensions, for example, reference 4 (Table 3, column 5). Thereby, our program provides better results compared with those obtained using ClustalW and IMSA. Statistically, ClustalW and IMSA are not successful in the SPS Freidman (Figure 12) compared with the results of QICSAL. However, in the SPS Freidman test (Figure 13), IMSA is the only unsuccessful program, whereas SPEM and ProbCons are the most powerful MSA tools in this experiment.

Figure 12

Friedman Test for SPS.

Figure 13

Friedman Test for CS.

The use of quantum CSA in the refinement phase to perform the MSA has contributed to the improvement of the solutions found in the first phase. Indeed, the multiple advantages of quantum encoding are the reduction of the size of the population and the increase in diversity. Moreover, the history of individuals is not lost throughout the iterations, due to interference operation, which also helps to intensify the search around the good solutions. Finally, the probabilistic nature of the quantum measure offers a good diversity to the CSA. It can find good results near those found by powerful programs such as T-Coffee, MUSCLE, ProbCons, etc.; in addition, our results are clearly better than Clustal results or other iterative algorithms (IMSA, GRASPALINE).

7 Conclusion

In this article, we have presented a new algorithm to solve the problem of MSA. The proposed algorithm is based on hybrid progressive/iterative algorithm. The progressive phase is used for the construction of initial solutions, and the constructive procedure uses a new algorithm based on a randomized progressive strategy to build diverse MSAs. The second phase is an improvement method based on the hybridization between CSA and QC. The optimization process consists of the application of a CSA dynamics (Lévy flight) enhanced by quantum operations such as interference, quantum mutations, and measurement. To evaluate our approach, we have used the BAliBASE benchmark. Compared with the most popular and competitive MSA tools, the obtained results are encouraged. The great feature of our approach is its ability to provide good results, which is due to a particular hybridization between randomized progressive algorithm and the quantum-inspired CSA. The intrinsic parallelism of the algorithm can be exploited to enhance its performance significantly. Finally, the proposed framework provides an extensible platform for evaluating different objective functions, and startup MSA algorithms. It would be an interesting attempt to study this issue as ongoing work.

Corresponding author: Widad Kartous, MISC Laboratory, Department of Computer Science and its applications, University of Constantine 2, Nouvelle Ville Ali Mendjeli - BP: 67A, 25000, Algeria, e-mail: kartous.widad@gmail.com

Bibliography

[1] S. F. Altschul, Gap costs for multiple sequence alignment, J. Theor. Biol.138 (1989), 297–309.10.1016/S0022-5193(89)80196-1Search in Google Scholar

[2] L. A. Anbarasu, P. Narayanasamy and V. Sundararajan, Multiple molecular sequence alignment by island parallel genetic algorithm, Curr. Sci.78 (2000), 858–863.Search in Google Scholar

[3] A. Bahr, J. D. Thompson, J. C. Thierry and O. Poch, BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations, Nucleic Acids Res.29 (2001), 323–326.10.1093/nar/29.1.323Search in Google Scholar PubMed PubMed Central

[4] A. D. Baxevanis and B. F. F. Ouellette (eds.), Bioinformatics: A Pracitcal Guid to the Analysis of Genes and Proteins (3rd ed.), Wiley-Liss, 2005.Search in Google Scholar

[5] V. Cutello, G. Nicosia, M. Pavone and I. Prizzi, Protein multiple sequence alignment by hybrid bio-inspired algorithms, Nucleic Acids Res.39 (2010), 1980–1992.10.1093/nar/gkq1052Search in Google Scholar PubMed PubMed Central

[6] F. Da Silva, P. Pulido, M. Rodríguez, J. Gómez and M. Vega, AlineaGA – a genetic algorithm with local search optimization for multiple sequence alignment, Appl. Intell.32 (2010), 164–172.10.1007/s10489-009-0189-4Search in Google Scholar

[7] M. Dhivya and M. Sundarambal, Cuckoo search for data gathering in wireless sensor networks, Int. J. Mobile Commun.9 (2011), 642–656.10.1504/IJMC.2011.042781Search in Google Scholar

[8] M. Dhivya, M. Sundarambal and L. N. Anand, Energy efficient computation of data fusion in wireless sensor networks using cuckoo based particle approach (CBPA), Int. J. Commun. Network Syst. Sci.4 (2011), 249–255.10.4236/ijcns.2011.44030Search in Google Scholar

[9] C. B. Do, M. S. P. Mahabhashyam, M. Brudno and S. Batzoglou, ProbCons: probabilistic consistency-based multiple sequence alignment, Genome Res. 15 (2005), 330–340.10.1101/gr.2821705Search in Google Scholar PubMed PubMed Central

[10] R. C. Edgar, Muscle: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Res.32 (2004), 1792–1797.10.1093/nar/gkh340Search in Google Scholar PubMed PubMed Central

[11] D. F. Feng and R. F. Doolittle, Progressive sequence alignment as a prerequisite to correct phylogenetic trees, J. Mol. Evol.25 (1987), 351–360.10.1007/BF02603120Search in Google Scholar PubMed

[12] P. Gardner, A. Wilm and S. Washietl, A benchmark of multiple sequence alignment programs upon structural RNAs, Nucleic Acids Res.33 (2005), 2433–2439.10.1093/nar/gki541Search in Google Scholar PubMed PubMed Central

[13] N. Gershenfeld and I. L. Chuang, Quantum computing with molecules, Sci. Am. Mag. 1998, 1 page.10.1038/scientificamerican0698-66Search in Google Scholar

[14] O. Gotoh, Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments, J. Mol. Biol.264 (1996), 823–838.10.1006/jmbi.1996.0679Search in Google Scholar

[15] K. H. Han and J. H. Kim, Quantum-inspired evolutionary algorithms with a new termination criterion, He gate, and two phase scheme, IEEE Trans. Evol. Comput.8 (2004), 156–169.10.1109/TEVC.2004.823467Search in Google Scholar

[16] J. Kim, S. Pramanik and M. J. Chung, Multiple sequence alignment using simulated annealing, Comput. Appl. Biosci.10 (1994), 419–426.10.1093/bioinformatics/10.4.419Search in Google Scholar

[17] A. Layeb, A novel quantum inspired cuckoo search for knapsack problems, Int. J. Bio-Inspired Comput.3 (2011), 150–163.10.1504/IJBIC.2011.042260Search in Google Scholar

[18] A. Layeb, M. Selmane and M. Bencheikh Elhoucine, A new greedy randomized adaptive search procedure for multiple sequence alignment, Int. J. Bioinform. Res. Appl.9 (2013), 323–335.10.1504/IJBRA.2013.054695Search in Google Scholar

[19] A. Layeb, S. Meshoul and M. Batouche, Multiple sequence alignment by quantum genetic algorithm, in: 7th International Workshop on Parallel and Distributed Scientific and Engineering Computing of the 20th International Parallel and Distributed Processing Symposium, pp. 1–8, 2006.10.1109/IPDPS.2006.1639617Search in Google Scholar

[20] D. J. Lipman, S. F. Altschul and J. D. Kececioglu, A tool for multiple sequence alignment, Proc. Natl. Acad. Sci. USA86 (1989), 4412–4415.10.1073/pnas.86.12.4412Search in Google Scholar

[21] M. A. McClure, T. K. Vasi and W. M. Fitch, Comparative analysis of multiple protein-sequence alignment methods, Mol. Biol. Evol.11 (1994), 571–592.Search in Google Scholar

[22] S. Meshoul, A. Layeb and M. Batouche, A quantum evolutionary algorithm for effective multiple sequence alignment, in: Progress in Artificial Intelligence, pp. 260–271, Springer, Berlin, 2005.10.1007/11595014_26Search in Google Scholar

[23] S. Meshoul, A. Layeb and M. Batouche, Quantum genetic algorithm for multiple RNA structural alignment, modeling and simulation, in: Second Asia International Conference (AICMS ′08), pp. 873–878, 2008.Search in Google Scholar

[24] A. Narayanan and M. Moore, Quantum-inspired genetic algorithms, Proc. IEEE Int. Conf. Evol. Comput. (1996), 61–66.Search in Google Scholar

[25] S. B. Needleman and C. D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins, J. Mol. Biol.48 (1970), 443–453.10.1016/0022-2836(70)90057-4Search in Google Scholar

[26] K. D. Nguyen and P. Yi, A reliable metric for quantifying multiple sequence alignment, in: Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, pp. 788–795, 2007.10.1109/BIBE.2007.4375650Search in Google Scholar

[27] K. D. Nguyen and P. Yi, Multiple sequence alignment based on dynamic weighted guidance tree, Int. J. Bioinform. Res. Appl.7 (2011), 168–182.10.1504/IJBRA.2011.040095Search in Google Scholar

[28] H. D. Nguyen, I. Yoshihara, K. Yamamori and M. Yasunaga, Aligning multiple protein sequences by parallel hybrid genetic algorithm, Genome Inform. 13 (2002), 123–132.Search in Google Scholar

[29] C. Notredame, Recent progresses in MSA: a survey, Pharmacogenomics3 (2002), 1–14.10.1517/14622416.3.1.131Search in Google Scholar

[30] C. Notredame and D. G. Higgins, SAGA: sequence alignment by genetic algorithm, Nucleic Acids Res.24 (1996), 1515–1524.10.1093/nar/24.8.1515Search in Google Scholar

[31] C. Notredame, D. G. Higgins and J. Heringa, T-Coffee: a novel method for fast and accurate multiple sequence alignment, J. Mol. Biol.302 (2000), 205–217.10.1006/jmbi.2000.4042Search in Google Scholar

[32] I. Pavlyukevich, Lévy flight, non-local search and simulated annealing, J. Comput. Phys.226 (2007), 1830–1844.10.1016/j.jcp.2007.06.008Search in Google Scholar

[33] R. B. Payne, M. D. Sorenson and K. Klitz, The cuckoos, Oxford University Press, Oxford, 2005.Search in Google Scholar

[34] K. Perumal, J. M. Ungati, G. Kumar, N. Jain, R. Gaurav and P. R. Srivastava, Test data generation: a hybrid approach using cuckoo and tabu search, in: Swarm, Evolutionary, and Memetic Computing (SEMCCO2011), Lect. Notes Comput. Sci. 7077, pp. 46–54, 2011.Search in Google Scholar

[35] T. Riaz, Y. Wang and K. B. Li, Multiple sequence alignment using tabu search, in: Proceedings of the 2nd Conference on Asia-Pacific Bioinformatics, pp. 223–232, Australian Computer Society, New Zealand, 2004.Search in Google Scholar

[36] C. Shyu, L. Sheneman and J. A. Foster, Multiple sequence alignment with evolutionary computation, Genet. Prog. Evol. Mach.5 (2004), 121–144.10.1023/B:GENP.0000023684.05565.78Search in Google Scholar

[37] V. A. Simossis and J. Heringa, PRALINE: a multiple sequence alignment toolbox that integrates homology-extended and secondary structure information, Nucleic Acids Res.33 (2005), 289–294.10.1093/nar/gki390Search in Google Scholar

[38] P. R. Srivastava, M. Chis, S. Deb and X. S. Yang, An efficient optimization algorithm for structural software testing, Int. J. Artif. Intell.9 (2012), 68–77.Search in Google Scholar

[39] J. Stoye, S. W. Perrey and A. W. M. Dress, Improving the divide-and-conquer approach to sum-of-pairs multiple sequence alignment, Appl. Math. Lett.10 (1997), 67–73.10.1016/S0893-9659(97)00013-XSearch in Google Scholar

[40] A. R. Subramanian, J. W. Menkhoff, M. Kaufmann and B. Morgenstern, DIALIGN-T: an improved algorithm for segment-based multiple sequence alignment, BMC Bioinform.6 (2005), 66.Search in Google Scholar

[41] L. H. Tein and R. Ramli, Recent advancements of nurse scheduling models and a potential path, in: Proceedings of the 6th IMT-GT Conference on Mathematics, Statistics and its Applications (ICMSA 2010), pp. 395–409, 2010.Search in Google Scholar

[42] J. D. Thompson, D. G. Higgins and T. J. Gibson, Clustal W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, Nucleic Acids Res.22 (1994), 4673–4680.10.1093/nar/22.22.4673Search in Google Scholar PubMed PubMed Central

[43] J. D. Thompson, F. Plewniak and O. Poch, A comprehensive comparison of multiple sequence alignment programs, Nucleic Acids Res.27 (1999), 2682–2690.10.1093/nar/27.13.2682Search in Google Scholar PubMed PubMed Central

[44] J. D. Thompson, F. Plewniak and O. Poch, BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs, Bioinformatics15 (1999), 87–88.10.1093/bioinformatics/15.1.87Search in Google Scholar PubMed

[45] I. M. Wallace, O. O’Sullivan and D. G. Higgins, Evaluation of iterative alignment algorithms for multiple alignment, Bioinformatics21 (2005), 1408–1414.10.1093/bioinformatics/bti159Search in Google Scholar PubMed

[46] X. S. Yang and S. Deb, Cuckoo search via Lévy flights, in: Proceedings of World Congress on Nature and Biologically Inspired Computing, pp. 210–214, India, 2009.10.1109/NABIC.2009.5393690Search in Google Scholar

[47] X. S. Yang and S. Deb, Engineering optimisation by cuckoo search, Int. J. Math. Modell. Numer. Optimisation1 (2010), 330–343.10.1504/IJMMNO.2010.035430Search in Google Scholar

[48] C. Zhang and A. K. Wong, A genetic algorithm for multiple molecular sequence alignment, Comput. Appl. Biosci.13 (1997), 565–581.10.1093/bioinformatics/13.6.565Search in Google Scholar PubMed

[49] H. Zhou and Y. Zhou, SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures, Bioinformatics21 (2005), 3615–3621.10.1093/bioinformatics/bti582Search in Google Scholar PubMed

Published Online: 2014-6-17

Published in Print: 2014-9-1

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

A New Quantum Cuckoo Search Algorithm for Multiple Sequence Alignment

Abstract

1 Introduction

2 Problem Formulation

3 QC Principles

4 Cuckoo search algorithm

5 Quantum-Inspired CSA for MSA

5.1 Construction Phase

5.2 Refinement Phase

5.2.1 Quantum Representation of MSA

5.2.2 Quantum Operations

5.3 Scoring Systems and Objective Functions

6 Results, Analysis, and Discussion

7 Conclusion

Bibliography

Journal and Issue

Articles in the same Issue