Multiple Sequence Alignments

11. Multiple Sequence Alignments#

11.1. Introduction#

While pairwise alignments enable comparisons between two sequences, many biological analyses require a more comprehensive approach: the alignment of multiple sequences simultaneously. Multiple sequence alignments (MSAs) offer a holistic view of sequence similarities and differences, illuminating functional and evolutionary relationships across diverse genomes and proteomes.

11.1.1. Exponential Time Complexity#

In principle, one can use dynamic programming to form MSAs. We can archive such systems using the same ideas as for the pairwise alignments, but just expanding the dynamic programming matrix into a tensor of the same dimension as the number of sequences. However, a major drawback is its exponential time complexity. Aligning \(T\) sequences each of length \(N\) can result in a time complexity of \(O(N^T)\), making it impractical for datasets with more than a few sequences.

11.1.2. Iterative Approaches as a Practical Alternative#

Given the inefficiencies of dynamic programming for MSAs, bioinformaticians have turned to heuristics that mimic the behaviour of full dynamic programming. Particularly iterative approaches, such as progressive alignment. These methods simplify the alignment process by breaking it down into manageable steps. Progressive alignment, for instance, first evaluate the pairwise similarities between the sequences by computing their pairwise alignments, then iteratively aligns sequences according to the witch sequences that are most similar. This reduces the computational burden and allows for the processing of large datasets in a reasonable time frame.

11.2. Progressive Alignments#

While dynamic programming offers an optimal solution for pairwise alignments, its inefficiency for MSAs necessitates a balance between accuracy and computational efficiency. Iterative and heuristic methods like progressive alignment and consistency-based algorithms strike this balance, providing near-optimal solutions in significantly less time.

A key method for constructing MSAs is progressive alignment, a stepwise approach that builds upon pairwise alignments to create a broader, more nuanced comparison. The progressive alignment method begins by generating a distance matrix, which quantifies the pairwise similarity between all sequences in the dataset.

Progressive alignment iteratively aligning sequences in order of their proximity on the tree. This process begins with the closest pair of sequences, which are aligned directly. Subsequent sequences are added progressively, with each new sequence being aligned to the growing consensus. This stepwise process creates an alignment that reflects both sequence similarities and evolutionary relationships.

The progressive alignment method offers several advantages. It efficiently handles large datasets, making it well-suited for genomics and proteomics research. By utilizing an evolutionary framework, it provides deeper insights into the conserved and divergent features of sequences, aiding in the identification of functionally important regions and evolutionary patterns.

In this chapter, we delve into the mechanics of progressive alignment, exploring its implementation and applications in bioinformatics. By understanding this approach, we gain valuable insights into the biological relationships and evolutionary trajectories that shape our world.

11.2.1. Progressive Alignment Steps#

The process builds alignments in a stepwise manner, starting with the most similar sequences and gradually incorporating others. Here’s a detailed look at how to create a progressive alignment. Let’s assume you start with \(T\) sequences, \(C=\{A_1, \ldots, A_T\}\).

Compute Pairwise Alignments: Initially, calculate all pairwise alignments between each pair of sequences in \(C\) using a suitable algorithm such as Needleman-Wunsch or Smith-Waterman. This provides a matrix of scores or distances that indicate how similar or different each pair of sequences is.
Use the pairwise distances to identify the two sequences in \(C\), \(A_i\) and \(A_j\) that are closest (those with the highest similarity score or lowest distance), and form an alignment of those two sequences. Remove \(A_i\) and \(A_j\) from \(C\). We call the alignment of those two sequences \(B\).
Add \(B\) to \(C\). Set the distance from \(B\) to each of the other sequences in \(C\) to the average of their distance from \(A_i\) and \(A_j\).
If there are more sequences than one in \(C\), then go to step 2.

11.2.2. Progressive Alignment Steps#

Input sequences

IAMAPEPTIDE  
IAMPEPTIDE
IAMPEPPED
IAMAPEPPERD

These are the sequences we want to align.

Seq1 and Seq2

IAMAPEPTIDE
IAM-PEPTIDE

We form a pairwise alignment of those two most similar sequences. We then pool the resulting sequences with the other two sequences.

Seq3 and Seq4.

IAM-PEPPE-D
IAMAPEPPERD

These sequences were the most similar among remaining ones.

(Seq1,Seq2) and (Seq3,Seq4).

IAMAPEPTID-E
IAM-PEPTID-E
IAM-PEPP-E-D
IAMAPEPP-ERD

There are two generalized sequences left. We align those two, and we are then done.

11.2.3. Applications of Progressive Alignments#

Modern bioinformatics tools, such as MUSCLE [Edg04], MAFFT [KMKM02], T-Coffee [NHH00], and Kalign [LS05] incorporate iterative refinement as a core feature, allowing researchers to achieve high-quality alignments with minimal manual intervention. These tools use sophisticated algorithms to iteratively refine alignments, often integrating additional information such as structural data or evolutionary constraints. This integration enhances the overall robustness of the alignment process, making these tools highly effective for a wide range of applications in genomics, proteomics, and evolutionary biology.

11.4. References#

[Edg04] (1,2)

Robert C Edgar. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic acids research, 32(5):1792–1797, 2004.

[KMKM02] (1,2)

Kazutaka Katoh, Kazuharu Misawa, Kei-ichi Kuma, and Takashi Miyata. Mafft: a novel method for rapid multiple sequence alignment based on fast fourier transform. Nucleic acids research, 30(14):3059–3066, 2002.

[LS05]

Timo Lassmann and Erik LL Sonnhammer. Kalign–an accurate and fast multiple sequence alignment algorithm. BMC bioinformatics, 6:1–9, 2005.

[NHH00]

Cédric Notredame, Desmond G Higgins, and Jaap Heringa. T-coffee: a novel method for fast and accurate multiple sequence alignment. Journal of molecular biology, 302(1):205–217, 2000.

Multiple Sequence Alignments

Contents

11. Multiple Sequence Alignments#

11.1. Introduction#

11.1.1. Exponential Time Complexity#

11.1.2. Iterative Approaches as a Practical Alternative#

11.2. Progressive Alignments#

11.2.1. Progressive Alignment Steps#

11.2.2. Progressive Alignment Steps#

11.2.3. Applications of Progressive Alignments#

11.3. Iterative Refinement of MSAs#

11.3.1. The Need for Iterative Refinement#

11.3.2. Iterative Refinement Techniques#

11.3.3. Applications and Advantages of Iterative Refinement#

11.4. References#