4. Code: Global pairwise alignments#

We will now to make use of our definitions of a Needleman-Wunsch alignment to see how the algorithm transforms to actual code. The sections below will walk you through how this is done.

To facilitate the the reasoning in the subsequent cells, we first we define a couple of service functions that we will need later, for formating and printing alignments. It is not important that you understand what these functions do, for now.

Hide code cell source
import numpy as np

# Print 2 sequences on top of each other
def print_alignment(seqA,seqB):
    print(seqA)
    print(seqB)

# Print a dynamic programming score matrix
# together with its sequences
def print_dynamic(seqA,seqB,dpm,trace):
    seqA,seqB = "-" + seqA, "-" + seqB
    m,n = len(seqA),len(seqB)
    print('{:^5}'.format(" "), end=""),
    for j in range(n):
        print('{:^5}'.format(seqB[j]), end="")
    print()
    for i in range(m):
        print ('{:^5}'.format(seqA[i]), end="")
        for j in range(n):
            print ('{:5.1f}'.format(dpm[i,j]), end="")
        print()
    print()

# Format an alignment by inserting gaps in sequences
def format_alignment(seqA,seqB,S,trace):
    print("Best score: " + str(S[-1,-1]))
    outA,outB = "",""
    i,j = len(seqA),len(seqB)
    while i>0 or j>0:
        di,dj = trace[i,j]
        i += int(di)
        j += int(dj)
        if di == 0:
            outA = "-" + outA
        else:
            outA = seqA[i] + outA
        if dj == 0:
            outB = "-" + outB
        else:
            outB = seqB[j] + outB
    return outA,outB

4.1. Scoring system for DNA sequences#

We setup the scoring system we need for the alignment of DNA sequences. Here we use a score system where gaps score -2 and miss matches are scored -1 and matches get a score of 3. You should absolutly try to change the scoring function at some point to see how the values affects the alignments.

def gap_penalty():
    return -2.0

def match_score(letterA,letterB):
    if letterA == '-' or letterB == '-':
        return gap_penalty()
    elif letterA == letterB:
        return 3.0
    else:
        return -1.0

4.2. Global alignments by Needleman-Wunsch#

Here follows an implementation of the Needleman-Wunsch pairwise alignment method. We want to align two sequences \(a=a_1,\ldots,a_{M}\) and \(b=b_1,\ldots,b_{N}\).

As said in last chapter, the dynamic programming matrix \(S\) is initiated as: \(S_{i0}=g \cdot i, \forall i,\) \(S_{0j}=g \cdot j, \forall j\)

This translates easy into two for loops filling in the upper row and the leftmost column.

Here we initiate pointers in the form of a trace matrix. Each element in the trace matrix contains the difference in index between the current cell and the optimal step that lead to the current cell.

# Initiating dynamic programming matrices, S and trace
def initiate_global_dp(m,n):
    S = np.zeros((m,n))       # An m*n matrix, initiated with 0's
    trace = np.zeros((m,n,2)) # An m*n matrix, initiated with (0,0)'s
    S[0,0] = 0.
    trace[0,0,:] = (0.,0.)
    for i in range(1,m):
        S[i,0] = i * gap_penalty()
        trace[i,0,:] =(-1,0)
    for j in range(1,n):
        S[0,j] = j * gap_penalty()
        trace[0,j,:] =(0,-1)
    return S,trace

Subsequently, the rest of \(S\) is filled as: \(S_{ij}=\max\left\{ \begin{array}{ll} S_{i-1,j-1} & +d(a_i,b_j)\\ S_{i-1,j} & +d(a_i,-)\\ S_{i,j-1} & +d(-,b_j) \end{array} \right.\)

This recursion is easily transformed into for-loops. We are free to select the order the matrix is filled in so we select to fill in row per row, i.e. rows become the inner loop, the columns the outer loop.

Again we keep track of the move that lead to a certain position by filling in the trace matrix.

def global_align(seqA,seqB,print_dynamic_matrix = False):
    # Initiating variables
    m, n = len(seqA)+1, len(seqB)+1
    S,trace = initiate_global_dp(m,n)
    # Fill in the rest of the dynamic programming matrix
    for i in range(1,m):
        for j in range(1,n):
            # Note the subtraction of 1 for the sequence position
            # In python sequences are indexed from 0 to len-1
            match = S[i-1,j-1] + match_score(seqA[i-1],seqB[j-1]) 
            delete = S[i-1,j] + match_score(seqA[i-1],'-') 
            insert = S[i,j-1] + match_score('-',seqB[j-1]) 
            S[i,j] = max(match,delete,insert)
            if match >= max(insert,delete):
                trace[i,j,:] = (-1,-1)
            elif delete >= insert:
                trace[i,j,:] = (-1,0)
            else:
                trace[i,j,:] = (0,-1)
    if print_dynamic_matrix:
        print_dynamic(seqA,seqB,S,trace)
    return format_alignment(seqA,seqB,S,trace)

Now everything is set. We can try the code for any of our favorite sequences. One can toggle the printout of the dynamic programming matrix by a boolean flag as a third argument.

seqA,seqB = global_align("GATTA","GCTAC",True)
print_alignment(seqA,seqB)
       -    G    C    T    A    C  
  -    0.0 -2.0 -4.0 -6.0 -8.0-10.0
  G   -2.0  3.0  1.0 -1.0 -3.0 -5.0
  A   -4.0  1.0  2.0  0.0  2.0  0.0
  T   -6.0 -1.0  0.0  5.0  3.0  1.0
  T   -8.0 -3.0 -2.0  3.0  4.0  2.0
  A  -10.0 -5.0 -4.0  1.0  6.0  4.0

Best score: 4.0
GATTA-
G-CTAC

I add a couple of extra alignments, check them manually as an excercise before the exam. Also try to run a couple of them yourself.

seqA,seqB = global_align("TGCATTA","GCATTAC",True)
print_alignment(seqA,seqB)
       -    G    C    A    T    T    A    C  
  -    0.0 -2.0 -4.0 -6.0 -8.0-10.0-12.0-14.0
  T   -2.0 -1.0 -3.0 -5.0 -3.0 -5.0 -7.0 -9.0
  G   -4.0  1.0 -1.0 -3.0 -5.0 -4.0 -6.0 -8.0
  C   -6.0 -1.0  4.0  2.0  0.0 -2.0 -4.0 -3.0
  A   -8.0 -3.0  2.0  7.0  5.0  3.0  1.0 -1.0
  T  -10.0 -5.0  0.0  5.0 10.0  8.0  6.0  4.0
  T  -12.0 -7.0 -2.0  3.0  8.0 13.0 11.0  9.0
  A  -14.0 -9.0 -4.0  1.0  6.0 11.0 16.0 14.0

Best score: 14.0
TGCATTA-
-GCATTAC
seqA,seqB = global_align("CTATCTCGCTATCCA","CTACGCTATTTCA",True)
print_alignment(seqA,seqB)
       -    C    T    A    C    G    C    T    A    T    T    T    C    A  
  -    0.0 -2.0 -4.0 -6.0 -8.0-10.0-12.0-14.0-16.0-18.0-20.0-22.0-24.0-26.0
  C   -2.0  3.0  1.0 -1.0 -3.0 -5.0 -7.0 -9.0-11.0-13.0-15.0-17.0-19.0-21.0
  T   -4.0  1.0  6.0  4.0  2.0  0.0 -2.0 -4.0 -6.0 -8.0-10.0-12.0-14.0-16.0
  A   -6.0 -1.0  4.0  9.0  7.0  5.0  3.0  1.0 -1.0 -3.0 -5.0 -7.0 -9.0-11.0
  T   -8.0 -3.0  2.0  7.0  8.0  6.0  4.0  6.0  4.0  2.0  0.0 -2.0 -4.0 -6.0
  C  -10.0 -5.0  0.0  5.0 10.0  8.0  9.0  7.0  5.0  3.0  1.0 -1.0  1.0 -1.0
  T  -12.0 -7.0 -2.0  3.0  8.0  9.0  7.0 12.0 10.0  8.0  6.0  4.0  2.0  0.0
  C  -14.0 -9.0 -4.0  1.0  6.0  7.0 12.0 10.0 11.0  9.0  7.0  5.0  7.0  5.0
  G  -16.0-11.0 -6.0 -1.0  4.0  9.0 10.0 11.0  9.0 10.0  8.0  6.0  5.0  6.0
  C  -18.0-13.0 -8.0 -3.0  2.0  7.0 12.0 10.0 10.0  8.0  9.0  7.0  9.0  7.0
  T  -20.0-15.0-10.0 -5.0  0.0  5.0 10.0 15.0 13.0 13.0 11.0 12.0 10.0  8.0
  A  -22.0-17.0-12.0 -7.0 -2.0  3.0  8.0 13.0 18.0 16.0 14.0 12.0 11.0 13.0
  T  -24.0-19.0-14.0 -9.0 -4.0  1.0  6.0 11.0 16.0 21.0 19.0 17.0 15.0 13.0
  C  -26.0-21.0-16.0-11.0 -6.0 -1.0  4.0  9.0 14.0 19.0 20.0 18.0 20.0 18.0
  C  -28.0-23.0-18.0-13.0 -8.0 -3.0  2.0  7.0 12.0 17.0 18.0 19.0 21.0 19.0
  A  -30.0-25.0-20.0-15.0-10.0 -5.0  0.0  5.0 10.0 15.0 16.0 17.0 19.0 24.0

Best score: 24.0
CTATCTCGCTA-TCCA
CTA---CGCTATTTCA