6. Code: Local Pairwise alignment#

To facilitate the the reasoning in the subsequent cells, we first we define a couple of functions that we will need later, for formating and printing alignments. It is not important that you understand what these functions do, for now.

Hide code cell source
import numpy as np

# Print 2 sequences on top of each other
def print_alignment(seqA,seqB):
    print(seqA)
    print(seqB)

# Print a dynamic programming score matrix
# together with its sequences
def print_dynamic(seqA,seqB,dpm):
    seqA,seqB = "-" + seqA, "-" + seqB
    m,n = len(seqA),len(seqB)
    print('{:^5}'.format(" "), end=""),
    for j in range(n):
        print('{:^5}'.format(seqB[j]), end="")
    print()
    for i in range(m):
        print ('{:^5}'.format(seqA[i]), end="")
        for j in range(n):
            print ('{:5.1f}'.format(dpm[i,j]), end="")
        print()
    print()

# Format an alignment by inserting gaps in sequences
def format_alignment(seqA,seqB,S,trace):
    print("Best score: " + str(S[-1,-1]))
    outA,outB = "",""
    i,j = len(seqA),len(seqB)
    while i>0 or j>0:
        di,dj = trace[i,j]
        i += int(di)
        j += int(dj)
        if di == 0:
            outA = "-" + outA
        else:
            outA = seqA[i] + outA
        if dj == 0:
            outB = "-" + outB
        else:
            outB = seqB[j] + outB
    return outA,outB

6.1. Scoring system for DNA sequences#

We setup the scoring system we need for the alignment of DNA sequences. Here we use a score system where gaps score -2 and miss matches are scored -1 and matches get a score of 3.

def gap_penalty():
    return -2.0

def match_score(letterA,letterB):
    if letterA == '-' or letterB == '-':
        return gap_penalty()
    elif letterA == letterB:
        return 3.0
    else:
        return -1.0

6.2. Local alignments using Smith-Waterman#

Smith-Waterman alignments are similar to the ones of Needleman-Wunsch. The difference sits in the initiation of the dynamic programming matrix, and how we trace the most optimal alignment. We will implement these difference by redefining some functions.

First the initiation of the dynamic programming matrix \(S\): \(S_{i0}=0, \forall i,\) \(S_{0j}=0, \forall j\)

# Initiating dynamic programming matrices, S and trace
def initiate_local_dp(m,n):
    S = np.zeros((m,n))
    trace = np.zeros((m,n,2))
    S[0,0] = 0.
    trace[0,0,:] = (0.,0.)
    for i in range(1,m):
        S[i,0] = 0
        trace[i,0,:] =(0,0)
    for j in range(1,n):
        S[0,j] = 0
        trace[0,j,:] =(0,0)
    return S,trace

Subsequently, the rest of \(S\) is filled as: \(S_{ij}=\max\left\{ \begin{array}{ll} S_{i-1,j-1} & +d(a_i,b_j)\\ S_{i-1,j} & +d(a_i,-)\\ S_{i,j-1} & +d(-,b_j)\\ 0 \end{array} \right.\)

def local_align(seqA,seqB,print_dynamic_matrix = False):
    # Initiating variables
    m, n = len(seqA)+1, len(seqB)+1
    S,trace = initiate_local_dp(m,n)
    # Fill in the rest of the dynamic programming matrix
    for i in range(1,m):
        for j in range(1,n):
            match = S[i-1][j-1] + match_score(seqA[i-1],seqB[j-1])
            delete = S[i-1,j] + match_score(seqA[i-1],'-') 
            insert = S[i,j-1] + match_score('-',seqB[j-1]) 
            S[i,j] = max(match, delete, insert, 0.)
            if match >= max(delete,insert,0.):
                trace[i,j,:] = (-1,-1.)
            elif delete >= max(insert,0.):
                trace[i,j,:] = (-1.,0.)
            elif insert >= 0.:
                trace[i,j,:] = (0.,-1.)
            else:
                trace[i,j,:] = (0.,0.)
    if print_dynamic_matrix:
        print_dynamic(seqA,seqB,S)
    return format_local_alignment(seqA,seqB,S,trace)

We also need a slightly different method to trace the the alignment. It is not essential you understand this code.

Hide code cell source
def format_local_alignment(seqA,seqB,S,trace):
    outA,outB = "",""
    i,j = np.unravel_index(S.argmax(),S.shape)
    print("Best score: " + str(S[i,j]))
    while min(trace[i,j])<0:
        di,dj = trace[i,j]
        i += int(di)
        j += int(dj)
        if di == 0:
            outA = "-" + outA
        else:
            outA = seqA[i] + outA
        if dj == 0:
            outB = "-" + outB
        else:
            outB = seqB[j] + outB
    return outA,outB
seqA,seqB = local_align("GATTA","GCTAC",True)
print_alignment(seqA,seqB)
       -    G    C    T    A    C  
  -    0.0  0.0  0.0  0.0  0.0  0.0
  G    0.0  3.0  1.0  0.0  0.0  0.0
  A    0.0  1.0  2.0  0.0  3.0  1.0
  T    0.0  0.0  0.0  5.0  3.0  2.0
  T    0.0  0.0  0.0  3.0  4.0  2.0
  A    0.0  0.0  0.0  1.0  6.0  4.0

Best score: 6.0
GATTA
G-CTA
seqA,seqB = local_align("GCGATTA","GCTTAC",True)
print_alignment(seqA,seqB)
       -    G    C    T    T    A    C  
  -    0.0  0.0  0.0  0.0  0.0  0.0  0.0
  G    0.0  3.0  1.0  0.0  0.0  0.0  0.0
  C    0.0  1.0  6.0  4.0  2.0  0.0  3.0
  G    0.0  3.0  4.0  5.0  3.0  1.0  1.0
  A    0.0  1.0  2.0  3.0  4.0  6.0  4.0
  T    0.0  0.0  0.0  5.0  6.0  4.0  5.0
  T    0.0  0.0  0.0  3.0  8.0  6.0  4.0
  A    0.0  0.0  0.0  1.0  6.0 11.0  9.0

Best score: 11.0
GATTA
GCTTA
seqA,seqB = local_align("CTATCTCGCTATCCA","CTACGCTATTTCA",True)
print_alignment(seqA,seqB)
       -    C    T    A    C    G    C    T    A    T    T    T    C    A  
  -    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
  C    0.0  3.0  1.0  0.0  3.0  1.0  3.0  1.0  0.0  0.0  0.0  0.0  3.0  1.0
  T    0.0  1.0  6.0  4.0  2.0  2.0  1.0  6.0  4.0  3.0  3.0  3.0  1.0  2.0
  A    0.0  0.0  4.0  9.0  7.0  5.0  3.0  4.0  9.0  7.0  5.0  3.0  2.0  4.0
  T    0.0  0.0  3.0  7.0  8.0  6.0  4.0  6.0  7.0 12.0 10.0  8.0  6.0  4.0
  C    0.0  3.0  1.0  5.0 10.0  8.0  9.0  7.0  5.0 10.0 11.0  9.0 11.0  9.0
  T    0.0  1.0  6.0  4.0  8.0  9.0  7.0 12.0 10.0  8.0 13.0 14.0 12.0 10.0
  C    0.0  3.0  4.0  5.0  7.0  7.0 12.0 10.0 11.0  9.0 11.0 12.0 17.0 15.0
  G    0.0  1.0  2.0  3.0  5.0 10.0 10.0 11.0  9.0 10.0  9.0 10.0 15.0 16.0
  C    0.0  3.0  1.0  1.0  6.0  8.0 13.0 11.0 10.0  8.0  9.0  8.0 13.0 14.0
  T    0.0  1.0  6.0  4.0  4.0  6.0 11.0 16.0 14.0 13.0 11.0 12.0 11.0 12.0
  A    0.0  0.0  4.0  9.0  7.0  5.0  9.0 14.0 19.0 17.0 15.0 13.0 11.0 14.0
  T    0.0  0.0  3.0  7.0  8.0  6.0  7.0 12.0 17.0 22.0 20.0 18.0 16.0 14.0
  C    0.0  3.0  1.0  5.0 10.0  8.0  9.0 10.0 15.0 20.0 21.0 19.0 21.0 19.0
  C    0.0  3.0  2.0  3.0  8.0  9.0 11.0  9.0 13.0 18.0 19.0 20.0 22.0 20.0
  A    0.0  1.0  2.0  5.0  6.0  7.0  9.0 10.0 12.0 16.0 17.0 18.0 20.0 25.0

Best score: 25.0
CT-CGCTA-TCCA
CTACGCTATTTCA