Code: Semi-global pairwise alignment

8. Code: Semi-global pairwise alignment#

To facilitate the the reasoning in the subsequent cells, we first we define a couple of functions that we will need later, for formating and printing alignments. It is not important that you understand what these functions do, for now.

Hide code cell source
import numpy as np

# Print 2 sequences on top of each other
def print_alignment(seqA,seqB):
    print(seqA)
    print(seqB)

# Print a dynamic programming score matrix
# together with its sequences
def print_dynamic(seqA,seqB,dpm):
    seqA,seqB = "-" + seqA, "-" + seqB
    m,n = len(seqA),len(seqB)
    print('{:^5}'.format(" "), end=""),
    for j in range(n):
        print('{:^5}'.format(seqB[j]), end="")
    print()
    for i in range(m):
        print ('{:^5}'.format(seqA[i]), end="")
        for j in range(n):
            print ('{:5.1f}'.format(dpm[i,j]), end="")
        print()
    print()

# Format an alignment by inserting gaps in sequences
def format_alignment(seqA,seqB,S,trace):
    print("Best score: " + str(S[-1,-1]))
    outA,outB = "",""
    i,j = len(seqA),len(seqB)
    while i>0 or j>0:
        di,dj = trace[i,j]
        i += int(di)
        j += int(dj)
        if di == 0:
            outA = "-" + outA
        else:
            outA = seqA[i] + outA
        if dj == 0:
            outB = "-" + outB
        else:
            outB = seqB[j] + outB
    return outA,outB

8.1. Scoring system for DNA sequences#

We setup the scoring system we need for the alignment of DNA sequences. Here we use a score system where gaps score -2 and miss matches are scored -1 and matches get a score of 3.

def gap_penalty():
    return -2.0

def match_score(letterA,letterB):
    if letterA == '-' or letterB == '-':
        return gap_penalty()
    elif letterA == letterB:
        return 3.0
    else:
        return -1.0

8.2. Semi-global alignments#

Yet another type of alignments are the semi-global alignments. Here we will able to reuse the code for the global score, but we will initiate the dynamic programming matrix as for the local alignment. We also need to redefine how to read the alignment.

First we initiate the dynamic programming matrix \(S\): \(S_{i0}=0, \forall i,\) \(S_{0j}=0, \forall j\)

# Initiating dynamic programming matrices, S and trace
def initiate_semiglobal_dp(m,n):
    S = np.zeros((m,n))
    trace = np.zeros((m,n,2))
    S[0,0] = 0.
    trace[0,0,:] = (0.,0.)
    for i in range(1,m):
        S[i,0] = 0
        trace[i,0,:] =(-1,0)
    for j in range(1,n):
        S[0,j] = 0
        trace[0,j,:] =(0,-1)
    return S,trace
def format_semiglobal_alignment(seqA,seqB,S,trace):
    outA,outB = "",""
    m,n = S.shape[0]-1, S.shape[1]-1
    i,j = S[:,n].argmax(), S[m,:].argmax()
    if S[i,n] > S[m,j]:
        print("Best score: " + str(S[i,n]))
        j = n
        # point the alignmnént from (m,n) to here
        for ix in range(i+1,m+1):
            print(ix,n)
            trace[ix,n,:] = (-1,0)
    else:
        print("Best score: " + str(S[m,j]))
        i = m    
        # point the alignmnént from (m,n) to here
        for ix in range(j+1,n+1):
            print(m,ix)
            trace[m,ix,:] = (0,-1)
    i,j = m,n
    while min(trace[i,j])<0:
        di,dj = trace[i,j]
        i += int(di)
        j += int(dj)
        if di == 0:
            outA = "-" + outA
        else:
            outA = seqA[i] + outA
        if dj == 0:
            outB = "-" + outB
        else:
            outB = seqB[j] + outB
    return outA,outB

Subsequently, the rest of \(S\) is filled as:

\( S_{ij}=\max\left\{ \begin{array}{ll} S_{i-1,j-1} & +d(a_i,b_j)\\ S_{i-1,j} & +d(a_i,-)\\ S_{i,j-1} & +d(-,b_j) \end{array} \right. \)

def semiglobal_align(seqA,seqB,print_dynamic_matrix = False):
    # Initiating variables
    m, n = len(seqA)+1, len(seqB)+1
    S,trace = initiate_semiglobal_dp(m,n)
    # Fill in the rest of the dynamic programming matrix
    for i in range(1,m):
        for j in range(1,n):
            match = S[i-1][j-1] + match_score(seqA[i-1],seqB[j-1])
            delete = S[i-1,j] + match_score(seqA[i-1],'-') 
            insert = S[i,j-1] + match_score('-',seqB[j-1]) 
            S[i,j] = max(match, delete, insert)
            if match >= max(delete,insert):
                trace[i,j,:] = (-1,-1.)
            elif delete >= insert:
                trace[i,j,:] = (-1.,0.)
            else:
                trace[i,j,:] = (0.,-1.)
    if print_dynamic_matrix:
        print_dynamic(seqA,seqB,S)
    return format_semiglobal_alignment(seqA,seqB,S,trace)
seqA,seqB = semiglobal_align("CTATCTATCTCGCTATCCAA","CTACGCTATTTCA",True)
print_alignment(seqA,seqB)
       -    C    T    A    C    G    C    T    A    T    T    T    C    A  
  -    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
  C    0.0  3.0  1.0 -1.0  3.0  1.0  3.0  1.0 -1.0 -1.0 -1.0 -1.0  3.0  1.0
  T    0.0  1.0  6.0  4.0  2.0  2.0  1.0  6.0  4.0  2.0  2.0  2.0  1.0  2.0
  A    0.0 -1.0  4.0  9.0  7.0  5.0  3.0  4.0  9.0  7.0  5.0  3.0  1.0  4.0
  T    0.0 -1.0  2.0  7.0  8.0  6.0  4.0  6.0  7.0 12.0 10.0  8.0  6.0  4.0
  C    0.0  3.0  1.0  5.0 10.0  8.0  9.0  7.0  5.0 10.0 11.0  9.0 11.0  9.0
  T    0.0  1.0  6.0  4.0  8.0  9.0  7.0 12.0 10.0  8.0 13.0 14.0 12.0 10.0
  A    0.0 -1.0  4.0  9.0  7.0  7.0  8.0 10.0 15.0 13.0 11.0 12.0 13.0 15.0
  T    0.0 -1.0  2.0  7.0  8.0  6.0  6.0 11.0 13.0 18.0 16.0 14.0 12.0 13.0
  C    0.0  3.0  1.0  5.0 10.0  8.0  9.0  9.0 11.0 16.0 17.0 15.0 17.0 15.0
  T    0.0  1.0  6.0  4.0  8.0  9.0  7.0 12.0 10.0 14.0 19.0 20.0 18.0 16.0
  C    0.0  3.0  4.0  5.0  7.0  7.0 12.0 10.0 11.0 12.0 17.0 18.0 23.0 21.0
  G    0.0  1.0  2.0  3.0  5.0 10.0 10.0 11.0  9.0 10.0 15.0 16.0 21.0 22.0
  C    0.0  3.0  1.0  1.0  6.0  8.0 13.0 11.0 10.0  8.0 13.0 14.0 19.0 20.0
  T    0.0  1.0  6.0  4.0  4.0  6.0 11.0 16.0 14.0 13.0 11.0 16.0 17.0 18.0
  A    0.0 -1.0  4.0  9.0  7.0  5.0  9.0 14.0 19.0 17.0 15.0 14.0 15.0 20.0
  T    0.0 -1.0  2.0  7.0  8.0  6.0  7.0 12.0 17.0 22.0 20.0 18.0 16.0 18.0
  C    0.0  3.0  1.0  5.0 10.0  8.0  9.0 10.0 15.0 20.0 21.0 19.0 21.0 19.0
  C    0.0  3.0  2.0  3.0  8.0  9.0 11.0  9.0 13.0 18.0 19.0 20.0 22.0 20.0
  A    0.0  1.0  2.0  5.0  6.0  7.0  9.0 10.0 12.0 16.0 17.0 18.0 20.0 25.0
  A    0.0 -1.0  0.0  5.0  4.0  5.0  7.0  8.0 13.0 14.0 15.0 16.0 18.0 23.0

Best score: 25.0
20 13
CTATCTATCT-CGCTA-TCCAA
--------CTACGCTATTTCA-