6. Code: Local Pairwise alignment#
To facilitate the the reasoning in the subsequent cells, we first we define a couple of functions that we will need later, for formating and printing alignments. It is not important that you understand what these functions do, for now.
Show code cell source
import numpy as np
# Print 2 sequences on top of each other
def print_alignment(seqA,seqB):
print(seqA)
print(seqB)
# Print a dynamic programming score matrix
# together with its sequences
def print_dynamic(seqA,seqB,dpm):
seqA,seqB = "-" + seqA, "-" + seqB
m,n = len(seqA),len(seqB)
print('{:^5}'.format(" "), end=""),
for j in range(n):
print('{:^5}'.format(seqB[j]), end="")
print()
for i in range(m):
print ('{:^5}'.format(seqA[i]), end="")
for j in range(n):
print ('{:5.1f}'.format(dpm[i,j]), end="")
print()
print()
# Format an alignment by inserting gaps in sequences
def format_alignment(seqA,seqB,S,trace):
print("Best score: " + str(S[-1,-1]))
outA,outB = "",""
i,j = len(seqA),len(seqB)
while i>0 or j>0:
di,dj = trace[i,j]
i += int(di)
j += int(dj)
if di == 0:
outA = "-" + outA
else:
outA = seqA[i] + outA
if dj == 0:
outB = "-" + outB
else:
outB = seqB[j] + outB
return outA,outB
6.1. Scoring system for DNA sequences#
We setup the scoring system we need for the alignment of DNA sequences. Here we use a score system where gaps score -2 and miss matches are scored -1 and matches get a score of 3.
def gap_penalty():
return -2.0
def match_score(letterA,letterB):
if letterA == '-' or letterB == '-':
return gap_penalty()
elif letterA == letterB:
return 3.0
else:
return -1.0
6.2. Local alignments using Smith-Waterman#
Smith-Waterman alignments are similar to the ones of Needleman-Wunsch. The difference sits in the initiation of the dynamic programming matrix, and how we trace the most optimal alignment. We will implement these difference by redefining some functions.
First the initiation of the dynamic programming matrix \(S\): \(S_{i0}=0, \forall i,\) \(S_{0j}=0, \forall j\)
# Initiating dynamic programming matrices, S and trace
def initiate_local_dp(m,n):
S = np.zeros((m,n))
trace = np.zeros((m,n,2))
S[0,0] = 0.
trace[0,0,:] = (0.,0.)
for i in range(1,m):
S[i,0] = 0
trace[i,0,:] =(0,0)
for j in range(1,n):
S[0,j] = 0
trace[0,j,:] =(0,0)
return S,trace
Subsequently, the rest of \(S\) is filled as: \(S_{ij}=\max\left\{ \begin{array}{ll} S_{i-1,j-1} & +d(a_i,b_j)\\ S_{i-1,j} & +d(a_i,-)\\ S_{i,j-1} & +d(-,b_j)\\ 0 \end{array} \right.\)
def local_align(seqA,seqB,print_dynamic_matrix = False):
# Initiating variables
m, n = len(seqA)+1, len(seqB)+1
S,trace = initiate_local_dp(m,n)
# Fill in the rest of the dynamic programming matrix
for i in range(1,m):
for j in range(1,n):
match = S[i-1][j-1] + match_score(seqA[i-1],seqB[j-1])
delete = S[i-1,j] + match_score(seqA[i-1],'-')
insert = S[i,j-1] + match_score('-',seqB[j-1])
S[i,j] = max(match, delete, insert, 0.)
if match >= max(delete,insert,0.):
trace[i,j,:] = (-1,-1.)
elif delete >= max(insert,0.):
trace[i,j,:] = (-1.,0.)
elif insert >= 0.:
trace[i,j,:] = (0.,-1.)
else:
trace[i,j,:] = (0.,0.)
if print_dynamic_matrix:
print_dynamic(seqA,seqB,S)
return format_local_alignment(seqA,seqB,S,trace)
We also need a slightly different method to trace the the alignment. It is not essential you understand this code.
Show code cell source
def format_local_alignment(seqA,seqB,S,trace):
outA,outB = "",""
i,j = np.unravel_index(S.argmax(),S.shape)
print("Best score: " + str(S[i,j]))
while min(trace[i,j])<0:
di,dj = trace[i,j]
i += int(di)
j += int(dj)
if di == 0:
outA = "-" + outA
else:
outA = seqA[i] + outA
if dj == 0:
outB = "-" + outB
else:
outB = seqB[j] + outB
return outA,outB
seqA,seqB = local_align("GATTA","GCTAC",True)
print_alignment(seqA,seqB)
- G C T A C
- 0.0 0.0 0.0 0.0 0.0 0.0
G 0.0 3.0 1.0 0.0 0.0 0.0
A 0.0 1.0 2.0 0.0 3.0 1.0
T 0.0 0.0 0.0 5.0 3.0 2.0
T 0.0 0.0 0.0 3.0 4.0 2.0
A 0.0 0.0 0.0 1.0 6.0 4.0
Best score: 6.0
GATTA
G-CTA
seqA,seqB = local_align("GCGATTA","GCTTAC",True)
print_alignment(seqA,seqB)
- G C T T A C
- 0.0 0.0 0.0 0.0 0.0 0.0 0.0
G 0.0 3.0 1.0 0.0 0.0 0.0 0.0
C 0.0 1.0 6.0 4.0 2.0 0.0 3.0
G 0.0 3.0 4.0 5.0 3.0 1.0 1.0
A 0.0 1.0 2.0 3.0 4.0 6.0 4.0
T 0.0 0.0 0.0 5.0 6.0 4.0 5.0
T 0.0 0.0 0.0 3.0 8.0 6.0 4.0
A 0.0 0.0 0.0 1.0 6.0 11.0 9.0
Best score: 11.0
GATTA
GCTTA
seqA,seqB = local_align("CTATCTCGCTATCCA","CTACGCTATTTCA",True)
print_alignment(seqA,seqB)
- C T A C G C T A T T T C A
- 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
C 0.0 3.0 1.0 0.0 3.0 1.0 3.0 1.0 0.0 0.0 0.0 0.0 3.0 1.0
T 0.0 1.0 6.0 4.0 2.0 2.0 1.0 6.0 4.0 3.0 3.0 3.0 1.0 2.0
A 0.0 0.0 4.0 9.0 7.0 5.0 3.0 4.0 9.0 7.0 5.0 3.0 2.0 4.0
T 0.0 0.0 3.0 7.0 8.0 6.0 4.0 6.0 7.0 12.0 10.0 8.0 6.0 4.0
C 0.0 3.0 1.0 5.0 10.0 8.0 9.0 7.0 5.0 10.0 11.0 9.0 11.0 9.0
T 0.0 1.0 6.0 4.0 8.0 9.0 7.0 12.0 10.0 8.0 13.0 14.0 12.0 10.0
C 0.0 3.0 4.0 5.0 7.0 7.0 12.0 10.0 11.0 9.0 11.0 12.0 17.0 15.0
G 0.0 1.0 2.0 3.0 5.0 10.0 10.0 11.0 9.0 10.0 9.0 10.0 15.0 16.0
C 0.0 3.0 1.0 1.0 6.0 8.0 13.0 11.0 10.0 8.0 9.0 8.0 13.0 14.0
T 0.0 1.0 6.0 4.0 4.0 6.0 11.0 16.0 14.0 13.0 11.0 12.0 11.0 12.0
A 0.0 0.0 4.0 9.0 7.0 5.0 9.0 14.0 19.0 17.0 15.0 13.0 11.0 14.0
T 0.0 0.0 3.0 7.0 8.0 6.0 7.0 12.0 17.0 22.0 20.0 18.0 16.0 14.0
C 0.0 3.0 1.0 5.0 10.0 8.0 9.0 10.0 15.0 20.0 21.0 19.0 21.0 19.0
C 0.0 3.0 2.0 3.0 8.0 9.0 11.0 9.0 13.0 18.0 19.0 20.0 22.0 20.0
A 0.0 1.0 2.0 5.0 6.0 7.0 9.0 10.0 12.0 16.0 17.0 18.0 20.0 25.0
Best score: 25.0
CT-CGCTA-TCCA
CTACGCTATTTCA