16. Code for BLAST-styled queries#
Here is some code demonstrating some of the basic building blocks of the BLAST algorithm: the database and query indexing, some basic joining of matches, and the display of the results.
Note that the code just tries to give the outline of some of the sapects of the algorithm, and the real BLAST contains code for extending matches, calculating scores and extensions etc. which is not part of this notebook.
build_index
function: This function creates an index from a given sequence for subsequences (k-mers) of length k. Each k-mer is mapped to all positions where it appears in the sequence, here given as a tuple of the sequence number and its position within the sequence.
This step of the processing could be precalculated for each database prior to encountering any query.
def build_index(sequences, k):
"""Builds indexes from multiple sequences for subsequences of length k."""
index = {}
for seq_id, sequence in enumerate(sequences):
# seq_id will be a number, 0 for first sequence, 1 for second, etc
for i in range(len(sequence) - k + 1):
k_mer = sequence[i:i+k]
if k_mer in index:
index[k_mer].append((seq_id,i))
else:
index[k_mer] = [(seq_id,i)]
return index
query_index
function: This function takes a query sequence and finds all k-mers in the query that are present in the index, along with their positions in the original sequence.
def query_index(query, index, k):
"""Queries multiple indexes and finds positions where the query's k-mers match the k-mers in the indexes."""
matches = {}
for pos_in_query in range(len(query) - k + 1):
k_mer = query[pos_in_query:pos_in_query+k]
if k_mer in index:
# we found a match. Store the position_in_query
matches[k_mer] = pos_in_query
return matches
The merge_matches
function consolidates all matching k-mers from the query into continuous sequences for each sequence in the database. It first rearranges the matches by sequence index and position, then sorts and merges the k-mers into longer, contiguous segments that represent potential alignments between the query and the database sequences.
def merge_matches(matches, query, index, k):
"""Merges all k-mer matches into continuous sequences for each database sequence."""
# First rearange the matches so that they are grouped by their matched database sequence
rearanged_matches = {}
for k_mer, pos_query in matches.items():
index_matches = index[k_mer]
for seqix, pos_in_seq in index_matches:
if seqix not in rearanged_matches:
rearanged_matches[seqix] = []
rearanged_matches[seqix].append((pos_query, pos_in_seq, k_mer))
matches = []
# Walk through each matched sequence and arrange all matched k-tuples
for seqix, list_of_pos2_kmer in rearanged_matches.items():
list_of_pos2_kmer.sort() # sort by query position
match = "-" * len(query)
previous_pos_i, start_of_match = -1, None
for pos_q, pos_i, k_mer in list_of_pos2_kmer:
if not start_of_match:
start_of_match = pos_i - pos_q
# Check that the match does not come after the qurrent match
if previous_pos_i<pos_i:
match = match[:pos_q] + k_mer + match[pos_q+k:]
previous_pos_i = pos_i
matches.append((match, seqix, start_of_match))
return matches
Now we have all the code we need for running a test run of our code.
# Example sequences in the database
sequences = ["APEPTIDE", "PEPTIDEA", "DIFFERENT", "TIDEAPEP", "REPTILE"]
# Query sequence
query = "PEPTID"
# Build index for a k-mer length of 3 from multiple sequences
k = 3
index = build_index(sequences, k)
#print("Indexes:", index)
# Query the indexes
matches = query_index(query, index, k)
#print("Matches:", matches)
# Merge matches into continuous sequences
merged_alignments = merge_matches(matches, query, index, k)
print("Alignments:")
for seq_match, seq_ix, start_of_match in merged_alignments:
print("DB: " + sequences[seq_ix] )
print("MA: " + '-'*start_of_match + seq_match + '-'*(len(sequences[seq_ix])-start_of_match-len(seq_match)))
print("QU: " + ' '*start_of_match + query)
print()
Alignments:
DB: APEPTIDE
MA: -PEPTID-
QU: PEPTID
DB: PEPTIDEA
MA: PEPTID--
QU: PEPTID
DB: TIDEAPEP
MA: -----PEP---
QU: PEPTID
DB: REPTILE
MA: -EPTI--
QU: PEPTID
The algoritm finds matches to the first, second, fourth and fifth sequences