Code for BLAST-styled queries

16. Code for BLAST-styled queries#

Here is some code demonstrating some of the basic building blocks of the BLAST algorithm: the database and query indexing, some basic joining of matches, and the display of the results.

Note that the code just tries to give the outline of some of the sapects of the algorithm, and the real BLAST contains code for extending matches, calculating scores and extensions etc. which is not part of this notebook.

build_index function: This function creates an index from a given sequence for subsequences (k-mers) of length k. Each k-mer is mapped to all positions where it appears in the sequence, here given as a tuple of the sequence number and its position within the sequence. This step of the processing could be precalculated for each database prior to encountering any query.

def build_index(sequences, k):
    """Builds indexes from multiple sequences for subsequences of length k."""
    index = {}
    for seq_id, sequence in enumerate(sequences):
        # seq_id will be a number, 0 for first sequence, 1 for second, etc
        for i in range(len(sequence) - k + 1):
            k_mer = sequence[i:i+k]
            if k_mer in index:
                index[k_mer].append((seq_id,i))
            else:
                index[k_mer] = [(seq_id,i)]
    return index

query_index function: This function takes a query sequence and finds all k-mers in the query that are present in the index, along with their positions in the original sequence.

def query_index(query, index, k):
    """Queries multiple indexes and finds positions where the query's k-mers match the k-mers in the indexes."""
    matches = {}
    for pos_in_query in range(len(query) - k + 1):
        k_mer = query[pos_in_query:pos_in_query+k]
        if k_mer in index:
            # we found a match. Store the position_in_query 
            matches[k_mer] = pos_in_query
    return matches

The merge_matches function consolidates all matching k-mers from the query into continuous sequences for each sequence in the database. It first rearranges the matches by sequence index and position, then sorts and merges the k-mers into longer, contiguous segments that represent potential alignments between the query and the database sequences.

def merge_matches(matches, query, index, k):
    """Merges all k-mer matches into continuous sequences for each database sequence."""
    # First rearange the matches so that they are grouped by their matched database sequence
    rearanged_matches = {}
    for k_mer, pos_query in matches.items():
        index_matches = index[k_mer]
        for seqix, pos_in_seq in index_matches:
            if seqix not in rearanged_matches:
                rearanged_matches[seqix] = []
            rearanged_matches[seqix].append((pos_query, pos_in_seq, k_mer))
    matches = []
    
    # Walk through each matched sequence and arrange all matched k-tuples
    for seqix, list_of_pos2_kmer in rearanged_matches.items():
        list_of_pos2_kmer.sort() # sort by query position
        match = "-" * len(query)
        previous_pos_i, start_of_match = -1, None
        for pos_q, pos_i, k_mer in list_of_pos2_kmer:
            if not start_of_match:
                start_of_match = pos_i - pos_q
            # Check that the match does not come after the qurrent match
            if previous_pos_i<pos_i:
                match = match[:pos_q] + k_mer + match[pos_q+k:]
                previous_pos_i = pos_i
        matches.append((match, seqix, start_of_match))
    return matches

Now we have all the code we need for running a test run of our code.

# Example sequences in the database
sequences = ["APEPTIDE", "PEPTIDEA", "DIFFERENT", "TIDEAPEP", "REPTILE"]

# Query sequence
query = "PEPTID"

# Build index for a k-mer length of 3 from multiple sequences
k = 3
index = build_index(sequences, k)
#print("Indexes:", index)

# Query the indexes
matches = query_index(query, index, k)
#print("Matches:", matches)

# Merge matches into continuous sequences
merged_alignments = merge_matches(matches, query, index, k)

print("Alignments:")
for seq_match, seq_ix, start_of_match in merged_alignments:
    print("DB:  " + sequences[seq_ix] )
    print("MA:  " + '-'*start_of_match + seq_match + '-'*(len(sequences[seq_ix])-start_of_match-len(seq_match)))
    print("QU:  " + ' '*start_of_match + query)
    print()
Alignments:
DB:  APEPTIDE
MA:  -PEPTID-
QU:   PEPTID

DB:  PEPTIDEA
MA:  PEPTID--
QU:  PEPTID

DB:  TIDEAPEP
MA:  -----PEP---
QU:       PEPTID

DB:  REPTILE
MA:  -EPTI--
QU:  PEPTID

The algoritm finds matches to the first, second, fourth and fifth sequences