Changes between Version 1 and Version 2 of Ticket #7358, comment 10


Ignore:
Timestamp:
Aug 12, 2022, 4:23:41 PM (3 years ago)
Author:
Tom Goddard

Legend:

Unmodified
Added
Removed
Modified
  • Ticket #7358, comment 10

    v1 v2  
     1**Fast k-mer search**
     2
    13I tested an idea for replacing the alphafold match command use of BLAT to quickly find a high identity sequence while using modest amounts of server memory.  It makes a table of all 5-mers in the AlphaFold database mapping to the sequences in the database.  To search for a query sequence it looks at every 5-mer in the query and counts how many of those 5-mers are found in each database sequence, and takes the database sequence with the most matching 5-mers.  In tests on the 1 million sequence AlphaFold database version 2 it was able to find a best match sequence with 40% identity for query sequence length 1137, or with 60% identity for query sequence length 230, or with 80% identity for a query sequence length of 50.  Shorter sequences require more identity because it does not take many mutations before the highest number of matching 5-mers is around 5 and poor matches are obtained.  The best sequences are found in roughly 0.1 seconds with the kmer map in memory.  But the kmer map is about 3 times the database fasta file size, so would be about 300 GB for the 100 GB AlphaFold v3 database.  But the method also runs with the kmer map on disk, taking maybe about 0.5 seconds in the above tests on v2 database.  In that mode total memory use should be about a Gbyte I think (32-bit integer for each of 214 million sequences to count query kmer matches).
    24