Context Navigation

← Previous Change
Ticket Comment History
Next Change →

Changes between Initial Version and Version 1 of Ticket #7358, comment 10

Timestamp:: Aug 11, 2022, 6:39:47 PM (3 years ago)
Author:: Tom Goddard

Legend:

: Unmodified
: Added
: Removed
: Modified

Ticket #7358, comment 10

initial	v1
5	5	Not sure how fast the search will be using the kmer map on disk since about N blocks have to be read from that 300 GB file for querying a sequence of length N. If that is slow due to spinning disk seek time it could perhaps be sped up by running say 8 threads all reading different blocks of the file at the same time so the disk scheduler can minimize seeking.
6	6
	7	Ran some tests, index file 253 Gbytes on beegfs file system (wynton nobackup), length 1137 sequence took 20 seconds to find best match down to 40% sequence identity, length 230 took 4 seconds down to 60% sequence identity, and length 50 took about 1.2 seconds down to 80% sequence identity. The average number of sequences per kmer is 2000, so about 8 Kbytes read for each kmer and for a sequence length N we read N-4 kmers randomly positioned in 253 Gbyte file. I tried increasing the read speed my multithreading and got about 5-fold read time speedup using 8 threads when reading 1000 blocks of size 16KB. So that simple optimization could decrease the lookup time by a big factor. The disk total read bandwidth is only about 20 Mbytes/second in this test with 8 threads which is pretty dismal compared to reading a large contiguous file (~500 Mbytes on wynton). But it is a network file system. Should try on a local spinning drive. Would not expect threads to help on an SSD drive but should test that too.
	8
7	9	It is unclear how valuable a fast search is if it may not get the best matching sequence when there are several similar quality matches. The method ranks all database sequences so it should be able to see if there are many nearly equally good matches, and could provide all of those if desired. If there is no sequence with high enough identity then a much longer more sensitive blast search may be needed.