| | 7 | Ran some tests, index file 253 Gbytes on beegfs file system (wynton nobackup), length 1137 sequence took 20 seconds to find best match down to 40% sequence identity, length 230 took 4 seconds down to 60% sequence identity, and length 50 took about 1.2 seconds down to 80% sequence identity. The average number of sequences per kmer is 2000, so about 8 Kbytes read for each kmer and for a sequence length N we read N-4 kmers randomly positioned in 253 Gbyte file. I tried increasing the read speed my multithreading and got about 5-fold read time speedup using 8 threads when reading 1000 blocks of size 16KB. So that simple optimization could decrease the lookup time by a big factor. The disk total read bandwidth is only about 20 Mbytes/second in this test with 8 threads which is pretty dismal compared to reading a large contiguous file (~500 Mbytes on wynton). But it is a network file system. Should try on a local spinning drive. Would not expect threads to help on an SSD drive but should test that too. |