Changes between Version 9 and Version 10 of Ticket #7358, comment 3


Ignore:
Timestamp:
Aug 3, 2022, 11:40:21 AM (3 years ago)
Author:
Tom Goddard

Legend:

Unmodified
Added
Removed
Modified
  • Ticket #7358, comment 3

    v9 v10  
    1919The index for the database is split across several files based on the amount of memory.  On crick with 376 GB of memory the 214 million sequence index is 6 files totaling 574 GB, and for 100 million sequences have 5 files totaling 269 GB.  On minsky the 100 million sequences has index with 11 files with total size 336 GB -- strange that total size is so much larger.
    2020
    21 **Minsky and Plato disk speed**
     21**Minsky disk speed**
    2222The disk read speed on minsky is slower than I thought.  It is a SATA drive, Samsung 870 QVO 4 TB, and reads at only 500 Mbytes/sec.  I wrote some simple C code that gave 0.52 GB/sec reading the first 100 million sequences of AlphaFold database 44 GB in 84 seconds.  To read the 336 GB mmseqs2 index for first 100 million sequences would take 643 seconds at that speed.
    2323
     24**Plato disk speed**
    2425Disk speed on watson (plato) was 0.35 GB/sec (124 seconds for 44 GB file alphafold100M.fasta).  It appears beegfs uses file some compression because the block size (du -h) of this file is 36 GB.  Tried reading alphafold214M.fasta, size 93 GB (block size 77GB) and took 146 seconds, or 0.63 GB / sec.  Maybe this file was partially in cache since I copied it an hour earlier.  On Crick it took 810 seconds to do an mmseqs2 search which presumably read the while 574 GB index, so at least 0.71 GB/sec, but again those index files were recently written and might have been cached (crick has 384 Gbytes of memory).