Opened 3 years ago
Closed 3 years ago
#7358 closed enhancement (fixed)
Update AlphaFold database sequence searches for July 2022 release of 200 million structures
| Reported by: | Tom Goddard | Owned by: | Tom Goddard |
|---|---|---|---|
| Priority: | moderate | Milestone: | |
| Component: | Structure Prediction | Version: | |
| Keywords: | Cc: | ||
| Blocked By: | Blocking: | ||
| Notify when closed: | tic20, zjp, meng | Platform: | all |
| Project: | ChimeraX |
Description
The EBI released an update to the AlphaFold database increasing from 1 million structures to 200 million structures.
https://www.embl.org/news/technology-and-innovation/alphafold-200-million/
We should update the ChimeraX alphafold match and search commands to use the new sequence database. But the searches will be very slow, maybe 30 minutes our an hour using our current BLAT and BLAST search methods. To handle 200 million sequences we will probably need to use mmseqs. It is claimed that mmseqs2 can run about 100 times faster than BLAST with the same sensitivity up to 10000 times faster at reduced sensitivity.
It is going to be hard to achieve fast searches. The 200 million sequences would take 70 Gbytes uncompressed if average sequence length is 350 (guess). The mmseqs2 docs indicates that it want to hold the entire index in memory for the fast searches it reports and that takes 7 times the uncompressed sequence size, so 500 Gbytes of memory. For ColabFold which is by the mmseqs2 authors they use a server with 48 cores and 768 Gbytes of memory to allow fast searches of the few billion sequences used for AlphaFold prediction. mmseqs2 can run with the index loaded in blocks with less memory.
At any rate, the server infrastructure and maintenance may be beyond our resources. It would be better if we did not have to maintain the sequence database and server.
Possibly we can use the EBI sequence search which includes the AlphaFold database.
Another possibility is to use the mmseqs web server, https://search.mmseqs.com. Unfortunately that server web page says it is no available now saying they don't have enough resources to run it and ColabFold. But it says "MMseqs2 search will be back".
Change History (14)
comment:1 by , 3 years ago
comment:2 by , 3 years ago
EBI sequence search on AlphaFold database 214 million structures took 10 minutes with FASTA, 13 minutes for BLAST for 139 residue sequences. EBI does not offer mmseqs2 search. Details follow.
The EBI AlphaFold database suggests using their general sequence search page
https://www.ebi.ac.uk/Tools/sss/fasta/
to search for AlphaFold models. This uses the FASTA algorithm for searching (similar speed to BLAST) and offers about 10 other old choices, but not mmseqs2. I've started a test run searching alphafold db using their service to see how long it takes. Took 10 minutes using 139 residue sequence of 7n7g producing 50 hits and it gave a description for each hit (e.g. "Fosfomycin resistance protein FosB UA=A0A286KC67 UI=A0A286KC67_ENTAV OS=Enterococcus avium OX=33945 GN=fosB"). The FASTA output reported these stats
139 residues in 1 query sequences 69001092509 residues in 214684311 library sequences Tcomplib [36.3.8h May, 2020] (32 proc in memory [15G]) start: Fri Jul 29 22:52:04 2022 done: Fri Jul 29 23:01:45 2022 Total Scan time: 2141.080 Total Display time: 0.050 Function used was FASTA [36.3.8h May, 2020]
using this command
cat fasta-E20220729-235004-0236-19789909-p1m.sequence | $APPBIN/fasta:36.3.8h /fasta36 -l $DATA_CURRENT/fastacfg/fasta3db -L -T 32 -p -s BL50 -f -10 -g -2 -E "10.0 -1.0" -F 0.0 -b 50 -d 50 -m "F9B fasta-E20220729-235004-0236-19789909-p1m.m9" -m "F10 fasta-E20220729-235004-0236-19789909-p1m.m10" -z 1 \@:1- +afdb+ 2
I tried the same search of AFDB using their BLAST server, took 13 minutes.
Effective search space used: 1687989905136
Database: afdb
Posted date: Jul 21, 2022 5:16 PM
Number of letters in database: 69,001,092,509
Number of sequences in database: 214,684,311
using command
$APPBIN/ncbi-blast-2.9.0+/bin/blastp -db "afdb" -query ncbiblast-E20220730-000943-0597-32032673-p2m.sequence -num_threads 32 -outfmt 11 -out ncbiblast-E20220730-000943-0597-32032673-p2m.archive -matrix BLOSUM62 -max_target_seqs 50 -evalue 10 -max_hsps 100 -gapopen 11 -gapextend 1 -seg no -word_size 6 -comp_based_stats F
Tried same search on UniRef90 using FASTA. It probably won't be any faster, but AlphaFold DB mostly covers UniRef90 and possibly EBI has UniRef90 search more heavily optimized. Search was on 147 million sequences versus 214 million for AFDB. Took 7.5 minutes.
139 residues in 1 query sequences 50157842458 residues in 147407377 library sequences Tcomplib [36.3.8h May, 2020] (32 proc in memory [15G]) start: Fri Jul 29 23:36:13 2022 done: Fri Jul 29 23:43:42 2022 Total Scan time: 1400.270 Total Display time: 0.050 Function used was FASTA [36.3.8h May, 2020]
using command
cat fasta-E20220730-003038-0088-44777607-p1m.sequence | $APPBIN/fasta:36.3.8h /fasta36 -l $DATA_CURRENT/fastacfg/fasta3db -L -T 32 -p -s BL50 -f -10 -g -2 -E "10.0 -1.0" -F 0.0 -b 50 -d 50 -m "F9B fasta-E20220730-003038-0088-44777607-p1m.m9" -m "F10 fasta-E20220730-003038-0088-44777607-p1m.m10" -z 1 \@:1- +uniref90+ 2
comment:3 by , 3 years ago
mmseqs2 speed on 1 million sequences
I timed mmseqs2 searching the 139 residue sequence of 7n7g against the 1 million sequence AlphaFold database on various UCSF computers, my 2022 MacBook Pro (M1 Max CPU, 32 GB memory) mmseqs installed from homebrew arm64 executable, minsky Ubuntu 20 alphafold machine (64 GB memory, Intel Core i9-10850K CPU @ 3.60GHz with 10 cores) with fast NVMe drive, and crick (376 GB, plato CentOS7, 2 Intel Xeon Gold 6132 CPU @ 2.60GHz each with 14 cores) in home directory on beegfs file system. For both linux systems I used the binary distribution of mmseqs2. The AlphaFold sequences fasta file is 500 Mb. I computed an mmseq index which is 3.9 Gbytes (*.idx file)o. I ran multiple times so the index should be cached in memory.
Mac 0.49 seconds
Minksy Ubuntu 0.15 seconds
Crick CentOS 1.70 seconds
The first run on crick took 13 seconds. The Mac run without the index took 13 seconds. If all 3 computers are caching the database index it is not clear why they differ by a factor of 10 in speed. Probably the very slow Crick is because of the beegfs file system.
mmseqs2 speed on 214 million sequences
Next I am going to try the search of the 214 million sequence AlphaFold database on minsky and on crick. Actually probably don't have enough disk space on minsky, index will take about 800 Gbytes and only have 700 Gbytes free because AlphaFold databases take up most of the 4 TB NVMe drive. Could try reducing to 100 million sequences for test on minsky.
On crick, 214 million sequences search took 810 seconds (13.5 minutes) on the first run. On second run took 568 seconds (9.5 minutes). Sensitivity was 5.7. Running with sensitivity 1 took 915 seconds first run, 597 seconds on second run. Strange that low sensitivity is slower. Search on 100 million sequences with default sensitivity (5.7) took 315 seconds on first run, 304 seconds on second run.
mmseqs2 speed on 100 million sequences*
On Minsky with 100 million sequences search took 659 seconds (11 minutes) seconds on first run, 651 seconds (11 minutes) on second run, with default sensitivity.
mmseqs2 index file size
The index for the database is split across several files based on the amount of memory. On crick with 376 GB of memory the 214 million sequence index is 6 files totaling 574 GB, and for 100 million sequences have 5 files totaling 269 GB. On minsky the 100 million sequences has index with 11 files with total size 336 GB -- strange that total size is so much larger.
Minsky disk speed
The disk read speed on minsky is slower than I thought. It is a SATA drive, Samsung 870 QVO 4 TB, and reads at only 500 Mbytes/sec. I wrote some simple C code that gave 0.52 GB/sec reading the first 100 million sequences of AlphaFold database 44 GB in 84 seconds. To read the 336 GB mmseqs2 index for first 100 million sequences would take 643 seconds at that speed.
Plato disk speed
Disk speed on watson (plato) was 0.35 GB/sec (124 seconds for 44 GB file alphafold100M.fasta). It appears beegfs uses file some compression because the block size (du -h) of this file is 36 GB. Tried reading alphafold214M.fasta, size 93 GB (block size 77GB) and took 146 seconds, or 0.63 GB / sec. Maybe this file was partially in cache since I copied it an hour earlier. On Crick it took 810 seconds to do an mmseqs2 search which presumably read the while 574 GB index, so at least 0.71 GB/sec, but again those index files were recently written and might have been cached (crick has 384 Gbytes of memory).
comment:4 by , 3 years ago
The July 2022 web logs show the AlphaFold database CGI script was run 597 times by 290 unique IP addresses. So about 20 runs per day. Not very many. Each run can have multiple sequences (it is run from the alphafold match command which looks for matches of all chains of an experimental structure).
The July 2022 webservices log shows 606 blast jobs from 301 unique IP addresses. The jobs are submitted with an https POST so the log does not indicate whether these were blasting PDB, or AlphaFold, or UniRef or NR. So BLAST is getting at most 20 AlphaFold searches per day, probably 10 per day is a better estimate.
So we have only about 30 AlphaFold database sequence searches per day.
comment:5 by , 3 years ago
| Cc: | removed |
|---|---|
| Notify when closed: | → tic20, zjp, meng |
BLAST with 214 million sequence AlphaFold database
Tested BLAST on 214 million sequence database with length 167 sequence (Q2YDE7) on watson, took 18 minutes (17:53), did not give num_threads option so default is 1 thread. Ran a second time took 11 minutes 15 seconds (probably faster due to memory cached database files), was using 65 Gbytes of memory at 11 minutes equal in size to the *.psq files of the blast database. Same test with 8 threads took 1 minute 44 seconds, reached 65 GB, ran at 800% CPU (measured by top). Tried longer sequence length 727 (Uniprot Q69ZS6) took 35 minutes 17 seconds with 1 thread, using 64 GB at 34 minutes. Longer sequence with 8 threads took 8 minutes 44 seconds. Possibly blast queries for available memory and uses as much as is available. Tried running with bash ulimit -Sv 8000000 to limit virtual memory to 8 GB, crashed when it reached 8 GB. Same with ulimit -Sm to limit resident memory. Does it speed up with smaller evalue limit? Tried option "-evalue 1e-50" with 8 threads on length 167 sequence, took 1 minute 43 seconds, same as with default evalue limit of 10.
Conclusions. Blast takes a lot of memory, as much memory as the database (65 GBytes for 214 million sequence AlphaFold database). With 1 thread it is pretty slow 18 minutes for length 167 sequence, 35 minutes for length 727 sequence but is considerably faster with 8 threads (2 minutes and 9 minutes).
comment:6 by , 3 years ago
BLAT with 214 million sequence AlphaFold? database
Tried BLAT on franklin with 214 million sequence database and length 132 sequence (PDB 7u0u chain B) took 59 minutes (58:53) using current CGI script (was using 98 GB after 36 minutes, 95 GB at 45 minutes, database size is 93 GB fasta file). Online documentation indicates no multi-threading, but there is a newer multithreaded pblat.
Conclusions. BLAT uses a lot of memory, about 93 GBytes equal to size of fasta database file, and is very slow taking 1 hour for a short sequence of length 132.
comment:7 by , 3 years ago
mmseqs2 memory and thread use
Ran tests on length 139 7n7g sequence on watson, took 16 minutes, using 290 GB of memory and 56 threads.
$ \rm -rf resultDB* tmp/* $ time ~/mmseqs/bin/mmseqs search queryDB alphafold214MDB resultDB tmp >& result.out &
Try using one thread, took 10 minutes, confirmed only 1 thread with ps -fL.
$ \rm -rf resultDB* tmp/* $ time ~/mmseqs/bin/mmseqs search queryDB alphafold214MDB resultDB tmp --threads=1 >& result.out &
Try limiting memory use to 8GB, gave error "Cannot fit databases into 7G. Please use a computer with more main memory." The 5 index files are each 118GB, made based on the 384 GB memory of the machine.
$ \rm -rf resultDB* tmp/* $ time ~/mmseqs/bin/mmseqs search queryDB alphafold214MDB resultDB tmp --threads 1 --split-memory-limit 8G >& result.out &
Try rebuilding smaller index files. Can also use the option to compress them. Tried building 16G index files but gave error "Cannot fit databases into 11G. Please use a computer with more main memory." Removing compressed option gave same error. Requesting 32G index files worked.
time ~/mmseqs/bin/mmseqs createindex alphafold214MDB tmp --split-memory-limit 32G --compressed 1
This mmseqs2 bug report discussed high memory use for mmseqs search despite splitting the index. It describes mmap() failures. But apparently that problem is that SGE only allows limiting vmem instead of real memory (Dec 2020) and mmseqs still uses large vmem, but supposedly lower real memory.
comment:8 by , 3 years ago
BLAST of NR database, slow and 180 GB memory use
Our current ChimeraX BLAST search of the NR (non-redundant) database brings 179 GB of blast database files (3 times larger than AlphaFold DB version 3) into memory on the server and takes 50 minutes, ticket #7388. I measured this by doing an NR sequence search in ChimeraX and monitoring it on the server (franklin).
comment:9 by , 3 years ago
BLAST test with AlphaFold v3 database
I tested ChimeraX BlastProtein searching the 214 million sequence AlphaFold database. Uniprot K0F9C8 with 451 amino acids took about 30 minutes and 64 Gbytes of memory on the server. I updated the BlastProtein code to handle parsing the AlphaFold version 3 Fasta file title lines which have a different format than version 2. The changes allow searching the version 3 or version 2 databases although there is currently no way to select the version, it is hard coded in blastprotein/job.py. A blastprotein command argument could be added to allow searching either. This code is ready to check in. But I haven't checked it in because it seems like a step backwards to go from a K0F9C8 search that took 10 seconds in version 2 to taking 30 minutes (200 times longer) in version 3 to find the same structure. Needs some more thought.
comment:10 by , 3 years ago
Fast k-mer search
I tested an idea for replacing the alphafold match command use of BLAT to quickly find a high identity sequence while using modest amounts of server memory. It makes a table of all 5-mers in the AlphaFold database mapping to the sequences in the database. To search for a query sequence it looks at every 5-mer in the query and counts how many of those 5-mers are found in each database sequence, and takes the database sequence with the most matching 5-mers. In tests on the 1 million sequence AlphaFold database version 2 it was able to find a best match sequence with 40% identity for query sequence length 1137, or with 60% identity for query sequence length 230, or with 80% identity for a query sequence length of 50. Shorter sequences require more identity because it does not take many mutations before the highest number of matching 5-mers is around 5 and poor matches are obtained. The best sequences are found in roughly 0.1 seconds with the kmer map in memory. But the kmer map is about 3 times the database fasta file size, so would be about 300 GB for the 100 GB AlphaFold v3 database. But the method also runs with the kmer map on disk, taking maybe about 0.5 seconds in the above tests on v2 database. In that mode total memory use should be about a Gbyte I think (32-bit integer for each of 214 million sequences to count query kmer matches).
I'm making the kmer map for the v3 database on plato, expected to take about half a day. All this prototype code is in Python.
Not sure how fast the search will be using the kmer map on disk since about N blocks have to be read from that 300 GB file for querying a sequence of length N. If that is slow due to spinning disk seek time it could perhaps be sped up by running say 8 threads all reading different blocks of the file at the same time so the disk scheduler can minimize seeking.
Ran some tests, index file 253 Gbytes on beegfs file system (wynton nobackup), length 1137 sequence took 20 seconds to find best match down to 40% sequence identity, length 230 took 4 seconds down to 60% sequence identity, and length 50 took about 1.2 seconds down to 80% sequence identity. The average number of sequences per kmer is 2000, so about 8 Kbytes read for each kmer and for a sequence length N we read N-4 kmers randomly positioned in 253 Gbyte file. I tried increasing the read speed my multithreading and got about 5-fold read time speedup using 8 threads when reading 1000 blocks of size 16KB. So that simple optimization could decrease the lookup time by a big factor. The disk total read bandwidth is only about 20 Mbytes/second in this test with 8 threads which is pretty dismal compared to reading a large contiguous file (~500 Mbytes on wynton). But it is a network file system. Should try on a local spinning drive. Would not expect threads to help on an SSD drive but should test that too.
It is unclear how valuable a fast search is if it may not get the best matching sequence when there are several similar quality matches. The method ranks all database sequences so it should be able to see if there are many nearly equally good matches, and could provide all of those if desired. If there is no sequence with high enough identity then a much longer more sensitive blast search may be needed.
comment:11 by , 3 years ago
BLAST using 4 threads
I tested blastp run on watson with ChimeraX defaults on a length 268 sequence (7u0u chain A), 19 minutes with 1 thread, 5 minutes with 4 threads, using blastp --num_threads 4 option. Seems worth using 4 threads on AlphaFold version 3 database and NR database ChimeraX web services.
Zach switched the ChimeraX BLAST server to use 4 threads (ticket #7435), and the length 268 test time run from ChimeraX took 4 minutes 45 seconds, with top on franklin showing 400% CPU usage the whole time.
comment:12 by , 3 years ago
Random read speed, SSD is 50 times faster than plato beegfs spinning disks
The k-mer search method described in comment 10 is slow due to random disk access on plato beegfs network spinning drives. I wrote a test program to see how slow reading 1024 randomly placed 16K blocks takes, trying it on the 93 GB fasta file of 214 million sequences. With 1 thread it took 17 seconds, with 8 threads it took 4 seconds, on plato in /wynton/group/ferrin/nobackup/goddard (beegfs file system), with 16 threads 3 seconds.
The random reads on an SSD are *much* faster. The 44 GB alphafold100M.fasta file on minsky with Samsung 870 QVO SATA III SSD 4TB 2.5" Internal Solid State Hard Drive which reads at 550 Mbytes/second took 0.3 seconds for reading 1024 randomly located 16K blocks, so 50 times faster. Surprising with 8 threads the SSD read got down to 0.064 seconds, about 4 times faster than 1 thread. I bet an NVMe PCIe drive would be hundreds of times faster than plato in this test. The SSD on minsky uses SATA 3.
This should make length 1100 sequence searches take 0.5 seconds instead of 20 seconds using 1 thread, or 0.1 second using 8 threads. Don't have disk space on minsky to test this. Need to get a fast NVMe drive for work. A 2 TB Fantom thunderbolt 3 drive ~$500 has 2.7 Gbyte/sec read speed about 5 times faster than the minsky ssd would be a good choice. This would also allow making the k-mer index faster (took 18 hours on plato), although it would require C code since time is dominated by Python CPU bottleneck currently.
comment:13 by , 3 years ago
Added version option to blastprotein and "alphafold search" commands
I added a "version" option to the blastprotein and the "alphafold search" commands that specifies the AlphaFold database version to search. The default version is still 2, but this option allows users to search version 3 if they want. The version option could conceivably be used to specify alternate versions for other databases (PDB, NR, UniRef) but currently alternate versions are only available for AlphaFold. I don't envision needing alternate versions for other databases. Even for AlphaFold DB I want to not have alternate versions. But I am still deciding how to balance the searches of AlphaFold DB v2 which took a few seconds with AlphaFold DB v3 searches which take 5 minutes.
I also updated the blast results gui code to correctly parse the new AlphaFold version 3 sequence description to extract uniprot id, species, title, ....
Another issue if we host the sequence search on our servers is that we currently take the FASTA AlphaFold database sequence file which does not have useful species and description information for the sequences and we replace the description of each sequence with the info from UniProt. For the last AlphaFold database update (1 million sequences) I did this by hand submitting 3 batches to the EMBL UniProt server. But this would have to be automated to do it for 200 million sequences.