#8041 closed enhancement (fixed)
Add ESMFold database to BlastProtein
Reported by: | Tom Goddard | Owned by: | Zach Pearson |
---|---|---|---|
Priority: | moderate | Milestone: | |
Component: | Sequence | Version: | |
Keywords: | Cc: | ||
Blocked By: | Blocking: | 7970 | |
Notify when closed: | Platform: | all | |
Project: | ChimeraX |
Description
The ESMFold metagenomic atlas is very similar to the AlphaFold database. I would like the blast protein command and tool to be able to search this database. I have created the blast database on plato at
/wynton/group/ferrin/databases/mol/ESMFold/v1
This is 600 million sequences, so considerably larger than the AlphaFold database which is about 200 million sequences. I think we are using 4 threads in our blast searches now and it would be good to increase to 8 threads.
The ESMFold sequence titles contain just the MGnify sequence database identifier, for example MGYP000000000040 in the following entry
>MGYP000000000040 MCGVYQSATFQATFFQYSYILHETLADIVVPDTIGGKIRKLRHSLNLAQMQFAKSIHRGFTTVTKWEQELTTTSEKALTNIIEIYKLQENYFDK
So the Blast tool output table of search results will be very simple, no species or descriptive name, just the identifier. The esmfold fetch command can fetch the associated predicted structure. I can help with that code.
Change History (10)
comment:1 by , 3 years ago
comment:2 by , 3 years ago
Was sidetracked last week by getting statistics from the backend and just finished (pending Elaine's review) the DICOM tickets. I'll enable and test this tomorrow.
comment:4 by , 3 years ago
For some reason the rq workers on webservices-test are not completing jobs (probably because I am calling them in a different way to get them to write log files), and there's an error in my sudo
permissions on Plato that prevents me from starting a service. I can restart a running service, but not start a stopped one. Apache does not agree with me that these are equivalent operations, and complains that it cannot restart a stopped service instead of starting it.
comment:5 by , 3 years ago
Blocking: | → 7970 |
---|
The ESMFold tools are done except for not being able to do BLAST search. This is just like the AlphaFold database BLAST search, so I am surprised there is difficulty to implement it.
comment:6 by , 3 years ago
Blocking: | 7970 |
---|
comment:7 by , 3 years ago
Blocking: | → 7970 |
---|
comment:8 by , 3 years ago
I renamed the ESMFold blast database directory from v1 to v0 just now
/wynton/group/ferrin/databases/mol/ESMFold/v0
since the ESM Atlas developers told me that the database is at version 0, not version 1.
I put in a symbolic link from v1 to v0 in case you are working on it, but the v1 link will be removed in the future.
comment:9 by , 3 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Thanks for your patience. As you said, it was an easy change, especially because the ESMFold bundle's API is similar to the AlphaFold bundle's. I was surprised that ESMFold results can go straight through the PDB parser, but I was disappointed that I wasn't able to find a programmable API for getting additional data like the one Conrad used for PDB results. I tried https://www.ebi.ac.uk/metagenomics/, thinking that there might be a 1:1 correlation between MGYP and MGYS prefixed entries (P for Predicted?) but simply replacing P with S does not pull up valid hits on that website.
That's all to say that the data that comes right out of the ESMFold database is a little sparse and if you know of a resource for more detailed data we want to show the user let me know.
follow-up: 10 comment:10 by , 3 years ago
Thanks! The "name" field in the blast esmfold output is some integer in the results table. It looks like it might be the index of the sequence in the file. If that is what it is it should be changed since that number is meaningless and will just cause confusion. Best would probably be to remove the "name" field entirely for esmfold. If that is difficult for some reason the name could be the same as the mgnify id, and hidden in the results by default. The MGnify database is from metagenomic sequencing, meaning they randomly sequence DNA found in a sample. They don't even know the organism in many cases. So meta-data for the sequences is usually lacking. So it is acceptable that the blast results shows very little. At least the e-value scores give the similarity of the matching sequence and fetching the structures would be the main use.
Any progress on this?