Context Navigation

← Previous Ticket
Next Ticket →

#8041 closed enhancement (fixed)

Add ESMFold database to BlastProtein

Reported by:	Tom Goddard	Owned by:	Zach Pearson
Priority:	moderate	Milestone:
Component:	Sequence	Version:
Keywords:		Cc:
Blocked By:		Blocking:	7970
Notify when closed:		Platform:	all
Project:	ChimeraX

Description

The ESMFold metagenomic atlas is very similar to the AlphaFold database. I would like the blast protein command and tool to be able to search this database. I have created the blast database on plato at

/wynton/group/ferrin/databases/mol/ESMFold/v1

This is 600 million sequences, so considerably larger than the AlphaFold database which is about 200 million sequences. I think we are using 4 threads in our blast searches now and it would be good to increase to 8 threads.

The ESMFold sequence titles contain just the MGnify sequence database identifier, for example MGYP000000000040 in the following entry

>MGYP000000000040                                                                                                            
MCGVYQSATFQATFFQYSYILHETLADIVVPDTIGGKIRKLRHSLNLAQMQFAKSIHRGFTTVTKWEQELTTTSEKALTNIIEIYKLQENYFDK

So the Blast tool output table of search results will be very simple, no species or descriptive name, just the identifier. The esmfold fetch command can fetch the associated predicted structure. I can help with that code.

Change History (10)

comment:1 by Tom Goddard, 3 years ago

Any progress on this?

comment:2 by Zach Pearson, 3 years ago

Was sidetracked last week by getting statistics from the backend and just finished (pending Elaine's review) the DICOM tickets. I'll enable and test this tomorrow.

in reply to: 3 ; follow-up: 3 comment:3 by Tom Goddard, 3 years ago

Thanks!

comment:4 by Zach Pearson, 3 years ago

For some reason the rq workers on webservices-test are not completing jobs (probably because I am calling them in a different way to get them to write log files), and there's an error in my sudo permissions on Plato that prevents me from starting a service. I can restart a running service, but not start a stopped one. Apache does not agree with me that these are equivalent operations, and complains that it cannot restart a stopped service instead of starting it.

Last edited 3 years ago by Zach Pearson (previous) (diff)

comment:5 by Tom Goddard, 3 years ago

Blocking:	→ 7970

The ESMFold tools are done except for not being able to do BLAST search. This is just like the AlphaFold database BLAST search, so I am surprised there is difficulty to implement it.

comment:6 by Tom Goddard, 3 years ago

Blocking:	7970

comment:7 by Tom Goddard, 3 years ago

Blocking:	→ 7970

comment:8 by Tom Goddard, 3 years ago

I renamed the ESMFold blast database directory from v1 to v0 just now

/wynton/group/ferrin/databases/mol/ESMFold/v0

since the ESM Atlas developers told me that the database is at version 0, not version 1.

https://github.com/facebookresearch/esm/issues/384

I put in a symbolic link from v1 to v0 in case you are working on it, but the v1 link will be removed in the future.

comment:9 by Zach Pearson, 3 years ago

Resolution:	→ fixed
Status:	assigned → closed

Thanks for your patience. As you said, it was an easy change, especially because the ESMFold bundle's API is similar to the AlphaFold bundle's. I was surprised that ESMFold results can go straight through the PDB parser, but I was disappointed that I wasn't able to find a programmable API for getting additional data like the one Conrad used for PDB results. I tried https://www.ebi.ac.uk/metagenomics/, thinking that there might be a 1:1 correlation between MGYP and MGYS prefixed entries (P for Predicted?) but simply replacing P with S does not pull up valid hits on that website.

That's all to say that the data that comes right out of the ESMFold database is a little sparse and if you know of a resource for more detailed data we want to show the user let me know.

in reply to: 10 ; follow-up: 10 comment:10 by goddard@…, 3 years ago

Thanks!

The "name" field in the blast esmfold output is some integer in the results table.  It looks like it might be the index of the sequence in the file.  If that is what it is it should be changed since that number is meaningless and will just cause confusion.  Best would probably be to remove the "name" field entirely for esmfold.  If that is difficult for some reason the name could be the same as the mgnify id, and hidden in the results by default.

The MGnify database is from metagenomic sequencing, meaning they randomly sequence DNA found in a sample.  They don't even know the organism in many cases.  So meta-data for the sequences is usually lacking.  So it is acceptable that the blast results shows very little.  At least the e-value scores give the similarity of the matching sequence and fetching the structures would be the main use.

Note: See TracTickets for help on using tickets.

Download in other formats: