Opened 3 years ago

Closed 3 years ago

#7970 closed enhancement (fixed)

Allow running ESMFold structure prediction and kmer and blast search in ChimeraX

Reported by: Tom Goddard Owned by: Tom Goddard
Priority: moderate Milestone:
Component: Structure Prediction Version:
Keywords: Cc: Elaine Meng
Blocked By: 8041 Blocking:
Notify when closed: Platform: all
Project: ChimeraX

Description

Twitter comment

https://twitter.com/Guillawme/status/1588088180481351681

suggested adding ESMFold structure prediction from sequence to ChimeraX. Apparently Meta has a web service to do it. There is a PyMol plugin that does it here

https://github.com/JinyuanSun/PymolFold

Change History (18)

comment:1 by Tom Goddard, 3 years ago

Here are the web service details

https://esmatlas.com/about#api

comment:2 by Tom Goddard, 3 years ago

I tried running a 2gbp prediction using the following and it ran in a few minutes and produced a structure very similar to the experimental model.

curl -X POST --silent --data "ADTRIGVTIYKYDDNFMSVVRKAIEQDAKAAPDVQLLMNDSQNDQSKQNDQIDVLLAKGVKALAINLVDPAAAGTVIEKARGQNVPVVFFNKEPSRKALDSYDKAYYVGTDSKESGIIQGDLIAKHWAANQGWDLNKDGQIQFVLLKGEPGHPDAEARTTYVIKELNDKGIKTEQLQLDTAMWDTAQAKDKMDAWLSGPNANKIEVVIANNDAMAMGAVEALKAHNKSSIPVFGVDALPEALALVKSGALAGTVLNDANNQAKATFDLAKNLADGKGAADGTNWKIDNKVVRVPYVGVDKDNLAEFSKK" https://api.esmatlas.com/foldSequence/v1/pdb/ > esm_2gbp.pdb

comment:3 by Tom Goddard, 3 years ago

Implemented a basic "esmfold predict" command. It is pretty fast. PDB 7fr0 chain A, 169 residues, 3.7 seconds. PDB 7sd9, 309 residues, 19.6 seconds.

comment:4 by Tom Goddard, 3 years ago

The esmatlas server only allows sequences lengths less than or equal to 400. Found this trying PDB 7soi chain A, 839 residues. Returns message "Sequence is longer than 400." instead of a PDB file.

The web page interface (https://esmatlas.com/resources?action=fold) also limits sequence length to 400.

I could allow predicting subsequences or even automatically chunking to length 400 with specified overlap. It would be nice if "esmfold predict /A:1-400" worked but current the sequence specifiers do not allow specifying a subsequence. This is because it returns a Chain instance (which allows aligning to the residues), but it does not look like our Python Chain instance can be created from Python. I could do it in a slightly hacky way with another option "subsequence 1,400" and for chunking "chunk 400 overlap 10".

Last edited 3 years ago by Tom Goddard (previous) (diff)

comment:5 by Tom Goddard, 3 years ago

Trying a length 400 sequence (PDB 7soi with subsequence 1,400) option gave an error return that was a JSON dictionary apparently

{"message": "Endpoint request timed out"}

Running the same prediction a second time completed in 0.8 seconds with a valid PDB model. It appears the server caches previous predictions. I saw this also with previous PDB tests that the second run was under a second.

Several other structure predictions with 400 residues (using subsequence option) completed in 20 seconds. Only saw the time out error one time.

Last edited 3 years ago by Tom Goddard (previous) (diff)

comment:6 by Tom Goddard, 3 years ago

I added "esmfold predict", "esmfold fetch" and "esmfold pae" commands.

Might add blast and kmer search commands -- still deciding if ESMFold is useful enough to be worth the effort.

comment:7 by Tom Goddard, 3 years ago

Cc: Elaine Meng added

I made a web page giving an example of the ChimeraX esmfold commands and tweeted it.

https://www.rbvi.ucsf.edu/chimerax/data/esmfold-nov2022/esmfold.html

https://twitter.com/UCSFChimeraX/status/1590520170459848704

comment:8 by Tom Goddard, 3 years ago

Would be nice to allow blast search and fast kmer search of ESM Metagenomic Atlas. It does not appear that they provide the sequences, just the MGnify accession codes and scores for the predictions in a Pandas data frame (7 GB) here

https://dl.fbaipublicfiles.com/esmatlas/v0/stats.parquet

with 577944949 rows and 5 columns

id is the MGnify ID
ptm is the predicted TM score
plddt is the predicted average lddt
num_conf is the number of residues with plddt > 0.7
len is the total residues in the protein

as described on the ESM github here

https://github.com/facebookresearch/esm/tree/main/scripts/atlas

Then I can get the MGnify 2022_05 sequences as described here

https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2022_05/README.txt

split into 25 files each about 10 GB

https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2022_05/mgy_proteins_N.fa.gz

I am downloading these files to plato

/wynton/group/ferrin/databases/mol/ESMFold/

It will take 10 days to get the 250 Gbytes of MGnify compressed sequence files, apparently bottlenecked by EBI providing only 1 Mbit/sec. I think that will be the 2.5 billion sequences. I can then filter down to just the 200 million with ESMFold predictions better than 0.7 average plddt and better than 0.7 pTM.

Last edited 3 years ago by Tom Goddard (previous) (diff)

comment:9 by Tom Goddard, 3 years ago

I requested that the ESM Metagenomic Atlas provide a fasta files of the sequences that have been predicted

https://github.com/facebookresearch/esm/issues/366

They responded same day giving links to the mgnify90.fasta (112 Gbytes) and highquality_clust30.fasta (36997632 sequences, 7.2 Gbytes, 8045941352 bytes) sequence files

https://github.com/facebookresearch/esm/issues/341#issuecomment-1311254996

which I downloaded on plato with aria2c taking less than an hour

~/aria2/bin/aria2c --http-accept-gzip=true https://dl.fbaipublicfiles.com/esmatlas/v0/full/mgnify90.fasta

Last edited 3 years ago by Tom Goddard (previous) (diff)

comment:10 by Tom Goddard, 3 years ago

The ESM Atlas pages don't give any way to download the PAE json data. So I made a github issue requesting the atlas pages add a pae download.

https://github.com/facebookresearch/esm/issues/369

ChimeraX users will probably have ChimeraX fetch the PAE but it is useful to have other ways to get it. I also made an issue to allow the predictions run on Meta's ESMFold server to download the PAE data

https://github.com/facebookresearch/esm/issues/370

comment:11 by Tom Goddard, 3 years ago

Summary: Allow running ESMFold structure prediction in ChimeraXAllow running ESMFold structure prediction and kmer and blast search in ChimeraX

I've built the ESM atlas kmer and blast search databases but I used the mgnify90.fasta database with 623796864 sequences instead and the ESM Atlas has fewer sequences, 577944949, possibly because the only took sequences of length between 20 and 1024, or maybe some sequences were just culled for other reasons. I requested in github issue

https://github.com/facebookresearch/esm/issues/366

that they provide the actual sequences predicted as a fasta file. May not hear from them for a while, so might want to filter the sequences myself using the stats.parquet mgnify id list.

comment:12 by Tom Goddard, 3 years ago

I added the esmfold match command. Also made all the esmfold commands reuse common code from the alphafold commands. That involved changing a lot of alphafold command code but I believe I've tested on all the main alphafold and esmfold uses and they are working.

Still waiting for #8041 to get blast search of ESM database working.

comment:13 by Tom Goddard, 3 years ago

Cc: Elaine Meng removed

Tom Secru has provided the full ESM Atlas fasta file

https://dl.fbaipublicfiles.com/esmatlas/v0/full/atlas.fasta

as described here

https://github.com/facebookresearch/esm/issues/366#issuecomment-1323962462

I have downloaded it to plato and am rebuilding the kmer and blast databases using it.

comment:14 by Tom Goddard, 3 years ago

Cc: Elaine Meng added

I added ESMFold and ESMFold Error Plot tools very similar to AlphaFold.

I added esmfold search and esmfold contacts commands, nearly the same as AlphaFold. The search command is not yet working, waiting for the BLAST web service to be updated #8041.

Once the BLAST search of the ESMFold database is working, the ESM commands and gui's should be about the same as for AlphaFold. They are extremely similar, the main difference being that AlphaFold database uses UniProt identifiers as the accession codes while the ESM atlas uses MGnify identifiers.

comment:15 by Tom Goddard, 3 years ago

Blocked By: 8041

comment:16 by Tom Goddard, 3 years ago

Blocked By: 8041

I plan to make an ESMFold tutorial video once blast is working, that shows prediction, database searching, error plots, and explains pros and cons relative to AlphaFold.

comment:17 by Tom Goddard, 3 years ago

Blocked By: 8041

comment:18 by Tom Goddard, 3 years ago

Resolution: fixed
Status: assignedclosed

Done.

Hooked up the ESMFold gui search button which runs a blast search.

Note: See TracTickets for help on using tickets.