Opened 3 years ago
Closed 3 years ago
#7970 closed enhancement (fixed)
Allow running ESMFold structure prediction and kmer and blast search in ChimeraX
Reported by: | Tom Goddard | Owned by: | Tom Goddard |
---|---|---|---|
Priority: | moderate | Milestone: | |
Component: | Structure Prediction | Version: | |
Keywords: | Cc: | Elaine Meng | |
Blocked By: | 8041 | Blocking: | |
Notify when closed: | Platform: | all | |
Project: | ChimeraX |
Description
Twitter comment
suggested adding ESMFold structure prediction from sequence to ChimeraX. Apparently Meta has a web service to do it. There is a PyMol plugin that does it here
Change History (18)
comment:1 by , 3 years ago
comment:2 by , 3 years ago
I tried running a 2gbp prediction using the following and it ran in a few minutes and produced a structure very similar to the experimental model.
curl -X POST --silent --data "ADTRIGVTIYKYDDNFMSVVRKAIEQDAKAAPDVQLLMNDSQNDQSKQNDQIDVLLAKGVKALAINLVDPAAAGTVIEKARGQNVPVVFFNKEPSRKALDSYDKAYYVGTDSKESGIIQGDLIAKHWAANQGWDLNKDGQIQFVLLKGEPGHPDAEARTTYVIKELNDKGIKTEQLQLDTAMWDTAQAKDKMDAWLSGPNANKIEVVIANNDAMAMGAVEALKAHNKSSIPVFGVDALPEALALVKSGALAGTVLNDANNQAKATFDLAKNLADGKGAADGTNWKIDNKVVRVPYVGVDKDNLAEFSKK" https://api.esmatlas.com/foldSequence/v1/pdb/ > esm_2gbp.pdb
comment:3 by , 3 years ago
Implemented a basic "esmfold predict" command. It is pretty fast. PDB 7fr0 chain A, 169 residues, 3.7 seconds. PDB 7sd9, 309 residues, 19.6 seconds.
comment:4 by , 3 years ago
The esmatlas server only allows sequences lengths less than or equal to 400. Found this trying PDB 7soi chain A, 839 residues. Returns message "Sequence is longer than 400." instead of a PDB file.
The web page interface (https://esmatlas.com/resources?action=fold) also limits sequence length to 400.
I could allow predicting subsequences or even automatically chunking to length 400 with specified overlap. It would be nice if "esmfold predict /A:1-400" worked but current the sequence specifiers do not allow specifying a subsequence. This is because it returns a Chain instance (which allows aligning to the residues), but it does not look like our Python Chain instance can be created from Python. I could do it in a slightly hacky way with another option "subsequence 1,400" and for chunking "chunk 400 overlap 10".
comment:5 by , 3 years ago
Trying a length 400 sequence (PDB 7soi with subsequence 1,400) option gave an error return that was a JSON dictionary apparently
{"message": "Endpoint request timed out"}
Running the same prediction a second time completed in 0.8 seconds with a valid PDB model. It appears the server caches previous predictions. I saw this also with previous PDB tests that the second run was under a second.
Several other structure predictions with 400 residues (using subsequence option) completed in 20 seconds. Only saw the time out error one time.
comment:6 by , 3 years ago
I added "esmfold predict", "esmfold fetch" and "esmfold pae" commands.
Might add blast and kmer search commands -- still deciding if ESMFold is useful enough to be worth the effort.
comment:7 by , 3 years ago
Cc: | added |
---|
I made a web page giving an example of the ChimeraX esmfold commands and tweeted it.
https://www.rbvi.ucsf.edu/chimerax/data/esmfold-nov2022/esmfold.html
https://twitter.com/UCSFChimeraX/status/1590520170459848704
comment:8 by , 3 years ago
Would be nice to allow blast search and fast kmer search of ESM Metagenomic Atlas. It does not appear that they provide the sequences, just the MGnify accession codes and scores for the predictions in a Pandas data frame (7 GB) here
with 577944949 rows and 5 columns
id is the MGnify ID ptm is the predicted TM score plddt is the predicted average lddt num_conf is the number of residues with plddt > 0.7 len is the total residues in the protein
as described on the ESM github here
https://github.com/facebookresearch/esm/tree/main/scripts/atlas
Then I can get the MGnify 2022_05 sequences as described here
https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2022_05/README.txt
split into 25 files each about 10 GB
https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2022_05/mgy_proteins_N.fa.gz
I am downloading these files to plato
/wynton/group/ferrin/databases/mol/ESMFold/
It will take 10 days to get the 250 Gbytes of MGnify compressed sequence files, apparently bottlenecked by EBI providing only 1 Mbit/sec. I think that will be the 2.5 billion sequences. I can then filter down to just the 200 million with ESMFold predictions better than 0.7 average plddt and better than 0.7 pTM.
comment:9 by , 3 years ago
I requested that the ESM Metagenomic Atlas provide a fasta files of the sequences that have been predicted
They responded same day giving links to the mgnify90.fasta (112 Gbytes) and highquality_clust30.fasta (36997632 sequences, 7.2 Gbytes, 8045941352 bytes) sequence files
https://github.com/facebookresearch/esm/issues/341#issuecomment-1311254996
which I downloaded on plato with aria2c taking less than an hour
~/aria2/bin/aria2c --http-accept-gzip=true https://dl.fbaipublicfiles.com/esmatlas/v0/full/mgnify90.fasta
comment:10 by , 3 years ago
The ESM Atlas pages don't give any way to download the PAE json data. So I made a github issue requesting the atlas pages add a pae download.
ChimeraX users will probably have ChimeraX fetch the PAE but it is useful to have other ways to get it. I also made an issue to allow the predictions run on Meta's ESMFold server to download the PAE data
comment:11 by , 3 years ago
Summary: | Allow running ESMFold structure prediction in ChimeraX → Allow running ESMFold structure prediction and kmer and blast search in ChimeraX |
---|
I've built the ESM atlas kmer and blast search databases but I used the mgnify90.fasta database with 623796864 sequences instead and the ESM Atlas has fewer sequences, 577944949, possibly because the only took sequences of length between 20 and 1024, or maybe some sequences were just culled for other reasons. I requested in github issue
that they provide the actual sequences predicted as a fasta file. May not hear from them for a while, so might want to filter the sequences myself using the stats.parquet mgnify id list.
comment:12 by , 3 years ago
I added the esmfold match command. Also made all the esmfold commands reuse common code from the alphafold commands. That involved changing a lot of alphafold command code but I believe I've tested on all the main alphafold and esmfold uses and they are working.
Still waiting for #8041 to get blast search of ESM database working.
comment:13 by , 3 years ago
Cc: | removed |
---|
Tom Secru has provided the full ESM Atlas fasta file
https://dl.fbaipublicfiles.com/esmatlas/v0/full/atlas.fasta
as described here
https://github.com/facebookresearch/esm/issues/366#issuecomment-1323962462
I have downloaded it to plato and am rebuilding the kmer and blast databases using it.
comment:14 by , 3 years ago
Cc: | added |
---|
I added ESMFold and ESMFold Error Plot tools very similar to AlphaFold.
I added esmfold search and esmfold contacts commands, nearly the same as AlphaFold. The search command is not yet working, waiting for the BLAST web service to be updated #8041.
Once the BLAST search of the ESMFold database is working, the ESM commands and gui's should be about the same as for AlphaFold. They are extremely similar, the main difference being that AlphaFold database uses UniProt identifiers as the accession codes while the ESM atlas uses MGnify identifiers.
comment:15 by , 3 years ago
Blocked By: | → 8041 |
---|
comment:16 by , 3 years ago
Blocked By: | 8041 |
---|
I plan to make an ESMFold tutorial video once blast is working, that shows prediction, database searching, error plots, and explains pros and cons relative to AlphaFold.
comment:17 by , 3 years ago
Blocked By: | → 8041 |
---|
comment:18 by , 3 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Done.
Hooked up the ESMFold gui search button which runs a blast search.
Here are the web service details