Opened 4 years ago
Closed 4 years ago
#5755 closed enhancement (fixed)
Make AlphaFold search use latest version of AlphaFold database
Reported by: | Tom Goddard | Owned by: | Tom Goddard |
---|---|---|---|
Priority: | moderate | Milestone: | |
Component: | Structure Prediction | Version: | |
Keywords: | Cc: | Zach Pearson, Elaine Meng | |
Blocked By: | Blocking: | ||
Notify when closed: | Platform: | all | |
Project: | ChimeraX |
Description
The AlphaFold database just updated from version 1 to version 2 with twice as many structures (now including most of SwissProt). The version 2 files on the ftp site where ChimeraX fetches the models have suffix _v2 instead of _v1. So older ChimeraX is still only going to read version 1 files because it has the _v1 in the ChimeraX code.
The aim is to have future ChimeraX versions use the latest AlphaFold database version without requiring updating ChimeraX.
Change History (7)
comment:1 by , 4 years ago
comment:2 by , 4 years ago
I got the AlphaFold sequences fasta file from Mihaly Varadi at EBI from here
https://alphafold.ebi.ac.uk/files/sequences.fasta
But the title lines look like
AFDB:AF-A0A1I9LN13-F1 RING/U-box superfamily protein
whereas before from Uniprot we were using
tr|A0A1I9LN13|A0A1I9LN13_ARATH RING/U-box superfamily protein OS=Arabidopsis thaliana OX=3702 GN=At3g58720 PE=4 SV=1
So we are missing the uniprot name and species name in the new file. I'll ask Mahaly if they can provide the more detailed info, if not we could process the file getting info from uniprot. Or we can live with the missing info.
comment:3 by , 4 years ago
I replaced the fasta file sequence titles with the uniprot titles that include uniprot name and species. I did this by downloading all the AlphaFold sequences from uniprot and using a Python script to replace the titles. This is described in
bundles/alphafold/src/README
and the Python script is
bundles/alphafold/src/fix_seq_titles.py
The README also describes how to make the BLAST database indices. The files were then installed on plato in
/databases/mol/AlphaFold/v2
and ChimeraX blast and blat for AlphaFold are now using those sequences.
comment:4 by , 4 years ago
Fetching and searching the v2 AlphaFold database requires a ChimeraX 1.4 daily build. ChimeraX 1.3 still uses AlphaFold database v1 files and sequences.
Then new code will allow updating to new AlphaFold database versions without updating ChimeraX. ChimeraX checks what version of the database to use by querying a file
https://www.rbvi.ucsf.edu/chimerax/data/status/alphafold_database.json
that lists the version of the database. It also includes the URL to fetch database files, so if EBI changes that or we want to host the files at UCSF we can update what ChimeraX uses by updating this file. The check is done if an AlphaFold search or fetch is tried and the previous check was more than one day earlier. The source for this JSON file is in the alphafold bundle
bundles/alphafold/src/alphafold_database.json
comment:5 by , 4 years ago
These changes will be in tomorrow's daily build. Should post on twitter that a ChimeraX daily build is needed to access the new AlphaFold version 2 database and that ChimeraX 1.3 only gets the older database files.
comment:6 by , 4 years ago
Cc: | added |
---|
Zach and I added support for AlphaFold fetch and search and match to use version 2 of the AlphaFold database that includes most of the SwissProt sequences in addition to the 21 organism proteomes from version 1. This increased the total predicted sequences from 360,000 to 800,000 and added a lot more diversity.
ChimeraX 1.3 and earlier only use the version 1 AlphaFold database. ChimeraX 1.4 daily builds starting tomorrow will use the version 2 database (and will use newer AlphaFold database versions automatically without having to update ChimeraX). Would be good to explain this in the docs since the version 2 database is a big improvement over version 1 and needs a daily build to use it.
comment:7 by , 4 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Fixed.
Elaine has documented which version of the database is being used.
I tweeted about it so users will know they need a daily build to get version 2 of the database.
The blast and blat alphafold searches run on plato and can easily be update to use a new AlphaFold sequence database. They return uniprot accession codes to ChimeraX. The question is how does ChimeraX know what version of the database, ie what files suffix, to use to fetch the structure?
There are a few options. ChimeraX could make a separate query asking what is the latest AlphaFold database version. Another approach would be to have the blast and blat services return not only the uniprot identifiers but also the database version.
Will the latest database version always have all the structure files or could it omit structures that were available in earlier versions? It gets more complicated if different sequence hits might belong to different versions of the database. From the current evidence with versions 1 and 2 it appears that all version 1 files are replaced by version 2 files. So we only need to look at _v2 structure files. It is unclear if that will remain true for future versions.
Having ChimeraX make a separate query for the current database version will slow things down a bit. Even directly fetching given a uniprot id will need to do this query. And the version will probably only change infrequently, a few times per year. Another approach is ChimeraX could remember the most recent version of the database it has seen and always try the next higher version, then fallback to the current version. Again it slows things down. Having the blast and blat servers provide the versions avoids the extra server round-trip. If the user simply fetches by uniprot id not using the blast or blat servers then we can try the most recent version and the next higher version. This is the scenario when trying to fetch the uniprot ids for an existing PDB structure. For that case it seems we have to at least sometimes do an extra fetch to get the database version. We could limit that check to once per day, or once per ChimeraX session.