Opened 5 years ago

Closed 5 years ago

Last modified 5 years ago

#3515 closed enhancement (fixed)

Fetch biological assemblies from PDB for NIH 3D

Reported by: phil.cruz@… Owned by: Tom Goddard
Priority: major Milestone:
Component: Higher-Order Structure Version:
Keywords: Cc: philip.macmenamin@…, Elaine Meng
Blocked By: Blocking:
Notify when closed: Platform: all
Project: ChimeraX

Description

Phil Cruz and Darrell Hurt say the NIH 3D pipeline needs ChimeraX to be able to fetch biological assemblies from the PDB as Chimera does. Need to check if this can utilize mmCIF.

Change History (15)

comment:1 by Tom Goddard, 5 years ago

Reporter: changed from Tom Goddard to phil.cruz@…

I looked at the RCSB web site to see what format the biological assembly files come in. Some entries provide mmCIF format from the Downloads menu on an entry page and some provide the old PDB format, apparently for older entries. Here are URLs that the Download menu fetches.

https://files.rcsb.org/download/2BFU.pdb1.gz

https://files.rcsb.org/download/6ZM3-assembly1.cif
https://files.rcsb.org/download/6ZM3-assembly2.cif

ChimeraX can readily generate the assemblies from standard mmCIF asymmetric unit file if it knows which mmCIF assembly defined in the file to use. But there does not appear to be any information in the mmCIF files indicating which assemblies are biological assemblies. For example 1ej6 (virus capsid) has 6 assemblies but only two are the full capsid.

It is not hard to fetch the assemblies using the above URLs, first try to get mmCIF and if it fails try *.pdb1.gz. But I have sent an email to RCSB help desk asking to clarify the situation. Since they are phasing out PDB format I would think all assemblies should be available in mmCIF.

For the NIH 3D pipeline is only the first assembly desired in cases where there are more than one assembly? I believe there are cases where there are multiple assemblies that actually have different binding interfaces between the components but I don't know if NIH 3D plans to offer a choice among those.

comment:2 by Tom Goddard, 5 years ago

Cc: philip.macmenamin@… Elaine Meng added

I added fetching of biological assemblies from RCSB using "open 2d2n from pdbbio". But the mmCIF and PDB files they provide have multiple subunit models, not a single model. Chimera used PQS files from PDBe which provided a single model with all subunits. But I think PQS has not been used for years and all the models from the last several years I tried had no PQS assembly. PDBe web pages provide assemblies that are in a single file from their web pages, for example

http://www.ebi.ac.uk/pdbe/static/entry/download/2d2n-assembly-1.cif.gz

These might be computed from PISA. But it is not clear how any of the biological assemblies are being computed. I sent the PDB help desk a query about why some biological assemblies from RCSB are in mmCIF format and some in PDB format and whether they plan to provide all in mmCIF. Have not yet heard back.

Current ChimeraX code uses RCSB but since each assembly is multiple models this may not be what is desired for NIH 3D.

in reply to:  3 ; comment:3 by Elaine Meng, 5 years ago

Aren't the assemblies listed in PDBe identical to those in the RCSB PDB?  I would have thought that assembly information would be consistent (mirrored) between the two, but it sounds like your experience is that PDBe is missing data.  

ChimeraX does not yet have a "combine" command for merging models.

comment:4 by Tom Goddard, 5 years ago

PDBe has two types biological assemblies, the old type that Chimera uses that I could not find any reference to on their web site and don't seem to be available for any structures released in recent years, made by a program called PQS. Many years ago PDBe switched to a program PISA to predict assemblies and I believe that is what they offer on their web pages in mmCIF format, single model for each biological assembly. Then the RCSB has download links that sometimes give old PDB format (even for structures from 2019 or 2020), and sometimes give mmCIF format, but in both cases they are muilti-model files where one assembly is made from multiple models. Their files are multimodel even in cases where the subunits are not the same protein so technically I think the PDB and mmCIF formats don't allow that, they only allow multimodel of the same protein, 6zm3 is an example of that.

So in summary there appear to be 4 different biological assembly sources, 2 from PDBe and 2 from RCSB, and not certain how any of them are calculated.

in reply to:  5 ; comment:5 by phil.cruz@…, 5 years ago

The current 3D Print Exchange pipeline only fetches the ".1" version of the biological assembly, for example:

open biounitID:2OM2.1

I think this corresponds to mmCIF Assembly 1 in ChimeraX, so maybe we have what we need, other than the discontinuity between PDBe and RCSB, and old vs. new (which is a different wrinkle).

Phil

On 7/30/20, 1:02 PM, "ChimeraX" <ChimeraX-bugs-admin@cgl.ucsf.edu> wrote:

    #3515: Fetch biological assemblies from PDB for NIH 3D
    ---------------------------------------------+-------------------------
              Reporter:  phil.cruz@…             |      Owner:  Tom Goddard
                  Type:  enhancement             |     Status:  assigned
              Priority:  major                   |  Milestone:
             Component:  Higher-Order Structure  |    Version:
            Resolution:                          |   Keywords:
            Blocked By:                          |   Blocking:
    Notify when closed:                          |   Platform:  all
               Project:  ChimeraX                |
    ---------------------------------------------+-------------------------

    Comment (by Tom Goddard):

     PDBe has two types biological assemblies, the old type that Chimera uses
     that I could not find any reference to on their web site and don't seem to
     be available for any structures released in recent years, made by a
     program called PQS.  Many years ago PDBe switched to a program PISA to
     predict assemblies and I believe that is what they offer on their web
     pages in mmCIF format, single model for each biological assembly.  Then
     the RCSB has download links that sometimes give old PDB format (even for
     structures from 2019 or 2020), and sometimes give mmCIF format, but in
     both cases they are muilti-model files where one assembly is made from
     multiple models.  Their files are multimodel even in cases where the
     subunits are not the same protein so technically I think the PDB and mmCIF
     formats don't allow that, they only allow multimodel of the same protein,
     6zm3 is an example of that.

     So in summary there appear to be 4 different biological assembly sources,
     2 from PDBe and 2 from RCSB, and not certain how any of them are
     calculated.

    --
    Ticket URL: <https://plato.cgl.ucsf.edu/trac/ChimeraX/ticket/3515#comment:4>
    ChimeraX <http://www.rbvi.ucsf.edu/chimerax/>
    ChimeraX Issue Tracker

comment:6 by Tom Goddard, 5 years ago

There are more options than I thought. Chimera fetches two kinds of biological assemblies that are different

open biounitID:2d2n.1

gets

ftp://ftp.wwpdb.org/pub/pdb/data/biounit/coordinates/all/2d2n.pdb1.gz

which opens as 6 models 0.1-0.6,

and

open pqsID:2d2n

gets

ftp://ftp.ebi.ac.uk/pub/databases/msd/pqs/macmol/2d2n.mmol

which opens as a single model, but much smaller assembly than biounitID one and is a PDB format file with weird file suffix mmol.

If the current NIH Print Exchange is using biounitID then I guess it is handling the multiple model assemblies.

If I try a structure deposited Dec 2019 neither biounitID or pqsID gets anything, but the RCSB 6ts0 page has the assembly

https://files.rcsb.org/download/6TS0-assembly1.cif

which opens as 24 separate subunit models. And PDBe has a different assembly file linked on its web pages

http://www.ebi.ac.uk/pdbe/static/entry/download/6ts0-assembly-1.cif.gz

that opens as a single model containing the 24 subunits. And also PDBe has an "atoms only" version of the assembly that just contains the atom_site table, not sure why.

http://www.ebi.ac.uk/pdbe/static/entry/download/6ts0-assembly-1_atom_site.cif.gz

Both these PDBe assembly files when downloaded by clicking the link in Safari on the PDBe web site produce a *.cif files without the .gz suffix yet it is compressed so won't open in ChimeraX unless I first change the suffix to .cif.gz and decompress by hand. This is probably something broken about the mime type specification given in the PDBe web page html. ChimeraX can easily deal with that since it won't go through the web page.

comment:7 by Tom Goddard, 5 years ago

The bottom line of the previous long list of possible biological assemblies available from RCSB and PDBe is that there are some old assemblies (fetched by Chimera) that don't seem to be provided for new PDB entries. And there are new assemblies offered on the RCSB and PDBe web sites that are available for newer files but give multimodel or single model assemblies, possibly being computed by different software.

We of course want to use files that the PDB is providing for new entries. Probably the single-model PDBe assemblies are easier to handle. The PDBe 6ts0 assembly spewed dozens of warnings when opened in ChimeraX, while the RCSB 6ts0 multi-model assembly gave no warnings.

comment:8 by Tom Goddard, 5 years ago

I have replaced the pdbbio database name with rcsb_bio and pdbe_bio for fetching biological assemblies, e.g. "open 6ts0 from rcsb_bio", because RCSB and PDBe provide different files. I think NIH 3D will be best off using the PDBe biological assemblies because they are single models instead of a multiple subunit files in one file that the RCSB provides. To request just the first assembly when more than one is available "open 2d2n from rcsb_bio maxAssemblies 1".

Currently the PDBe biological assembly fetch fails because PDBe is incorrectly double compressing the .gz files. I have emailed PDBe help desk about this -- I would be surprised if they did not know because downloading from their web site gives an unusable file with one .gz suffix but that has been compressed with gzip two times. Looks like a web server misconfiguration where the server add a second gzip compression but does not indicate that in the http reply headers.

Begin forwarded message:

From: "PDBe helpdesk via RT" <pdbehelp@…>
Subject: [PDBe #435715] AutoReply: Biological assembly files compressed twice
Date: July 30, 2020 at 5:48:01 PM PDT
To: goddard
Reply-To: pdbehelp@…

Thank you for contacting the Protein Data Bank in Europe (PDBe) help desk.

This message has been automatically generated in response to the creation of a ticket regarding:

"Biological assembly files compressed twice",

a summary of which appears below.

Your query has been assigned an ID of [PDBe #435715] and a member of our support team will get back to you as soon as possible. There is no need to reply to this message right now. However, please include the string [PDBe #435715] in the subject line of any future correspondence about this issue.

Best regards,
The PDBe Support Team


When I download a biological assembly file from PDBe using the download menu on an entry web page the file I get is gzip compressed twice making it unusable.

Here is an example. Using the Downloads menu on page

https://www.ebi.ac.uk/pdbe/entry/pdb/6ts0 <https://www.ebi.ac.uk/pdbe/entry/pdb/6ts0>

and choose Assembly 1 (mmcif; gz) the downloaded file is ~/Downloads/6ts0-assembly-1.cif.gz using current FireFox web browser. I then gunzip this file

gunzip ~/Downloads/6ts0-assembly-1.cif.gz

producing ~/Downloads/6ts0-assembly-1.cif. But that file is also gzip compressed. If I uncompress it I get the plain text file. Using the Safari browser to download the file I directly get ~/Downloads/6ts0-assembly-1.cif since Safari automatically decompresses when there is a .gz suffix. But the resulting file is still gzip compressed and cannot be used by applications expecting a plain text mmCIF. The PDBe download menu entry is using

http://www.ebi.ac.uk/pdbe/static/entry/download/6ts0-assembly-1.cif.gz <http://www.ebi.ac.uk/pdbe/static/entry/download/6ts0-assembly-1.cif.gz>

Using this URL directly in the browser also produces a double compressed file also. If I use wget to fetch this file which does not request the web server to compress (or possibly requests it not to compress)

wget http://www.ebi.ac.uk/pdbe/static/entry/download/6ts0-assembly-1.cif.gz <http://www.ebi.ac.uk/pdbe/static/entry/download/6ts0-assembly-1.cif.gz>

then I get 6ts0-assembly-1.cif.gz and it is only compressed once as it should be.

So it appears your web server is misconfigured and somehow the server adds a second level of compression to the already compressed file and maybe also sends wrong http reply headers claiming it has not been compressed. Not sure. But definitely the downloaded biological assembly files are not usable.

Tom Goddard
ChimeraX developer
UC San Francisco

in reply to:  9 ; comment:9 by Elaine Meng, 5 years ago

Yes, the assemblies from RCSB are less convenient because (as Tom said) even a single assembly can be split up into multiple submodels.  It is not even one submodel per subunit (chain) but instead per copy of the asymmetric unit, which might itself contain multiple chains. Aside to Tom: perhaps "subunit" in the submodel name should be something else like "copy" if that isn't too pedantic ... In the 2D2N example, assembly 1 is a 24-mer, but you get 6 submodels, each a copy of the asymmetric unit heterotetramer.  When the biological assembly is less than the asymmetric unit (1HV4), only one model is fetched per assembly, but there can still be multiple assemblies.  

comment:10 by Tom Goddard, 5 years ago

I think the PDBe single model biological assemblies are going to be the most useful. I sent their help desk email about the files being doubly compressed but have not heard back from them.

comment:11 by Tom Goddard, 5 years ago

Rachel Green from the RCSB help desk reports that RCSB will have all biological assemblies in mmCIF format in 2021, and also that mmCIF files currently do not annotate which assemblies are biological, although there is a pdbx_struct_assembly_auth_evidence category used in 20% of entries that apparently indicates that a specified assembly is believed to be biological.

Begin forwarded message:

From: Rachel Kramer Green <kramer@…>
Subject: Re: [JIRA] (HELP-15577) Are biological assemblies identified in mmCIF files?
Date: August 4, 2020 at 6:03:24 AM PDT
To: goddard
Cc: info <info@…>
Reply-To: info <info@…>

Dear Tom,
The short answer to your first question is no, the mmCIF file does not indicate which assemblies are biological. However, we have recently implemented a field (rcsb_candidate_assembly (y/n); available through the API at data.rcsb.org) to indicate that an assembly is biological. This information may eventually be incorporated in a new data item in the mmCIF.
Also, evidence for functional assemblies is being collected in pdbx_struct_assembly_auth_evidence; the coverage for this recently added category is now 20%.
We plan to provide biological assembly files in mmCIF format in 2021.
Best wishes,
Rachel


RACHEL KRAMER GREEN, PH.D.
Scientific Support & Customer Service Lead
RCSB Protein Data Bank

comment:12 by Tom Goddard, 5 years ago

I added a fix for the double compression problem for fetching PDBe biological assemblies, for example, "open 6p8k from pdbe_bio" now works. The fix was for pdbe_bio fetch to make the http request not allow gzip compression.

comment:13 by Tom Goddard, 5 years ago

Biological assemblies can now be fetched from PDBe or RCSB, e.g.

open 6p8k from pdbe_bio maxAssemblies 1
open 2d2n from rcsb_bio maxAssemblies 1

The PDBe fetch gives a single model for each assembly while the RCSB fetch has each assembly comprised of multiple models. The maxAssemblies option limits fetching to get only the first N available assemblies. Without that option all available assemblies are fetched.

comment:14 by Tom Goddard, 5 years ago

Resolution: fixed
Status: assignedclosed

comment:15 by Tom Goddard, 5 years ago

Rachel Green at RCSB says the mmCIF assemblies they will provide in 2021 will be a single model for each assembly.

Begin forwarded message:

From: Rachel Kramer Green <kramer@…>
Subject: Re: [JIRA] (HELP-15577) Are biological assemblies identified in mmCIF files?
Date: August 6, 2020 at 5:54:21 AM PDT
To: Tom Goddard
Cc: info <info@…>
Reply-To: info <info@…>

Hi Tom,

Yes, the next generation of mmCIF assembly files will be in a single model.

Best regards,

Rachel


RACHEL KRAMER GREEN, PH.D.
Scientific Support & Customer Service Lead
RCSB Protein Data Bank

Note: See TracTickets for help on using tickets.