Opened 7 years ago
Closed 7 years ago
#1180 closed defect (fixed)
mmCIF reader forms no chains
| Reported by: | Owned by: | Greg Couch | |
|---|---|---|---|
| Priority: | blocker | Milestone: | 0.7 | 
| Component: | Input/Output | Version: | |
| Keywords: | Cc: | chimera-programmers | |
| Blocked By: | Blocking: | ||
| Notify when closed: | Platform: | all | |
| Project: | ChimeraX | 
Description
At PDBe (EMBL-EBI) we are working on improving presentation of small molecules and their binding. I am looking into ChimeraX to add hydrogens into protein structures, however, I found out that the protonated structure written in mmCIF is modified in comparison to the input structure.
 
I attach both the input (1tqn-assembly-1.cif) and the output files (1tqn_h.cif). It appears that ChimeraX modifies the following mmCIF fields:
 
_atom_site.group_PDB -> all the atoms become HETATM in the output file
_atom_site.label_asym_id -> has been changed from A - protein, B - HEM, 
C - HOH to T - protein, U - HEM, HOH - water
 
I’m not sure if that is an intended behaviour or a bug? I used ChimeraX version 0.7 (2018-06-30)
Attachments (2)
Change History (9)
by , 7 years ago
| Attachment: | 1tqn-assembly-1.cif added | 
|---|
comment:2 by , 7 years ago
| Status: | assigned → accepted | 
|---|
Thanks for looking at mmCIF output.  It is a much easier to work on a feature that someone uses.
Both of those problems are bugs: the HETATM problem and the atom_site.label_asym_id assignment not starting with A.  As an aside, note that the atom_site.label_asym_id's are regenerated when the mmCIF file is written and are not guaranteed to correspond to what was in the original file.  While the author fields, atom.site.auth*, are preserved.  I'll look into what it would take to preserve the label fields too.
And since I have your ear, please consider giving unique label_seq_id's to waters.  If they were unique, then each line the atom_site table would be uniquely keyed by its label fields (a database table best practice).  
comment:3 by , 7 years ago
Hello,
Thank you for looking into this. Once you will know more, please let me know. Regarding your other queries: The regeneration of atom_site.label_asym_id is a problem, what is the motivation behind that? It would be excellent, if the information from the source file is preserved.  Next, the  thing with the standard mmCIF, which is disseminated by the wwPDB is, that _atom_site.label_seq_id is not present for any HETATOMs. So, by default none of the archive structures downloaded from any wwPDB partner site is going to have it. In the PDBe we are mapping PDB structures to their relevant targets from UNIPROT using SIFTS and handling the previously mentioned issue by introducing _atom_site.pdbe_label_seq_id. This is provided in all assembly files distributed by PDBe as well as polished archive. Examples can be found here (http://www.ebi.ac.uk/pdbe/static/entry/1tqn_updated.cif) and here (http://www.ebi.ac.uk/pdbe/static/entry/download/1tqn-assembly-1.cif.gz). Do you think it would make sense to use this identifier instead when available?
Should you have any further questions, we are more than happy to help.
Best,
Lukas
On 10/07/2018, 00:08, "ChimeraX" <ChimeraX-bugs-admin@cgl.ucsf.edu> wrote:
    #1180: mmCIF reader forms no chains
    -----------------------------------+----------------------
              Reporter:  lpravda@…     |      Owner:  gregc
                  Type:  defect        |     Status:  accepted
              Priority:  blocker       |  Milestone:  0.7
             Component:  Input/Output  |    Version:
            Resolution:                |   Keywords:
            Blocked By:                |   Blocking:
    Notify when closed:                |   Platform:  all
               Project:  ChimeraX      |
    -----------------------------------+----------------------
    Changes (by gregc):
    
     * status:  assigned => accepted
    
    
    Comment:
    
     Thanks for looking at mmCIF output.  It is a much easier to work on a
     feature that someone uses.
    
     Both of those problems are bugs: the HETATM problem and the
     atom_site.label_asym_id assignment not starting with A.  As an aside, note
     that the atom_site.label_asym_id's are regenerated when the mmCIF file is
     written and are not guaranteed to correspond to what was in the original
     file.  While the author fields, atom.site.auth*, are preserved.  I'll look
     into what it would take to preserve the label fields too.
    
     And since I have your ear, please consider giving unique label_seq_id's to
     waters.  If they were unique, then each line the atom_site table would be
     uniquely keyed by its label fields (a database table best practice).
    
    --
    Ticket URL: <https://plato.cgl.ucsf.edu/trac/ChimeraX/ticket/1180#comment:2>
    ChimeraX <http://www.rbvi.ucsf.edu/chimerax/>
    ChimeraX Issue Tracker
    
comment:4 by , 7 years ago
So the daily ChimeraX can't open 1tqn-assembly-1.cif because it is confused by the there not being an entity_poly_seq table when entity_poly is present.  I've fixed that.  Note that the entity_poly_seq table is needed to reliably find gaps, entity_poly is not enough.
HETATM generated correctly now for all non-standard residues.  ChimeraX doesn't look at the field in input, and I would like drop it from the mmCIF output, but it seems to be liked.
Still looking into maintaining label_seq_id when reasonable.  It should be possible if there isn't any major structural editing.
As for pdbe_label_seq_id, that is a possibility.  I don't know the details, but I heard that the RCSB's IHM effort is coming across the same problem of uniquely identifying solvent residues, so I hope there will be one solution for everyone to use.  Right now, ChimeraX's solution is to look at the auth_seq_id.  My preference is for actually using label_seq_id :-)
And a FYI, ChimeraX already has some support for PDBe's updated mmCIF files. In the open command, you can say "open 3fx2 from pdbe_updated" and it will fetch the updated file.  In particular, we use the chem_comp_bond table.  It would be better if the chem_comp_bond table actually had bonds for every atom.  The CCD files are not complete templates, and I believe they are for a neutral pH instead of a human pH.  So frequently, there are missing hydrogens -- an example would be the missing HO5' for DT in the chem_comp_bond table in 3rec.
follow-up: 3 comment:5 by , 7 years ago
Hello,
At PDBe we are going to update the 'assembly files' and 'updated mmcif files' very soon. So that additional value-added data will be present in comparison to the wwPDB archive version. For example, the 'entity_poly_seq' table is going to be added to the molecular assembly files, so hopefully no further problems are going to be expected from that resource. If you have any other fields/tables which you would find beneficial in retrieving along with the structural data from PDBe resource, please let us know so we can consider adding them. Having unique identification of ligands and solvent molecules in the archive files would be certainly beneficial, however this is upon discussion with all the wwPDB partners and these discussions can take rather a long time. That's why we are providing our own field 'pdbe_label_seq_id'.
As for the lack of 'chem_comp_bond' table in the archive mmcif files, we are providing this information for all the building blocks present in a given entry as a part of 'updated mmcif files' and this information is going to be added to the assembly files as well. So, you might find beneficial to retrieve structures from that resource until the 'wwPDB solution' is established. If you would have other queries on the data side, let me know. In the meantime, I'll continue testing ChimeraX.
Best,
Lukas
comment:6 by , 7 years ago
Hi Lukas,
Will the updated mmCIF files also be able to distinguish the ligands and solvent with the auth_seq_id as well?  That's the workaround we're currently using in ChimeraX.  Additions, like pdbe_label_seq_id, need to be coded.  And while adding pdbe_label_seq_id would not be hard, it would slow down the mmCIF reader.
The point I was making about the chem_comp_bond table in updated mmCIF files, was that they didn't always include all of the bonds.  My guess is that they are just copies of what is in the CCD entries.  So my hope is that, in the future, that the protocol for including the chem_comp_bond table in PDBe's updated mmCIF entries will be updated to include all of the bonds.
Another request for the updated mmCIF files.  Please output the atom_site and atom_site_anisotrop tables in fixed width column format, with the data left-justified in each column.  ChimeraX can parse the mmCIF file much faster if it knows that the columns are fixed width (close to 4 times faster!).  I've added support for a chimerax_audit_syntax category for explicitly giving that information, and I've put in a request to the CIF folks for an official audit_syntax category.  See https://www.cgl.ucsf.edu/chimerax/docs/devel/mmcif.html for more details.
It would also be nice if the updated mmCIF files had an audit_conform table in them.  I consider it a deficiency of the original CIF standard that the audit_conform table is not required since the .cif file suffix is insufficient for telling what kind of data is in the file (ChimeraX does not understand small molecule CIF files).  That leads to the fact that the mmCIF file suffix really should be .mmcif (ChimeraX accepts both .cif and .mmcif).  The current practice of using .cif for both is like using .gif for both GIF and PNG files.
-- Greg
follow-up: 5 comment:7 by , 7 years ago
| Resolution: | → fixed | 
|---|---|
| Status: | accepted → closed | 
mmCIF chain id is now preserved if possible in mmCIF output.

input