Opened 6 years ago

Closed 6 years ago

#1969 closed defect (fixed)

mmCIF write/restore transforms pseudobonds into bonds

Reported by: Tristan Croll Owned by: pett
Priority: moderate Milestone:
Component: Input/Output Version:
Keywords: Cc: pett
Blocked By: Blocking:
Notify when closed: Platform: all
Project: ChimeraX

Description

Sorry to do this so close to the 0.9 release, but I spent yesterday working on fitting a new cryo-EM map for a colleague, and it threw up a number of bugs in both PDB and mmCIF I/O (some of which I would swear had been fixed last year). I'll attach the example model - the provenance of this is that it started as the "biological assembly 1" PDB file for 5u8r from the RCSB website. That's provided as the two relevant copies of the asymmetric unit between MODEL...ENDMDL cards. I hacked the starting PDB file together with a text editor - just cut out the second model, changed the chain IDs and put it back, deleting the MODEL and ENDMDL cards. Of course, I forgot to duplicate the SSBOND and LINK records for the second copy... my bad. Anyway, here is a short list of what went wrong from that point:

  • adding the missing bonds (disulfides and Asn-NAG glycosidic bonds) by deleting the extraneous hydrogens and then using AtomicStructure.new_bond(atom1, atom2) works fine for the session, but does not update the SSBOND and LINK records when saving a PDB so next time the file is loaded those bonds are gone.
  • (possibly the cause of the above) the PDB metadata is silently read-only. If I do:
md = model.metadata
ssbonds = md['SSBOND']
ssbonds.extend(new_ss_bonds)
md['SSBOND']=ssbonds

... there is no error message, but the metadata doesn't actually change.

  • I can't replicate this now, but in one instance after saving and reloading (can't remember if mmCIF or PDB - I've tried both and can't find it) I ended up with a spurious bond between the two oxygens in an acid sidechain. Deleting the CONECT records seemed to get rid of that - but since I can't replicate it I'm not sure if that was actually the solution.
  • Saving as mmCIF *does* save the new bonds, but the pseudobonds between chain breaks are treated as bona fide bonds on reloading. But if I save *that* as a PDB file and reload, then everything seems fine - all the necessary SSBONDs and LINKs are there, and the chain breaks are back to being breaks.
  • Somewhere in the middle of that, I accidentally entered "save structure.cif" without specifying a model when I had three structures opened. The resulting .cif file caused a segmentation fault when I attempted to open it.

Attachments (2)

starting_coords.pdb (3.0 MB ) - added by Tristan Croll 6 years ago.
bug1969.py (367 bytes ) - added by Greg Couch 6 years ago.

Change History (16)

by Tristan Croll, 6 years ago

Attachment: starting_coords.pdb added

comment:1 by Greg Couch, 6 years ago

The mmCIF handling of multiple models is buggy (i.e, the equivalent of the PDB format's MODEL/ENDMDL) and might cause segmentation faults. It is a problem for NMR ensembles and expanded assemblies.

The only pseudobonds the mmCIF writer outputs as bonds are metal coordination bonds and hydrogen bonds, and they are labelled as such. Other types of pseudobonds are ignored. So it's unclear to me how the pseudobonds between chain breaks became bona fide bonds.

comment:2 by Tristan Croll, 6 years ago

Another issue I hadn't appreciated: saving as .cif, opening that and saving as .pdb loses *all* the metadata - CRYST1, SSBOND, LINK, etc.

comment:3 by pett, 6 years ago

Cc: Greg Couch added
Owner: changed from pett, gregc to pett
Summary: Various structure I/O issuesBetter metadata handling

comment:4 by pett, 6 years ago

The metadata isn't exactly "silently" read only. model.metdata returns a regular Python dictionary which you are free to change in any way you like of course. If after your modifications you had then tried:

model.metadata = md

you would have gotten an error, noting that metadata is a read-only attribute. Clearly, metadata handling needs work.

Currently, HELIX and SHEET records in the metadata are ignored and (poorly) generated from the residue is_helix and is_strand attributes. Improving that is ticket #1021. It would seem that SSBOND and LINK records should also be ignored and generated directly from the structure.

I am going to open two new tickets, one for generating correct SSBOND/LINK records and one for read/write metadata and assign myself those, then reassign this one to Greg for the "pseudobonds change to actual bonds" thing.

comment:5 by pett, 6 years ago

Also, not sure _what_ to do about the PDB→mmCIF→PDB metadata wipe. The two readers put differently formatted information into the metadata attribute, so they don't "see" the metadata coming from the other format. Fixing that will/would be a ton of work.

comment:6 by pett, 6 years ago

Cc: pett added; Greg Couch removed
Owner: changed from pett to Greg Couch
Priority: majormoderate
Summary: Better metadata handlingmmCIF write/restore transforms pseudobonds into bonds

comment:7 by Greg Couch, 6 years ago

Tristan, do you have an example where pseudobonds are turned into bonds?

in reply to:  9 comment:8 by tic20@…, 6 years ago

The PDB file I attached with the bug report. If I save it as a .cif and reopen, the chain fragments within each chain are connected.
 

 


by Greg Couch, 6 years ago

Attachment: bug1969.py added

comment:9 by Greg Couch, 6 years ago

Owner: changed from Greg Couch to pett

This a bug with the internal Chain representation. Chain.residues is not returning None for the missing residues in chain B. As you can see in the output below.

open bug1969.py
# of missing segments: 10 Counter({'A': 5, 'B': 5})
Chain: A 885 residues
missing: [37, 38, 39, 154, 155, 156, 157, 158, 159, 160, 508, 509, 510, 511, 512, 513, 514, 515, 642, 643, 644, 645, 646, 647, 648, 649, 650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662, 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675, 676, 677, 678, 679, 680, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714, 715, 716, 717, 718, 719, 720, 721, 722, 877, 878, 879, 880, 881, 882, 883, 884]
Chain: B 801 residues
missing: []
executed bug1969.py

comment:10 by Greg Couch, 6 years ago

Milestone: 0.9

Should be considered for fixing in the 0.9 release.

comment:11 by Greg Couch, 6 years ago

Milestone: 0.9

This won't be fixed for 0.9. The original pdb file is bad. It is missing the SEQRES records for chain B. Eric is unwilling :-) to put the missing structure guesses in to the Chain object because the size of the gaps is unknown. The proposal is for a method to be added to the Chain class that would make a new object Chain with best guesses incorporated. Then the mmcif writing code would look at the chain's from_seqres attribute and if it is false, use the guess.

in reply to:  14 comment:12 by tic20@…, 6 years ago

OK, that makes sense. The genesis of this file is that it started as a “biological assembly” PDB file downloaded from the RCSB. Those are provided with the extra copies of the asymmetric unit defined as separate models with the same chain IDs. The quick-and-dirty approach to merging into something more useful is to just use a text editor to rename the chains for the second copy, and remove the MODEL/ENDMDL cards. I just hadn’t appreciated how much ChimeraX makes use of the surrounding metadata.

The elegant way to handle this would of course be to provide methods to merge models (including renaming chains where necessary) within ChimeraX itself - but that’s a separate issue and probably not too urgent in the grand scheme of things.


comment:13 by Greg Couch, 6 years ago

Put in a workaround. The mmCIF writer now omits the sequence if it wasn't given in the original data. And the mmCIF reader now decides whether to guess the connectivity on a per-chain basis. Not sure that generating bad output and accepting bad input is the best idea, but I expect ChimeraX to see lots of bad mmCIF files in the future.

comment:14 by Greg Couch, 6 years ago

Resolution: fixed
Status: assignedclosed

Closing for now since there is a workaround. Long term we may want a way to sanctify a chain's sequence.

Note: See TracTickets for help on using tickets.