Opened 5 years ago

Closed 5 years ago

#3168 closed enhancement (fixed)

Support CHARMM/X-PLOR format PDB files

Reported by: Tristan Croll Owned by: pett
Priority: moderate Milestone:
Component: Input/Output Version:
Keywords: Cc:
Blocked By: Blocking:
Notify when closed: Platform: all
Project: ChimeraX

Description

Like it or not, these are still a favourite with many in the molecular dynamics world - probably because it was the oldest and least-hacky way to support really large models within the PDB format. It essentially ignores the chain ID, instead using the SEGID field (4 characters) to separate chains. As a consequence, many such files will have multiple residues with identical chain ID and number, but different SEGID. Other adaptations that I know of are in the atom numbering: beyond 99999 VMD/NAMD first switch to hexadecimal, and if it gets beyond the limit of that just fills the atom number field with * from that point on.

One approach to supporting it that I think would be "easy": simply replace chain ID with SEGID throughout the model. To trigger it you could ask the user to explicitly specify "format xplor" when opening, or perhaps watch for the case where SEGIDs are used and a duplicate chain ID/resnum is encountered (although I imagine that would slow things down a lot). I wouldn't worry about *saving* to the format - the result would be perfectly compatible with mmCIF.

Entirely understand if you don't want to support it - but if there's an easy way, it might be worthwhile.

Will attach an example (protein chains are uniquely named/numbered relative to each other, but glycans share the same chain IDs as their proteins and are numbered from 1). This is different from the one I mentioned yesterday - turns out the X-PLOR format had nothing to do with its issues on loading into ChimeraX, and everything to do with the fact that beyond chain A every single atom came with its own TER... honestly one of the stranger things I've come across:

ATOM  20374 H5_5 F2A A1179      26.834 -12.437 -82.219  1.00  0.00           H  
ATOM  20375 H5_6 F2A A1179      26.003  -8.493 -86.927  1.00  0.00           H  
ATOM  20376 H5_7 F2A A1179      28.310 -25.830 -77.706  1.00  0.00           H  
ATOM  20377 H5_8 F2A A1179      34.986 -24.667 -81.395  1.00  0.00           H  
ATOM  20378 H5_9 F2A A1179      40.367 -26.088 -78.035  1.00  0.00           H  
TER   20379      ACE B  26
ATOM  20380  C   ACE B  26     -36.536  40.158  14.798  1.00  0.00           C  
TER   20381      ACE B  26
ATOM  20382  O   ACE B  26     -36.454  38.949  14.564  1.00  0.00           O  
TER   20383      ACE B  26
ATOM  20384  CH3 ACE B  26     -36.819  40.560  16.168  1.00  0.00           C  
TER   20385      ACE B  26
ATOM  20386 HH31 ACE B  26     -37.494  39.806  16.573  1.00  0.00           H  
TER   20387      ACE B  26
ATOM  20388 HH32 ACE B  26     -35.891  40.639  16.734  1.00  0.00           H  
TER   20389      ACE B  26
ATOM  20390 HH33 ACE B  26     -37.349  41.511  16.235  1.00  0.00           H  
TER   20391      ALA B  27
ATOM  20392  N   ALA B  27     -36.215  41.146  13.953  1.00  0.00           N  


Attachments (1)

6vsb_1_1_1.pdb (5.8 MB ) - added by Tristan Croll 5 years ago.
Added by email2trac

Change History (5)

in reply to:  1 ; comment:1 by Tristan Croll, 5 years ago

Here's the example PDB.

On 2020-05-08 18:21, ChimeraX wrote:

6vsb_1_1_1.pdb

by Tristan Croll, 5 years ago

Attachment: 6vsb_1_1_1.pdb added

Added by email2trac

comment:2 by Tristan Croll, 5 years ago

*Residue* numbering gets a little (much) trickier. Looking at the DE Shaw model after removing all the TER lines, they actually don't use the SEGID but do use the numbering hack for both atoms and residue numbers. In particular, all the waters are chain W. At residue 9999 it looks like this:

ATOM  276fa  HWS SOL W9999      65.903 -14.617 -69.589  1.00  0.00           H  
ATOM  276fc  OWS SOL W2710     -76.113  -2.234 -54.124  1.00  0.00           O  

(2710 in Hex is 10000 in decimal)... and even worse, when it gets to the end of what hex can handle:

ATOM  7855c  HWS SOL Wffff     -37.637  24.129  89.330  1.00  0.00           H  
ATOM  7855e  OWS SOL W****     -41.279  -0.995 -74.321  1.00  0.00           O  

Wouldn't blame you at all for throwing your hands up at that. The only way to proceed beyond that point would be to parse and use their separate topology file, which is a *much* bigger job.

comment:3 by pett, 5 years ago

Status: assignedaccepted

I am willing to offer a flag to PDB input where the chain ID is gleaned from the SEGID columns rather than the normal columns, but that latter monstrosity where it drops to what looks like decimal but is actually hex, and then eventually goes to asterisks -- um, no. BTW, VMD/NAMD (and ChimeraX) use H36, not hex. H36 supports >87 million atoms and >2.4 million residues.

comment:4 by pett, 5 years ago

Resolution: fixed
Status: acceptedclosed

"Fixed" in that segment IDs can be used instead of chain IDs if "segidChains true" in the open command

Note: See TracTickets for help on using tickets.