Opened 5 years ago
Closed 5 years ago
#3168 closed enhancement (fixed)
Support CHARMM/X-PLOR format PDB files
Reported by: | Tristan Croll | Owned by: | pett |
---|---|---|---|
Priority: | moderate | Milestone: | |
Component: | Input/Output | Version: | |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Notify when closed: | Platform: | all | |
Project: | ChimeraX |
Description
Like it or not, these are still a favourite with many in the molecular dynamics world - probably because it was the oldest and least-hacky way to support really large models within the PDB format. It essentially ignores the chain ID, instead using the SEGID field (4 characters) to separate chains. As a consequence, many such files will have multiple residues with identical chain ID and number, but different SEGID. Other adaptations that I know of are in the atom numbering: beyond 99999 VMD/NAMD first switch to hexadecimal, and if it gets beyond the limit of that just fills the atom number field with * from that point on.
One approach to supporting it that I think would be "easy": simply replace chain ID with SEGID throughout the model. To trigger it you could ask the user to explicitly specify "format xplor" when opening, or perhaps watch for the case where SEGIDs are used and a duplicate chain ID/resnum is encountered (although I imagine that would slow things down a lot). I wouldn't worry about *saving* to the format - the result would be perfectly compatible with mmCIF.
Entirely understand if you don't want to support it - but if there's an easy way, it might be worthwhile.
Will attach an example (protein chains are uniquely named/numbered relative to each other, but glycans share the same chain IDs as their proteins and are numbered from 1). This is different from the one I mentioned yesterday - turns out the X-PLOR format had nothing to do with its issues on loading into ChimeraX, and everything to do with the fact that beyond chain A every single atom came with its own TER... honestly one of the stranger things I've come across:
ATOM 20374 H5_5 F2A A1179 26.834 -12.437 -82.219 1.00 0.00 H ATOM 20375 H5_6 F2A A1179 26.003 -8.493 -86.927 1.00 0.00 H ATOM 20376 H5_7 F2A A1179 28.310 -25.830 -77.706 1.00 0.00 H ATOM 20377 H5_8 F2A A1179 34.986 -24.667 -81.395 1.00 0.00 H ATOM 20378 H5_9 F2A A1179 40.367 -26.088 -78.035 1.00 0.00 H TER 20379 ACE B 26 ATOM 20380 C ACE B 26 -36.536 40.158 14.798 1.00 0.00 C TER 20381 ACE B 26 ATOM 20382 O ACE B 26 -36.454 38.949 14.564 1.00 0.00 O TER 20383 ACE B 26 ATOM 20384 CH3 ACE B 26 -36.819 40.560 16.168 1.00 0.00 C TER 20385 ACE B 26 ATOM 20386 HH31 ACE B 26 -37.494 39.806 16.573 1.00 0.00 H TER 20387 ACE B 26 ATOM 20388 HH32 ACE B 26 -35.891 40.639 16.734 1.00 0.00 H TER 20389 ACE B 26 ATOM 20390 HH33 ACE B 26 -37.349 41.511 16.235 1.00 0.00 H TER 20391 ALA B 27 ATOM 20392 N ALA B 27 -36.215 41.146 13.953 1.00 0.00 N
Attachments (1)
Change History (5)
follow-up: 1 comment:1 by , 5 years ago
comment:2 by , 5 years ago
*Residue* numbering gets a little (much) trickier. Looking at the DE Shaw model after removing all the TER lines, they actually don't use the SEGID but do use the numbering hack for both atoms and residue numbers. In particular, all the waters are chain W. At residue 9999 it looks like this:
ATOM 276fa HWS SOL W9999 65.903 -14.617 -69.589 1.00 0.00 H ATOM 276fc OWS SOL W2710 -76.113 -2.234 -54.124 1.00 0.00 O
(2710 in Hex is 10000 in decimal)... and even worse, when it gets to the end of what hex can handle:
ATOM 7855c HWS SOL Wffff -37.637 24.129 89.330 1.00 0.00 H ATOM 7855e OWS SOL W**** -41.279 -0.995 -74.321 1.00 0.00 O
Wouldn't blame you at all for throwing your hands up at that. The only way to proceed beyond that point would be to parse and use their separate topology file, which is a *much* bigger job.
comment:3 by , 5 years ago
Status: | assigned → accepted |
---|
I am willing to offer a flag to PDB input where the chain ID is gleaned from the SEGID columns rather than the normal columns, but that latter monstrosity where it drops to what looks like decimal but is actually hex, and then eventually goes to asterisks -- um, no. BTW, VMD/NAMD (and ChimeraX) use H36, not hex. H36 supports >87 million atoms and >2.4 million residues.
comment:4 by , 5 years ago
Resolution: | → fixed |
---|---|
Status: | accepted → closed |
"Fixed" in that segment IDs can be used instead of chain IDs if "segidChains true" in the open command
6vsb_1_1_1.pdb