Opened 8 years ago
Closed 8 years ago
#1014 closed defect (fixed)
mmCIF not assigning ss_id correctly
| Reported by: | Elaine Meng | Owned by: | Eric Pettersen |
|---|---|---|---|
| Priority: | moderate | Milestone: | |
| Component: | Input/Output | Version: | |
| Keywords: | Cc: | Greg Couch | |
| Blocked By: | Blocking: | ||
| Notify when closed: | Platform: | all | |
| Project: | ChimeraX |
Description
If 2gbp is read from mmCIF (but not PDB), ss_id = 1 refers to the second strand and the first helix, ss_id = 2 refers to the first strand and second helix. The correct behavior, as from PDB input, is for ss_id = 1 to refer to the first strand and first helix, 2 for the second strand and second helix, etc. Example commands:
open 2gbp
rainbow
sel ::ss_id=1
Change History (11)
comment:1 by , 8 years ago
| Cc: | added |
|---|---|
| Owner: | changed from to |
| Summary: | residue ss_id assigned incorrectly from mmCIF → residue ss_id assigned incorrectly from PDB |
follow-up: 2 comment:2 by , 8 years ago
It’s not a bug in the PDB reader if the PDB reader gets it right and the mmCIF reader gets it wrong. The ss_id is an index from N-term to C-term (sequence order). The order in the file is something different. It may be the spatial order of the strand in the sheet since the second strand is one edge of the sheet.
follow-up: 3 comment:3 by , 8 years ago
It's a bug in the PDB reader, when it changes the order of the strands. The PDB file clearly identifies the order. On 1/23/2018 3:07 PM, Elaine Meng wrote:
comment:4 by , 8 years ago
ss_id is not reflecting the structure of the input file — it is reflecting the structure of the data. ss_id 1 is the first secondary structure element of that type in N->C order, regardless of how the input file has decided to order its listing. ss_id is also set by DSSP, which has no “input file” per se.
comment:5 by , 8 years ago
| Summary: | residue ss_id assigned incorrectly from PDB → mmCIF not assigning ss_id correctly |
|---|
Instead of having each reader figure out ss_id assignment itself, I could implement an "assign_ss_ids" call that both readers could use...
comment:6 by , 8 years ago
Belay that. I can't write an "assign_ss_ids" call because the main purpose of ss_ids is to distinguish between sequence-adjacent secondary structure elements of the same type, and the input (or calculation, in the case of DSSP) is needed for that.
--Eric
comment:7 by , 8 years ago
That's a shame assign_ss_ids won't work. The ss_id assignment rules are needed for the MMTF reader too.
The ss_ids from the mmCIF reader also distinguish between sequence-adjacent secondary structure elements of the same type. The PDB reader could be fixed to follow the numbering in the PBD file too. In the 2gbp case, there is a second strand 1 in sheet II. Perhaps ss_id=1 should get all three secondary structure elements: 1 helix and 2 strands.
If the DSSP code won't number strands the way PDB numbers strands, then isn't the bug in the DSSP code? Or the code that uses the DSSP output?
comment:8 by , 8 years ago
ss_id has nothing to do with the input file. You seem to be hung up on that. ss_id starts at 1 for each type and increases separately for each type from N->C terminus. PDB and DSSP assign ss_ids in exactly the same way. We can discuss this tomorrow.
--Eric
comment:10 by , 8 years ago
I'm getting hung up on the ss_id that is used in atom specs not being the published secondary structure identifier. Once DSSP is run, who cares.
comment:11 by , 8 years ago
| Resolution: | → fixed |
|---|---|
| Status: | assigned → closed |
ss_ids are now normalized before first use: 0 for coil, helix/strand elements count separately from 1 in N->C order, per chain.
Not sure what to do here. The strand selected is the first strand in the mmCIF file. In the PDB file, /A:33-39 is also the first strand listed, so it's unclear why the second strand, /A:3-9 is selected. I believe this is a bug in the PDB reader.
FYI, there is a deficiency referring to strands (and helices) in the ChimeraX atom specs and the internal data structures -- there is only a single number to identify the secondary structure element. In mmCIF files, the helix identifiers strings, in this case HELX_P1, ..., HELX_P10 and the strands are SHEET.STRAND 2-tuples, ie., I.1, ..., I.6, II.1, ..., II.6. In PDB format files, the helix identifiers are H01, ..., H10, and the strand identifiers are the same as mmCIF. The numbers for the helices in PDB are the serial numbers of the HELIX records.
If we are going to keep using a single number to identify a secondary structure element, then the helix and strand numbers should be disjoint, that way one could easily refer to a single secondary structure element (or a range). But it would better to be a little fancier, for example, knowing which sheet a strand is in, would be very useful.