Opened 8 years ago

Closed 8 years ago

#1014 closed defect (fixed)

mmCIF not assigning ss_id correctly

Reported by: Elaine Meng Owned by: Eric Pettersen
Priority: moderate Milestone:
Component: Input/Output Version:
Keywords: Cc: Greg Couch
Blocked By: Blocking:
Notify when closed: Platform: all
Project: ChimeraX

Description

If 2gbp is read from mmCIF (but not PDB), ss_id = 1 refers to the second strand and the first helix, ss_id = 2 refers to the first strand and second helix. The correct behavior, as from PDB input, is for ss_id = 1 to refer to the first strand and first helix, 2 for the second strand and second helix, etc. Example commands:

open 2gbp
rainbow
sel ::ss_id=1

Change History (11)

comment:1 by Greg Couch, 8 years ago

Cc: Greg Couch added
Owner: changed from Greg Couch to Eric Pettersen
Summary: residue ss_id assigned incorrectly from mmCIFresidue ss_id assigned incorrectly from PDB

Not sure what to do here. The strand selected is the first strand in the mmCIF file. In the PDB file, /A:33-39 is also the first strand listed, so it's unclear why the second strand, /A:3-9 is selected. I believe this is a bug in the PDB reader.

FYI, there is a deficiency referring to strands (and helices) in the ChimeraX atom specs and the internal data structures -- there is only a single number to identify the secondary structure element. In mmCIF files, the helix identifiers strings, in this case HELX_P1, ..., HELX_P10 and the strands are SHEET.STRAND 2-tuples, ie., I.1, ..., I.6, II.1, ..., II.6. In PDB format files, the helix identifiers are H01, ..., H10, and the strand identifiers are the same as mmCIF. The numbers for the helices in PDB are the serial numbers of the HELIX records.

If we are going to keep using a single number to identify a secondary structure element, then the helix and strand numbers should be disjoint, that way one could easily refer to a single secondary structure element (or a range). But it would better to be a little fancier, for example, knowing which sheet a strand is in, would be very useful.

in reply to:  2 ; comment:2 by Elaine Meng, 8 years ago

It’s not a bug in the PDB reader if the PDB reader gets it right and the mmCIF reader gets it wrong.  The ss_id is an index from N-term to C-term (sequence order).  The order in the file is something different. It may be the spatial order of the strand in the sheet since the second strand is one edge of the sheet.

in reply to:  3 ; comment:3 by Greg Couch, 8 years ago

It's a bug in the PDB reader, when it changes the order of the strands.  
The PDB file clearly identifies the order.


On 1/23/2018 3:07 PM, Elaine Meng wrote:

comment:4 by Eric Pettersen, 8 years ago

ss_id is not reflecting the structure of the input file — it is reflecting the structure of the data. ss_id 1 is the first secondary structure element of that type in N->C order, regardless of how the input file has decided to order its listing. ss_id is also set by DSSP, which has no “input file” per se.

comment:5 by Eric Pettersen, 8 years ago

Summary: residue ss_id assigned incorrectly from PDBmmCIF not assigning ss_id correctly

Instead of having each reader figure out ss_id assignment itself, I could implement an "assign_ss_ids" call that both readers could use...

comment:6 by Eric Pettersen, 8 years ago

Belay that. I can't write an "assign_ss_ids" call because the main purpose of ss_ids is to distinguish between sequence-adjacent secondary structure elements of the same type, and the input (or calculation, in the case of DSSP) is needed for that.

--Eric

comment:7 by Greg Couch, 8 years ago

That's a shame assign_ss_ids won't work. The ss_id assignment rules are needed for the MMTF reader too.

The ss_ids from the mmCIF reader also distinguish between sequence-adjacent secondary structure elements of the same type. The PDB reader could be fixed to follow the numbering in the PBD file too. In the 2gbp case, there is a second strand 1 in sheet II. Perhaps ss_id=1 should get all three secondary structure elements: 1 helix and 2 strands.

If the DSSP code won't number strands the way PDB numbers strands, then isn't the bug in the DSSP code? Or the code that uses the DSSP output?

comment:8 by Eric Pettersen, 8 years ago

ss_id has nothing to do with the input file. You seem to be hung up on that. ss_id starts at 1 for each type and increases separately for each type from N->C terminus. PDB and DSSP assign ss_ids in exactly the same way. We can discuss this tomorrow.

--Eric

comment:9 by Eric Pettersen, 8 years ago

"PDB" in the last comment means "the PDB reader" not "the PDB file"

comment:10 by Greg Couch, 8 years ago

I'm getting hung up on the ss_id that is used in atom specs not being the published secondary structure identifier. Once DSSP is run, who cares.

comment:11 by Eric Pettersen, 8 years ago

Resolution: fixed
Status: assignedclosed

ss_ids are now normalized before first use: 0 for coil, helix/strand elements count separately from 1 in N->C order, per chain.

Note: See TracTickets for help on using tickets.